Enhancing Railway Operations: Rule-Based Event Log Extraction for ProRail Traffic Control
Summary
Current research in process mining and event detection offers excellent interfaces and algorithms for discovering and improving real-world processes. However, these methods assume that the very activities that make up the processes are already captured in data. Algorithms capable of extracting event log data from generic tabular data sources are limited and often require significant manual effort or input from domain experts. A generic, easy-to-use approach that can transform tabular data into event log data is currently lacking.
This thesis addresses the gap through a case study conducted in collaboration with the Dutch railway infrastructure manager ProRail. ProRail oversees the maintenance, traffic control, management, and expansion of the Dutch railway network, with safety as its top priority. To ensure and enhance safety, it is crucial to understand the operational processes on the railway network, such as driving trains. Currently, ProRail does not automatically and in real-time monitor these user processes.
The main research question addressed by this thesis is: ‘To what extent is it possible to automatically
detect user processes in the data of ProRail?’
This thesis proposes a comprehensive solution by:
1. Formalising a methodology to convert any generic tabular data source into event log data.
2. Formalising operational processes and their activities as process models.
3. Implementing a rule-based system (RBS) to extract activities from data.
4. Matching extracted activity sequences to predefined process models using the longest common subsequence (LCS) and token-based replay (TBR) algorithms.
The performance of the proof of concept (PoC) is assessed using the macro-averaged F1-score of fitness and precision. This metric accounts for the imbalance in the number of instances per process model.
The RBS extracted over 70, 000 activities from 2, 506 process instances (train journeys) present in the data. The created dataset annotated by the RBS serves as a ground truth for future research, such as training supervised learning models. The TBR algorithm found a single best process model for 2, 448 out of 2, 506 train journeys, while the LCS algorithm found 1, 156 single best matches. The macro-averaged F1-score of the LCS algorithm is 0.120 ± 0.087, while the TBR algorithm scores 0.104 ± 0.048.
A two-sided paired t-test (t = 0.564, p = 0.629) indicates no significant performance difference between the LCS and TBR algorithms. However, the TBR algorithm is preferred as it matches more than twice as many activity sequences to process models and supports multiple paths, loops, and parallel activities within a process model, which will be beneficial for future improvements to the processes. Inter-annotator agreement between two domain experts resulted in a Cohen’s Kappa coefficient of κ = 0.49, indicating moderate agreement, while the agreement between each algorithm-expert pair ranged from poor to slight with κ-values between −0.06 and 0.09.
To conclude, it is possible to automatically detect user processes in the data of ProRail to some extent. Future improvements include refining current process models, incorporating additional data sources to capture activities currently unable to be implemented, and enhancing the RBS by allowing a percentage of rules to be true for an activity rather than requiring all rules to be true. The annotation of more data by the experts will further improve the quality of the constructed event log dataset. More research is needed to improve the ordering of activities within an event and the detection of multiple processes within a single train journey. The use of semantic similarity measures to match activities to dataset columns is very promising and should be further explored.