Estimation and Analysis of the Quality of Event Log Samples for Process Discovery
Summary
Background:
Process discovery aims at learning a process model from event logs. With the increasing volume of data there is quickly too much data to efficiently analyse using current process discovery tools. Recently, sampling has been proposed as one of the ways to combat this challenge of increasing data volume. However, little is known about sampling techniques in the context of process discovery.
Method:
This study looks at the effect of sampling on event logs and discovered process models. First, literature on process discovery and sampling for process discovery was studied. Next, new sampling methods were created based on insights from this literature. Furthermore, new measures which indicate the quality of event log samples were introduced. Finally, an evaluation using two real-life event logs was conducted to study the effect of different sampling techniques and sample ratios on event logs and discovered process models. The samples were studied using the newly introduced quality measures. Furthermore, the models discovered from the samples using the Inductive Miner were evaluated using quality dimensions from literature and a qualitative comparison.
Key Findings:
The measures which indicate the quality of the samples showed an increase in quality as the sample size increased. Contrary to this, the quality measures which indicate the quality of models discovered from these samples showed a decrease in quality as the sample size increased. Furthermore, no large differences were found between the different sampling techniques, except for one sampling technique which was able to produce results similar to the original event log using only 1\% of the data.
Discussion:
For process discovery practitioners it could be useful that sampling can create models of equal or better quality while decreasing the data volume. Future research which studies the effect of sampling on more real-life event logs and with different process discovery algorithms is needed.