Measuring the Behavioral Quality of Log Sampling
Summary
Process mining is a discipline that uses techniques from data mining and process analysis. It has three main goals that are realised by analysing event logs: process discovery, conformance checking and process enhancement. Especially in process discovery the size of these event logs is becoming a problem given the current tools available in process mining. This is problematic given the exploratory nature of many process mining algorithms, which in many cases are run several
times with different parameters. A solution would be reducing the data by sampling the event log. Yet although many sampling approaches exist, the quality of these approaches is unknown.
This thesis studies the quality of random samples from an event log by introducing six quality measures based on the behavior of the event log. The approach is backed by theory and has been implemented in the tool ProM. Experiments show that sampling very quickly introduces under and oversampled behavior in the event log, for both high and low sample rates. Unsampled behavior, where behavior is completely absent from a sample, also occurs in all samples, which can can be problematic for frequency-based algorithms. Future research should thoroughly study the sampling of event logs to help practitioners choosing sample rates and techniques.