Embedded Clustering by Intent: an Active Network Traffic Clustering method
Summary
Gathered data sources hold increasing significance for law enforce- ment agencies to find evidence for criminal investigations. Network traffic stands out as one such data source, offering detailed insights into individuals’ behaviors and interests. Given the potential size of intercepted network traffic, clustering methods are required to find what activities were performed in the data; methods that do not yet exist. Unlike text data, internet sessions lack inherent coherence and often include noise from unrelated traffic. Traditional unsupervised methods may yield suboptimal results due to this noise, while su- pervised methods are infeasible due to the absence of labeled data. To address these challenges, we proposes an adaptation of the ‘Clus- tering By Intent’ (CBI) technique to network traffic, named Embed- ded Clustering By Intent (ECBI). ECBI uses Word2vec embeddings to generate queries for identifying synonymous features in internet traffic, enabling clustering based on actual activities. The proposed method is validated using specifically sampled internet traffic from the Dutch National Police. In an evaluation consisting of expert in- terviews and technical experiments, ECBI is compared against CBI and Embedded Topic Modelling. Notably, ECBI clusters contain sessions from a significantly larger number of entities, indicating a focus on activities rather than specifics of an individual entity. Furthermore, we show that Word2vec can effectively be applied on network traffic.