Multi-label text classi?cation of news articles for ASDMedia
Meeuwen, F.W. van
MetadataShow full item record
With the continuously increasing amount of online information, there is a pressing need to structure this information. Text classification (TC) is a technique which classifies textual information into a predefined set of categories. This thesis describes a case study on classifying news articles on two different datasets collected by the business-to-business news publisher ASDMedia. The goal is to find out if it's possible to use a machine learning (ML) approach to TC to construct a classification system that can be used in a semi-automatic setting. Two main challenges of the cases are that news articles are potentially labeled with multiple categories (multi-label) and the dataset is very imbalanced. For analytical purposes, we restrict ourselves to ML algorithms that generate humanly interpretable models, namely decision trees. We applied state-of-the-art techniques to solve the above mentioned challenges and conduct various experiments. Our focus is on 1) Finding the best feature representation of news articles and 2) Trying out techniques to exploit structures within the class labels; namely classifier chains (CC) and hierarchical top-down classification (HTC). By using the optimized feature representation and by applying the CC technique we managed to improve the results substantially for both datasets from a default setup. The best settings reached a Micro-F1 value of .625 and .752 for both ASDMedia datasets. We can conclude that our constructed classification system is suited to be part of a semi-automated system. However, advisable is to collect more data for the minority categories. Although HTC looked promising and saves a lot of CPU-time, the actual performance was considerably lower than not using it.