Multi-label text classi?cation of news
articles for ASDMedia

Meeuwen, F.W. van

View/Open

thesis.pdf (3.736Mb)

Publication date

2013

Author

Meeuwen, F.W. van

Metadata

Show full item record

Summary

With the continuously increasing amount of online information, there is a pressing need to structure this information. Text classification (TC) is a technique which classifies textual information into a predefined set of categories. This thesis describes a case study on classifying news articles on two different datasets collected by the business-to-business news publisher ASDMedia. The goal is to find out if it's possible to use a machine learning (ML) approach to TC to construct a classification system that can be used in a semi-automatic setting. Two main challenges of the cases are that news articles are potentially labeled with multiple categories (multi-label) and the dataset is very imbalanced. For analytical purposes, we restrict ourselves to ML algorithms that generate humanly interpretable models, namely decision trees. We applied state-of-the-art techniques to solve the above mentioned challenges and conduct various experiments. Our focus is on 1) Finding the best feature representation of news articles and 2) Trying out techniques to exploit structures within the class labels; namely classifier chains (CC) and hierarchical top-down classification (HTC). By using the optimized feature representation and by applying the CC technique we managed to improve the results substantially for both datasets from a default setup. The best settings reached a Micro-F1 value of .625 and .752 for both ASDMedia datasets. We can conclude that our constructed classification system is suited to be part of a semi-automated system. However, advisable is to collect more data for the minority categories. Although HTC looked promising and saves a lot of CPU-time, the actual performance was considerably lower than not using it.

URI

https://studenttheses.uu.nl/handle/20.500.12932/14907

Collections

Theses