Multi-label text classi?cation of news
articles for ASDMedia

Meeuwen, F.W. van

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Feelders, A.J.
dc.contributor.advisor	Siebes, A.P.J.M.
dc.contributor.advisor	Hoeve, J.
dc.contributor.author	Meeuwen, F.W. van
dc.date.accessioned	2013-09-19T17:02:01Z
dc.date.available	2013-09-19
dc.date.available	2013-09-19T17:02:01Z
dc.date.issued	2013
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/14907
dc.description.abstract	With the continuously increasing amount of online information, there is a pressing need to structure this information. Text classification (TC) is a technique which classifies textual information into a predefined set of categories. This thesis describes a case study on classifying news articles on two different datasets collected by the business-to-business news publisher ASDMedia. The goal is to find out if it's possible to use a machine learning (ML) approach to TC to construct a classification system that can be used in a semi-automatic setting. Two main challenges of the cases are that news articles are potentially labeled with multiple categories (multi-label) and the dataset is very imbalanced. For analytical purposes, we restrict ourselves to ML algorithms that generate humanly interpretable models, namely decision trees. We applied state-of-the-art techniques to solve the above mentioned challenges and conduct various experiments. Our focus is on 1) Finding the best feature representation of news articles and 2) Trying out techniques to exploit structures within the class labels; namely classifier chains (CC) and hierarchical top-down classification (HTC). By using the optimized feature representation and by applying the CC technique we managed to improve the results substantially for both datasets from a default setup. The best settings reached a Micro-F1 value of .625 and .752 for both ASDMedia datasets. We can conclude that our constructed classification system is suited to be part of a semi-automated system. However, advisable is to collect more data for the minority categories. Although HTC looked promising and saves a lot of CPU-time, the actual performance was considerably lower than not using it.
dc.description.sponsorship	Utrecht University
dc.language.iso	en_US
dc.title	Multi-label text classi?cation of news articles for ASDMedia
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Classification, Machine Learning, Data Mining, Multi-label Classification, Hierarchical Classification, Text Classification, Chaining
dc.subject.courseuu	Technical Artificial Intelligence

Files in this item

Name:: thesis.pdf
Size:: 3.736Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record

Multi-label text classi?cation of news articles for ASDMedia

Files in this item

This item appears in the following Collection(s)