Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorFeelders, A.J.
dc.contributor.advisorSiebes, A.P.J.M.
dc.contributor.advisorHoeve, J.
dc.contributor.authorMeeuwen, F.W. van
dc.date.accessioned2013-09-19T17:02:01Z
dc.date.available2013-09-19
dc.date.available2013-09-19T17:02:01Z
dc.date.issued2013
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/14907
dc.description.abstractWith the continuously increasing amount of online information, there is a pressing need to structure this information. Text classification (TC) is a technique which classifies textual information into a predefined set of categories. This thesis describes a case study on classifying news articles on two different datasets collected by the business-to-business news publisher ASDMedia. The goal is to find out if it's possible to use a machine learning (ML) approach to TC to construct a classification system that can be used in a semi-automatic setting. Two main challenges of the cases are that news articles are potentially labeled with multiple categories (multi-label) and the dataset is very imbalanced. For analytical purposes, we restrict ourselves to ML algorithms that generate humanly interpretable models, namely decision trees. We applied state-of-the-art techniques to solve the above mentioned challenges and conduct various experiments. Our focus is on 1) Finding the best feature representation of news articles and 2) Trying out techniques to exploit structures within the class labels; namely classifier chains (CC) and hierarchical top-down classification (HTC). By using the optimized feature representation and by applying the CC technique we managed to improve the results substantially for both datasets from a default setup. The best settings reached a Micro-F1 value of .625 and .752 for both ASDMedia datasets. We can conclude that our constructed classification system is suited to be part of a semi-automated system. However, advisable is to collect more data for the minority categories. Although HTC looked promising and saves a lot of CPU-time, the actual performance was considerably lower than not using it.
dc.description.sponsorshipUtrecht University
dc.language.isoen_US
dc.titleMulti-label text classi?cation of news articles for ASDMedia
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsClassification, Machine Learning, Data Mining, Multi-label Classification, Hierarchical Classification, Text Classification, Chaining
dc.subject.courseuuTechnical Artificial Intelligence


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record