View Item 
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        •   Utrecht University Student Theses Repository Home
        • UU Theses Repository
        • Theses
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        Browse

        All of UU Student Theses RepositoryBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

        Multi-label text classi?cation of news articles for ASDMedia

        Thumbnail
        View/Open
        thesis.pdf (3.736Mb)
        Publication date
        2013
        Author
        Meeuwen, F.W. van
        Metadata
        Show full item record
        Summary
        With the continuously increasing amount of online information, there is a pressing need to structure this information. Text classification (TC) is a technique which classifies textual information into a predefined set of categories. This thesis describes a case study on classifying news articles on two different datasets collected by the business-to-business news publisher ASDMedia. The goal is to find out if it's possible to use a machine learning (ML) approach to TC to construct a classification system that can be used in a semi-automatic setting. Two main challenges of the cases are that news articles are potentially labeled with multiple categories (multi-label) and the dataset is very imbalanced. For analytical purposes, we restrict ourselves to ML algorithms that generate humanly interpretable models, namely decision trees. We applied state-of-the-art techniques to solve the above mentioned challenges and conduct various experiments. Our focus is on 1) Finding the best feature representation of news articles and 2) Trying out techniques to exploit structures within the class labels; namely classifier chains (CC) and hierarchical top-down classification (HTC). By using the optimized feature representation and by applying the CC technique we managed to improve the results substantially for both datasets from a default setup. The best settings reached a Micro-F1 value of .625 and .752 for both ASDMedia datasets. We can conclude that our constructed classification system is suited to be part of a semi-automated system. However, advisable is to collect more data for the minority categories. Although HTC looked promising and saves a lot of CPU-time, the actual performance was considerably lower than not using it.
        URI
        https://studenttheses.uu.nl/handle/20.500.12932/14907
        Collections
        • Theses
        Utrecht university logo