Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorHage, J.
dc.contributor.advisorFeelders, A.
dc.contributor.authorKreuzer, R.A.
dc.date.accessioned2013-09-19T17:01:53Z
dc.date.available2013-09-19
dc.date.available2013-09-19T17:01:53Z
dc.date.issued2013
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/14898
dc.description.abstractThis thesis explores the effectiveness of different semantic Web page segmentation algorithms on modern websites. We compare the BlockFusion, PageSegmenter, VIPS and the novel WebTerrain algorithm, which was developed as part of this thesis, to each other. We introduce a new testing framework that allows to selectively run different algorithms on different datasets and that subsequently automatically compares the generated results to the ground truth. We used it to run each algorithm in eight different configurations where we varied datasets, evaluation metric and the type of the input HTML documents for a total of 32 combinations. We found that all algorithms performed better on random pages on average than on popular pages. The reason for this is most likely the higher complexity of popular pages. Furthermore the results are better when running the algorithms on the HTML obtained from the DOM than on the plain HTML. Of the different algorithms BlockFusion has the lowest F-score on average and WebTerrain the highest. Overall there is still room for improvement as we find the best average F-score to be 0.49.
dc.description.sponsorshipUtrecht University
dc.format.extent578691 bytes
dc.format.mimetypeapplication/pdf
dc.language.isoen_US
dc.titleA Quantitative Comparison of Semantic Web Page Segmentation Algorithms
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsSemantic Web, Web Page segmentation, Segmentation algorithms, Full-text Extraction
dc.subject.courseuuComputing Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record