A Quantitative Comparison of Semantic Web Page Segmentation Algorithms
MetadataShow full item record
This thesis explores the effectiveness of different semantic Web page segmentation algorithms on modern websites. We compare the BlockFusion, PageSegmenter, VIPS and the novel WebTerrain algorithm, which was developed as part of this thesis, to each other. We introduce a new testing framework that allows to selectively run different algorithms on different datasets and that subsequently automatically compares the generated results to the ground truth. We used it to run each algorithm in eight different configurations where we varied datasets, evaluation metric and the type of the input HTML documents for a total of 32 combinations. We found that all algorithms performed better on random pages on average than on popular pages. The reason for this is most likely the higher complexity of popular pages. Furthermore the results are better when running the algorithms on the HTML obtained from the DOM than on the plain HTML. Of the different algorithms BlockFusion has the lowest F-score on average and WebTerrain the highest. Overall there is still room for improvement as we find the best average F-score to be 0.49.