Enhancing table discovery and similarity evaluation in
data lakes

Díaz De Burgos Llaberia, Víctor

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Velegrakis, Ioannis
dc.contributor.author	Díaz De Burgos Llaberia, Víctor
dc.date.accessioned	2023-07-25T00:02:24Z
dc.date.available	2023-07-25T00:02:24Z
dc.date.issued	2023
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/44314
dc.description.abstract	The discovery of relevant tables within a data lake is a crucial task for users seeking to expand their available data and gain deeper insights into specific topics. In this thesis, we propose several approaches to address the problem of table discovery using similarity metrics. Our main objective is to find the most similar tables in a data lake to a given query table. We begin by comparing the columns of the query table with those of the candidate tables using Jaccard similarity. This pairwise comparison allows us to compute their similarity score. It presents efficiency challenges due to the extensive computational requirements. To overcome these limitations, we investigate the use of keyword-based approaches. We propose using the Yake and LDA algorithms to extract the keywords that best represent the tables and determine the similarity score with weighted Jaccard similarity. These keyword-based approaches yield comparable accuracy scores to the column-based methods while offering improved efficiency. However, comparing keywords works poorer when tables contain data about a similar but not exact same topic. We transform the keywords into embeddings usingWord2Vec and BERT to be able to analyse the semantics of the words. The similarity score in this case is determined by the weighted cosine similarity between the vectorized keywords. Furthermore, we evaluate our models using the NDCG@10 evaluation metric, which assesses the ranking of the top tables based on a labelled data lake we annotated. We show that LDA combined with Word2Vec is the most efficient and accurate model when tables contain sufficient natural language textual data. In conclusion, our research presents a comprehensive exploration of table discovery in data lakes, focusing on similarity-based approaches. We provide insights into the efficiency and accuracy of various methods, emphasising the use of keywords and embeddings for table comparison. Our findings contribute to the broader field of data discovery and serve as a foundation for future research in improving table discovery techniques.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	The proposed research aims to improve the current state-of-the-art algorithm for dataset discovery by example. The input to the algorithm is a query table, and the goal is to find the most similar tables and their corresponding similarity scores from a data lake. The research aims to identify gaps in the current algorithm and develop new techniques to address those gaps. The overall objective is to enhance the efficiency and accuracy of the dataset discovery process, which is essential for vario
dc.title	Enhancing table discovery and similarity evaluation in data lakes
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Dataset discovery, text processing, keyword extraction, word embeddings
dc.subject.courseuu	Applied Data Science
dc.thesis.id	20044

Files in this item

Name:: Thesis_DiazdeBurgos_Victor.pdf
Size:: 472.3Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record

Enhancing table discovery and similarity evaluation in data lakes

Files in this item

This item appears in the following Collection(s)