Enhancing table discovery and similarity evaluation in data lakes
View/ Open
Publication date
2023Author
Díaz De Burgos Llaberia, Víctor
Metadata
Show full item recordSummary
The discovery of relevant tables within a data lake is a crucial task for users
seeking to expand their available data and gain deeper insights into specific
topics. In this thesis, we propose several approaches to address the
problem of table discovery using similarity metrics. Our main objective is
to find the most similar tables in a data lake to a given query table.
We begin by comparing the columns of the query table with those of the
candidate tables using Jaccard similarity. This pairwise comparison allows
us to compute their similarity score. It presents efficiency challenges due to
the extensive computational requirements. To overcome these limitations,
we investigate the use of keyword-based approaches. We propose using
the Yake and LDA algorithms to extract the keywords that best represent
the tables and determine the similarity score with weighted Jaccard similarity.
These keyword-based approaches yield comparable accuracy scores
to the column-based methods while offering improved efficiency. However,
comparing keywords works poorer when tables contain data about a
similar but not exact same topic. We transform the keywords into embeddings
usingWord2Vec and BERT to be able to analyse the semantics of the
words. The similarity score in this case is determined by the weighted cosine
similarity between the vectorized keywords. Furthermore, we evaluate
our models using the NDCG@10 evaluation metric, which assesses the
ranking of the top tables based on a labelled data lake we annotated. We
show that LDA combined with Word2Vec is the most efficient and accurate
model when tables contain sufficient natural language textual data.
In conclusion, our research presents a comprehensive exploration of table
discovery in data lakes, focusing on similarity-based approaches. We
provide insights into the efficiency and accuracy of various methods, emphasising
the use of keywords and embeddings for table comparison. Our
findings contribute to the broader field of data discovery and serve as a
foundation for future research in improving table discovery techniques.