TripLan2vec: Leveraging Pre-Trained Language Models for Inductive Triple Embeddings

Kisjes, Adriaan

View/Open

Thesis_Adriaan_Kisjes.pdf (1.582Mb)

Publication date

2023

Author

Kisjes, Adriaan

Metadata

Show full item record

Summary

Many organizations and data dependent applications deal with the fact that data is often incomplete and siloed across multiple knowledge bases. The semantic web and knowledge graphs are powerful tools that mitigate this by allowing rule-based systems to complete and connect different knowledge bases. To enable the use of more advanced machine-learning algorithms such as logistic regression or neural net- works, knowledge graphs need to be transformed into some kind of numeric input. In the field of neural language processing this has been solved with vector embed- dings, where for each word a vector is learned that captures its semantic meaning. There exist many knowledge graph embedding techniques inspired by natural lan- guage processing, one of which is Triple2vec where triples (two entities and the relation connecting them) are embedded as a whole. Triple2vec is innovative be- cause it captures both the graph topology as well as the heterogeneity of knowledge graphs, where other methods often focus on just one of those aspects. This thesis proposes to build on triple embeddings by developing TripLan2vec: a triple em- bedding technique that uses a pre-trained language model to generate embeddings based on textual descriptions. This enriches the embeddings by both capturing graph structure as well as natural language semantics. Moreover, it also enables the triple embeddings to be generated inductively with just a description as input, this means that triples that are not part of the training process can still be embedded, unlike with Triple2vec. During evaluation it was shown that TripLan2vec performs well at discriminating between true and false triples, and at predicting whether two triples are neighbours. In inductive evaluation, where just part of the training data was available, TripLan2vec outperforms most other methods.

URI

https://studenttheses.uu.nl/handle/20.500.12932/43907

Collections

Theses