Vectorized UDFs in Column-Stores

Raasveldt, M.

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Philippi, Hans
dc.contributor.advisor	Mühleisen, Hannes
dc.contributor.author	Raasveldt, M.
dc.date.accessioned	2016-01-19T18:01:00Z
dc.date.available	2016-01-19T18:01:00Z
dc.date.issued	2016
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/21713
dc.description.abstract	An ever increasing amount of data is gathered by companies, government entities and individuals. Analysing and making sense of that data is a crucial task. A task that is getting more important as there is a shift towards more data-driven decision making in society. The tools used by data scientists for ad-hoc data analysis are scripting languages, of which R and Python are the most popular. Scripting languages are flexible, easy to use and have a large existing code base for data analytics. Relational Database Management Systems (RDBMS) are the de-facto standard for storing tabular data. RDBMS have numerous advantages when storing tabular data, amongst which are ACID properties, scalability, data validation and automatic parallelization. If the user wants to use data stored in a RDBMS in a scripting language, the data has to be transferred from the RDBMS to the scripting language. The standard solution is a loosely coupled approach between the scripting language and the database using an ODBC connector. To transfer the data to the scripting language, the data is exported from the database and copied, often over a network connection, and converted between the differing formats of the database and the scripting language. This loose-coupling approach has significant performance implications, especially when transferring data over the network. In addition, the lack of a tight integration has lead to data management features being re-implemented from scratch within scripting languages, by libraries such as Pandas and Dplyr. The main contribution of this thesis is research towards how a scripting language can be tightly integrated into a columnar data management system. We present a new system, MonetDB/Python, which deeply integrates the Python scripting language into MonetDB, an open-source relational column store. By using this system, users can execute arbitrary python functions as part of relational SQL queries inside the database process. This significantly reduces the cost of data transfer, and allows for automatic parallelization of scripting language functions. We show that our method is not only faster than current RDBMS connectors, but also that it is faster than native storage solutions in Python. MonetDB/Python allows us to combine the scalability and power of a RDBMS with the flexibility of a scripting language without the drawback of slow transfer speed.
dc.description.sponsorship	Utrecht University
dc.format.extent	907897
dc.format.mimetype	application/pdf
dc.language.iso	en
dc.title	Vectorized UDFs in Column-Stores
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	MonetDB,Python,User-Defined Functions,UDF,In-Database Processing
dc.subject.courseuu	Computing Science

Files in this item

Name:: MasterThesis.pdf
Size:: 886.6Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record