Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorPhilippi, Hans
dc.contributor.advisorMühleisen, Hannes
dc.contributor.authorRaasveldt, M.
dc.date.accessioned2016-01-19T18:01:00Z
dc.date.available2016-01-19T18:01:00Z
dc.date.issued2016
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/21713
dc.description.abstractAn ever increasing amount of data is gathered by companies, government entities and individuals. Analysing and making sense of that data is a crucial task. A task that is getting more important as there is a shift towards more data-driven decision making in society. The tools used by data scientists for ad-hoc data analysis are scripting languages, of which R and Python are the most popular. Scripting languages are flexible, easy to use and have a large existing code base for data analytics. Relational Database Management Systems (RDBMS) are the de-facto standard for storing tabular data. RDBMS have numerous advantages when storing tabular data, amongst which are ACID properties, scalability, data validation and automatic parallelization. If the user wants to use data stored in a RDBMS in a scripting language, the data has to be transferred from the RDBMS to the scripting language. The standard solution is a loosely coupled approach between the scripting language and the database using an ODBC connector. To transfer the data to the scripting language, the data is exported from the database and copied, often over a network connection, and converted between the differing formats of the database and the scripting language. This loose-coupling approach has significant performance implications, especially when transferring data over the network. In addition, the lack of a tight integration has lead to data management features being re-implemented from scratch within scripting languages, by libraries such as Pandas and Dplyr. The main contribution of this thesis is research towards how a scripting language can be tightly integrated into a columnar data management system. We present a new system, MonetDB/Python, which deeply integrates the Python scripting language into MonetDB, an open-source relational column store. By using this system, users can execute arbitrary python functions as part of relational SQL queries inside the database process. This significantly reduces the cost of data transfer, and allows for automatic parallelization of scripting language functions. We show that our method is not only faster than current RDBMS connectors, but also that it is faster than native storage solutions in Python. MonetDB/Python allows us to combine the scalability and power of a RDBMS with the flexibility of a scripting language without the drawback of slow transfer speed.
dc.description.sponsorshipUtrecht University
dc.format.extent907897
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.titleVectorized UDFs in Column-Stores
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsMonetDB,Python,User-Defined Functions,UDF,In-Database Processing
dc.subject.courseuuComputing Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record