Mapping Research Software Landscapes through Exploratory Studies of GitHub Data

Quach, Keven

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Lamprecht, Anna-Lena
dc.contributor.author	Quach, Keven
dc.date.accessioned	2022-11-08T00:00:40Z
dc.date.available	2022-11-08T00:00:40Z
dc.date.issued	2022
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/43162
dc.description.abstract	Research software enables data processing and plays a vital role in academia and industry. As such, it is essential to have findable, accessible, interoperable, and reusable (FAIR) research software. However, what precisely the landscape of research software looks like is unknown. Thus, we would like to understand the research software landscape better and utilize this information to infer actionable recommendations for the Research Software Engineer (RSE) practice. This study provides insights into the research software landscape at Utrecht University through an exploratory analysis while also considering the different scientific domains. We achieve this by collecting GitHub data and analyzing repository FAIRness and characteristics through heatmaps, histograms, statistical tables, and tests. Our method retrieved 176 users with 1521 repositories, of which 823 are considered research software. Others can adopt the proposed method to gain insights into their specific organization, as it is designed to be reproducible and reusable. The analysis showed significant differences between faculty characteristics and how to support the application of FAIR variables. Among other things, our results showed that Geosciences have the highest percentage of unlicensed repositories with 57%. Also, Social Sciences are an outlier in language usage, as they are the only faculty to primarily use R, while other faculties primarily use Python. A first classification model is developed that achieves 70% accuracy in identifying research software that can be used for future labelling tasks. Our recommendations include expanding the R café, creating FAIR reference documents, featuring and highlighting high impact and FAIR research software, and creating yearly reports. We conclude that our labelled GitHub dataset allows us to infer actionable recommendations on RSE practice.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	Exploratory data analysis of GitHub data in the context of FAIRness and research domains
dc.title	Mapping Research Software Landscapes through Exploratory Studies of GitHub Data
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	exploratory data analysis; FAIR; research software; GitHub
dc.subject.courseuu	Business Informatics
dc.thesis.id	11858

Files in this item

Name:: Master_thesis.pdf
Size:: 2.279Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record