Mapping Research Software Landscapes through Exploratory Studies of GitHub Data
MetadataShow full item record
Research software enables data processing and plays a vital role in academia and industry. As such, it is essential to have findable, accessible, interoperable, and reusable (FAIR) research software. However, what precisely the landscape of research software looks like is unknown. Thus, we would like to understand the research software landscape better and utilize this information to infer actionable recommendations for the Research Software Engineer (RSE) practice. This study provides insights into the research software landscape at Utrecht University through an exploratory analysis while also considering the different scientific domains. We achieve this by collecting GitHub data and analyzing repository FAIRness and characteristics through heatmaps, histograms, statistical tables, and tests. Our method retrieved 176 users with 1521 repositories, of which 823 are considered research software. Others can adopt the proposed method to gain insights into their specific organization, as it is designed to be reproducible and reusable. The analysis showed significant differences between faculty characteristics and how to support the application of FAIR variables. Among other things, our results showed that Geosciences have the highest percentage of unlicensed repositories with 57%. Also, Social Sciences are an outlier in language usage, as they are the only faculty to primarily use R, while other faculties primarily use Python. A first classification model is developed that achieves 70% accuracy in identifying research software that can be used for future labelling tasks. Our recommendations include expanding the R café, creating FAIR reference documents, featuring and highlighting high impact and FAIR research software, and creating yearly reports. We conclude that our labelled GitHub dataset allows us to infer actionable recommendations on RSE practice.