Towards Automated LIterature Based Innovation Output (ALBIO) Indicators

Raedts, Cas

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Frenken, Koen
dc.contributor.author	Raedts, Cas
dc.date.accessioned	2025-09-22T23:02:03Z
dc.date.available	2025-09-22T23:02:03Z
dc.date.issued	2025
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/50426
dc.description.abstract	This thesis develops and evaluates Automated Literature-Based Innovation Output (ALBIO), a modular, local pipeline that applies local large language models (LLMs) to extract, structure, enrich, and deduplicate innovation records from unstructured text. The study asks: How can LLMs automate and enhance LBIO while preserving data control and measurement quality? ALBIO is applied to 6,568 Dutch agricultural trade-journal articles (2015–2025) and is informally compared with SWINNO. The pipeline combines schema-validated JSON generation for innovation extraction, an LLM-as-judge validation step, targeted (but currently generic) web enrichment, and hybrid duplicate resolution (lexical + semantic). A held-out ground truth is used for evaluation. Results show high precision (~81%) and moderate recall (~56%) on innovation identification (F1 ≈ 0.67), positioning ALBIO as a precision-first complement to manual LBIO rather than a full substitute at present. A ~14% discovery rate indicates that ALBIO surfaces valid innovations initially missed by human coders. Substantively, the approach counters biases of traditional innovation output indicators, such as technology bias and dependency on respondents. Furthermore, it broadens coverage beyond patents, surveys and traditional LBIO by explicitly recording process and service innovations and by classifying innovator types beyond incumbent firms. A comparison with SWINNO shows broadly similar output patterns but flags overconfidence where evidence is scarce - for example, much lower “unknown” rates for novelty-to-market - underscoring the need for cautious interpretation and better enrichment. The main bottleneck is information availability and clarity in source texts (and web data), not model capacity; variables that require external enrichment (e.g., location, industry, size, finance) are most affected. A manual audit suggests deduplication is acceptable overall but degrades for larger clusters, showing drift effects there. This thesis contributes a transparent, local LLM pipeline for LBIO, empirical evidence on precision-first scalability with discovery benefits, and a roadmap to next-generation indicators. Priorities for future work include structured enrichment (registries, patent databases, sector statistics), multi-label/probabilistic and temporal/relational representations (e.g., knowledge graphs), and more robust, auditable deduplication. Together, these advances can move automated, text-based indicators beyond counting toward richer characterisation of innovation.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	This thesis explores the possibilities of using local LLMs to automatically extract innovation output indicators for trade journal texts.
dc.title	Towards Automated LIterature Based Innovation Output (ALBIO) Indicators
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Innovation Output; Local LLM application; Literature Based Innovation Output (LBIO); Automated Literature Based Innovation Output (ALBIO);
dc.subject.courseuu	Innovation Sciences
dc.thesis.id	54116

Files in this item

Name:: Thesis.final.pdf
Size:: 5.025Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record