Towards Automated LIterature Based Innovation Output (ALBIO) Indicators
Summary
This thesis develops and evaluates Automated Literature-Based Innovation Output (ALBIO), a modular, local pipeline that applies local large language models (LLMs) to extract, structure, enrich, and deduplicate innovation records from unstructured text. The study asks: How can LLMs automate and enhance LBIO while preserving data control and measurement quality? ALBIO is applied to 6,568 Dutch agricultural trade-journal articles (2015–2025) and is informally compared with SWINNO. The pipeline combines schema-validated JSON generation for innovation extraction, an LLM-as-judge validation step, targeted (but currently generic) web enrichment, and hybrid duplicate resolution (lexical + semantic). A held-out ground truth is used for evaluation. Results show high precision (~81%) and moderate recall (~56%) on innovation identification (F1 ≈ 0.67), positioning ALBIO as a precision-first complement to manual LBIO rather than a full substitute at present. A ~14% discovery rate indicates that ALBIO surfaces valid innovations initially missed by human coders. Substantively, the approach counters biases of traditional innovation output indicators, such as technology bias and dependency on respondents. Furthermore, it broadens coverage beyond patents, surveys and traditional LBIO by explicitly recording process and service innovations and by classifying innovator types beyond incumbent firms. A comparison with SWINNO shows broadly similar output patterns but flags overconfidence where evidence is scarce - for example, much lower “unknown” rates for novelty-to-market - underscoring the need for cautious interpretation and better enrichment. The main bottleneck is information availability and clarity in source texts (and web data), not model capacity; variables that require external enrichment (e.g., location, industry, size, finance) are most affected. A manual audit suggests deduplication is acceptable overall but degrades for larger clusters, showing drift effects there. This thesis contributes a transparent, local LLM pipeline for LBIO, empirical evidence on precision-first scalability with discovery benefits, and a roadmap to next-generation indicators. Priorities for future work include structured enrichment (registries, patent databases, sector statistics), multi-label/probabilistic and temporal/relational representations (e.g., knowledge graphs), and more robust, auditable deduplication. Together, these advances can move automated, text-based indicators beyond counting toward richer characterisation of innovation.