Large foundation models to investigate entrepreneurial pitch videos
Summary
This thesis explores how foundation models with multimodal capabilities, integrating visual, auditory, and textual data, can be applied to predict investment outcomes during the Q&A sessions of startup pitch videos. Unlike prior research that mainly focused on pitch content, this study examines interactive social dynamics, such as body language, vocal tone, and response quality, which influence investor decision-making.
Using a dataset from the Jheronimus Academy of Data Science (JADS) that includes both offline and online presentations, the research evaluates unimodal (single-modality) and multimodal fusion approaches, comparing different integration techniques (early, late, and neural fusion).
The results show that multimodal models significantly outperform unimodal ones, with the XGBoost dual-fusion model achieving an F1-score of 0.789, and demonstrating strong cross-domain generalization to online settings (F1 = 0.848). These findings highlight the potential of multimodal AI systems for real-time behavioral and social signal analysis, offering valuable insights for entrepreneurial decision-making and AI-assisted investment evaluation in both academic and practical contexts.