Large vision and language models-based investment predictions on entrepreneurial pitch videos
Summary
This research aims to explore the potential of predicting investment decisions in entrepreneurial pitches using multimodal signals, particularly visual and linguistic features. Entrepreneurial pitches are critical for securing funding, and signals from both verbal content and non-verbal cues, such as gestures and facial expressions, play a crucial role in shaping investors' decisions. However, current studies have largely focused on isolated forms of signaling, leaving a gap in understanding how multimodal features interact to influence investment outcomes. This study proposes a machine-learning approach that leverages visual and linguistic cues from pitch videos to predict the likelihood of investment. Using the "Data Management Entrepreneurial Pitches" dataset, the research seeks to address several key questions, including the efficacy of visual and linguistic unimodal models, the benefits of combining modalities into a unified linguistic space, and the performance of multimodal fusion models. To this end, a series of neural network models will be designed and tested, utilizing advanced techniques in Natural Language Processing (NLP) and Computer Vision, such as BERT, MEGA, VideoMAE, and VideoLLaVA. This thesis investigated the effectiveness of visual and linguistic multi-modal models in predicting the probability of entrepreneurial investment and comparing the performance of unimodal models with that of multi-modal fusion models.