Data Mining for Tweet Sentiment Classification
Summary
The goal of this master thesis is to classify short Twitter messages with respect to their sentiment using data mining techniques. Twitter messages, or tweets, are limited to 140 characters. This limitation makes it more difficult for people to express their sentiment and as a consequence, the classification of the sentiment will be more difficult as well. The sentiment can refer to two different types: emotions and opinions. This research is solely focused on the sentiment of opinions. These opinions can be divided into three classes: positive, neutral and negative. The tweets are then classified with an algorithm to one of those three classes.
Known supervised learning algorithms as support vector machines and naive Bayes are used to create a prediction model. Before the prediction model can be created, the data has to be pre-processed from text to a fixed-length feature vector. The features consist of sentiment-words and frequently occurring words that are predictive for the sentiment. The learned model is then applied to a test set to validate the model.
The data that is considered in this research is based on two datasets, one from Sananalytics and a self-built Twitter dataset.
When the models are applied and tested in-sample the results were quite acceptable. However out-of-sample, with cross-validation the results were disappointing. The sparsity in the dataset seems to cause the issue that the features in the training set does not cover the data in the test set well.