Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorFeelders, A.J.
dc.contributor.advisorSiebes, A.P.J.M.
dc.contributor.advisorHoogstrate, H.
dc.contributor.authorMan, J.T.K.
dc.date.accessioned2016-10-18T17:00:34Z
dc.date.available2016-10-18T17:00:34Z
dc.date.issued2016
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/24605
dc.description.abstractAs video consumption rises and more households use set-top boxes, an environment is created where cable operators are able to unobtrusively collect data from households. This in turn can be used by advertisers to predict household characteristics and deliver targeted advertisements. This thesis examines whether it is possible to predict household compositions using program level television viewing data. The household structures are split into eight categories, namely: family (F), family with children (FC), household (H), household with children (HC), single female (SF), single male (SF), single female parent (SPF) and single male parent (SPM). Three different machine learning algorithms are used to perform this task: regularized multinomial logistic regression, stochastic gradient boosting and support vector machines with a linear, polynomial or a radial basis kernel. For training these models, the Nielsen National People Meter (NPPM) data set is used. Households in this data set are continuously added and removed after a fixed period, therefore we pre-process the households by a process called unification. For each algorithm we examine three models selected by three different performance measures using 5-fold cross-validation. The three performance measures that we use are: accuracy, Kappa statistic (Kappa) and the area under the curve (AUC). In our research we observe that the stochastic gradient boosting model is performing the best, with an accuracy of 0.488 on the validation set. This can also be shown in the resampling distribution of the models along with statistical tests. In addition, we use a regularized multinomial logistic and stochastic gradient boosting model to perform dimension reduction prior to training the support vector machines. This improves the support vector machines slightly. We also create a cascading classification model, which consists of two models, each training a different aspect of our target variable. In the first stage, the first model classifies whether a child is present and the second model classifies the number of adults in the household. This consists of the following categories: family (F), household (H), single female (SF) and single male (SM). The resulting cascading model has a validation accuracy of 0.479, which is slightly less than the stochastic gradient boosting machine model even though both use the same machine learning algorithms.
dc.description.sponsorshipUtrecht University
dc.format.extent5989731
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.titleStatistical Learning of Household Composition from Television Viewing Behaviour
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsHousehold; Composition; Prediction; Television; Viewing; Behaviour; SVM; Regularization; Stochastic; Gradient; Boosting; USA; Nielsen; Big Data
dc.subject.courseuuComputing Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record