Statistical Learning of Household Composition from Television Viewing Behaviour

Man, J.T.K.

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Feelders, A.J.
dc.contributor.advisor	Siebes, A.P.J.M.
dc.contributor.advisor	Hoogstrate, H.
dc.contributor.author	Man, J.T.K.
dc.date.accessioned	2016-10-18T17:00:34Z
dc.date.available	2016-10-18T17:00:34Z
dc.date.issued	2016
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/24605
dc.description.abstract	As video consumption rises and more households use set-top boxes, an environment is created where cable operators are able to unobtrusively collect data from households. This in turn can be used by advertisers to predict household characteristics and deliver targeted advertisements. This thesis examines whether it is possible to predict household compositions using program level television viewing data. The household structures are split into eight categories, namely: family (F), family with children (FC), household (H), household with children (HC), single female (SF), single male (SF), single female parent (SPF) and single male parent (SPM). Three different machine learning algorithms are used to perform this task: regularized multinomial logistic regression, stochastic gradient boosting and support vector machines with a linear, polynomial or a radial basis kernel. For training these models, the Nielsen National People Meter (NPPM) data set is used. Households in this data set are continuously added and removed after a fixed period, therefore we pre-process the households by a process called unification. For each algorithm we examine three models selected by three different performance measures using 5-fold cross-validation. The three performance measures that we use are: accuracy, Kappa statistic (Kappa) and the area under the curve (AUC). In our research we observe that the stochastic gradient boosting model is performing the best, with an accuracy of 0.488 on the validation set. This can also be shown in the resampling distribution of the models along with statistical tests. In addition, we use a regularized multinomial logistic and stochastic gradient boosting model to perform dimension reduction prior to training the support vector machines. This improves the support vector machines slightly. We also create a cascading classification model, which consists of two models, each training a different aspect of our target variable. In the first stage, the first model classifies whether a child is present and the second model classifies the number of adults in the household. This consists of the following categories: family (F), household (H), single female (SF) and single male (SM). The resulting cascading model has a validation accuracy of 0.479, which is slightly less than the stochastic gradient boosting machine model even though both use the same machine learning algorithms.
dc.description.sponsorship	Utrecht University
dc.format.extent	5989731
dc.format.mimetype	application/pdf
dc.language.iso	en
dc.title	Statistical Learning of Household Composition from Television Viewing Behaviour
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Household; Composition; Prediction; Television; Viewing; Behaviour; SVM; Regularization; Stochastic; Gradient; Boosting; USA; Nielsen; Big Data
dc.subject.courseuu	Computing Science

Files in this item

Name:: Statistical Learning of Household ...
Size:: 5.712Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record