Detecting and Mitigating Goal Misgeneralisation with Logical Interpretability Tools

Rouwmaat, Coen

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Broersen, Jan
dc.contributor.author	Rouwmaat, Coen
dc.date.accessioned	2023-07-20T00:01:58Z
dc.date.available	2023-07-20T00:01:58Z
dc.date.issued	2023
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/44216
dc.description.abstract	This thesis expands on the problem of AI alignment, and the specific instances of misalignment. Current and future problems are discussed to stress the increasing importance of alignment, and both reward specification and goal misgeneralisation are discussed as difficulties with aligning agent behavior with the intended objective of its designer. Original research will be done by eliciting and studying properties of goal misgeneralisation in a novel collection of toy environments. Furthermore, rule induction algorithms are implemented as an interpretability tool in order to generate multiple different explanations for an agent's behavior, which can aid in detecting goal misgeneralisation.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	This thesis examines the alignment problem of goal misgeneralisation by designing toy environments to elicit and study this phenomenon, and implements rule induction algorithms to generate rule-based explanations for this behavior.
dc.title	Detecting and Mitigating Goal Misgeneralisation with Logical Interpretability Tools
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	AI Safety; Alignment; Interpretability; Misalignment; Goal Misgeneralisation; Logical Induction Algorithms
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	19494

Files in this item

Name:: Master Thesis Coen Rouwmaat.pdf
Size:: 1.054Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Theses

Show simple item record