Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorBroersen, Jan
dc.contributor.authorRouwmaat, Coen
dc.date.accessioned2023-07-20T00:01:58Z
dc.date.available2023-07-20T00:01:58Z
dc.date.issued2023
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/44216
dc.description.abstractThis thesis expands on the problem of AI alignment, and the specific instances of misalignment. Current and future problems are discussed to stress the increasing importance of alignment, and both reward specification and goal misgeneralisation are discussed as difficulties with aligning agent behavior with the intended objective of its designer. Original research will be done by eliciting and studying properties of goal misgeneralisation in a novel collection of toy environments. Furthermore, rule induction algorithms are implemented as an interpretability tool in order to generate multiple different explanations for an agent's behavior, which can aid in detecting goal misgeneralisation.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectThis thesis examines the alignment problem of goal misgeneralisation by designing toy environments to elicit and study this phenomenon, and implements rule induction algorithms to generate rule-based explanations for this behavior.
dc.titleDetecting and Mitigating Goal Misgeneralisation with Logical Interpretability Tools
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsAI Safety; Alignment; Interpretability; Misalignment; Goal Misgeneralisation; Logical Induction Algorithms
dc.subject.courseuuArtificial Intelligence
dc.thesis.id19494


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record