Show simple item record

dc.rights.licenseCC-BY-NC-ND
dc.contributor.advisorPaperno, Denis
dc.contributor.authorKirch, Nathalie
dc.date.accessioned2024-08-07T23:02:05Z
dc.date.available2024-08-07T23:02:05Z
dc.date.issued2024
dc.identifier.urihttps://studenttheses.uu.nl/handle/20.500.12932/47119
dc.description.abstractThis work is a contribution to the field of Machine Ethics (ME) benchmarking, in which tests are developed to measure whether intelligent systems have accurate representations of human values and whether they reliably act in accordance with these values. We identify three issues with current ME benchmarks: Firstly, their ecological validity is limited due to insufficient realism of included ethical dilemmas. Secondly, the question-answer pairs are often generated in a rather unstructured manner with no real inclusion and exclusion criteria. Thirdly, benchmarks are often not scalable and rely too heavily on human annotations. Lastly, benchmarks do not include sufficient syntax variations, which limits the robustness of findings. To address these issues, we develop two novel ME benchmarks; the Triage Benchmark and the Medical Law (MedLaw) Benchmark, which both include real-world ethical dilemmas from the medical context which have been to some extent solved through rules and regulations. The MedLaw Benchmark was entirely AI-generated and thus constitutes a scalable alternative to previous methods. We add multiple context perturbations to the set of questions in our benchmarks which allows us to include models’ approximate worst-case performance in our evaluations. With these novel aspects of our benchmarks, we test hypotheses that have been proposed based on previous ME test results. Our first finding is that ethics prompting does not always positively affect ethical decision-making. Further, we find that context perturbations do not only substantially reduce the performance of our models, but also change their relative performance, and sometimes even reverse the error patterns. Lastly, when comparing the approximate worst-case performance of models, we find that general capability does not always seem to be a good predictor of good ethical decision-making. We argue that due to the safety focus of ME benchmarks, it is pivotal to develop them in such a way as to approximate the real-world and worst-case performance of models under scrutiny.
dc.description.sponsorshipUtrecht University
dc.language.isoEN
dc.subjectWe develop two novel ME benchmarks; the Triage Benchmark and the MedLaw Benchmark, which both include real-world ethical dilemmas from the medical context and do not rely on human annotators for gold-standard solutions. We test the extent to which SOTA LLMs accurately represent human values and whether they reliably act in accordance with these values. We add multiple context perturbations to the set of questions in our benchmarks.
dc.titleEthical Benchmarking in Large Language Models
dc.type.contentMaster Thesis
dc.rights.accessrightsOpen Access
dc.subject.keywordsIntelligent Systems; Large Language Models; Machine Ethics Benchmarking; Human Values Representation; Ethical Decision-Making; Ethical Dilemmas; Scalable Benchmarking; AI-Generated Benchmarks; Ethics Prompting; Jailbreaking; Jailbreaks; Ethical Decision-Making; Machine Ethics
dc.subject.courseuuArtificial Intelligence
dc.thesis.id36213


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record