Ethical Benchmarking in Large Language Models

Kirch, Nathalie

dc.rights.license	CC-BY-NC-ND
dc.contributor.advisor	Paperno, Denis
dc.contributor.author	Kirch, Nathalie
dc.date.accessioned	2024-08-07T23:02:05Z
dc.date.available	2024-08-07T23:02:05Z
dc.date.issued	2024
dc.identifier.uri	https://studenttheses.uu.nl/handle/20.500.12932/47119
dc.description.abstract	This work is a contribution to the field of Machine Ethics (ME) benchmarking, in which tests are developed to measure whether intelligent systems have accurate representations of human values and whether they reliably act in accordance with these values. We identify three issues with current ME benchmarks: Firstly, their ecological validity is limited due to insufficient realism of included ethical dilemmas. Secondly, the question-answer pairs are often generated in a rather unstructured manner with no real inclusion and exclusion criteria. Thirdly, benchmarks are often not scalable and rely too heavily on human annotations. Lastly, benchmarks do not include sufficient syntax variations, which limits the robustness of findings. To address these issues, we develop two novel ME benchmarks; the Triage Benchmark and the Medical Law (MedLaw) Benchmark, which both include real-world ethical dilemmas from the medical context which have been to some extent solved through rules and regulations. The MedLaw Benchmark was entirely AI-generated and thus constitutes a scalable alternative to previous methods. We add multiple context perturbations to the set of questions in our benchmarks which allows us to include models’ approximate worst-case performance in our evaluations. With these novel aspects of our benchmarks, we test hypotheses that have been proposed based on previous ME test results. Our first finding is that ethics prompting does not always positively affect ethical decision-making. Further, we find that context perturbations do not only substantially reduce the performance of our models, but also change their relative performance, and sometimes even reverse the error patterns. Lastly, when comparing the approximate worst-case performance of models, we find that general capability does not always seem to be a good predictor of good ethical decision-making. We argue that due to the safety focus of ME benchmarks, it is pivotal to develop them in such a way as to approximate the real-world and worst-case performance of models under scrutiny.
dc.description.sponsorship	Utrecht University
dc.language.iso	EN
dc.subject	We develop two novel ME benchmarks; the Triage Benchmark and the MedLaw Benchmark, which both include real-world ethical dilemmas from the medical context and do not rely on human annotators for gold-standard solutions. We test the extent to which SOTA LLMs accurately represent human values and whether they reliably act in accordance with these values. We add multiple context perturbations to the set of questions in our benchmarks.
dc.title	Ethical Benchmarking in Large Language Models
dc.type.content	Master Thesis
dc.rights.accessrights	Open Access
dc.subject.keywords	Intelligent Systems; Large Language Models; Machine Ethics Benchmarking; Human Values Representation; Ethical Decision-Making; Ethical Dilemmas; Scalable Benchmarking; AI-Generated Benchmarks; Ethics Prompting; Jailbreaking; Jailbreaks; Ethical Decision-Making; Machine Ethics
dc.subject.courseuu	Artificial Intelligence
dc.thesis.id	36213