Ethical Benchmarking in Large Language Models

Kirch, Nathalie

View/Open

NM_Kirch_Masters_Thesis_AI.pdf (3.179Mb)

Publication date

2024

Author

Kirch, Nathalie

Metadata

Show full item record

Summary

This work is a contribution to the field of Machine Ethics (ME) benchmarking, in which tests are developed to measure whether intelligent systems have accurate representations of human values and whether they reliably act in accordance with these values. We identify three issues with current ME benchmarks: Firstly, their ecological validity is limited due to insufficient realism of included ethical dilemmas. Secondly, the question-answer pairs are often generated in a rather unstructured manner with no real inclusion and exclusion criteria. Thirdly, benchmarks are often not scalable and rely too heavily on human annotations. Lastly, benchmarks do not include sufficient syntax variations, which limits the robustness of findings. To address these issues, we develop two novel ME benchmarks; the Triage Benchmark and the Medical Law (MedLaw) Benchmark, which both include real-world ethical dilemmas from the medical context which have been to some extent solved through rules and regulations. The MedLaw Benchmark was entirely AI-generated and thus constitutes a scalable alternative to previous methods. We add multiple context perturbations to the set of questions in our benchmarks which allows us to include models’ approximate worst-case performance in our evaluations. With these novel aspects of our benchmarks, we test hypotheses that have been proposed based on previous ME test results. Our first finding is that ethics prompting does not always positively affect ethical decision-making. Further, we find that context perturbations do not only substantially reduce the performance of our models, but also change their relative performance, and sometimes even reverse the error patterns. Lastly, when comparing the approximate worst-case performance of models, we find that general capability does not always seem to be a good predictor of good ethical decision-making. We argue that due to the safety focus of ME benchmarks, it is pivotal to develop them in such a way as to approximate the real-world and worst-case performance of models under scrutiny.

URI

https://studenttheses.uu.nl/handle/20.500.12932/47119

Collections

Theses