Ethical Benchmarking in Large Language Models
Summary
This work is a contribution to the field of Machine Ethics (ME) benchmarking, in which tests are developed to measure whether intelligent systems have accurate representations of human values and whether they reliably act in accordance with these values. We identify three issues with current ME benchmarks: Firstly, their ecological validity is limited due to insufficient realism of included ethical dilemmas. Secondly, the question-answer pairs are often generated in a rather unstructured manner with no real inclusion and exclusion criteria. Thirdly, benchmarks are often not scalable and rely too heavily on human annotations. Lastly, benchmarks do not include sufficient syntax variations, which limits the robustness of findings. To address these issues, we develop two novel ME benchmarks; the Triage Benchmark and the Medical Law (MedLaw) Benchmark, which both include real-world ethical dilemmas from the medical context which have been to some extent solved through rules and regulations. The MedLaw Benchmark was entirely AI-generated and thus constitutes a scalable alternative to previous methods. We add multiple context perturbations to the set of questions in our benchmarks which allows us to include models’ approximate worst-case performance in our evaluations. With these novel aspects of our benchmarks, we test hypotheses that have been proposed based on previous ME test results. Our first finding is that ethics prompting does not always positively affect ethical decision-making. Further, we find that context perturbations do not only substantially reduce the performance of our models, but also change their relative performance, and sometimes even reverse the error patterns. Lastly, when comparing the approximate worst-case performance of models, we find that general capability does not always seem to be a good predictor of good ethical decision-making. We argue that due to the safety focus of ME benchmarks, it is pivotal to develop them in such a way as to approximate the real-world and worst-case performance of models under scrutiny.
Collections
Related items
Showing items related by title, author, creator and subject.
-
Carbon Intensity Indicators, A comparison of carbon intensity indicators to benchmark companies within a sector
Eggink, E.E. (2013)Global action is needed to ensure that the global temperature rise is limited to a maximum of 2°C. There is a growing gap between where global emissions are heading and where they need to be (UNEP, 2011). The private sector ... -
Missing Data Techniques in Contract Benchmarking
Wortmann, D.A. (2016)Organizations may periodically perform benchmarks as a way to measure their performance. This study attemptsto find a method to provide a way of working to effectively go from raw data to data sets ready to be analyzed in ... -
Benchmarking AI Techniques in Online Games
Diana, Matteo (2023)