Ethical Benchmarking in Large Language Models
Summary
This work is a contribution to the field of Machine Ethics (ME) benchmarking, in which tests are developed to measure whether intelligent systems have accurate representations of human values and whether they reliably act in accordance with these values. We identify three issues with current ME benchmarks: Firstly, their ecological validity is limited due to insufficient realism of included ethical dilemmas. Secondly, the question-answer pairs are often generated in a rather unstructured manner with no real inclusion and exclusion criteria. Thirdly, benchmarks are often not scalable and rely too heavily on human annotations. Lastly, benchmarks do not include sufficient syntax variations, which limits the robustness of findings. To address these issues, we develop two novel ME benchmarks; the Triage Benchmark and the Medical Law (MedLaw) Benchmark, which both include real-world ethical dilemmas from the medical context which have been to some extent solved through rules and regulations. The MedLaw Benchmark was entirely AI-generated and thus constitutes a scalable alternative to previous methods. We add multiple context perturbations to the set of questions in our benchmarks which allows us to include models’ approximate worst-case performance in our evaluations. With these novel aspects of our benchmarks, we test hypotheses that have been proposed based on previous ME test results. Our first finding is that ethics prompting does not always positively affect ethical decision-making. Further, we find that context perturbations do not only substantially reduce the performance of our models, but also change their relative performance, and sometimes even reverse the error patterns. Lastly, when comparing the approximate worst-case performance of models, we find that general capability does not always seem to be a good predictor of good ethical decision-making. We argue that due to the safety focus of ME benchmarks, it is pivotal to develop them in such a way as to approximate the real-world and worst-case performance of models under scrutiny.
Collections
Related items
Showing items related by title, author, creator and subject.
- 
CEO Compensation Benchmarking in a European Context Eikmans, Emile (2024)Fairness and equality are important matters to people. However, inequality has long been rising both in Europe, and particularly the US. This led to increasing levels of scrutiny on the high salaries of top-level CEOs. Within ...
- 
Multivariate Postprocessing of Temporal Dependencies with Autoregressive and LSTM Neural Networks Teixeira Soares Tolomei, Daniel (2022)Weather forecasts issued by Numerical Weather Prediction (NWP) systems often display systematic bias and do not quantify the inherent uncertainty of the forecast. It is the task of statistical postprocessing to use these ...
- 
How do storm sequences impact dune erosion? Modelling a 2022 storm group at Egmond aan Zee in XBeach Niemeijer, Merijn (2024)Sandy beaches and dunes form the primary flood defences along a large part of the Dutch coast. Dune safety assessments are formulated in terms of the morphological response to a benchmark storm occurring in isolation. ...
