AI Alignment and Social Choice: Fundamental Limitations and Policy Implications

October 24, 2023 · Declared Dead · 🏛 Social Science Research Network

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Abhilash Mishra arXiv ID 2310.16048 Category cs.AI: Artificial Intelligence Cross-listed cs.CL, cs.CY, cs.HC, cs.LG Citations 35 Venue Social Science Research Network Last Checked 4 months ago

Abstract

Aligning AI agents to human intentions and values is a key bottleneck in building safe and deployable AI applications. But whose values should AI agents be aligned with? Reinforcement learning with human feedback (RLHF) has emerged as the key framework for AI alignment. RLHF uses feedback from human reinforcers to fine-tune outputs; all widely deployed large language models (LLMs) use RLHF to align their outputs to human values. It is critical to understand the limitations of RLHF and consider policy challenges arising from these limitations. In this paper, we investigate a specific challenge in building RLHF systems that respect democratic norms. Building on impossibility results in social choice theory, we show that, under fairly broad assumptions, there is no unique voting protocol to universally align AI systems using RLHF through democratic processes. Further, we show that aligning AI agents with the values of all individuals will always violate certain private ethical preferences of an individual user i.e., universal AI alignment using RLHF is impossible. We discuss policy implications for the governance of AI systems built using RLHF: first, the need for mandating transparent voting rules to hold model builders accountable. Second, the need for model builders to focus on developing AI agents that are narrowly aligned to specific user groups.