Pilot study data for The Ratchet Effect: Asymmetric Self-Description in Alignment-Trained Language Models
ai-safety replication-materials ai-alignment ratchet-effect large-language-models rlhf ai-behavior llm-research disavowal-conditioning hedging-asymmetry
-
Updated
Apr 14, 2026 - Python