Empirical framework for testing whether RL training instills genuine behavioral dispositions or surface compliance in language models, using compute frugality as a controllable proxy value.
language-model ai-alignment rlhf mechanistic-interpretability behavioral-alignment qwen2 grpo value-internalization alignment-faking
-
Updated
Jul 3, 2026 - Python