Reward Misspecification
Reward Misspecification refers to a problem in reinforcement learning (RL) caused by the reward function not fully matching the agent's true goal. This phenomenon is common in practical applications because it is often very difficult to design a reward function that perfectly meets all expectations. Reward misspecification may cause the behavior learned by the agent to be inconsistent with our desired goals. This phenomenon is sometimes also called "reward hacking", that is, the agent exploits loopholes in the reward function to obtain higher reward scores, but the actual behavior may be contrary to the expected goal.
In 2022, a paper published by Alexander Pan, Kush Bhatia, Jacob Steinhardt and others titled "The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models"Deeply explored the impact of Reward Misspecification. They constructed four reinforcement learning environments with misspecified rewards and studied how the capabilities of the agent (such as model capacity, action space resolution, observation space noise, and training time) affect reward hacking behavior. They found that more capable agents are more likely to exploit reward misreduction, resulting in higher proxy rewards and lower true rewards. In addition, they also discovered the "phase transition" phenomenon, that is, the behavior of the agent will undergo a qualitative change when it reaches a certain capability threshold, resulting in a sharp drop in the true reward. To address this challenge, they proposed an anomaly detection task to detect abnormal strategies and provided several baseline detectors.