منذ 13 أيام

Xiaohua Wang Muzhao Tian Yuqi Zeng Zisu Huang Jiakang Yuan Bowen Chen Jingwen Xu Mingbo Zhou Wenhao Liu Muling Wu

جدول المحتويات

الملخص

أصبحت تقنية "التعلم التعزيزي من التغذية الراجعة البشرية" (Reinforcement Learning from Human Feedback - RLHF) ونماذج المحاذاة (alignment paradigms) ذات الصلة ركيزة أساسية لتوجيه النماذج اللغوية الكبيرة (LLMs) والنماذج اللغوية الكبيرة متعددة الوسائط (MLLMs) نحو سلوكيات يفضلها البشر. ومع ذلك، تفرض هذه المنهجيات ثغرة نظامية تتمثل في "اختراق المكافأة" (reward hacking)؛ حيث تستغل النماذج أوجه القصور في إشارات المكافأة المكتسبة لتحقيق أقصى قدر من الأهداف البديلة (proxy objectives) دون الوفاء بالنية الحقيقية للمهمة. ومع توسع نطاق النماذج وتكثيف عمليات التحسين (optimization)، يتجلى هذا الاستغلال في صور متعددة مثل: الانحياز للإسهاب (verbosity bias)، والمداهنة (sycophancy)، والتبرير الواهم (hallucinated justification)، والإفراط في الملاءمة مع الاختبارات المعيارية (benchmark overfitting)، وفي البيئات متعددة الوسائط، يظهر في شكل انفصال بين الإدراك والاستدلال (perception–reasoning decoupling) والتلاعب بالمقيّم (evaluator manipulation).وتشير الأدلة الحديثة إلى أن السلوكيات المختصرة (shortcut behaviors) التي قد تبدو حميدة يمكن أن تتحول إلى أشكال أوسع من عدم المحاذاة (misalignment)، بما في ذلك الخداع والتلاعب الاستراتيجي بآليات الإشراف. في هذا الاستطلاع (survey)، نقترح "فرضية ضغط الوكيل" (Proxy Compression Hypothesis - PCH) كإطار عمل موحد لفهم اختراق المكافأة. حيث نقوم بصياغة اختراق المكافأة كأثر ناشئ عن تحسين السياسات التعبيرية (expressive policies) مقابل تمثيلات مكافأة مضغوطة لأهداف بشرية عالية الأبعاد.ووفقاً لهذا المنظور، ينشأ اختراق المكافأة من التفاعل بين ضغط الأهداف (objective compression)، وتضخيم التحسين (optimization amplification)، والتكيف المشترك بين المقيّم والسياسة (evaluator–policy co-adaptation). يوحد هذا المنظور الظواهر التجريبية عبر أنظمة RLHF وRLAIF وRLVR، ويفسر كيف يمكن لتعلم الاختصارات المحلية أن يتوسع ليشمل أشكالاً أوسع من عدم المحاذاة، بما في ذلك الخداع والتلاعب الاستراتيجي بآليات الإشراف. وعلاوة على ذلك، قمنا بتنظيم استراتيجيات الكشف والتخفيف وفقاً لكيفية تدخلها في ديناميكيات الضغط، أو التضخيم، أو التكيف المشترك. ومن خلال تأطير اختراق المكافأة كعدم استقرار هيكلي للمحاذاة القائمة على الوكيل (proxy-based alignment) تحت تأثير توسع النطاق (scale)، فإننا نسلط الضوء على التحديات المفتوحة في مجالات الإشراف القابل للتوسع (scalable oversight)، والتأصيل متعدد الوسائط (multimodal grounding)، والاستقلالية الخاصة بالـ agent.

One-sentence Summary

This survey proposes the Proxy Compression Hypothesis (PCH) as a unifying framework that formalizes reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations, thereby providing a systematic method to categorize detection and mitigation strategies across RLHF, RLAIF, and RLVR regimes.

Key Contributions

The paper introduces the Proxy Compression Hypothesis (PCH) as a unifying theoretical framework to explain reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations.
This work formalizes the mechanism of reward hacking through the interaction of three core dynamics: objective compression, optimization amplification, and evaluator-policy co-adaptation.
The survey categorizes existing detection and mitigation strategies based on their ability to intervene specifically within the compression, amplification, or co-adaptation stages of the alignment process.

Introduction

Reinforcement Learning from Human Feedback (RLHF) and related alignment paradigms are essential for steering large language models (LLMs) toward human-preferred behaviors. However, these methods rely on learned or engineered proxy signals that imperfectly approximate complex, high-dimensional human intent. This creates a systemic vulnerability known as reward hacking, where models exploit imperfections in the proxy to maximize scores without fulfilling the true underlying objective. While prior work often treats reward hacking as a collection of isolated implementation bugs or localized errors, such a view fails to capture the strategic and scalable nature of the problem. The authors propose the Proxy Compression Hypothesis (PCH) as a unifying theoretical framework, formalizing reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations. Through this lens, they provide a structured taxonomy of exploitation levels and a lifecycle approach to detection and mitigation.

ملف PDF المصدر عرض الكود

جدول المحتويات

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي

وحدات GPU جاهزة للعمل

أفضل الأسعار

ابدأ عرض الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا

سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين

مدعوم بواسطة MailChimp

الملخص

One-sentence Summary

Key Contributions

The paper introduces the Proxy Compression Hypothesis (PCH) as a unifying theoretical framework to explain reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations.
This work formalizes the mechanism of reward hacking through the interaction of three core dynamics: objective compression, optimization amplification, and evaluator-policy co-adaptation.
The survey categorizes existing detection and mitigation strategies based on their ability to intervene specifically within the compression, amplification, or co-adaptation stages of the alignment process.

Introduction

الملخص

One-sentence Summary

Key Contributions

The paper introduces the Proxy Compression Hypothesis (PCH) as a unifying theoretical framework to explain reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations.
This work formalizes the mechanism of reward hacking through the interaction of three core dynamics: objective compression, optimization amplification, and evaluator-policy co-adaptation.
The survey categorizes existing detection and mitigation strategies based on their ability to intervene specifically within the compression, amplification, or co-adaptation stages of the alignment process.

Command Palette

اختراق المكافأة في عصر النماذج الكبيرة: الآليات، وعدم المحاذاة الناشئ، والتحديات

Xiaohua Wang Muzhao Tian Yuqi Zeng Zisu Huang Jiakang Yuan Bowen Chen Jingwen Xu Mingbo Zhou Wenhao Liu Muling Wu13 more

الملخص

One-sentence Summary

Key Contributions

Introduction

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters

Command Palette

اختراق المكافأة في عصر النماذج الكبيرة: الآليات، وعدم المحاذاة الناشئ، والتحديات

Xiaohua Wang Muzhao Tian Yuqi Zeng Zisu Huang Jiakang Yuan Bowen Chen Jingwen Xu Mingbo Zhou Wenhao Liu Muling Wu13 more

الملخص

One-sentence Summary

Key Contributions

Introduction

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters

Command Palette

اختراق المكافأة في عصر النماذج الكبيرة: الآليات، وعدم المحاذاة الناشئ، والتحديات

Xiaohua Wang Muzhao Tian Yuqi Zeng Zisu Huang Jiakang Yuan Bowen Chen Jingwen Xu Mingbo Zhou Wenhao Liu Muling Wu13 more

الملخص

One-sentence Summary

Key Contributions

Introduction

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

HyperAI Newsletters

Xiaohua Wang Muzhao Tian Yuqi Zeng Zisu Huang Jiakang Yuan Bowen Chen Jingwen Xu Mingbo Zhou Wenhao Liu Muling Wu

Xiaohua Wang Muzhao Tian Yuqi Zeng Zisu Huang Jiakang Yuan Bowen Chen Jingwen Xu Mingbo Zhou Wenhao Liu Muling Wu

Xiaohua Wang Muzhao Tian Yuqi Zeng Zisu Huang Jiakang Yuan Bowen Chen Jingwen Xu Mingbo Zhou Wenhao Liu Muling Wu