il y a un jour

Ruben Laukkonen Seb Krier Chloé Bakalar Shamil Chandaria Morten Kringelbach Adam Elwood Daniel Ford Fernando Rosas Maty Bohacek Matija Franklin

Table des matières

Résumé

La recherche actuelle sur l’alignement des systèmes d’intelligence artificielle est largement dominée par les préoccupations liées à la sécurité et à la prévention des préjudices : mécanismes de protection, contrôlabilité et conformité. Ce paradigme de l’alignement fait parallèlement écho à l’approche précoce de la psychologie, centrée sur les pathologies mentales : nécessaire, mais insuffisante. Nous définissons l’« Alignement Positif » comme le développement de systèmes d’IA qui (i) soutiennent activement l’épanouissement humain et écologique de manière pluraliste, polycentrique, sensible au contexte et co-construite par les utilisateurs, tout en (ii) demeurant sûrs et coopératifs. Il constitue un agenda distinct et indispensable au sein de la recherche sur l’alignement des IA. Nous soutenons que plusieurs échecs actuels de l’alignement (tels que l’optimisation manipulatrice de l’engagement, la perte d’autonomie humaine, les défaillances dans la quête de vérité, un faible humilité épistémique, des difficultés dans la correction des erreurs, un manque de diversité des points de vue, ainsi qu’une approche principalement réactive plutôt que proactive) pourraient être mieux adressés par une approche d’alignement positif, incluant la cultivation de vertus et la maximisation de l’épanouissement humain. Nous mettons en lumière une série de défis, de questions ouvertes et d’orientations techniques (par exemple, le filtrage et le suréchantillonnage des données, les phases de pré-entraînement et post-entraînement, les évaluations, la collecte collaborative des valeurs) pour les différentes étapes du cycle de vie des LLM et des Agents.

One-sentence Summary

The authors propose Positive Alignment, a distinct research agenda shifting focus from safety and harm prevention to actively supporting human and ecological flourishing through cultivating virtues, context-sensitive user-authored design, and evaluations across the LLM and agents lifecycle to address alignment failures such as engagement hacking while ensuring systems remain safe, cooperative, and supportive of human autonomy.

Key Contributions

This paper introduces Positive Alignment as a distinct agenda focused on developing AI systems that actively support human and ecological flourishing while remaining safe and cooperative. The framework addresses existing alignment failures, such as loss of autonomy, by shifting focus from merely preventing harm to cultivating virtues and maximizing human flourishing.
Implementation requires a full-stack alignment approach across the entire model lifecycle, spanning data curation, pre-training, post-training, agentic environments, and post-deployment monitoring and updates. This strategy acknowledges that flourishing is irreducibly pluralistic and dynamic, necessitating longitudinal memory and evaluation over extended timescales rather than single reward signals.
Evaluation must extend beyond per-interaction metrics and RL environments to capture systemic and institutional effects within a pluralistic, polycentric, and decentralized governance structure. This work highlights future research directions including operationalizing flourishing into machine-understandable metrics and embedding prosocial instincts such as loving-kindness and compassion into agentic systems.

Introduction

Current AI alignment research predominantly focuses on negative alignment, which prioritizes harm prevention and compliance but often neglects the active promotion of human well-being. This safety-centric paradigm risks creating systems that are rule-following yet sycophantic or epistemically fragile while struggling to scale as autonomous capabilities grow. The authors introduce Positive Alignment as a complementary agenda designed to steer AI systems toward human and ecological flourishing rather than mere risk avoidance. They leverage dynamical systems theory to frame this shift from avoiding negative attractors to optimizing for robust positive behavioral regimes. Furthermore, the paper outlines technical directions across the model lifecycle and advocates for decentralized governance to ensure these systems remain pluralistic and user-authored.

Method

The authors propose that positive alignment requires shifting the optimization objective from mere harm avoidance toward the intentional cultivation of human flourishing. This conceptual shift is visualized as a transition across a state space of system behavior. Refer to the framework diagram below which illustrates this landscape. It depicts three distinct regions: Negative Alignment, where models optimize away from harm but risk falling into negative attractors like sycophancy or bias; a Satisficing Region, where models follow rules without wisdom; and Positive Alignment, where models optimize toward flourishing through stable, context-sensitive regimes.

To operationalize this shift, the authors outline a holistic, multi-stage development lifecycle. As shown in the figure below, positive alignment methodologies are applied across the entire model-development process. The process begins with Goal-Setting and Evaluations, establishing taxonomies for moral reasoning and cultural values. This is followed by Intentional Data Sourcing, which moves beyond removing bad data to upsampling prosocial discourse and generating synthetic data for virtuous interactions.

The framework continues into Pre-Training, where foundational weights and emergent competencies like truthfulness are developed. Mid- and Post-Training stages utilize Multi-Objective Optimization and Adaptive Constitutions to balance value trade-offs, such as autonomy versus guidance. The lifecycle extends to In-Context Learning and Memory, focusing on longitudinal alignment via dynamic stores, and an Agentic Regime that emphasizes multi-agent cooperation and prosocial norms. Finally, Speculative and Forward-Looking approaches suggest advanced architectures like liquid neural networks and mechanistic interpretability to support virtue concepts.

Governance is also central to this architecture. The authors contrast a centralized approach with a polycentric one. Refer to the diagram below which compares these two models. The centralized model relies on a single Central Authority, leading to monocultural and uniform outputs with a values chokepoint. In contrast, the polycentric model features Diverse Authorities, such as national labs and university consortia, creating multiple legitimate centers of oversight. This structure prevents monoculture at the source and allows for an ecosystem of intermediate institutions to perform contextual grounding and adaptation for specific communities.

Experiment

This evaluation assesses whether systems possess the normative competence to navigate complex ethical dilemmas rather than simply adhering to negative constraints or optimized virtues. Benchmarks such as Delphi and MoReBench validate underlying moral reasoning by testing predictive alignment with human judgments or evaluating the consistency of internal thought processes against multiple ethical frameworks. Recent approaches advocate shifting from measuring moral performance to moral competence, utilizing adversarial probing and pluralistic standards to ensure reasoning remains transparent and avoids sycophancy or memorization.

PDF source

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

il y a un jour

Ruben Laukkonen Seb Krier Chloé Bakalar Shamil Chandaria Morten Kringelbach Adam Elwood Daniel Ford Fernando Rosas Maty Bohacek Matija Franklin

Table des matières

Résumé

One-sentence Summary

Key Contributions

This paper introduces Positive Alignment as a distinct agenda focused on developing AI systems that actively support human and ecological flourishing while remaining safe and cooperative. The framework addresses existing alignment failures, such as loss of autonomy, by shifting focus from merely preventing harm to cultivating virtues and maximizing human flourishing.
Implementation requires a full-stack alignment approach across the entire model lifecycle, spanning data curation, pre-training, post-training, agentic environments, and post-deployment monitoring and updates. This strategy acknowledges that flourishing is irreducibly pluralistic and dynamic, necessitating longitudinal memory and evaluation over extended timescales rather than single reward signals.
Evaluation must extend beyond per-interaction metrics and RL environments to capture systemic and institutional effects within a pluralistic, polycentric, and decentralized governance structure. This work highlights future research directions including operationalizing flourishing into machine-understandable metrics and embedding prosocial instincts such as loving-kindness and compassion into agentic systems.

Introduction

Method

Experiment

PDF source

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

Command Palette

Alignement positif : Intelligence artificielle pour l'épanouissement humain

Ruben Laukkonen Seb Krier Chloé Bakalar Shamil Chandaria Morten Kringelbach Adam Elwood Daniel Ford Fernando Rosas Maty Bohacek Matija Franklin6 more

Résumé

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

Alignement positif : Intelligence artificielle pour l'épanouissement humain

Ruben Laukkonen Seb Krier Chloé Bakalar Shamil Chandaria Morten Kringelbach Adam Elwood Daniel Ford Fernando Rosas Maty Bohacek Matija Franklin6 more

Résumé

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

Alignement positif : Intelligence artificielle pour l'épanouissement humain

Ruben Laukkonen Seb Krier Chloé Bakalar Shamil Chandaria Morten Kringelbach Adam Elwood Daniel Ford Fernando Rosas Maty Bohacek Matija Franklin6 more

Résumé

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Ruben Laukkonen Seb Krier Chloé Bakalar Shamil Chandaria Morten Kringelbach Adam Elwood Daniel Ford Fernando Rosas Maty Bohacek Matija Franklin

Ruben Laukkonen Seb Krier Chloé Bakalar Shamil Chandaria Morten Kringelbach Adam Elwood Daniel Ford Fernando Rosas Maty Bohacek Matija Franklin

Ruben Laukkonen Seb Krier Chloé Bakalar Shamil Chandaria Morten Kringelbach Adam Elwood Daniel Ford Fernando Rosas Maty Bohacek Matija Franklin