HyperAIHyperAI

Command Palette

Search for a command to run...

GSM-Symbolic : Comprendre les limites du raisonnement mathématique dans les grands modèles de langage

Iman Mirzadeh Keivan Alizadeh Oncel Tuzel Samy Bengio Hooman Shahrokhi Mehrdad Farajtabar

Résumé

Les récents progrès des grands modèles de langage (LLM) ont suscité un intérêt croissant pour leurs capacités de raisonnement mathématique. Si leurs performances sur le célèbre benchmark GSM8K se sont améliorées, des interrogations persistent quant à la fiabilité des métriques d’évaluation rapportées et quant à l’évolution réelle des capacités de raisonnement des LLM. Afin de pallier les limites des évaluations existantes, nous présentons GSM-Symbolic, un benchmark amélioré, élaboré à partir de modèles symboliques permettant de générer un ensemble diversifié d’énoncés. GSM-Symbolic permet des évaluations plus contrôlées, offrant des insights clés et des métriques plus fiables pour mesurer les capacités de raisonnement des modèles. Nos résultats révèlent que les LLM présentent une variance notable lorsqu’ils répondent à différentes instanciations d’un même énoncé. Plus précisément, les performances des modèles diminuent lorsque seules les valeurs numériques de l’énoncé sont modifiées dans le cadre du benchmark GSM-Symbolic. En outre, nous examinons la fragilité du raisonnement mathématique de ces modèles et démontrons que leurs performances se dégradent significativement à mesure que le nombre de propositions (clauses) dans une question augmente. Nous émettons l’hypothèse que ce déclin s’explique par le fait que les LLM actuels ne sont pas capables d’un véritable raisonnement logique ; ils tentent plutôt de reproduire les étapes de raisonnement observées dans leurs données d’entraînement.

One-sentence Summary

The authors introduce GSM-Symbolic, a benchmark utilizing symbolic templates to generate diverse questions that overcome limitations in existing evaluations such as GSM8K by revealing performance variance across numerical instantiations and significant deterioration with increased clause counts, indicating models replicate training patterns rather than perform genuine logical reasoning.

Key Contributions

  • This work introduces GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. The benchmark enables more controllable evaluations and provides reliable metrics for measuring the reasoning capabilities of models.
  • Findings reveal that large language models exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of models declines when only the numerical values in the question are altered within the GSM-Symbolic benchmark.
  • The research demonstrates that model performance significantly deteriorates as the number of clauses in a question increases. These findings support the hypothesis that current models lack genuine logical reasoning and instead replicate steps observed in training data.

Introduction

Mathematical reasoning is vital for deploying artificial intelligence in scientific and real-world applications. Current evaluations rely on static benchmarks like GSM8K, which limit robustness testing and risk data contamination. The authors address these gaps with GSM-Symbolic, a benchmark utilizing symbolic templates to generate diverse question variants for controlled evaluation. Their findings show that model performance varies across question instantiations and drops when numerical values or irrelevant clauses change. These results suggest Large Language Models depend on pattern matching rather than genuine logical reasoning.

Dataset

  • Dataset Composition and Sources

    • The authors base their work on the GSM8K dataset, which includes over 8000 grade school math questions split into 7473 training and 1319 test examples.
    • They introduce GSM-Symbolic to generate numerous instances with controlled difficulty, addressing risks of data contamination and sensitivity to minor question modifications.
  • Processing and Key Details

    • Templates are created from GSM8K test examples by identifying variables, domains, and conditions such as divisibility to ensure whole number answers.
    • Common proper names are used for persons, foods, and currencies while automated checks verify that original values do not appear in the template.
    • Manual review covers 10 random samples per template, with additional review triggered if fewer than two models answer a question correctly during evaluation.
    • Numerical ranges align with the original GSM8K test set to focus on logical reasoning rather than arithmetic skills within known accuracy boundaries.
  • Model Usage and Evaluation

    • The data serves as a reliable evaluation framework to view LLM performance as a distribution across various problem instances.
    • This approach assesses mathematical capabilities and robustness to diverse problem difficulties and augmentations beyond a single fixed metric.

Experiment

This work evaluates the reasoning capabilities of various large language models using GSM-Symbolic, a benchmark generated by mutating GSM8K templates to test reliability and robustness. Results indicate significant performance variance across different question instantiations, with original GSM8K metrics often overstating true capability due to potential data contamination. Additional experiments demonstrate that reasoning abilities are fragile, as accuracy declines with increased numerical complexity or irrelevant information, suggesting models rely on pattern matching rather than formal logical understanding.

The authors evaluate various large language models on the GSM8K benchmark and its symbolic variants to assess reasoning reliability. The results indicate a consistent performance drop when models are tested on symbolic variations compared to the original GSM8K dataset, suggesting potential data contamination or reliance on pattern matching rather than formal reasoning. Furthermore, increasing question difficulty or adding irrelevant information leads to significant degradation in accuracy across all tested models. Models generally achieve higher accuracy on the standard GSM8K benchmark compared to the symbolic variations. Increasing question difficulty by adding clauses leads to lower performance scores compared to the base symbolic version. Adding seemingly relevant but inconsequential information results in the lowest performance scores across the board.

The authors compare model performance across the original GSM8K benchmark and three modified variants designed to test reasoning robustness through symbolic changes and increased difficulty. The results indicate that the original GSM8K benchmark consistently achieves higher overall accuracy compared to the modified versions, suggesting that these variations introduce challenges that slightly reduce model performance. The original GSM8K benchmark demonstrates superior overall performance compared to the modified GSM-Symbolic and GSM-P variants. Increasing question difficulty through added clauses results in a measurable drop in accuracy, with GSM-P1 and GSM-P2 scoring lower than GSM-Symbolic. Models maintain high accuracy across the benchmarks, though the specific variant of the test has a noticeable impact on the final results.

The authors evaluate large language models on the GSM8K benchmark and modified variants to assess reasoning reliability and robustness against symbolic changes. Results indicate a consistent performance drop on these variations compared to the original dataset, suggesting a potential reliance on pattern matching rather than formal reasoning. Furthermore, increasing question difficulty or adding irrelevant information leads to significant accuracy degradation across all tested models.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp