Command Palette
Search for a command to run...
GSM-Symbolic: Das Verständnis der Einschränkungen der mathematischen Schlussfolgerung in großen Sprachmodellen
GSM-Symbolic: Das Verständnis der Einschränkungen der mathematischen Schlussfolgerung in großen Sprachmodellen
Iman Mirzadeh Keivan Alizadeh Oncel Tuzel Samy Bengio Hooman Shahrokhi Mehrdad Farajtabar
Zusammenfassung
Die jüngsten Fortschritte bei Large Language Models (LLMs) haben das Interesse an ihren mathematischen Reasoning-Fähigkeiten geweckt. Während die Leistung auf dem weit verbreiteten GSM8K-Benchmark verbessert wurde, bleiben Fragen bezüglich der Zuverlässigkeit der berichteten Evaluationsmetriken offen, während sich die Reasoning-Fähigkeiten von LLMs weiterentwickelt haben. Um die Limitationen bestehender Evaluierungen zu überwinden, führen wir GSM-Symbolic ein, einen verbesserten Benchmark, der auf symbolischen Templates basiert und die Generierung eines vielfältigen Satzes von Fragen ermöglicht. GSM-Symbolic ermöglicht kontrolliertere Evaluierungen und bietet wichtige Erkenntnisse sowie zuverlässigere Metriken zur Messung der Reasoning-Fähigkeiten von Modellen.Unsere Ergebnisse zeigen, dass LLMs beim Beantworten unterschiedlicher Instanziierungen derselben Frage eine bemerkenswerte Varianz aufweisen. Insbesondere verschlechtert sich die Leistung der Modelle, wenn im GSM-Symbolic-Benchmark lediglich die numerischen Werte in der Frage geändert werden. Darüber hinaus untersuchen wir die Fragilität des mathematischen Reasoning in diesen Modellen und demonstrieren, dass sich ihre Leistung signifikant verschlechtert, wenn die Anzahl der Klauseln (Clauses) in einer Frage zunimmt. Wir gehen davon aus, dass dieser Rückgang darauf zurückzuführen ist, dass aktuelle LLMs nicht in der Lage sind, echtes logisches Reasoning durchzuführen; stattdessen versuchen sie, die in ihren Trainingsdaten beobachteten Reasoning-Schritte nachzuahmen.
One-sentence Summary
The authors introduce GSM-Symbolic, a benchmark utilizing symbolic templates to generate diverse questions that overcome limitations in existing evaluations such as GSM8K by revealing performance variance across numerical instantiations and significant deterioration with increased clause counts, indicating models replicate training patterns rather than perform genuine logical reasoning.
Key Contributions
- This work introduces GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. The benchmark enables more controllable evaluations and provides reliable metrics for measuring the reasoning capabilities of models.
- Findings reveal that large language models exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of models declines when only the numerical values in the question are altered within the GSM-Symbolic benchmark.
- The research demonstrates that model performance significantly deteriorates as the number of clauses in a question increases. These findings support the hypothesis that current models lack genuine logical reasoning and instead replicate steps observed in training data.
Introduction
Mathematical reasoning is vital for deploying artificial intelligence in scientific and real-world applications. Current evaluations rely on static benchmarks like GSM8K, which limit robustness testing and risk data contamination. The authors address these gaps with GSM-Symbolic, a benchmark utilizing symbolic templates to generate diverse question variants for controlled evaluation. Their findings show that model performance varies across question instantiations and drops when numerical values or irrelevant clauses change. These results suggest Large Language Models depend on pattern matching rather than genuine logical reasoning.
Dataset
-
Dataset Composition and Sources
- The authors base their work on the GSM8K dataset, which includes over 8000 grade school math questions split into 7473 training and 1319 test examples.
- They introduce GSM-Symbolic to generate numerous instances with controlled difficulty, addressing risks of data contamination and sensitivity to minor question modifications.
-
Processing and Key Details
- Templates are created from GSM8K test examples by identifying variables, domains, and conditions such as divisibility to ensure whole number answers.
- Common proper names are used for persons, foods, and currencies while automated checks verify that original values do not appear in the template.
- Manual review covers 10 random samples per template, with additional review triggered if fewer than two models answer a question correctly during evaluation.
- Numerical ranges align with the original GSM8K test set to focus on logical reasoning rather than arithmetic skills within known accuracy boundaries.
-
Model Usage and Evaluation
- The data serves as a reliable evaluation framework to view LLM performance as a distribution across various problem instances.
- This approach assesses mathematical capabilities and robustness to diverse problem difficulties and augmentations beyond a single fixed metric.
Experiment
This work evaluates the reasoning capabilities of various large language models using GSM-Symbolic, a benchmark generated by mutating GSM8K templates to test reliability and robustness. Results indicate significant performance variance across different question instantiations, with original GSM8K metrics often overstating true capability due to potential data contamination. Additional experiments demonstrate that reasoning abilities are fragile, as accuracy declines with increased numerical complexity or irrelevant information, suggesting models rely on pattern matching rather than formal logical understanding.
The authors evaluate various large language models on the GSM8K benchmark and its symbolic variants to assess reasoning reliability. The results indicate a consistent performance drop when models are tested on symbolic variations compared to the original GSM8K dataset, suggesting potential data contamination or reliance on pattern matching rather than formal reasoning. Furthermore, increasing question difficulty or adding irrelevant information leads to significant degradation in accuracy across all tested models. Models generally achieve higher accuracy on the standard GSM8K benchmark compared to the symbolic variations. Increasing question difficulty by adding clauses leads to lower performance scores compared to the base symbolic version. Adding seemingly relevant but inconsequential information results in the lowest performance scores across the board.
The authors compare model performance across the original GSM8K benchmark and three modified variants designed to test reasoning robustness through symbolic changes and increased difficulty. The results indicate that the original GSM8K benchmark consistently achieves higher overall accuracy compared to the modified versions, suggesting that these variations introduce challenges that slightly reduce model performance. The original GSM8K benchmark demonstrates superior overall performance compared to the modified GSM-Symbolic and GSM-P variants. Increasing question difficulty through added clauses results in a measurable drop in accuracy, with GSM-P1 and GSM-P2 scoring lower than GSM-Symbolic. Models maintain high accuracy across the benchmarks, though the specific variant of the test has a noticeable impact on the final results.
The authors evaluate large language models on the GSM8K benchmark and modified variants to assess reasoning reliability and robustness against symbolic changes. Results indicate a consistent performance drop on these variations compared to the original dataset, suggesting a potential reliance on pattern matching rather than formal reasoning. Furthermore, increasing question difficulty or adding irrelevant information leads to significant accuracy degradation across all tested models.