Evaluating Large Language Model Performance Across Diverse Reasoning Tasks and Methods

Published on 2025-11-12 • Avichala Research

Evaluating Large Language Model Performance Across Diverse Reasoning Tasks and Methods

Abstract: This paper presents a comprehensive benchmarking study of Large Language Models (LLMs) across a wide spectrum of reasoning tasks and prompting methodologies. Utilizing established benchmarks like ARC-C, GSM8K, MMLU, and GPQA, alongside novel selection methods, the research demonstrates significant variations in LLM performance dependent on both the underlying model architecture and the prompting strategies employed. The findings highlight the critical need for more nuanced evaluation metrics and sophisticated prompting techniques to unlock the full potential of LLMs.

Problem Statement: The rapid proliferation of Large Language Models has created a critical gap in our understanding of their true capabilities and limitations. Initial benchmarks and “headline” performance numbers often fail to capture the subtle differences in how LLMs perform across diverse reasoning tasks and under varying prompting conditions. This lack of granular evaluation hinders the responsible development and deployment of these powerful models, particularly in high-stakes applications where reliable performance is paramount. Furthermore, the absence of standardized methods for probing LLM ‘reasoning’ – moving beyond simply generating correct answers – leaves open critical questions about their genuine comprehension and problem-solving abilities. This research aims to address this need by providing a robust, multi-faceted assessment of LLM performance, focusing on the interplay between model architecture, prompting, and task complexity.

Methodology: The research employs a rigorous experimental design to evaluate LLMs’ performance. The core of the study leverages a suite of established benchmarks, including:

ARC-C (Abstraction and Reasoning Corpus - Challenging): Tests abstract reasoning and common-sense knowledge.
GSM8K (Grade School Math Knowledge): Assesses mathematical reasoning abilities.
MMLU (Massive Multitask Language Understanding): Evaluates knowledge across a broad range of subjects.
GPQA (Grade School Question Answering): Focuses on factual question answering.
IFEval: A dataset for evaluating the performance of language models on tasks involving arithmetic and logical reasoning.
MATH-HARD: A specifically designed dataset for testing numerical reasoning in LLMs.

Crucially, the researchers extend beyond standard benchmark usage by investigating various prompting techniques, categorized as:

Full CPT (Conditional Prompt Tuning): Utilizes fine-tuning to adapt the model to specific tasks.
Full SFT (Full Supervised Fine-Tuning): Standard fine-tuning approach.
SPEAR-MM (Selective Prompting Methods): Employs diverse prompt selection strategies, including Conservative, Balanced, and Aggressive approaches, all focused on improving task-specific performance.
Random Freeze: A baseline approach involving no model parameter updates.
Top-8 Freeze: Freezing the top 8 layers of the LLM to observe the impact of lower-layer freezing on task performance.

Furthermore, the study investigates selection methods to control the retrieval process for question answering, focusing on:

Random (50%): A baseline selection method using random retrieval.
SVDR-only (Sparse Vector Retrieval - only): Leveraging sparse vector retrieval.
SWCI-only (Semantic Weighted Context Integration – only): Incorporating semantic weighting into the context integration process.
SNR-only (Signal-to-Noise Ratio - only): Utilizing SNR metrics to filter irrelevant context.
SPEAR-MM (Combined): Combines elements of SPEAR-MM with other selection techniques.

The experimental setup involved running each model and prompting method across the selected benchmarks, meticulously recording accuracy and retention rates.

Findings & Results: The results demonstrate a highly variable landscape of LLM performance. Across the core benchmarks (ARC-C, GSM8K, MMLU, GPQA), models fine-tuned with Full CPT and Full SFT consistently outperformed the random freeze and top-8 freeze approaches, achieving consistently high accuracy rates (often above 90% on MMLU and approaching 100% on ARC-C). The SPEAR-MM prompting methods, particularly the 'Aggressive' variant, yielded the most significant gains across multiple benchmarks, demonstrating the value of targeted prompt engineering. Interestingly, the selection methods (SVDR-only, SWCI-only, SNR-only, and SPEAR-MM combined) had a notable influence, with SNR-only achieving relatively high retention rates in the MATH-HARD dataset. However, the impact of selection varied significantly based on the specific benchmark. For example, while SNR-only was effective in the MATH-HARD dataset, it performed comparatively poorly on more open-ended reasoning tasks like MMLU. Full CPT and Full SFT consistently generated the highest accuracy across all benchmarks, suggesting an optimal balance between fine-tuning and prompt strategy for maximizing performance on the investigated set of tasks.

Limitations: This research primarily focuses on a specific set of benchmarks and prompting techniques. The study lacks investigation into other prompting methods beyond those explicitly tested (e.g., Chain-of-Thought prompting, Tree of Thoughts). Furthermore, the evaluation primarily centers around accuracy; it does not delve deeply into the reasoning process itself, leaving open questions about whether the observed performance is due to genuine understanding or simply sophisticated pattern matching. The evaluation largely relies on readily available datasets, potentially limiting the generalization of the findings to more complex or real-world scenarios. Finally, the study’s focus on specific model architectures (while not explicitly stated, the use of 'freeze' methods strongly implies the use of established LLM architectures) may restrict the applicability of the results to models with significantly different designs.

Future Work & Outlook: This research lays a crucial foundation for future investigations. Further work should explore a broader range of prompting techniques, including techniques designed to explicitly elicit reasoning processes from LLMs. Developing more sophisticated metrics beyond accuracy, incorporating measures of 'explainability' and 'confidence,' is critical. Exploring the impact of different model architectures and training strategies on these newly developed prompting methods will be essential. Finally, moving beyond static benchmarks towards dynamically generated, adaptive evaluation environments will be vital for truly assessing the robustness and adaptability of LLMs in real-world applications. The growing emphasis on “agentic” AI, where LLMs are integrated with external tools and memory systems, demands new evaluation frameworks that can measure not just individual performance but also the ability of the system to achieve complex goals.

Avichala Commentary: This research offers a valuable, albeit narrowly focused, contribution to the burgeoning field of LLM evaluation. It reinforces the increasingly clear picture that LLMs are not inherently intelligent but rather excel at pattern recognition and statistical prediction. The diverse findings underscore the urgent need for a more nuanced understanding of how LLMs “reason,” highlighting the crucial role of prompt engineering. As AI agents become more sophisticated, capable of interacting with the world and autonomously achieving goals, the evaluation of LLMs will shift from simple accuracy metrics to encompass measurable reasoning abilities and the ability to adapt to novel situations – a shift that is fundamentally driven by the complexity and unpredictability of the real world. This research provides a key data point in what promises to be a highly dynamic and iterative research landscape.

Link to the Arxiv: https://arxiv.org/abs/2511.08500