The Surprising Effectiveness of Test-Time Training for Abstract Reasoning

The Surprising Effectiveness of Test-Time Training for Abstract Reasoning Ekin Akyürek Mehul Damani Linlu Qiu Han Guo Yoon Kim Jacob Andreas Massachusetts Institute of Technology

In recent years, language models have made remarkable advancements in handling a wide array of tasks within their training distribution, from language translation to complex text generation. However, when faced with novel problems that demand higher-order reasoning and abstract thinking, these models often fall short. This white paper explores a transformative approach known as Test-Time Training (TTT), which dynamically updates model parameters during inference to enhance reasoning capabilities. Using the Abstraction and Reasoning Corpus (ARC) as a benchmark, this study examines how language models, with the aid of TTT, can approach human-level reasoning on unfamiliar tasks. By delving into the design and implementation of TTT, the paper highlights a significant step toward achieving generalizable intelligence in AI, pushing the boundaries of what language models can accomplish beyond their initial training.

Abstract

Language models have shown impressive performance on tasks within their training distribution, but often struggle with novel problems requiring complex reasoning. We investigate the effectiveness of test-time training (TTT)—updating model parameters temporarily during inference using a loss derived from input data—as a mechanism for improving models’ reasoning capabilities, using the Abstraction and Reasoning Corpus (ARC) as a benchmark. Through systematic experimentation, we identify three crucial components for successful TTT: (1) initial finetuning on similar tasks (2) auxiliary task format and augmentations (3) per-instance training. TTT significantly improves performance on ARC tasks, achieving up to 6× improvement in accuracy compared to base fine-tuned models; applying TTT to an 8B-parameter language model, we achieve 53% accuracy on the ARC’s public validation set, improving the state-of-the-art by nearly 25% for public and purely neural approaches. By ensembling our method with recent program generation approaches, we get SoTA public validation accuracy of 61.9%, matching the average human score. Our findings suggest that explicit symbolic search is not the only path to improved abstract reasoning in neural language models; additional test-time applied to continued training on few-shot examples can also be extremely effective.

1 Introduction

Large-scale neural language models (LMs) excel at performing tasks that occur in their training data, and often elementary variations or compositions of those tasks (Brown et al., 2020; Todd et al., 2024). Given natural language task specifications or a small number of examples, LMs often successfully infer the desired task and produce an appropriate output. But can LMs also solve new problems, involving non-trivial reasoning, planning, or string manipulation of a kind very different from their pre-training data? This question is central to understanding the novel skill acquisition capabilities of current AI systems, which has been proposed as a key measure of intelligence (Chollet, 2019).

 

View & Download the Complete White Document

 

Large-scale neural language models (LMs)

 

For complex and novel tasks, it is often difficult to obtain a correct answer simply by sampling from an LM (Wu et al., 2023). However, a significant finding in recent years has been that LM performance can be substantially improved by augmenting LM decoding with additional test-time computation. Methods in this category include chain-of-thought prompting (Wei et al., 2022), sampling with majority voting (self­ consistency; Wang et al., 2022), code execution (Brown et al., 2024; Snell et al., 2024; Damani et al., 2024), and search (Yao et al., 2024).

One scaling strategy that has gained recent attention is test-time training (TTT), in which models are updated through explicit gradient steps based on test-time inputs (Krause et al., 2018; 2019). This method differs from standard fine-tuning as it operates in an extremely low-data regime—typically via an unsupervised objective on a single input, or a supervised objective applied to one or two in-context labeled examples. Modern versions of this approach was proposed for vision models by Sun et al. (2020), and also applied to sequence models by Gandelsman et al. (2022). The design space for TTT approaches is large, and there is currently a limited understanding of which design choices are most effective for LMs (and specifically for novel-task learning). In this paper, we systematically study the impact of various TTT design choices, as well as its interaction with pre-training and sampling schemes.

We evaluate these methods in the Abstraction and Reasoning Corpus (ARC) (Chollet, 2019), a collection of extremely challenging few-shot visual reasoning problems. ARC is an ideal benchmark for testing the limits of LM generalization as it presents novel tasks, in a novel format, requiring nontrivial search and inference capabilities. Current language models perform poorly on ARC. Most successful approaches have relied on program synthesis techniques (Butt et al., 2024; Ainooson et al., 2023; Huang et al., 2023), though recently Cole et al. (2024) reported promising results using TTT on the benchmark.

We identify several crucial ingredients for effective application of TTT to few-shot learning: (1) initial fine-tuning on synthetic tasks similar to those encountered at test time, (2) an augmented, leave-one-out task generation strategy for constructing the test-time dataset, (3) per-instance adapter training and (4) a self-consistency (Wang et al., 2022) approach under invertible transformations. With careful choices of these components, TTT can significantly improve LM performance on ARC—increasing accuracy by up to a factor of six over a 1B model, and achieving state-of-the-art results for published, purely neural models on the ARC task with a 8B model. Indeed, our results show that when equipped with test-time training, ordinary LMs can match or exceed the performance of many neuro-symbolic approaches on ARC.

Our main contributions1 are:

    1. We identify and systematically analyze the key components needed for test-time training on ARC tasks, with a a novel test time training data generation and self-consistency component.
    2. We achieve state-of-the-art results among published neural approaches on the ARC validation set:
      • 53% accuracy on the public validation set with a 8B parameter model.
      • 61.9% accuracy when ensembled with program synthesis approaches, comparable to average human performance on the dataset.
    3. We demonstrate that tasks that could only be solved by program synthesis previously can be solved with fully neural approaches equipped with our TTT framework.

These results challenge the assumption that symbolic components are strictly necessary for solving such complex tasks. Instead, they suggest that the critical factor in solving novel reasoning problems may be the allocation of proper computational resources during test time, perhaps independently of whether these resources are deployed through symbolic or neural mechanisms.


 

2 Preliminaries

In this section, we first formally describe the ARC challenge. Next, we give an overview of in-context learning and test-time training, which form the foundation of our investigation. Finally, we detail our default experimental setup.
1Our implementation can be found at this link.

2.1 ARC Challenge

The Abstraction and Reasoning Corpus (ARC) aims to evaluate the abstract reasoning capabilities of language models through their ability to solve visual puzzles. Each puzzle, henceforth referred to as task, is comprised of input-output pairs of 2-D grids (up to 30 × 30 in size) that contain shapes or patterns made with up to 10 different colors, as displayed in Fig. 1(b). The output of each pair is obtained by applying an intuitive and shared transformation rule or function y = f (x). In practice, these transformations are highly diverse and composite, ranging from simple concepts such as reflection and counting, to more complex ones such as application of gravity and path finding.

Each task in ARC is composed of a training and test split, with:

  • Training examples denoted (x , y )kK =1 (typically K ranges from 2 to 7).
    kk , ytest)M
  • Test examples denoted (xtest m=1 (typically M ranges from 1 to 3).

Given the set of training examples, the goal is to predict the test output ytest for test test input xtest by reasoning about the underlying transformation.
train train, xtest

We denote a task as d =(x , y , ytest) where d .DARC, the collection of such ARC tasks. The original training and validation sets of ARC dataset, respectively Dtrain ARC, consists of 400 tasks each.
ARC and Dval Success criteria requires to produce exact match for all test outputs (if not partial points are given). Please refer to Johnson et al. (2021) for a taxonomy and analysis of these tasks.

Most approaches to ARC can be categorized into two main categories: program synthesis and fully neural. Program synthesis approaches (Butt et al., 2024; Wang et al., 2024; Li et al., 2024; Greenblatt, 2024) try to first find the transformation function f , and later apply it to the test example. On the other hand, fully neural approaches (Thoms et al., 2023; Bober-Irizar and Banerjee, 2024) try to directly predict the output ytest, only implicitly reasoning about the underlying transformation. In this work, we use a fully neural approach, using a LM to predict the test outputs.

We start with an LM pre-trained on text data (without a vision encoder). To provide ARC examples as input to these models, we thus require a formatting function (denoted str) that converts 2D grids into their textual representations as shown in Appendix A.3. Previous work has presented examples as lists of numbers (Wang et al., 2024) or color words, or lists of connected components labeled with shapes and locations (Greenblatt, 2024). Given any such string representation of a task, we may present it to an LM and perform predictions with few-short prompting, as explained in the next section.

2.2 In-context Learning
At a certain scale, many LMs exhibit the ability to adapt to new tasks without updating their parameters by simply conditioning on input examples or instructions provided. Given a sequence of input-output pairs (x1, y1), …, (xn, yn) and a new input xn+1, a LM can be used to generate the output yˆn+1 by sampling from:

yˆn+1 ~ LM(·| x1, y1, . . . xn, yn, xn+1) (1)

The possibility of in-context learning as implicit machine learning simulation discussed in previous work (Akyürek et al., 2022), but the empirical evidence shows that in-context learning with language models does not always resemble any standard machine learning algorithm (Zhao et al., 2024; Min et al., 2022), and it does not always work out-of-the box for novel tasks — e.g. small language models (few billion parameters) performs poorly on ARC (Opielka et al., 2024; Bober-Irizar and Banerjee, 2024).

2.3 Test-Time Training
Test-time training (TTT) enables parametric models to adapt during inference through dynamic parameter updates, an approach that remains relatively unexplored in the era of large language models. This technique is a form of transductive learning, where models leverages the test data structure to improve its predictions. The general TTT process works as follows: Starting with initial model parameters .0, for each test input (or batch of inputs), we first generate training data DTTT(dinput) from the test inputs. We then optimize
these parameters to minimize a loss function L(DTTT; .), producing temporarily updated parameters .d Figure 2: TTT dataset generation for a test task (Section 3.1): We start by creating leave-one-out tasks from the given training examples of the task. These tasks are then augmented through rule-based transformations to obtain the full TTT dataset. Finally, we train task-specific LoRA adapters on top of the base FT model.

 

for prediction. After generating predictions, the model is restored to the original parameters .0 for the next instance or batch. Thus, TTT trains a specialized prediction model for each test input, obtained by fine-tuning a base model on a test-time dataset generated from that test input.

In past work (e.g. Sun et al., 2020), DTTT is typically constructed by applying an unsupervised objective
(e.g. masked autoencoding) to the input x alone. However, the in-context learning setting we consider provides richer context in the form of demonstration pairs (x1, y1),…, (xK, yK). Here, applying test-time tuning involves first constructing an initial language model LM, mapping each test input x to an input-specific dataset DTTT, fine-tuning the LM to optimize some loss function L over the dataset according to: Σd∈DTTT L(LM(d)),, and finally sampling from the updated model to obtain a final prediction. Our experiments in this paper characterize each component of this pipeline, describing:

  1. How to construct the augmented TTT dataset DTTT from the test input (Section 3).
  2. An augmented inference strategy based on self-consistency over transformations (Section 4).
  3. A base model with parameters .0 that is fine-tuned on a dataset DFT of similar tasks (Section 5).

2.4 Experimental Setup
To investigate the impact of each TTT component, we conduct experiments by varying one component while holding the others constant at their optimal values (described in their respective sections). Our default configuration in the experiments uses the following settings:

Model Architecture & Optimization We use an 8B parameter language model from the Llama-3 models, and 1B, 3B from Llama-3.2 models (Dubey et al., 2024). We use Low-Rank Adaptation (LoRA) (Hu et al., 2021) for parameter-efficient test-time training. For each task d, we initialize a separate set of LoRA parameters that are trained on the dataset DTTT. The LoRA rank is set to 128, and adaptations are applied to MLP, attention, and output layers. We train models with AdamW optimizer (Loshchilov and Hutter, 2019) with 2 epochs with batch sizes of 2.

Data & Formatting For efficient evaluation purposes, we randomly pick 80 balanced ARC tasks from ARC validation set, includes 20 easy, 20 medium, 20 hard, 20 expert tasks according to the classification in LeGris et al. (2024a) (see Appendix A.2 for this task list). We will use this subset of ARC tasks throughout the paper, except our final results given in for the full validation set (Section 6). We limit DTTT to have maximum of 250 examples per task for efficiency reasons. With that, the whole TTT and inference process takes approximately 12 hours for 100 randomly sampled validation tasks when using an NVIDIA-A100 GPU. Appendix B.2

provides additional details on the hyper-parameters. Input grids are converted to text using numpy’s default array printing format as shown in Fig. 8.

In the following sections, we investigate the key factors that contribute to successful abstract reasoning with language models. Our analysis covers the impact of fine-tuning data DFT data, TTT data DTTT, training objectives, inference procedures, and model size, providing insights into effective strategy for deploying test-time training.

TTT Dataset Generation for a Test Task

 

View & Download the Complete White Document

AI Technologies Discussed

The white paper discusses several advanced AI and ML technologies, particularly in the context of improving reasoning abilities in language models. Here’s a breakdown of the specific technologies and methods involved:

1. Test-Time Training (TTT)

  • Definition: TTT is a method where a model’s parameters are updated during inference (testing) rather than during the standard training phase. This enables real-time adjustments to the model’s understanding of a given task.
  • Application: TTT is used to improve the performance of language models on novel tasks by allowing models to learn from specific examples during test time. It achieves this by applying transformations and using task-specific data to refine predictions​.

2. Abstraction and Reasoning Corpus (ARC) Benchmark

  • Purpose: The ARC benchmark is designed to test AI’s ability for abstract reasoning. It consists of visual and logical puzzles that require a mix of pattern recognition, symbolic reasoning, and extrapolation—skills necessary for AGI.
  • Usage: ARC serves as the primary dataset for evaluating the effectiveness of TTT in the study, as it presents reasoning challenges beyond what the models encountered during pre-training​.

3. Fine-Tuning with Low-Rank Adaptation (LoRA)

  • Description: LoRA is a parameter-efficient technique that applies task-specific fine-tuning by training only a subset of model parameters, such as low-rank matrices, rather than adjusting all model parameters.
  • Benefits: This approach allows efficient fine-tuning at test time, preserving computational resources while enabling the model to adapt to individual tasks dynamically​.

4. Data Augmentation with Geometric Transformations

  • Techniques: The paper details geometric transformations, such as rotation, flipping, transposing, and reflecting, applied to training data to generate augmented datasets. These transformations help create diverse views of tasks, enriching the TTT dataset and improving generalization.
  • Purpose: Data augmentation via transformations supports the model’s ability to interpret tasks under different representations, enhancing its adaptability to new reasoning tasks​.

5.Hierarchical Voting for Inference

  • Methodology: The study employs a hierarchical voting strategy in which predictions generated through transformed task versions are aggregated. This involves two stages:
    • Intra-Transformation Voting: Voting within each set of transformed predictions.
    • Global Voting: A final vote across the top candidates from each transformation.
  • Outcome: This ensemble technique improves the reliability of predictions by choosing the most likely correct answer across multiple perspectives, significantly enhancing accuracy​.

6. Self-Consistency via Augmented Inference

  • Concept: Self-consistency, as applied here, involves generating multiple prediction candidates through transformations and aggregating them to achieve consensus in the final answer.
  • Execution: By creating multiple versions of the task through geometric transformations and using a self-consistent approach, the model can better handle ambiguity and improve decision-making accuracy​.

7. Program Synthesis Integration

  • Description: Program synthesis involves creating programs or function-based solutions that the model can follow to solve a problem. Traditionally, program synthesis has been essential for ARC tasks, as it uses structured, rule-based approaches to abstract reasoning.
  • Significance: The study combines program synthesis with TTT to maximize performance on ARC, demonstrating that integrating symbolic (programmatic) and neural methods can yield state-of-the-art results​.

8. Synthetic Data Generation Using Language Models

  • Technique: The researchers generate synthetic training tasks using large language models, such as GPT-4, in few-shot settings to produce task descriptions and new generators.
  • Utility: This synthetic data generation enriches the fine-tuning dataset by expanding the variety of tasks, thus allowing the base model to learn and generalize over a broader range of examples​.

9. Quantized LoRA (QLoRA) for Memory Efficiency

  • Purpose: QLoRA is a memory-efficient variant of LoRA that reduces memory requirements by quantizing the LoRA adapters.
  • Application: This technique allows the model to perform task-specific fine-tuning in memory-constrained environments without significantly compromising performance​.

10.Multi-Model Ensembling

  • Approach: By ensembling TTT-augmented neural models with program synthesis-based models, the study achieves a state-of-the-art accuracy comparable to human-level performance.
  • Significance: This ensembling of neural and symbolic approaches highlights the complementary strengths of both methodologies, showing that they can be used together to approach AGI-like capabilities in abstract reasoning​.

These technologies collectively point toward a future where language models can perform complex, adaptive reasoning across various domains, suggesting that AGI may emerge through enhanced adaptability and multi-technique integration in AI systems.

Ai_researchers, Data_scientists, Educators_academics, Business_analysts, Software_developers,