The Surprising Effectiveness of Test-Time Training for Abstract Reasoning

Posted on November 14, 2024 by admin - AGI (Artificial General Intelligence), AI Research, Cognitive Computing, Data Science, Machine Learning, Model optimization

In recent years, language models have made remarkable advancements in handling a wide array of tasks within their training distribution, from language translation to complex text generation. However, when faced with novel problems that demand higher-order reasoning and abstract thinking, these models often fall short. This white paper explores a transformative approach known as Test-Time Training (TTT), which dynamically updates model parameters during inference to enhance reasoning capabilities. Using the Abstraction and Reasoning Corpus (ARC) as a benchmark, this study examines how language models, with the aid of TTT, can approach human-level reasoning on unfamiliar tasks. By delving into the design and implementation of TTT, the paper highlights a significant step toward achieving generalizable intelligence in AI, pushing the boundaries of what language models can accomplish beyond their initial training.

Abstract

Language models have shown impressive performance on tasks within their training distribution, but often struggle with novel problems requiring complex reasoning. We investigate the effectiveness of test-time training (TTT)—updating model parameters temporarily during inference using a loss derived from input data—as a mechanism for improving models’ reasoning capabilities, using the Abstraction and Reasoning Corpus (ARC) as a benchmark. Through systematic experimentation, we identify three crucial components for successful TTT: (1) initial finetuning on similar tasks (2) auxiliary task format and augmentations (3) per-instance training. TTT significantly improves performance on ARC tasks, achieving up to 6× improvement in accuracy compared to base fine-tuned models; applying TTT to an 8B-parameter language model, we achieve 53% accuracy on the ARC’s public validation set, improving the state-of-the-art by nearly 25% for public and purely neural approaches. By ensembling our method with recent program generation approaches, we get SoTA public validation accuracy of 61.9%, matching the average human score. Our findings suggest that explicit symbolic search is not the only path to improved abstract reasoning in neural language models; additional test-time applied to continued training on few-shot examples can also be extremely effective.

1 Introduction

Large-scale neural language models (LMs) excel at performing tasks that occur in their training data, and often elementary variations or compositions of those tasks (Brown et al., 2020; Todd et al., 2024). Given natural language task specifications or a small number of examples, LMs often successfully infer the desired task and produce an appropriate output. But can LMs also solve new problems, involving non-trivial reasoning, planning, or string manipulation of a kind very different from their pre-training data? This question is central to understanding the novel skill acquisition capabilities of current AI systems, which has been proposed as a key measure of intelligence (Chollet, 2019).

View & Download the Complete White Document

For complex and novel tasks, it is often difficult to obtain a correct answer simply by sampling from an LM (Wu et al., 2023). However, a significant finding in recent years has been that LM performance can be substantially improved by augmenting LM decoding with additional test-time computation. Methods in this category include chain-of-thought prompting (Wei et al., 2022), sampling with majority voting (self consistency; Wang et al., 2022), code execution (Brown et al., 2024; Snell et al., 2024; Damani et al., 2024), and search (Yao et al., 2024).

One scaling strategy that has gained recent attention is test-time training (TTT), in which models are updated through explicit gradient steps based on test-time inputs (Krause et al., 2018; 2019). This method differs from standard fine-tuning as it operates in an extremely low-data regime—typically via an unsupervised objective on a single input, or a supervised objective applied to one or two in-context labeled examples. Modern versions of this approach was proposed for vision models by Sun et al. (2020), and also applied to sequence models by Gandelsman et al. (2022). The design space for TTT approaches is large, and there is currently a limited understanding of which design choices are most effective for LMs (and specifically for novel-task learning). In this paper, we systematically study the impact of various TTT design choices, as well as its interaction with pre-training and sampling schemes.

We evaluate these methods in the Abstraction and Reasoning Corpus (ARC) (Chollet, 2019), a collection of extremely challenging few-shot visual reasoning problems. ARC is an ideal benchmark for testing the limits of LM generalization as it presents novel tasks, in a novel format, requiring nontrivial search and inference capabilities. Current language models perform poorly on ARC. Most successful approaches have relied on program synthesis techniques (Butt et al., 2024; Ainooson et al., 2023; Huang et al., 2023), though recently Cole et al. (2024) reported promising results using TTT on the benchmark.

We identify several crucial ingredients for effective application of TTT to few-shot learning: (1) initial fine-tuning on synthetic tasks similar to those encountered at test time, (2) an augmented, leave-one-out task generation strategy for constructing the test-time dataset, (3) per-instance adapter training and (4) a self-consistency (Wang et al., 2022) approach under invertible transformations. With careful choices of these components, TTT can significantly improve LM performance on ARC—increasing accuracy by up to a factor of six over a 1B model, and achieving state-of-the-art results for published, purely neural models on the ARC task with a 8B model. Indeed, our results show that when equipped with test-time training, ordinary LMs can match or exceed the performance of many neuro-symbolic approaches on ARC.

Our main contributions1 are:

1. We identify and systematically analyze the key components needed for test-time training on ARC tasks, with a a novel test time training data generation and self-consistency component.
2. We achieve state-of-the-art results among published neural approaches on the ARC validation set:
  - 53% accuracy on the public validation set with a 8B parameter model.
  - 61.9% accuracy when ensembled with program synthesis approaches, comparable to average human performance on the dataset.
3. We demonstrate that tasks that could only be solved by program synthesis previously can be solved with fully neural approaches equipped with our TTT framework.

These results challenge the assumption that symbolic components are strictly necessary for solving such complex tasks. Instead, they suggest that the critical factor in solving novel reasoning problems may be the allocation of proper computational resources during test time, perhaps independently of whether these resources are deployed through symbolic or neural mechanisms.

2 Preliminaries

In this section, we first formally describe the ARC challenge. Next, we give an overview of in-context learning and test-time training, which form the foundation of our investigation. Finally, we detail our default experimental setup.
1Our implementation can be found at this link.

2.1 ARC Challenge

The Abstraction and Reasoning Corpus (ARC) aims to evaluate the abstract reasoning capabilities of language models through their ability to solve visual puzzles. Each puzzle, henceforth referred to as task, is comprised of input-output pairs of 2-D grids (up to 30 × 30 in size) that contain shapes or patterns made with up to 10 different colors, as displayed in Fig. 1(b). The output of each pair is obtained by applying an intuitive and shared transformation rule or function y = f (x). In practice, these transformations are highly diverse and composite, ranging from simple concepts such as reflection and counting, to more complex ones such as application of gravity and path finding.

Each task in ARC is composed of a training and test split, with:

Training examples denoted (x , y )kK =1 (typically K ranges from 2 to 7).
kk , ytest)M
Test examples denoted (xtest m=1 (typically M ranges from 1 to 3).

Given the set of training examples, the goal is to predict the test output ytest for test test input xtest by reasoning about the underlying transformation.
train train, xtest

We denote a task as d =(x , y , ytest) where d .DARC, the collection of such ARC tasks. The original training and validation sets of ARC dataset, respectively Dtrain ARC, consists of 400 tasks each.
ARC and Dval Success criteria requires to produce exact match for all test outputs (if not partial points are given). Please refer to Johnson et al. (2021) for a taxonomy and analysis of these tasks.

Most approaches to ARC can be categorized into two main categories: program synthesis and fully neural. Program synthesis approaches (Butt et al., 2024; Wang et al., 2024; Li et al., 2024; Greenblatt, 2024) try to first find the transformation function f , and later apply it to the test example. On the other hand, fully neural approaches (Thoms et al., 2023; Bober-Irizar and Banerjee, 2024) try to directly predict the output ytest, only implicitly reasoning about the underlying transformation. In this work, we use a fully neural approach, using a LM to predict the test outputs.

We start with an LM pre-trained on text data (without a vision encoder). To provide ARC examples as input to these models, we thus require a formatting function (denoted str) that converts 2D grids into their textual representations as shown in Appendix A.3. Previous work has presented examples as lists of numbers (Wang et al., 2024) or color words, or lists of connected components labeled with shapes and locations (Greenblatt, 2024). Given any such string representation of a task, we may present it to an LM and perform predictions with few-short prompting, as explained in the next section.

2.2 In-context Learning
At a certain scale, many LMs exhibit the ability to adapt to new tasks without updating their parameters by simply conditioning on input examples or instructions provided. Given a sequence of input-output pairs (x1, y1), …, (xn, yn) and a new input xn+1, a LM can be used to generate the output yˆn+1 by sampling from:

yˆn+1 ~ LM(·| x1, y1, . . . xn, yn, xn+1) (1)

The possibility of in-context learning as implicit machine learning simulation discussed in previous work (Akyürek et al., 2022), but the empirical evidence shows that in-context learning with language models does not always resemble any standard machine learning algorithm (Zhao et al., 2024; Min et al., 2022), and it does not always work out-of-the box for novel tasks — e.g. small language models (few billion parameters) performs poorly on ARC (Opielka et al., 2024; Bober-Irizar and Banerjee, 2024).

2.3 Test-Time Training
Test-time training (TTT) enables parametric models to adapt during inference through dynamic parameter updates, an approach that remains relatively unexplored in the era of large language models. This technique is a form of transductive learning, where models leverages the test data structure to improve its predictions. The general TTT process works as follows: Starting with initial model parameters .0, for each test input (or batch of inputs), we first generate training data DTTT(dinput) from the test inputs. We then optimize
these parameters to minimize a loss function L(DTTT; .), producing temporarily updated parameters .d Figure 2: TTT dataset generation for a test task (Section 3.1): We start by creating leave-one-out tasks from the given training examples of the task. These tasks are then augmented through rule-based transformations to obtain the full TTT dataset. Finally, we train task-specific LoRA adapters on top of the base FT model.

for prediction. After generating predictions, the model is restored to the original parameters .0 for the next instance or batch. Thus, TTT trains a specialized prediction model for each test input, obtained by fine-tuning a base model on a test-time dataset generated from that test input.

In past work (e.g. Sun et al., 2020), DTTT is typically constructed by applying an unsupervised objective
(e.g. masked autoencoding) to the input x alone. However, the in-context learning setting we consider provides richer context in the form of demonstration pairs (x1, y1),…, (xK, yK). Here, applying test-time tuning involves first constructing an initial language model LM, mapping each test input x to an input-specific dataset DTTT, fine-tuning the LM to optimize some loss function L over the dataset according to: Σd∈DTTT L(LM(d)),, and finally sampling from the updated model to obtain a final prediction. Our experiments in this paper characterize each component of this pipeline, describing:

How to construct the augmented TTT dataset DTTT from the test input (Section 3).
An augmented inference strategy based on self-consistency over transformations (Section 4).
A base model with parameters .0 that is fine-tuned on a dataset DFT of similar tasks (Section 5).

2.4 Experimental Setup
To investigate the impact of each TTT component, we conduct experiments by varying one component while holding the others constant at their optimal values (described in their respective sections). Our default configuration in the experiments uses the following settings:

Model Architecture & Optimization We use an 8B parameter language model from the Llama-3 models, and 1B, 3B from Llama-3.2 models (Dubey et al., 2024). We use Low-Rank Adaptation (LoRA) (Hu et al., 2021) for parameter-efficient test-time training. For each task d, we initialize a separate set of LoRA parameters that are trained on the dataset DTTT. The LoRA rank is set to 128, and adaptations are applied to MLP, attention, and output layers. We train models with AdamW optimizer (Loshchilov and Hutter, 2019) with 2 epochs with batch sizes of 2.

Data & Formatting For efficient evaluation purposes, we randomly pick 80 balanced ARC tasks from ARC validation set, includes 20 easy, 20 medium, 20 hard, 20 expert tasks according to the classification in LeGris et al. (2024a) (see Appendix A.2 for this task list). We will use this subset of ARC tasks throughout the paper, except our final results given in for the full validation set (Section 6). We limit DTTT to have maximum of 250 examples per task for efficiency reasons. With that, the whole TTT and inference process takes approximately 12 hours for 100 randomly sampled validation tasks when using an NVIDIA-A100 GPU. Appendix B.2

provides additional details on the hyper-parameters. Input grids are converted to text using numpy’s default array printing format as shown in Fig. 8.

In the following sections, we investigate the key factors that contribute to successful abstract reasoning with language models. Our analysis covers the impact of fine-tuning data DFT data, TTT data DTTT, training objectives, inference procedures, and model size, providing insights into effective strategy for deploying test-time training.

View & Download the Complete White Document

AI Technologies Discussed

The white paper discusses several advanced AI and ML technologies, particularly in the context of improving reasoning abilities in language models. Here’s a breakdown of the specific technologies and methods involved:

1. Test-Time Training (TTT)

Definition: TTT is a method where a model’s parameters are updated during inference (testing) rather than during the standard training phase. This enables real-time adjustments to the model’s understanding of a given task.
Application: TTT is used to improve the performance of language models on novel tasks by allowing models to learn from specific examples during test time. It achieves this by applying transformations and using task-specific data to refine predictions.

2. Abstraction and Reasoning Corpus (ARC) Benchmark

Purpose: The ARC benchmark is designed to test AI’s ability for abstract reasoning. It consists of visual and logical puzzles that require a mix of pattern recognition, symbolic reasoning, and extrapolation—skills necessary for AGI.
Usage: ARC serves as the primary dataset for evaluating the effectiveness of TTT in the study, as it presents reasoning challenges beyond what the models encountered during pre-training.

3. Fine-Tuning with Low-Rank Adaptation (LoRA)

Description: LoRA is a parameter-efficient technique that applies task-specific fine-tuning by training only a subset of model parameters, such as low-rank matrices, rather than adjusting all model parameters.
Benefits: This approach allows efficient fine-tuning at test time, preserving computational resources while enabling the model to adapt to individual tasks dynamically.

4. Data Augmentation with Geometric Transformations

Techniques: The paper details geometric transformations, such as rotation, flipping, transposing, and reflecting, applied to training data to generate augmented datasets. These transformations help create diverse views of tasks, enriching the TTT dataset and improving generalization.
Purpose: Data augmentation via transformations supports the model’s ability to interpret tasks under different representations, enhancing its adaptability to new reasoning tasks.

5.Hierarchical Voting for Inference

Methodology: The study employs a hierarchical voting strategy in which predictions generated through transformed task versions are aggregated. This involves two stages:
- Intra-Transformation Voting: Voting within each set of transformed predictions.
- Global Voting: A final vote across the top candidates from each transformation.
Outcome: This ensemble technique improves the reliability of predictions by choosing the most likely correct answer across multiple perspectives, significantly enhancing accuracy.

6. Self-Consistency via Augmented Inference

Concept: Self-consistency, as applied here, involves generating multiple prediction candidates through transformations and aggregating them to achieve consensus in the final answer.
Execution: By creating multiple versions of the task through geometric transformations and using a self-consistent approach, the model can better handle ambiguity and improve decision-making accuracy.

7. Program Synthesis Integration

Description: Program synthesis involves creating programs or function-based solutions that the model can follow to solve a problem. Traditionally, program synthesis has been essential for ARC tasks, as it uses structured, rule-based approaches to abstract reasoning.
Significance: The study combines program synthesis with TTT to maximize performance on ARC, demonstrating that integrating symbolic (programmatic) and neural methods can yield state-of-the-art results.

8. Synthetic Data Generation Using Language Models

Technique: The researchers generate synthetic training tasks using large language models, such as GPT-4, in few-shot settings to produce task descriptions and new generators.
Utility: This synthetic data generation enriches the fine-tuning dataset by expanding the variety of tasks, thus allowing the base model to learn and generalize over a broader range of examples.

9. Quantized LoRA (QLoRA) for Memory Efficiency

Purpose: QLoRA is a memory-efficient variant of LoRA that reduces memory requirements by quantizing the LoRA adapters.
Application: This technique allows the model to perform task-specific fine-tuning in memory-constrained environments without significantly compromising performance.

10.Multi-Model Ensembling

Approach: By ensembling TTT-augmented neural models with program synthesis-based models, the study achieves a state-of-the-art accuracy comparable to human-level performance.
Significance: This ensembling of neural and symbolic approaches highlights the complementary strengths of both methodologies, showing that they can be used together to approach AGI-like capabilities in abstract reasoning.

These technologies collectively point toward a future where language models can perform complex, adaptive reasoning across various domains, suggesting that AGI may emerge through enhanced adaptability and multi-technique integration in AI systems.

Ai_researchers, Data_scientists, Educators_academics, Business_analysts, Software_developers,

White Paper Details

Author(s)/Organization:

Ekin Akyürek, Mehul Damani, Linlu Qiu, Han Guo, Yoon Kim, Jacob Andreas, Massachusetts Institute of Technology

Original Source: Source Website

Publication Date: November 11, 2024

Executive Summary/Abstract

Language models have shown impressive performance on tasks within their training distribution, but often struggle with novel problems requiring complex reasoning. We investigate the effectiveness of test-time training (TTT) -- updating model parameters temporarily during inference using a loss derived from input data -- as a mechanism for improving models' reasoning capabilities, using the Abstraction and Reasoning Corpus (ARC) as a benchmark. Through systematic experimentation, we identify three crucial components for successful TTT: (1) initial finetuning on similar tasks (2) auxiliary task format and augmentations (3) per-instance training. TTT significantly improves performance on ARC tasks, achieving up to 6x improvement in accuracy compared to base fine-tuned models; applying TTT to an 8B-parameter language model, we achieve 53% accuracy on the ARC's public validation set, improving the state-of-the-art by nearly 25% for public and purely neural approaches. By ensembling our method with recent program generation approaches, we get SoTA public validation accuracy of 61.9%, matching the average human score. Our findings suggest that explicit symbolic search is not the only path to improved abstract reasoning in neural language models; additional test-time applied to continued training on few-shot examples can also be extremely effective.

Download White Paper: Download PDF

Key Findings/Highlights

This paper, "The Surprising Effectiveness of Test-Time Training for Abstract Reasoning," provides insights into enhancing language models' reasoning capabilities, particularly through Test-Time Training (TTT). Here are the main conclusions and contributions from the study:

Enhanced Abstract Reasoning:
- TTT significantly boosts the performance of language models on tasks requiring abstract reasoning, especially when evaluated on the Abstraction and Reasoning Corpus (ARC), a challenging benchmark for abstract thinking. Using TTT, models saw up to a sixfold increase in accuracy over traditional fine-tuned approaches.
Components for Successful TTT Implementation:
- The researchers identified three critical components to optimize TTT for ARC tasks:
  1. Initial fine-tuning on synthetic tasks similar to those expected during testing.
  2. Use of specific task formats and augmentations to generate the TTT dataset.
  3. Applying per-instance training during test time to tailor the model’s learning to each unique task.
Achieving Human-Comparable Accuracy:
- By applying TTT to an 8-billion parameter language model, the research achieved 53% accuracy on the ARC's validation set, establishing a new state-of-the-art for neural models on this benchmark. When combined with program synthesis methods, the accuracy rose to 61.9%, aligning with average human performance.
Challenging the Need for Symbolic AI in Reasoning:
- Traditionally, symbolic approaches were deemed essential for abstract reasoning. However, this study demonstrates that neural language models, when equipped with TTT, can handle such reasoning tasks effectively without explicit symbolic components, suggesting a new pathway for reasoning in AI systems.
Scalability and Versatility of TTT:
- TTT proved effective across different model sizes (from 1B to 8B parameters), enhancing even smaller models to approach performance levels previously achievable only with larger models. The study also suggests that task-specific adapters (through Low-Rank Adaptation or LoRA) and geometric transformations during test-time inference amplify performance, indicating TTT’s scalability and applicability across diverse AI architectures.
Advanced Inference Strategies:
- The paper introduced an augmented inference strategy using transformations and a hierarchical voting mechanism, which selects the most likely correct prediction from multiple transformed outputs. This method achieved near-optimal results and was comparable to an oracle (a system capable of perfect selection among predictions), illustrating TTT’s effectiveness in practical scenarios.

In essence, this study underscores TTT’s potential to bridge the gap between task-specific intelligence and broader, adaptable reasoning, marking significant progress toward AGI. The success of TTT in this context may reshape current thinking on how neural and symbolic AI can complement each other to achieve higher levels of cognitive processing in machines.

Accuracy of Different Data and Optimization Ablations in TTF

Off Canvas