Task Implementations

Task Implementations

Individual task implementations for evaluation and training across multiple domains including reasoning, mathematics, coding, and language understanding.

Overview

The task system provides standardized implementations for popular benchmarks:

  • ARC - AI2 Reasoning Challenge (multiple choice reasoning)
  • MMLU - Massive Multitask Language Understanding
  • GSM8K - Grade School Math with tool use
  • HumanEval - Code generation and execution
  • SpellingBee - Letter counting with mixed reasoning
  • Common Framework - Base classes and utilities

Key Files:

  • tasks/common.py - Base Task class and utilities
  • tasks/arc.py - AI2 Reasoning Challenge
  • tasks/mmlu.py - Multitask Language Understanding
  • tasks/gsm8k.py - Grade School Math
  • tasks/humaneval.py - Code generation benchmark
  • tasks/spellingbee.py - Custom spelling and counting task

Common Framework

Provides base classes and utilities for all task implementations.

Source: tasks/common.py:1-20

python
"""
Base class for all Tasks.
A Task is basically a dataset of conversations, together with some
metadata and often also evaluation criteria.
Example tasks: MMLU, ARC-Easy, ARC-Challenge, GSM8K, HumanEval, SmolTalk.
"""

import random

class Task:
    """
    Base class of a Task. Allows for lightweight slicing of the underlying dataset.
    """

    def __init__(self, start=0, stop=None, step=1):
        # allows a lightweight logical view over a dataset
        assert start >= 0, f"Start must be non-negative, got {start}"
        assert stop is None or stop >= start, f"Stop should be greater than or equal to start, got {stop} and {start}"
        assert step >= 1, f"Step must be strictly positive, got {step}"
        self.start = start
        self.stop = stop # could be None here
        self.step = step

Task Base Class

All tasks inherit from the base Task class with standard interface:

Source: tasks/common.py:25-50

python
@property
def eval_type(self):
    # one of 'generative' | 'categorical'
    raise NotImplementedError

def num_examples(self):
    raise NotImplementedError

def get_example(self, index):
    raise NotImplementedError

def __len__(self):
    start = self.start
    stop = self.num_examples() if self.stop is None else self.stop
    step = self.step
    span = stop - start
    num = (span + step - 1) // step # ceil_div(span, step)
    assert num >= 0, f"Negative number of examples???: {num}" # prevent footguns
    return num

def __getitem__(self, index: int):
    assert isinstance(index, int), f"Index must be an integer, got {type(index)}"
    physical_index = self.start + index * self.step
    conversation = self.get_example(physical_index)
    return conversation

def evaluate(self, problem, completion):
    raise NotImplementedError

Task Composition

Supports mixing and sequencing tasks for training:

Source: tasks/common.py:75-95

python
class TaskMixture(Task):
    """
    For SFT Training it becomes useful to train on a mixture of datasets.
    Fun trick: if you wish to oversample any task, just pass it in multiple times in the list.
    """

    def __init__(self, tasks, **kwargs):
        super().__init__(**kwargs)
        # tasks is a list of Task objects
        self.tasks = tasks
        self.lengths = [len(task) for task in self.tasks]
        self.num_conversations = sum(self.lengths)
        # Build list of all (task_idx, local_idx) pairs
        self.index_map = []
        for task_idx, task_length in enumerate(self.lengths):
            for local_idx in range(task_length):
                self.index_map.append((task_idx, local_idx))
        # Deterministically shuffle to mix tasks throughout training
        rng = random.Random(42)
        rng.shuffle(self.index_map)

Multiple Choice Rendering

Standardized format for categorical tasks:

Source: tasks/common.py:130-150

python
def render_mc(question, letters, choices):
    """
    The common multiple choice rendering format we will use.

    Note two important design decisions:
    1)
    Bigger models don't care as much, but smaller models prefer to have
    the letter *after* the choice, which results in better binding.
    2)
    There is no whitespace between the delimiter (=) and the letter.
    This is actually critical because the tokenizer has different token ids
    for " A" vs. "A". The assistant responses will be just the letter itself,
    i.e. "A", so it is important that here in the prompt it is the exact same
    token, i.e. "A" with no whitespace before it. Again, bigger models don't care
    about this too much, but smaller models do care about some of these details.
    """
    query = f"Multiple Choice question: {question}\\n"
    query += "".join([f"- {choice}={letter}\\n" for letter, choice in zip(letters, choices)])
    query += "\\nRespond only with the letter of the correct answer."
    return query

ARC (AI2 Reasoning Challenge)

Multiple choice reasoning tasks from Allen AI.

Source: tasks/arc.py:1-15

python
"""
The ARC dataset from Allen AI.
https://huggingface.co/datasets/allenai/ai2_arc
"""

from datasets import load_dataset
from tasks.common import Task, render_mc

class ARC(Task):

    def __init__(self, subset, split, **kwargs):
        super().__init__(**kwargs)
        assert subset in ["ARC-Easy", "ARC-Challenge"], "ARC subset must be ARC-Easy or ARC-Challenge"
        assert split in ["train", "validation", "test"], "ARC split must be train|validation|test"
        self.ds = load_dataset("allenai/ai2_arc", subset, split=split).shuffle(seed=42)

Example Generation

Converts ARC format to conversation format:

Source: tasks/arc.py:20-40

python
def get_example(self, index):
    row = self.ds[index]
    question = row["question"] # the question text
    choices = row["choices"]["text"] # the text of each choice
    answer_string = row["answerKey"] # e.g. "A", "B", "C", "D"
    letters = row["choices"]["label"] # e.g. ["A", "B", "C", "D"]
    assert answer_string in letters, f"ARC answer {answer_string} must be one of {letters}" # sanity check
    # create and return the Conversation object
    user_message = render_mc(question, letters, choices)
    messages = [
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": answer_string}
    ]
    conversation = {
        "messages": messages,
        "letters": letters, # useful during evaluation, so we can narrow and clamp the assistant prediction to one of the letters
    }
    return conversation

ARC uses categorical evaluation where models predict A/B/C/D choices directly.

MMLU (Massive Multitask Language Understanding)

Broad knowledge evaluation across 57 academic subjects.

Source: tasks/mmlu.py:1-15

python
"""
The MMLU dataset.
https://huggingface.co/datasets/cais/mmlu
"""

from datasets import load_dataset
from tasks.common import Task, render_mc

class MMLU(Task):

    letters = ('A', 'B', 'C', 'D')
    groups = ('abstract_algebra', 'anatomy', 'astronomy', 'business_ethics', 'clinical_knowledge', 'college_biology', ...)

Subject Coverage

MMLU covers diverse academic domains:

  • STEM: mathematics, physics, chemistry, biology, computer science
  • Humanities: history, philosophy, literature
  • Social Sciences: psychology, economics, political science
  • Professional: law, medicine, business

Source: tasks/mmlu.py:25-45

python
def get_example(self, index):
    row = self.ds[index]
    question = row["question"] # the question text
    choices = row["choices"] # the text of each choice
    answer = row["answer"] # index of the answer, e.g. 0,1,2,3 (for A,B,C,D)
    subject = row["subject"] # e.g. "college_biology", "college_chemistry", etc.
    assert len(choices) == 4, "MMLU should have 4 choices"
    # create and return the Conversation object
    user_message = render_mc(question, self.letters, choices)
    assistant_message = self.letters[answer]
    messages = [
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": assistant_message}
    ]
    conversation = {
        "messages": messages,
        "subject": subject, # might be useful later for grouping metrics by subject
        "letters": self.letters, # useful during evaluation, so we can narrow and clamp the assistant prediction to one of the letters
    }
    return conversation

GSM8K (Grade School Math)

Mathematical word problems with tool use for calculations.

Source: tasks/gsm8k.py:1-20

python
"""
GSM8K evaluation.
https://huggingface.co/datasets/openai/gsm8k

Example problem instance:

Question:
Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?
Answer:
Weng earns 12/60 = $<<12/60=0.2>>0.2 per minute.
Working 50 minutes, she earned 0.2 x 50 = $<<0.2*50=10>>10.
#### 10

Notice that GSM8K uses tool calls inside << >> tags.
"""

Tool Use Integration

GSM8K demonstrates multi-modal reasoning with embedded calculations:

Source: tasks/gsm8k.py:40-70

python
def get_example(self, index):
    """ Get a single problem from the dataset. """
    row = self.ds[index]
    question = row['question'] # string of the question prompt
    answer = row['answer'] # string of the full solution and the answer after #### marker
    # Create and return the Conversation object
    # This is tricky because GSM8K uses tool calls, which we need to parse here.
    assistant_message_parts = []
    parts = re.split(r'(<<[^>]+>>)', answer)
    for part in parts:
        if part.startswith('<<') and part.endswith('>>'):
            # This is a calculator tool call
            inner = part[2:-2]  # Remove << >>
            # Split on = to get expression and result
            if '=' in inner:
                expr, result = inner.rsplit('=', 1)
            else:
                expr, result = inner, ""
            # Add the tool call as a part
            assistant_message_parts.append({"type": "python", "text": expr})
            # Add the result as a part
            assistant_message_parts.append({"type": "python_output", "text": result})
        else:
            # Regular text in between tool calls
            assistant_message_parts.append({"type": "text", "text": part})

Answer Extraction

Robust numerical answer parsing:

Source: tasks/gsm8k.py:25-35

python
GSM_RE = re.compile(r"#### (\\-?[0-9\\.\\,]+)")
def extract_answer(completion):
    """
    Extract the numerical answer after #### marker.
    Follows official code for normalization:
    https://github.com/openai/grade-school-math/blob/3101c7d5072418e28b9008a6636bde82a006892c/grade_school_math/dataset.py#L28
    """
    match = GSM_RE.search(completion)
    if match:
        match_str = match.group(1).strip()
        match_str = match_str.replace(",", "")
        return match_str
    return None

HumanEval (Code Generation)

Python code generation with execution-based evaluation.

Source: tasks/humaneval.py:1-15

python
"""
Evaluate the Chat model on HumanEval dataset.
Btw this dataset is a misnomer and has nothing to do with humans.
It is a coding benchmark.
"""

import re
from datasets import load_dataset
from nanochat.execution import execute_code
from tasks.common import Task

Code Extraction

Robust parsing of code from model responses:

Source: tasks/humaneval.py:20-40

python
def extract_program(completion):
    """
    Extract Python code from LLM completion.

    Handles various output formats:
    - Code wrapped in ```python ... ``` or ``` ... ``` blocks
    - Plain code without markdown blocks
    - Extra text before/after code blocks

    Returns the first code block if found, otherwise returns the whole completion.
    """
    # Try to find markdown code blocks (```python or just ```)
    # Match ```python\\n...\\n``` or ```\\n...\\n```
    pattern = r'```(?:python)?\\s*\\n(.*?)\\n```'
    matches = re.findall(pattern, completion, re.DOTALL)

    if matches:
        # Return the first code block found
        return matches[0].strip()

    # No code blocks found, return the whole completion
    return completion.strip()

Execution-Based Evaluation

Uses safe code execution to test solutions:

Source: tasks/humaneval.py:65-85

python
def evaluate(self, conversation, completion):
    """ Given (conversation, completion), return boolean success of the completion. """
    # the prompt will contain the imports and the function signature
    imports = extract_imports(conversation['messages'][0]['content'])
    # the completion will usually contain the whole function
    # but not always with the needed imports, so we manually append them
    completion_code = extract_program(completion)
    program = (
        imports
        + "\\n\\n"
        + completion_code
        + "\\n\\n"
        + conversation['test']
        + "\\n"
        + f"check({conversation['entry_point']})"
    )
    result = execute_code(program)
    success = result.success
    return success

HumanEval tests functional correctness by running the generated code against test cases.

SpellingBee (Custom Task)

Novel task combining letter counting with tool use and manual reasoning.

Source: tasks/spellingbee.py:1-30

python
"""
Task intended to make nanochat better in spelling and counting, for example:

"How many r are in strawberry?" -> 3

An interesting part of this task is that we will get the assistant to
solve the problem using a combination of manual counting and Python.
This is a good problem solving "instinct" to mix into the model and RL
may further refine it to trust one over the other. If we were extra fancy
(which we could/should be) we'd add small errors here and there to allow
the model also learn recoveries. We can do this in future versions.

There are two tasks in this file:
1. SpellingBee: Counting the number of occurrences of a letter in a word
2. SimpleSpelling: Simply spelling words

(1) is the goal, but (2) exists as a highly condensed version of the part
that makes (1) difficult, which is word spelling. This is non-trivial for an
LLM because it has to learn how every token (a little semantic chunk/atom)
maps to the sequence of individual characters that make it up.
"""

Multi-Modal Problem Solving

Demonstrates hybrid reasoning approach:

Source: tasks/spellingbee.py:110-140

python
# Now create the ideal assistant response - build as parts (text + tool calls)
assistant_parts = []
word_letters = ",".join(list(word))
manual_text = f"""We are asked to find the number '{letter}' in the word '{word}'. Let me try a manual approach first.

First spell the word out:
{word}:{word_letters}

Then count the occurrences of '{letter}':
"""
# Little simulated loop of the solution process
running_count = 0
for i, char in enumerate(word, 1):
    if char == letter:
        running_count += 1
        manual_text += f"{i}:{char} hit! count={running_count}\\n"
    else:
        manual_text += f"{i}:{char}\\n"

manual_text += f"\\nThis gives us {running_count}."
assistant_parts.append({"type": "text", "text": manual_text})
# Part 2: Python verification
assistant_parts.append({"type": "text", "text": "\\n\\nLet me double check this using Python:\\n\\n"})
# Part 3: Python tool call
python_expr = f"'{word}'.count('{letter}')"
assistant_parts.append({"type": "python", "text": python_expr})
# Part 4: Python output
assistant_parts.append({"type": "python_output", "text": str(count)})

Data Augmentation

Extensive template variation for robust training:

Source: tasks/spellingbee.py:40-70

python
# User message templates for data augmentation
USER_MSG_TEMPLATES = [
    "How many {letter} are in the word {word}",
    "How many {letter} are in {word}",
    "Count the number of {letter} in {word}",
    "How many times does {letter} appear in {word}",
    "What's the count of {letter} in {word}",
    "In the word {word}, how many {letter} are there",
    # Spanish
    "¿Cuántas {letter} hay en {word}?",
    "¿Cuántas veces aparece {letter} en {word}?",
    # Chinese (Simplified)
    "{word}中有多少个{letter}",
    "{word}里有几个{letter}",
    # Korean
    "{word}에 {letter}가 몇 개 있나요",
    # French
    "Combien de {letter} dans {word}",
    # German
    "Wie viele {letter} sind in {word}",
    # Japanese
    "{word}に{letter}は何個ありますか",
]

Evaluation Types

Categorical Evaluation

  • Used for multiple choice tasks (ARC, MMLU)
  • Models predict from fixed set of choices
  • More efficient batched evaluation
  • Focuses logits on valid answer tokens

Generative Evaluation

  • Used for open-ended tasks (GSM8K, HumanEval, SpellingBee)
  • Models generate free-form responses
  • Requires post-processing and answer extraction
  • Tests full generation capabilities

Sources:

  • tasks/common.py:1-20,25-50,75-95,130-150
  • tasks/arc.py:1-15,20-40
  • tasks/mmlu.py:1-15,25-45
  • tasks/gsm8k.py:1-20,25-35,40-70
  • tasks/humaneval.py:1-15,20-40,65-85
  • tasks/spellingbee.py:1-30,40-70,110-140
Last updated: 1/10/2026