Task Implementations
Task Implementations
Individual task implementations for evaluation and training across multiple domains including reasoning, mathematics, coding, and language understanding.
Overview
The task system provides standardized implementations for popular benchmarks:
- ARC - AI2 Reasoning Challenge (multiple choice reasoning)
- MMLU - Massive Multitask Language Understanding
- GSM8K - Grade School Math with tool use
- HumanEval - Code generation and execution
- SpellingBee - Letter counting with mixed reasoning
- Common Framework - Base classes and utilities
Key Files:
tasks/common.py- Base Task class and utilitiestasks/arc.py- AI2 Reasoning Challengetasks/mmlu.py- Multitask Language Understandingtasks/gsm8k.py- Grade School Mathtasks/humaneval.py- Code generation benchmarktasks/spellingbee.py- Custom spelling and counting task
Common Framework
Provides base classes and utilities for all task implementations.
Source: tasks/common.py:1-20
"""
Base class for all Tasks.
A Task is basically a dataset of conversations, together with some
metadata and often also evaluation criteria.
Example tasks: MMLU, ARC-Easy, ARC-Challenge, GSM8K, HumanEval, SmolTalk.
"""
import random
class Task:
"""
Base class of a Task. Allows for lightweight slicing of the underlying dataset.
"""
def __init__(self, start=0, stop=None, step=1):
# allows a lightweight logical view over a dataset
assert start >= 0, f"Start must be non-negative, got {start}"
assert stop is None or stop >= start, f"Stop should be greater than or equal to start, got {stop} and {start}"
assert step >= 1, f"Step must be strictly positive, got {step}"
self.start = start
self.stop = stop # could be None here
self.step = step
Task Base Class
All tasks inherit from the base Task class with standard interface:
Source: tasks/common.py:25-50
@property
def eval_type(self):
# one of 'generative' | 'categorical'
raise NotImplementedError
def num_examples(self):
raise NotImplementedError
def get_example(self, index):
raise NotImplementedError
def __len__(self):
start = self.start
stop = self.num_examples() if self.stop is None else self.stop
step = self.step
span = stop - start
num = (span + step - 1) // step # ceil_div(span, step)
assert num >= 0, f"Negative number of examples???: {num}" # prevent footguns
return num
def __getitem__(self, index: int):
assert isinstance(index, int), f"Index must be an integer, got {type(index)}"
physical_index = self.start + index * self.step
conversation = self.get_example(physical_index)
return conversation
def evaluate(self, problem, completion):
raise NotImplementedError
Task Composition
Supports mixing and sequencing tasks for training:
Source: tasks/common.py:75-95
class TaskMixture(Task):
"""
For SFT Training it becomes useful to train on a mixture of datasets.
Fun trick: if you wish to oversample any task, just pass it in multiple times in the list.
"""
def __init__(self, tasks, **kwargs):
super().__init__(**kwargs)
# tasks is a list of Task objects
self.tasks = tasks
self.lengths = [len(task) for task in self.tasks]
self.num_conversations = sum(self.lengths)
# Build list of all (task_idx, local_idx) pairs
self.index_map = []
for task_idx, task_length in enumerate(self.lengths):
for local_idx in range(task_length):
self.index_map.append((task_idx, local_idx))
# Deterministically shuffle to mix tasks throughout training
rng = random.Random(42)
rng.shuffle(self.index_map)
Multiple Choice Rendering
Standardized format for categorical tasks:
Source: tasks/common.py:130-150
def render_mc(question, letters, choices):
"""
The common multiple choice rendering format we will use.
Note two important design decisions:
1)
Bigger models don't care as much, but smaller models prefer to have
the letter *after* the choice, which results in better binding.
2)
There is no whitespace between the delimiter (=) and the letter.
This is actually critical because the tokenizer has different token ids
for " A" vs. "A". The assistant responses will be just the letter itself,
i.e. "A", so it is important that here in the prompt it is the exact same
token, i.e. "A" with no whitespace before it. Again, bigger models don't care
about this too much, but smaller models do care about some of these details.
"""
query = f"Multiple Choice question: {question}\\n"
query += "".join([f"- {choice}={letter}\\n" for letter, choice in zip(letters, choices)])
query += "\\nRespond only with the letter of the correct answer."
return query
ARC (AI2 Reasoning Challenge)
Multiple choice reasoning tasks from Allen AI.
Source: tasks/arc.py:1-15
"""
The ARC dataset from Allen AI.
https://huggingface.co/datasets/allenai/ai2_arc
"""
from datasets import load_dataset
from tasks.common import Task, render_mc
class ARC(Task):
def __init__(self, subset, split, **kwargs):
super().__init__(**kwargs)
assert subset in ["ARC-Easy", "ARC-Challenge"], "ARC subset must be ARC-Easy or ARC-Challenge"
assert split in ["train", "validation", "test"], "ARC split must be train|validation|test"
self.ds = load_dataset("allenai/ai2_arc", subset, split=split).shuffle(seed=42)
Example Generation
Converts ARC format to conversation format:
Source: tasks/arc.py:20-40
def get_example(self, index):
row = self.ds[index]
question = row["question"] # the question text
choices = row["choices"]["text"] # the text of each choice
answer_string = row["answerKey"] # e.g. "A", "B", "C", "D"
letters = row["choices"]["label"] # e.g. ["A", "B", "C", "D"]
assert answer_string in letters, f"ARC answer {answer_string} must be one of {letters}" # sanity check
# create and return the Conversation object
user_message = render_mc(question, letters, choices)
messages = [
{"role": "user", "content": user_message},
{"role": "assistant", "content": answer_string}
]
conversation = {
"messages": messages,
"letters": letters, # useful during evaluation, so we can narrow and clamp the assistant prediction to one of the letters
}
return conversation
ARC uses categorical evaluation where models predict A/B/C/D choices directly.
MMLU (Massive Multitask Language Understanding)
Broad knowledge evaluation across 57 academic subjects.
Source: tasks/mmlu.py:1-15
"""
The MMLU dataset.
https://huggingface.co/datasets/cais/mmlu
"""
from datasets import load_dataset
from tasks.common import Task, render_mc
class MMLU(Task):
letters = ('A', 'B', 'C', 'D')
groups = ('abstract_algebra', 'anatomy', 'astronomy', 'business_ethics', 'clinical_knowledge', 'college_biology', ...)
Subject Coverage
MMLU covers diverse academic domains:
- STEM: mathematics, physics, chemistry, biology, computer science
- Humanities: history, philosophy, literature
- Social Sciences: psychology, economics, political science
- Professional: law, medicine, business
Source: tasks/mmlu.py:25-45
def get_example(self, index):
row = self.ds[index]
question = row["question"] # the question text
choices = row["choices"] # the text of each choice
answer = row["answer"] # index of the answer, e.g. 0,1,2,3 (for A,B,C,D)
subject = row["subject"] # e.g. "college_biology", "college_chemistry", etc.
assert len(choices) == 4, "MMLU should have 4 choices"
# create and return the Conversation object
user_message = render_mc(question, self.letters, choices)
assistant_message = self.letters[answer]
messages = [
{"role": "user", "content": user_message},
{"role": "assistant", "content": assistant_message}
]
conversation = {
"messages": messages,
"subject": subject, # might be useful later for grouping metrics by subject
"letters": self.letters, # useful during evaluation, so we can narrow and clamp the assistant prediction to one of the letters
}
return conversation
GSM8K (Grade School Math)
Mathematical word problems with tool use for calculations.
Source: tasks/gsm8k.py:1-20
"""
GSM8K evaluation.
https://huggingface.co/datasets/openai/gsm8k
Example problem instance:
Question:
Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?
Answer:
Weng earns 12/60 = $<<12/60=0.2>>0.2 per minute.
Working 50 minutes, she earned 0.2 x 50 = $<<0.2*50=10>>10.
#### 10
Notice that GSM8K uses tool calls inside << >> tags.
"""
Tool Use Integration
GSM8K demonstrates multi-modal reasoning with embedded calculations:
Source: tasks/gsm8k.py:40-70
def get_example(self, index):
""" Get a single problem from the dataset. """
row = self.ds[index]
question = row['question'] # string of the question prompt
answer = row['answer'] # string of the full solution and the answer after #### marker
# Create and return the Conversation object
# This is tricky because GSM8K uses tool calls, which we need to parse here.
assistant_message_parts = []
parts = re.split(r'(<<[^>]+>>)', answer)
for part in parts:
if part.startswith('<<') and part.endswith('>>'):
# This is a calculator tool call
inner = part[2:-2] # Remove << >>
# Split on = to get expression and result
if '=' in inner:
expr, result = inner.rsplit('=', 1)
else:
expr, result = inner, ""
# Add the tool call as a part
assistant_message_parts.append({"type": "python", "text": expr})
# Add the result as a part
assistant_message_parts.append({"type": "python_output", "text": result})
else:
# Regular text in between tool calls
assistant_message_parts.append({"type": "text", "text": part})
Answer Extraction
Robust numerical answer parsing:
Source: tasks/gsm8k.py:25-35
GSM_RE = re.compile(r"#### (\\-?[0-9\\.\\,]+)")
def extract_answer(completion):
"""
Extract the numerical answer after #### marker.
Follows official code for normalization:
https://github.com/openai/grade-school-math/blob/3101c7d5072418e28b9008a6636bde82a006892c/grade_school_math/dataset.py#L28
"""
match = GSM_RE.search(completion)
if match:
match_str = match.group(1).strip()
match_str = match_str.replace(",", "")
return match_str
return None
HumanEval (Code Generation)
Python code generation with execution-based evaluation.
Source: tasks/humaneval.py:1-15
"""
Evaluate the Chat model on HumanEval dataset.
Btw this dataset is a misnomer and has nothing to do with humans.
It is a coding benchmark.
"""
import re
from datasets import load_dataset
from nanochat.execution import execute_code
from tasks.common import Task
Code Extraction
Robust parsing of code from model responses:
Source: tasks/humaneval.py:20-40
def extract_program(completion):
"""
Extract Python code from LLM completion.
Handles various output formats:
- Code wrapped in ```python ... ``` or ``` ... ``` blocks
- Plain code without markdown blocks
- Extra text before/after code blocks
Returns the first code block if found, otherwise returns the whole completion.
"""
# Try to find markdown code blocks (```python or just ```)
# Match ```python\\n...\\n``` or ```\\n...\\n```
pattern = r'```(?:python)?\\s*\\n(.*?)\\n```'
matches = re.findall(pattern, completion, re.DOTALL)
if matches:
# Return the first code block found
return matches[0].strip()
# No code blocks found, return the whole completion
return completion.strip()
Execution-Based Evaluation
Uses safe code execution to test solutions:
Source: tasks/humaneval.py:65-85
def evaluate(self, conversation, completion):
""" Given (conversation, completion), return boolean success of the completion. """
# the prompt will contain the imports and the function signature
imports = extract_imports(conversation['messages'][0]['content'])
# the completion will usually contain the whole function
# but not always with the needed imports, so we manually append them
completion_code = extract_program(completion)
program = (
imports
+ "\\n\\n"
+ completion_code
+ "\\n\\n"
+ conversation['test']
+ "\\n"
+ f"check({conversation['entry_point']})"
)
result = execute_code(program)
success = result.success
return success
HumanEval tests functional correctness by running the generated code against test cases.
SpellingBee (Custom Task)
Novel task combining letter counting with tool use and manual reasoning.
Source: tasks/spellingbee.py:1-30
"""
Task intended to make nanochat better in spelling and counting, for example:
"How many r are in strawberry?" -> 3
An interesting part of this task is that we will get the assistant to
solve the problem using a combination of manual counting and Python.
This is a good problem solving "instinct" to mix into the model and RL
may further refine it to trust one over the other. If we were extra fancy
(which we could/should be) we'd add small errors here and there to allow
the model also learn recoveries. We can do this in future versions.
There are two tasks in this file:
1. SpellingBee: Counting the number of occurrences of a letter in a word
2. SimpleSpelling: Simply spelling words
(1) is the goal, but (2) exists as a highly condensed version of the part
that makes (1) difficult, which is word spelling. This is non-trivial for an
LLM because it has to learn how every token (a little semantic chunk/atom)
maps to the sequence of individual characters that make it up.
"""
Multi-Modal Problem Solving
Demonstrates hybrid reasoning approach:
Source: tasks/spellingbee.py:110-140
# Now create the ideal assistant response - build as parts (text + tool calls)
assistant_parts = []
word_letters = ",".join(list(word))
manual_text = f"""We are asked to find the number '{letter}' in the word '{word}'. Let me try a manual approach first.
First spell the word out:
{word}:{word_letters}
Then count the occurrences of '{letter}':
"""
# Little simulated loop of the solution process
running_count = 0
for i, char in enumerate(word, 1):
if char == letter:
running_count += 1
manual_text += f"{i}:{char} hit! count={running_count}\\n"
else:
manual_text += f"{i}:{char}\\n"
manual_text += f"\\nThis gives us {running_count}."
assistant_parts.append({"type": "text", "text": manual_text})
# Part 2: Python verification
assistant_parts.append({"type": "text", "text": "\\n\\nLet me double check this using Python:\\n\\n"})
# Part 3: Python tool call
python_expr = f"'{word}'.count('{letter}')"
assistant_parts.append({"type": "python", "text": python_expr})
# Part 4: Python output
assistant_parts.append({"type": "python_output", "text": str(count)})
Data Augmentation
Extensive template variation for robust training:
Source: tasks/spellingbee.py:40-70
# User message templates for data augmentation
USER_MSG_TEMPLATES = [
"How many {letter} are in the word {word}",
"How many {letter} are in {word}",
"Count the number of {letter} in {word}",
"How many times does {letter} appear in {word}",
"What's the count of {letter} in {word}",
"In the word {word}, how many {letter} are there",
# Spanish
"¿Cuántas {letter} hay en {word}?",
"¿Cuántas veces aparece {letter} en {word}?",
# Chinese (Simplified)
"{word}中有多少个{letter}",
"{word}里有几个{letter}",
# Korean
"{word}에 {letter}가 몇 개 있나요",
# French
"Combien de {letter} dans {word}",
# German
"Wie viele {letter} sind in {word}",
# Japanese
"{word}に{letter}は何個ありますか",
]
Evaluation Types
Categorical Evaluation
- Used for multiple choice tasks (ARC, MMLU)
- Models predict from fixed set of choices
- More efficient batched evaluation
- Focuses logits on valid answer tokens
Generative Evaluation
- Used for open-ended tasks (GSM8K, HumanEval, SpellingBee)
- Models generate free-form responses
- Requires post-processing and answer extraction
- Tests full generation capabilities
Related Pages
Sources:
- tasks/common.py:1-20,25-50,75-95,130-150
- tasks/arc.py:1-15,20-40
- tasks/mmlu.py:1-15,25-45
- tasks/gsm8k.py:1-20,25-35,40-70
- tasks/humaneval.py:1-15,20-40,65-85
- tasks/spellingbee.py:1-30,40-70,110-140