Evaluation Scripts
Evaluation Scripts
The evaluation framework provides comprehensive assessment tools for models at different training stages, from tokenizer compression to chat capabilities.
Overview
The evaluation system includes:
- Base Model Evaluation - CORE metric assessment for foundation models
- Chat Model Evaluation - Task-specific evaluation for chat models
- Tokenizer Evaluation - Compression ratio analysis
- Core Evaluation Engine - Unified evaluation infrastructure
- Loss Evaluation - Bits-per-byte metric calculation
Key Files:
scripts/base_eval.py- Base model CORE evaluationscripts/chat_eval.py- Chat model task evaluationscripts/tok_eval.py- Tokenizer compression assessmentnanochat/core_eval.py- CORE metric implementationnanochat/loss_eval.py- BPB evaluation utilities
Base Model Evaluation
Evaluates foundation models using the CORE (Core Open-source Research Evaluation) benchmark.
Source: scripts/base_eval.py:1-50
"""
Evaluate the CORE metric for a given model.
Run on a single GPU:
python -m scripts.base_eval
Run with torchrun on e.g. 8 GPUs:
torchrun --nproc_per_node=8 -m scripts.base_eval
The script will print the CORE metric to the console.
"""
Key Features
CORE Metric Calculation
- Downloads evaluation bundle (~162MB) automatically
- Centered accuracy:
(accuracy - random_baseline) / (1.0 - random_baseline) - Aggregates scores across multiple ICL tasks
Multi-Task Evaluation
- Supports multiple choice, schema, and language modeling tasks
- Few-shot learning with configurable examples
- Handles various continuation delimiters
Source: scripts/base_eval.py:75-105
def evaluate_model(model, tokenizer, device, max_per_task=-1):
"""
Evaluate a base model on the CORE benchmark.
- max_per_task: crop the data to this many examples per task for testing (-1 = disable)
"""
# Load config and task metadata
base_dir = get_base_dir()
eval_bundle_dir = os.path.join(base_dir, "eval_bundle")
# Download the eval bundle to disk (and unzip if needed)
if not os.path.exists(eval_bundle_dir):
download_file_with_lock(EVAL_BUNDLE_URL, "eval_bundle.zip", postprocess_fn=place_eval_bundle)
Usage Examples
# Single GPU evaluation
python -m scripts.base_eval
# Distributed evaluation
torchrun --nproc_per_node=8 -m scripts.base_eval
# Evaluate HuggingFace model
python -m scripts.base_eval --hf-path openai-community/gpt2
# Limited examples for testing
python -m scripts.base_eval --max-per-task 100
Chat Model Evaluation
Comprehensive evaluation suite for chat models across multiple benchmarks.
Source: scripts/chat_eval.py:1-25
"""
Evaluate the Chat model.
All the generic code lives here, and all the evaluation-specific
code lives in nanochat directory and is imported from here.
Example runs:
python -m scripts.chat_eval -a ARC-Easy
torchrun --nproc_per_node=8 -m scripts.chat_eval -- -a ARC-Easy
"""
Supported Tasks
- ARC-Easy/Challenge - AI2 Reasoning Challenge
- MMLU - Massive Multitask Language Understanding
- GSM8K - Grade School Math Word Problems
- HumanEval - Code generation benchmark
- SpellingBee - Spelling accuracy test
Evaluation Types
Generative Evaluation - For open-ended tasks
- Samples multiple completions
- Checks success criteria for any completion
- Used for HumanEval, GSM8K, SpellingBee
Categorical Evaluation - For multiple choice tasks
- Computes logits for answer choices only
- More efficient batched evaluation
- Used for ARC, MMLU
Source: scripts/chat_eval.py:35-70
def run_generative_eval(task_object, tokenizer, model, engine, num_samples, max_new_tokens, temperature, top_k, max_problems=None):
ddp, ddp_rank, ddp_local_rank, ddp_world_size = get_dist_info()
device = model.get_device()
num_problems = len(task_object) if max_problems is None else min(len(task_object), max_problems)
# Run the evaluation
num_passed, total = 0, 0
for i in range(ddp_rank, num_problems, ddp_world_size):
conversation = task_object[i]
# Tokenize the prompt
encoded_prompt = tokenizer.render_for_completion(conversation)
# Get the completions
results, _ = engine.generate_batch(
encoded_prompt,
num_samples=num_samples,
max_tokens=max_new_tokens,
temperature=temperature,
top_k=top_k,
)
ChatCORE Metric
Similar to CORE but for chat models, calculating centered accuracy across all tasks:
Source: scripts/chat_eval.py:180-195
# calculate the ChatCORE metric if we can (similar to CORE, it's the mean centered accuracy)
# this way, ChatCORE ranges from 0 (at random baseline) to 1 (peak performance)
chatcore_metric_dict = {}
if all_tasks_were_evaluated:
centered_mean = 0
for task_name, acc in results.items():
baseline_acc = baseline_accuracies.get(task_name, 0.0)
centered_acc = (acc - baseline_acc) / (1.0 - baseline_acc)
centered_mean += centered_acc
chatcore_metric = centered_mean / len(results)
chatcore_metric_dict = {"ChatCORE metric": chatcore_metric}
Usage Examples
# Single task evaluation
python -m scripts.chat_eval -i sft -a ARC-Easy
# Multiple tasks
python -m scripts.chat_eval -i sft -a "ARC-Easy|MMLU|GSM8K"
# All tasks with sampling
python -m scripts.chat_eval -i sft -n 5 -t 0.7
# Distributed evaluation
torchrun --nproc_per_node=8 -m scripts.chat_eval -i sft
Tokenizer Evaluation
Analyzes compression performance of tokenizers across different text types.
Source: scripts/tok_eval.py:1-30
"""
Evaluate compression ratio of the tokenizer.
"""
from nanochat.tokenizer import get_tokenizer, RustBPETokenizer
from nanochat.dataset import parquets_iter_batched
# Random text I got from a random website this morning
news_text = r"""
(Washington, D.C., July 9, 2025)- Yesterday, Mexico's National Service of Agro-Alimentary Health...
""".strip()
Evaluation Categories
- News Text - Current events and formal writing
- Korean Text - Non-English language compression
- Code Text - Programming language efficiency
- Math Text - LaTeX mathematical notation
- Science Text - Technical scientific writing
- Training Data - FineWeb-Edu samples
Compression Metrics
- Bytes per Token Ratio - Higher is better compression
- Relative Improvement - Percentage difference vs baseline
- Cross-tokenizer Comparison - GPT-2, GPT-4, and custom tokenizer
Source: scripts/tok_eval.py:150-175
for tokenizer_name in ["gpt2", "gpt4", "ours"]:
if tokenizer_name == "gpt2":
tokenizer = RustBPETokenizer.from_pretrained("gpt2") # gpt-2 base model tokenizer
elif tokenizer_name == "gpt4":
tokenizer = RustBPETokenizer.from_pretrained("cl100k_base") # gpt-4 base model tokenizer
else:
tokenizer = get_tokenizer()
vocab_sizes[tokenizer_name] = tokenizer.get_vocab_size()
tokenizer_results[tokenizer_name] = {}
for name, text in all_text:
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded)
assert decoded == text
encoded_bytes = text.encode('utf-8')
ratio = len(encoded_bytes) / len(encoded)
tokenizer_results[tokenizer_name][name] = {
'bytes': len(encoded_bytes),
'tokens': len(encoded),
'ratio': ratio
}
Core Evaluation Engine
Implements the CORE benchmark evaluation logic used by base model evaluation.
Source: nanochat/core_eval.py:1-20
"""
Functions for evaluating the CORE metric, as described in the DCLM paper.
https://arxiv.org/abs/2406.11794
TODOs:
- All tasks ~match except for squad. We get 31% reference is 37%. Figure out why.
"""
import random
from jinja2 import Template
import torch
import torch.distributed as dist
Task Types Supported
- Multiple Choice - Common prefix, different continuations
- Schema Tasks - Different contexts, common suffix
- Language Modeling - Autoregressive next token prediction
Prompt Rendering
Source: nanochat/core_eval.py:25-40
def render_prompts_mc(item, continuation_delimiter, fewshot_examples=None):
"""Render complete prompts for a multiple choice question"""
template_str = """
{%- for example in fewshot_examples -%}
{{ example.query }}{{ continuation_delimiter }}{{ example.choices[example.gold] }}
{% endfor -%}
{{ item.query }}{{ continuation_delimiter }}{{ choice }}""".strip()
template = Template(template_str)
fewshot_examples = fewshot_examples or []
context = {
'fewshot_examples': fewshot_examples,
'continuation_delimiter': continuation_delimiter,
'item': item
}
prompts = [template.render(choice=choice, **context) for choice in item['choices']]
return prompts
Evaluation Process
- Few-shot Sampling - Random examples excluding current item
- Prompt Batching - Efficient sequence handling per task type
- Model Forward Pass - Get logits and predictions
- Accuracy Calculation - Task-specific success criteria
Loss Evaluation
Provides bits-per-byte (BPB) metric for tokenization-independent loss measurement.
Source: nanochat/loss_eval.py:1-20
"""
A number of functions that help with evaluating a base model.
"""
import math
import torch
import torch.distributed as dist
@torch.no_grad()
def evaluate_bpb(model, batches, steps, token_bytes):
"""
Instead of the naive 'mean loss', this function returns the bits per byte (bpb),
which is a tokenization vocab size-independent metric, meaning you are still comparing
apples:apples if you change the vocab size.
"""
BPB Calculation
- Sum Loss & Bytes - Accumulate across all target tokens
- Mask Special Tokens - Exclude BOS, padding, ignored tokens
- Normalize by Bytes - Loss per byte rather than per token
- Convert to Bits - Divide by log(2) for bits per byte
Source: nanochat/loss_eval.py:35-50
# record the losses
total_nats = torch.tensor(0.0, dtype=torch.float32, device=model.get_device())
total_bytes = torch.tensor(0, dtype=torch.int64, device=model.get_device())
batch_iter = iter(batches)
for _ in range(steps):
x, y = next(batch_iter)
loss2d = model(x, y, loss_reduction='none') # (B, T)
loss2d = loss2d.view(-1) # flatten
y = y.view(-1) # flatten
if (y.int() < 0).any(): # mps does not currently have kernel for < 0 for int64, only int32
# slightly more complex code path if some target tokens are ignore_index (e.g. -1)
# any target token < 0 is to be ignored: do NOT index token_bytes with negatives
valid = y >= 0
y_safe = torch.where(valid, y, torch.zeros_like(y))
# map valid targets to their byte length; ignored targets contribute 0 bytes
num_bytes2d = torch.where(
valid,
token_bytes[y_safe],
torch.zeros_like(y, dtype=token_bytes.dtype)
)
Related Pages
Sources:
- scripts/base_eval.py
- scripts/chat_eval.py:1-25,35-70,180-195
- scripts/tok_eval.py:1-30,150-175
- nanochat/core_eval.py:1-20,25-40
- nanochat/loss_eval.py:1-20,35-50