Evaluation Scripts

Evaluation Scripts

The evaluation framework provides comprehensive assessment tools for models at different training stages, from tokenizer compression to chat capabilities.

Overview

The evaluation system includes:

  • Base Model Evaluation - CORE metric assessment for foundation models
  • Chat Model Evaluation - Task-specific evaluation for chat models
  • Tokenizer Evaluation - Compression ratio analysis
  • Core Evaluation Engine - Unified evaluation infrastructure
  • Loss Evaluation - Bits-per-byte metric calculation

Key Files:

  • scripts/base_eval.py - Base model CORE evaluation
  • scripts/chat_eval.py - Chat model task evaluation
  • scripts/tok_eval.py - Tokenizer compression assessment
  • nanochat/core_eval.py - CORE metric implementation
  • nanochat/loss_eval.py - BPB evaluation utilities

Base Model Evaluation

Evaluates foundation models using the CORE (Core Open-source Research Evaluation) benchmark.

Source: scripts/base_eval.py:1-50

python
"""
Evaluate the CORE metric for a given model.

Run on a single GPU:
python -m scripts.base_eval

Run with torchrun on e.g. 8 GPUs:
torchrun --nproc_per_node=8 -m scripts.base_eval

The script will print the CORE metric to the console.
"""

Key Features

  1. CORE Metric Calculation

    • Downloads evaluation bundle (~162MB) automatically
    • Centered accuracy: (accuracy - random_baseline) / (1.0 - random_baseline)
    • Aggregates scores across multiple ICL tasks
  2. Multi-Task Evaluation

    • Supports multiple choice, schema, and language modeling tasks
    • Few-shot learning with configurable examples
    • Handles various continuation delimiters

Source: scripts/base_eval.py:75-105

python
def evaluate_model(model, tokenizer, device, max_per_task=-1):
    """
    Evaluate a base model on the CORE benchmark.
    - max_per_task: crop the data to this many examples per task for testing (-1 = disable)
    """
    # Load config and task metadata
    base_dir = get_base_dir()
    eval_bundle_dir = os.path.join(base_dir, "eval_bundle")
    # Download the eval bundle to disk (and unzip if needed)
    if not os.path.exists(eval_bundle_dir):
        download_file_with_lock(EVAL_BUNDLE_URL, "eval_bundle.zip", postprocess_fn=place_eval_bundle)

Usage Examples

bash
# Single GPU evaluation
python -m scripts.base_eval

# Distributed evaluation
torchrun --nproc_per_node=8 -m scripts.base_eval

# Evaluate HuggingFace model
python -m scripts.base_eval --hf-path openai-community/gpt2

# Limited examples for testing
python -m scripts.base_eval --max-per-task 100

Chat Model Evaluation

Comprehensive evaluation suite for chat models across multiple benchmarks.

Source: scripts/chat_eval.py:1-25

python
"""
Evaluate the Chat model.
All the generic code lives here, and all the evaluation-specific
code lives in nanochat directory and is imported from here.

Example runs:
python -m scripts.chat_eval -a ARC-Easy
torchrun --nproc_per_node=8 -m scripts.chat_eval -- -a ARC-Easy
"""

Supported Tasks

  • ARC-Easy/Challenge - AI2 Reasoning Challenge
  • MMLU - Massive Multitask Language Understanding
  • GSM8K - Grade School Math Word Problems
  • HumanEval - Code generation benchmark
  • SpellingBee - Spelling accuracy test

Evaluation Types

  1. Generative Evaluation - For open-ended tasks

    • Samples multiple completions
    • Checks success criteria for any completion
    • Used for HumanEval, GSM8K, SpellingBee
  2. Categorical Evaluation - For multiple choice tasks

    • Computes logits for answer choices only
    • More efficient batched evaluation
    • Used for ARC, MMLU

Source: scripts/chat_eval.py:35-70

python
def run_generative_eval(task_object, tokenizer, model, engine, num_samples, max_new_tokens, temperature, top_k, max_problems=None):

    ddp, ddp_rank, ddp_local_rank, ddp_world_size = get_dist_info()
    device = model.get_device()

    num_problems = len(task_object) if max_problems is None else min(len(task_object), max_problems)

    # Run the evaluation
    num_passed, total = 0, 0
    for i in range(ddp_rank, num_problems, ddp_world_size):
        conversation = task_object[i]

        # Tokenize the prompt
        encoded_prompt = tokenizer.render_for_completion(conversation)
        # Get the completions
        results, _ = engine.generate_batch(
            encoded_prompt,
            num_samples=num_samples,
            max_tokens=max_new_tokens,
            temperature=temperature,
            top_k=top_k,
        )

ChatCORE Metric

Similar to CORE but for chat models, calculating centered accuracy across all tasks:

Source: scripts/chat_eval.py:180-195

python
# calculate the ChatCORE metric if we can (similar to CORE, it's the mean centered accuracy)
# this way, ChatCORE ranges from 0 (at random baseline) to 1 (peak performance)
chatcore_metric_dict = {}
if all_tasks_were_evaluated:
    centered_mean = 0
    for task_name, acc in results.items():
        baseline_acc = baseline_accuracies.get(task_name, 0.0)
        centered_acc = (acc - baseline_acc) / (1.0 - baseline_acc)
        centered_mean += centered_acc
    chatcore_metric = centered_mean / len(results)
    chatcore_metric_dict = {"ChatCORE metric": chatcore_metric}

Usage Examples

bash
# Single task evaluation
python -m scripts.chat_eval -i sft -a ARC-Easy

# Multiple tasks
python -m scripts.chat_eval -i sft -a "ARC-Easy|MMLU|GSM8K"

# All tasks with sampling
python -m scripts.chat_eval -i sft -n 5 -t 0.7

# Distributed evaluation
torchrun --nproc_per_node=8 -m scripts.chat_eval -i sft

Tokenizer Evaluation

Analyzes compression performance of tokenizers across different text types.

Source: scripts/tok_eval.py:1-30

python
"""
Evaluate compression ratio of the tokenizer.
"""

from nanochat.tokenizer import get_tokenizer, RustBPETokenizer
from nanochat.dataset import parquets_iter_batched

# Random text I got from a random website this morning
news_text = r"""
(Washington, D.C., July 9, 2025)- Yesterday, Mexico's National Service of Agro-Alimentary Health...
""".strip()

Evaluation Categories

  1. News Text - Current events and formal writing
  2. Korean Text - Non-English language compression
  3. Code Text - Programming language efficiency
  4. Math Text - LaTeX mathematical notation
  5. Science Text - Technical scientific writing
  6. Training Data - FineWeb-Edu samples

Compression Metrics

  • Bytes per Token Ratio - Higher is better compression
  • Relative Improvement - Percentage difference vs baseline
  • Cross-tokenizer Comparison - GPT-2, GPT-4, and custom tokenizer

Source: scripts/tok_eval.py:150-175

python
for tokenizer_name in ["gpt2", "gpt4", "ours"]:

    if tokenizer_name == "gpt2":
        tokenizer = RustBPETokenizer.from_pretrained("gpt2") # gpt-2 base model tokenizer
    elif tokenizer_name == "gpt4":
        tokenizer = RustBPETokenizer.from_pretrained("cl100k_base") # gpt-4 base model tokenizer
    else:
        tokenizer = get_tokenizer()

    vocab_sizes[tokenizer_name] = tokenizer.get_vocab_size()
    tokenizer_results[tokenizer_name] = {}

    for name, text in all_text:
        encoded = tokenizer.encode(text)
        decoded = tokenizer.decode(encoded)
        assert decoded == text

        encoded_bytes = text.encode('utf-8')
        ratio = len(encoded_bytes) / len(encoded)
        tokenizer_results[tokenizer_name][name] = {
            'bytes': len(encoded_bytes),
            'tokens': len(encoded),
            'ratio': ratio
        }

Core Evaluation Engine

Implements the CORE benchmark evaluation logic used by base model evaluation.

Source: nanochat/core_eval.py:1-20

python
"""
Functions for evaluating the CORE metric, as described in the DCLM paper.
https://arxiv.org/abs/2406.11794

TODOs:
- All tasks ~match except for squad. We get 31% reference is 37%. Figure out why.
"""
import random

from jinja2 import Template
import torch
import torch.distributed as dist

Task Types Supported

  1. Multiple Choice - Common prefix, different continuations
  2. Schema Tasks - Different contexts, common suffix
  3. Language Modeling - Autoregressive next token prediction

Prompt Rendering

Source: nanochat/core_eval.py:25-40

python
def render_prompts_mc(item, continuation_delimiter, fewshot_examples=None):
    """Render complete prompts for a multiple choice question"""
    template_str = """
{%- for example in fewshot_examples -%}
{{ example.query }}{{ continuation_delimiter }}{{ example.choices[example.gold] }}

{% endfor -%}
{{ item.query }}{{ continuation_delimiter }}{{ choice }}""".strip()
    template = Template(template_str)
    fewshot_examples = fewshot_examples or []
    context = {
        'fewshot_examples': fewshot_examples,
        'continuation_delimiter': continuation_delimiter,
        'item': item
    }
    prompts = [template.render(choice=choice, **context) for choice in item['choices']]
    return prompts

Evaluation Process

  1. Few-shot Sampling - Random examples excluding current item
  2. Prompt Batching - Efficient sequence handling per task type
  3. Model Forward Pass - Get logits and predictions
  4. Accuracy Calculation - Task-specific success criteria

Loss Evaluation

Provides bits-per-byte (BPB) metric for tokenization-independent loss measurement.

Source: nanochat/loss_eval.py:1-20

python
"""
A number of functions that help with evaluating a base model.
"""
import math
import torch
import torch.distributed as dist

@torch.no_grad()
def evaluate_bpb(model, batches, steps, token_bytes):
    """
    Instead of the naive 'mean loss', this function returns the bits per byte (bpb),
    which is a tokenization vocab size-independent metric, meaning you are still comparing
    apples:apples if you change the vocab size.
    """

BPB Calculation

  1. Sum Loss & Bytes - Accumulate across all target tokens
  2. Mask Special Tokens - Exclude BOS, padding, ignored tokens
  3. Normalize by Bytes - Loss per byte rather than per token
  4. Convert to Bits - Divide by log(2) for bits per byte

Source: nanochat/loss_eval.py:35-50

python
# record the losses
total_nats = torch.tensor(0.0, dtype=torch.float32, device=model.get_device())
total_bytes = torch.tensor(0, dtype=torch.int64, device=model.get_device())
batch_iter = iter(batches)
for _ in range(steps):
    x, y = next(batch_iter)
    loss2d = model(x, y, loss_reduction='none') # (B, T)
    loss2d = loss2d.view(-1) # flatten
    y = y.view(-1) # flatten
    if (y.int() < 0).any(): # mps does not currently have kernel for < 0 for int64, only int32
        # slightly more complex code path if some target tokens are ignore_index (e.g. -1)
        # any target token < 0 is to be ignored: do NOT index token_bytes with negatives
        valid = y >= 0
        y_safe = torch.where(valid, y, torch.zeros_like(y))
        # map valid targets to their byte length; ignored targets contribute 0 bytes
        num_bytes2d = torch.where(
            valid,
            token_bytes[y_safe],
            torch.zeros_like(y, dtype=token_bytes.dtype)
        )

Sources:

  • scripts/base_eval.py
  • scripts/chat_eval.py:1-25,35-70,180-195
  • scripts/tok_eval.py:1-30,150-175
  • nanochat/core_eval.py:1-20,25-40
  • nanochat/loss_eval.py:1-20,35-50
Last updated: 1/10/2026