Evaluation Scripts

The evaluation framework provides comprehensive assessment tools for models at different training stages, from tokenizer compression to chat capabilities.

Overview

The evaluation system includes:

Base Model Evaluation - CORE metric assessment for foundation models
Chat Model Evaluation - Task-specific evaluation for chat models
Tokenizer Evaluation - Compression ratio analysis
Core Evaluation Engine - Unified evaluation infrastructure
Loss Evaluation - Bits-per-byte metric calculation

Key Files:

scripts/base_eval.py - Base model CORE evaluation
scripts/chat_eval.py - Chat model task evaluation
scripts/tok_eval.py - Tokenizer compression assessment
nanochat/core_eval.py - CORE metric implementation
nanochat/loss_eval.py - BPB evaluation utilities

Base Model Evaluation

Evaluates foundation models using the CORE (Core Open-source Research Evaluation) benchmark.

Source: scripts/base_eval.py:1-50

python

"""
Evaluate the CORE metric for a given model.

Run on a single GPU:
python -m scripts.base_eval

Run with torchrun on e.g. 8 GPUs:
torchrun --nproc_per_node=8 -m scripts.base_eval

The script will print the CORE metric to the console.
"""

Key Features

CORE Metric Calculation
- Downloads evaluation bundle (~162MB) automatically
- Centered accuracy: (accuracy - random_baseline) / (1.0 - random_baseline)
- Aggregates scores across multiple ICL tasks
Multi-Task Evaluation
- Supports multiple choice, schema, and language modeling tasks
- Few-shot learning with configurable examples
- Handles various continuation delimiters

Source: scripts/base_eval.py:75-105

python

def evaluate_model(model, tokenizer, device, max_per_task=-1):
    """
    Evaluate a base model on the CORE benchmark.
    - max_per_task: crop the data to this many examples per task for testing (-1 = disable)
    """
    # Load config and task metadata
    base_dir = get_base_dir()
    eval_bundle_dir = os.path.join(base_dir, "eval_bundle")
    # Download the eval bundle to disk (and unzip if needed)
    if not os.path.exists(eval_bundle_dir):
        download_file_with_lock(EVAL_BUNDLE_URL, "eval_bundle.zip", postprocess_fn=place_eval_bundle)

Usage Examples

bash

# Single GPU evaluation
python -m scripts.base_eval

# Distributed evaluation
torchrun --nproc_per_node=8 -m scripts.base_eval

# Evaluate HuggingFace model
python -m scripts.base_eval --hf-path openai-community/gpt2

# Limited examples for testing
python -m scripts.base_eval --max-per-task 100

Chat Model Evaluation

Comprehensive evaluation suite for chat models across multiple benchmarks.

Source: scripts/chat_eval.py:1-25

python

"""
Evaluate the Chat model.
All the generic code lives here, and all the evaluation-specific
code lives in nanochat directory and is imported from here.

Example runs:
python -m scripts.chat_eval -a ARC-Easy
torchrun --nproc_per_node=8 -m scripts.chat_eval -- -a ARC-Easy
"""

Supported Tasks

ARC-Easy/Challenge - AI2 Reasoning Challenge
MMLU - Massive Multitask Language Understanding
GSM8K - Grade School Math Word Problems
HumanEval - Code generation benchmark
SpellingBee - Spelling accuracy test

Evaluation Types

Generative Evaluation - For open-ended tasks
- Samples multiple completions
- Checks success criteria for any completion
- Used for HumanEval, GSM8K, SpellingBee
Categorical Evaluation - For multiple choice tasks
- Computes logits for answer choices only
- More efficient batched evaluation
- Used for ARC, MMLU

Source: scripts/chat_eval.py:35-70

python

def run_generative_eval(task_object, tokenizer, model, engine, num_samples, max_new_tokens, temperature, top_k, max_problems=None):

    ddp, ddp_rank, ddp_local_rank, ddp_world_size = get_dist_info()
    device = model.get_device()

    num_problems = len(task_object) if max_problems is None else min(len(task_object), max_problems)

    # Run the evaluation
    num_passed, total = 0, 0
    for i in range(ddp_rank, num_problems, ddp_world_size):
        conversation = task_object[i]

        # Tokenize the prompt
        encoded_prompt = tokenizer.render_for_completion(conversation)
        # Get the completions
        results, _ = engine.generate_batch(
            encoded_prompt,
            num_samples=num_samples,
            max_tokens=max_new_tokens,
            temperature=temperature,
            top_k=top_k,
        )

ChatCORE Metric

Similar to CORE but for chat models, calculating centered accuracy across all tasks:

Source: scripts/chat_eval.py:180-195

python

# calculate the ChatCORE metric if we can (similar to CORE, it's the mean centered accuracy)
# this way, ChatCORE ranges from 0 (at random baseline) to 1 (peak performance)
chatcore_metric_dict = {}
if all_tasks_were_evaluated:
    centered_mean = 0
    for task_name, acc in results.items():
        baseline_acc = baseline_accuracies.get(task_name, 0.0)
        centered_acc = (acc - baseline_acc) / (1.0 - baseline_acc)
        centered_mean += centered_acc
    chatcore_metric = centered_mean / len(results)
    chatcore_metric_dict = {"ChatCORE metric": chatcore_metric}

Usage Examples

bash

# Single task evaluation
python -m scripts.chat_eval -i sft -a ARC-Easy

# Multiple tasks
python -m scripts.chat_eval -i sft -a "ARC-Easy|MMLU|GSM8K"

# All tasks with sampling
python -m scripts.chat_eval -i sft -n 5 -t 0.7

# Distributed evaluation
torchrun --nproc_per_node=8 -m scripts.chat_eval -i sft

Tokenizer Evaluation

Analyzes compression performance of tokenizers across different text types.

Source: scripts/tok_eval.py:1-30

python

"""
Evaluate compression ratio of the tokenizer.
"""

from nanochat.tokenizer import get_tokenizer, RustBPETokenizer
from nanochat.dataset import parquets_iter_batched

# Random text I got from a random website this morning
news_text = r"""
(Washington, D.C., July 9, 2025)- Yesterday, Mexico's National Service of Agro-Alimentary Health...
""".strip()

Evaluation Categories

News Text - Current events and formal writing
Korean Text - Non-English language compression
Code Text - Programming language efficiency
Math Text - LaTeX mathematical notation
Science Text - Technical scientific writing
Training Data - FineWeb-Edu samples

Compression Metrics

Bytes per Token Ratio - Higher is better compression
Relative Improvement - Percentage difference vs baseline
Cross-tokenizer Comparison - GPT-2, GPT-4, and custom tokenizer

Source: scripts/tok_eval.py:150-175

python

for tokenizer_name in ["gpt2", "gpt4", "ours"]:

    if tokenizer_name == "gpt2":
        tokenizer = RustBPETokenizer.from_pretrained("gpt2") # gpt-2 base model tokenizer
    elif tokenizer_name == "gpt4":
        tokenizer = RustBPETokenizer.from_pretrained("cl100k_base") # gpt-4 base model tokenizer
    else:
        tokenizer = get_tokenizer()

    vocab_sizes[tokenizer_name] = tokenizer.get_vocab_size()
    tokenizer_results[tokenizer_name] = {}

    for name, text in all_text:
        encoded = tokenizer.encode(text)
        decoded = tokenizer.decode(encoded)
        assert decoded == text

        encoded_bytes = text.encode('utf-8')
        ratio = len(encoded_bytes) / len(encoded)
        tokenizer_results[tokenizer_name][name] = {
            'bytes': len(encoded_bytes),
            'tokens': len(encoded),
            'ratio': ratio
        }

Core Evaluation Engine

Implements the CORE benchmark evaluation logic used by base model evaluation.

Source: nanochat/core_eval.py:1-20

python

"""
Functions for evaluating the CORE metric, as described in the DCLM paper.
https://arxiv.org/abs/2406.11794

TODOs:
- All tasks ~match except for squad. We get 31% reference is 37%. Figure out why.
"""
import random

from jinja2 import Template
import torch
import torch.distributed as dist

Task Types Supported

Multiple Choice - Common prefix, different continuations
Schema Tasks - Different contexts, common suffix
Language Modeling - Autoregressive next token prediction

Prompt Rendering

Source: nanochat/core_eval.py:25-40

python

def render_prompts_mc(item, continuation_delimiter, fewshot_examples=None):
    """Render complete prompts for a multiple choice question"""
    template_str = """
{%- for example in fewshot_examples -%}
{{ example.query }}{{ continuation_delimiter }}{{ example.choices[example.gold] }}

{% endfor -%}
{{ item.query }}{{ continuation_delimiter }}{{ choice }}""".strip()
    template = Template(template_str)
    fewshot_examples = fewshot_examples or []
    context = {
        'fewshot_examples': fewshot_examples,
        'continuation_delimiter': continuation_delimiter,
        'item': item
    }
    prompts = [template.render(choice=choice, **context) for choice in item['choices']]
    return prompts

Evaluation Process

Few-shot Sampling - Random examples excluding current item
Prompt Batching - Efficient sequence handling per task type
Model Forward Pass - Get logits and predictions
Accuracy Calculation - Task-specific success criteria

Loss Evaluation

Provides bits-per-byte (BPB) metric for tokenization-independent loss measurement.

Source: nanochat/loss_eval.py:1-20

python

"""
A number of functions that help with evaluating a base model.
"""
import math
import torch
import torch.distributed as dist

@torch.no_grad()
def evaluate_bpb(model, batches, steps, token_bytes):
    """
    Instead of the naive 'mean loss', this function returns the bits per byte (bpb),
    which is a tokenization vocab size-independent metric, meaning you are still comparing
    apples:apples if you change the vocab size.
    """

BPB Calculation

Sum Loss & Bytes - Accumulate across all target tokens
Mask Special Tokens - Exclude BOS, padding, ignored tokens
Normalize by Bytes - Loss per byte rather than per token
Convert to Bits - Divide by log(2) for bits per byte

Source: nanochat/loss_eval.py:35-50

python

# record the losses
total_nats = torch.tensor(0.0, dtype=torch.float32, device=model.get_device())
total_bytes = torch.tensor(0, dtype=torch.int64, device=model.get_device())
batch_iter = iter(batches)
for _ in range(steps):
    x, y = next(batch_iter)
    loss2d = model(x, y, loss_reduction='none') # (B, T)
    loss2d = loss2d.view(-1) # flatten
    y = y.view(-1) # flatten
    if (y.int() < 0).any(): # mps does not currently have kernel for < 0 for int64, only int32
        # slightly more complex code path if some target tokens are ignore_index (e.g. -1)
        # any target token < 0 is to be ignored: do NOT index token_bytes with negatives
        valid = y >= 0
        y_safe = torch.where(valid, y, torch.zeros_like(y))
        # map valid targets to their byte length; ignored targets contribute 0 bytes
        num_bytes2d = torch.where(
            valid,
            token_bytes[y_safe],
            torch.zeros_like(y, dtype=token_bytes.dtype)
        )

Sources:

scripts/base_eval.py
scripts/chat_eval.py:1-25,35-70,180-195
scripts/tok_eval.py:1-30,150-175
nanochat/core_eval.py:1-20,25-40
nanochat/loss_eval.py:1-20,35-50

# Evaluation Scripts

# Evaluation Scripts

# Overview

# Base Model Evaluation

# Key Features

# Usage Examples

# Chat Model Evaluation

# Supported Tasks

# Evaluation Types

# ChatCORE Metric

# Usage Examples

# Tokenizer Evaluation

# Evaluation Categories

# Compression Metrics

# Core Evaluation Engine

# Task Types Supported

# Prompt Rendering

# Evaluation Process

# Loss Evaluation

# BPB Calculation

# Related Pages

Evaluation Scripts

Evaluation Scripts

Overview

Base Model Evaluation

Key Features

Usage Examples

Chat Model Evaluation

Supported Tasks

Evaluation Types

ChatCORE Metric

Usage Examples

Tokenizer Evaluation

Evaluation Categories

Compression Metrics

Core Evaluation Engine

Task Types Supported

Prompt Rendering

Evaluation Process

Loss Evaluation

BPB Calculation

Related Pages