Training Your Model

Complete guide to training nanochat models from tokenizer creation through reinforcement learning fine-tuning.

Overview

Nanochat uses a multi-stage training pipeline that progressively builds capabilities:

Tokenizer Training - Create custom vocabulary optimized for your data
Base Model Training - Learn fundamental language patterns
Mid-Training - Expand knowledge and capabilities
Supervised Fine-Tuning - Learn conversation format and tasks
Reinforcement Learning - Optimize for human preferences

Each stage builds on the previous one, creating increasingly capable models.

Quick Start

For a complete training run, use the provided speedrun script:

bash

# Complete end-to-end training
bash speedrun.sh

# Train specific stages only
bash speedrun.sh tok        # Tokenizer only
bash speedrun.sh base       # Base model only  
bash speedrun.sh sft        # SFT only (requires base model)

The speedrun script automatically handles dependencies and stages.

Stage 1: Tokenizer Training

Create a custom tokenizer optimized for your training data.

Basic Training

bash

# Train tokenizer on default data
python -m scripts.tok_train

# Custom vocabulary size
python -m scripts.tok_train --vocab-size 32000

# Train on specific data splits
python -m scripts.tok_train --train-ratio 0.95 --val-ratio 0.05

Advanced Configuration

bash

# Custom tokenizer with specific settings
python -m scripts.tok_train \\
    --vocab-size 32768 \\
    --train-ratio 0.9 \\
    --val-ratio 0.1 \\
    --num-merges 32512 \\
    --vocab-dir custom_tokenizer

Key Parameters:

--vocab-size: Total vocabulary size (default: 32768)
--num-merges: BPE merges (vocab_size - 256)
--train-ratio: Training data ratio (default: 0.9)
--val-ratio: Validation data ratio (default: 0.1)

Output Files

Tokenizer training produces:

tokenizer.json - Tokenizer configuration
vocab.txt - Vocabulary mapping
merges.txt - BPE merge rules

Stage 2: Base Model Training

Train the foundation transformer model on language modeling.

Single GPU Training

bash

# Basic base model training
python -m scripts.base_train

# With custom settings
python -m scripts.base_train \\
    --model-size tiny \\
    --seq-len 1024 \\
    --batch-size 128 \\
    --lr 1e-3

Multi-GPU Training

bash

# Distributed training on 8 GPUs
torchrun --nproc_per_node=8 -m scripts.base_train \\
    --model-size small \\
    --batch-size 64 \\
    --total-steps 50000

Model Sizes

Available model configurations:

Size	Parameters	Layers	Hidden	Heads	Context
tiny	20M	12	768	12	1024
small	124M	12	768	12	1024
medium	354M	24	1024	16	1024
large	774M	36	1280	20	1024

Training Parameters

Core Settings:

--model-size: Model architecture size
--seq-len: Context length (default: 1024)
--batch-size: Per-GPU batch size
--total-steps: Total training steps
--lr: Peak learning rate

Optimization:

--optimizer: adamw or muon (default: adamw)
--weight-decay: L2 regularization
--lr-warmup-steps: Learning rate warmup
--lr-decay-steps: Learning rate decay

Stage 3: Mid-Training (Optional)

Specialized training phase for knowledge expansion.

bash

# Mid-training on base model
python -m scripts.mid_train \\
    --base-model-tag base_v1 \\
    --total-steps 10000 \\
    --lr 5e-4

Mid-training uses:

Lower learning rate than base training
Specialized datasets (e.g., math, code)
Continued pretraining approach

Stage 4: Supervised Fine-Tuning

Train the model to follow conversation format and complete tasks.

Basic SFT

bash

# SFT on default tasks
python -m scripts.chat_sft

# Custom task mixture
python -m scripts.chat_sft \\
    --tasks "GSM8K,MMLU,HumanEval" \\
    --total-steps 5000

Multi-GPU SFT

bash

# Distributed SFT training
torchrun --nproc_per_node=4 -m scripts.chat_sft \\
    --base-model base_model_v1 \\
    --batch-size 32 \\
    --lr 1e-4

Task Selection

Available tasks for SFT:

GSM8K: Grade school mathematics
MMLU: Broad knowledge questions
HumanEval: Code generation
ARC: Reasoning challenges
SpellingBee: Letter counting (custom)

bash

# Train on specific task subset
python -m scripts.chat_sft --tasks "GSM8K,HumanEval"

# Custom task weights
python -m scripts.chat_sft --task-weights "GSM8K:2,MMLU:1,HumanEval:3"

Stage 5: Reinforcement Learning

Optimize model behavior using human preference data.

Basic RL Training

bash

# RL training from SFT checkpoint
python -m scripts.chat_rl \\
    --sft-model sft_model_v1 \\
    --total-steps 1000

Advanced RL Configuration

bash

# Custom RL training
python -m scripts.chat_rl \\
    --sft-model sft_model_v1 \\
    --reward-model reward_v1 \\
    --kl-penalty 0.1 \\
    --batch-size 16 \\
    --ppo-epochs 4

RL Parameters:

--kl-penalty: KL divergence penalty weight
--ppo-epochs: PPO optimization epochs
--clip-range: PPO clipping parameter
--value-loss-coeff: Value function loss weight

Training Configuration

Hardware Requirements

Minimum Requirements:

Single GPU: 8GB+ VRAM for tiny/small models
Multi-GPU: 4x 8GB GPUs for medium models
Large models: 8x 16GB+ GPUs recommended

Memory Optimization:

Use gradient checkpointing for large models
Enable mixed precision (bfloat16)
Adjust batch size based on VRAM

Data Preparation

Training data should be in conversation format:

json

{
  "messages": [
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."}
  ]
}

Data Sources:

FineWeb-Edu (base training)
OpenHermes (SFT)
Custom conversation datasets

Monitoring Training

Track training progress with built-in reporting:

bash

# View training report
cat ~/nanochat_data/report.md

# Monitor loss in real-time
tail -f training.log

Key Metrics:

Training loss (should decrease steadily)
Validation loss (watch for overfitting)
Learning rate schedule
Memory usage per GPU

Resuming Training

All scripts support checkpoint resumption:

bash

# Resume from latest checkpoint
python -m scripts.base_train --resume

# Resume from specific step
python -m scripts.base_train --resume --resume-step 10000

# Resume with different settings
python -m scripts.base_train --resume --lr 5e-4

Best Practices

Model Sizing

Start with tiny/small models for experimentation
Scale up only when needed
Consider compute budget vs. performance trade-offs

Hyperparameter Tuning

Use learning rate warmup (10% of total steps)
Decay learning rate in final 10% of training
Monitor loss curves for optimal stopping

Data Quality

Clean and deduplicate training data
Balance different data sources
Validate conversation format consistency

Debugging

Start with small batch sizes to test setup
Use gradient clipping (max_norm=1.0)
Check for NaN losses or gradient explosions

Troubleshooting

Common Issues

Out of Memory:

bash

# Reduce batch size
--batch-size 32

# Enable gradient checkpointing
--gradient-checkpointing

# Use smaller model
--model-size tiny

Slow Training:

bash

# Increase batch size if memory allows
--batch-size 256

# Use multiple GPUs
torchrun --nproc_per_node=8

# Enable mixed precision
--dtype bfloat16

Poor Convergence:

bash

# Adjust learning rate
--lr 3e-4

# Increase warmup steps
--lr-warmup-steps 1000

# Check data quality and format

Getting Help

Check logs in ~/nanochat_data/
Review training report for diagnostics
Verify data format matches expected schema
Test with smaller models first

Sources:

speedrun.sh (complete training pipeline)
scripts/tok_train.py (tokenizer training)
scripts/base_train.py (base model training)
scripts/chat_sft.py (supervised fine-tuning)
scripts/chat_rl.py (reinforcement learning)

# Training Your Model

# Training Your Model

# Overview

# Quick Start

# Stage 1: Tokenizer Training

# Basic Training

# Advanced Configuration

# Output Files

# Stage 2: Base Model Training

# Single GPU Training

# Multi-GPU Training

# Model Sizes

# Training Parameters

# Stage 3: Mid-Training (Optional)

# Stage 4: Supervised Fine-Tuning

# Basic SFT

# Multi-GPU SFT

# Task Selection

# Stage 5: Reinforcement Learning

# Basic RL Training

# Advanced RL Configuration

# Training Configuration

# Hardware Requirements

# Data Preparation

# Monitoring Training

# Resuming Training

# Best Practices

# Model Sizing

# Hyperparameter Tuning

# Data Quality

# Debugging

# Troubleshooting

# Common Issues

# Getting Help

# Related Pages

Training Your Model

Training Your Model

Overview

Quick Start

Stage 1: Tokenizer Training

Basic Training

Advanced Configuration

Output Files

Stage 2: Base Model Training

Single GPU Training

Multi-GPU Training

Model Sizes

Training Parameters

Stage 3: Mid-Training (Optional)

Stage 4: Supervised Fine-Tuning

Basic SFT

Multi-GPU SFT

Task Selection

Stage 5: Reinforcement Learning

Basic RL Training

Advanced RL Configuration

Training Configuration

Hardware Requirements

Data Preparation

Monitoring Training

Resuming Training

Best Practices

Model Sizing

Hyperparameter Tuning

Data Quality

Debugging

Troubleshooting

Common Issues

Getting Help

Related Pages