Training Your Model

Training Your Model

Complete guide to training nanochat models from tokenizer creation through reinforcement learning fine-tuning.

Overview

Nanochat uses a multi-stage training pipeline that progressively builds capabilities:

  1. Tokenizer Training - Create custom vocabulary optimized for your data
  2. Base Model Training - Learn fundamental language patterns
  3. Mid-Training - Expand knowledge and capabilities
  4. Supervised Fine-Tuning - Learn conversation format and tasks
  5. Reinforcement Learning - Optimize for human preferences

Each stage builds on the previous one, creating increasingly capable models.

Quick Start

For a complete training run, use the provided speedrun script:

bash
# Complete end-to-end training
bash speedrun.sh

# Train specific stages only
bash speedrun.sh tok        # Tokenizer only
bash speedrun.sh base       # Base model only  
bash speedrun.sh sft        # SFT only (requires base model)

The speedrun script automatically handles dependencies and stages.

Stage 1: Tokenizer Training

Create a custom tokenizer optimized for your training data.

Basic Training

bash
# Train tokenizer on default data
python -m scripts.tok_train

# Custom vocabulary size
python -m scripts.tok_train --vocab-size 32000

# Train on specific data splits
python -m scripts.tok_train --train-ratio 0.95 --val-ratio 0.05

Advanced Configuration

bash
# Custom tokenizer with specific settings
python -m scripts.tok_train \\
    --vocab-size 32768 \\
    --train-ratio 0.9 \\
    --val-ratio 0.1 \\
    --num-merges 32512 \\
    --vocab-dir custom_tokenizer

Key Parameters:

  • --vocab-size: Total vocabulary size (default: 32768)
  • --num-merges: BPE merges (vocab_size - 256)
  • --train-ratio: Training data ratio (default: 0.9)
  • --val-ratio: Validation data ratio (default: 0.1)

Output Files

Tokenizer training produces:

  • tokenizer.json - Tokenizer configuration
  • vocab.txt - Vocabulary mapping
  • merges.txt - BPE merge rules

Stage 2: Base Model Training

Train the foundation transformer model on language modeling.

Single GPU Training

bash
# Basic base model training
python -m scripts.base_train

# With custom settings
python -m scripts.base_train \\
    --model-size tiny \\
    --seq-len 1024 \\
    --batch-size 128 \\
    --lr 1e-3

Multi-GPU Training

bash
# Distributed training on 8 GPUs
torchrun --nproc_per_node=8 -m scripts.base_train \\
    --model-size small \\
    --batch-size 64 \\
    --total-steps 50000

Model Sizes

Available model configurations:

Size Parameters Layers Hidden Heads Context
tiny 20M 12 768 12 1024
small 124M 12 768 12 1024
medium 354M 24 1024 16 1024
large 774M 36 1280 20 1024

Training Parameters

Core Settings:

  • --model-size: Model architecture size
  • --seq-len: Context length (default: 1024)
  • --batch-size: Per-GPU batch size
  • --total-steps: Total training steps
  • --lr: Peak learning rate

Optimization:

  • --optimizer: adamw or muon (default: adamw)
  • --weight-decay: L2 regularization
  • --lr-warmup-steps: Learning rate warmup
  • --lr-decay-steps: Learning rate decay

Stage 3: Mid-Training (Optional)

Specialized training phase for knowledge expansion.

bash
# Mid-training on base model
python -m scripts.mid_train \\
    --base-model-tag base_v1 \\
    --total-steps 10000 \\
    --lr 5e-4

Mid-training uses:

  • Lower learning rate than base training
  • Specialized datasets (e.g., math, code)
  • Continued pretraining approach

Stage 4: Supervised Fine-Tuning

Train the model to follow conversation format and complete tasks.

Basic SFT

bash
# SFT on default tasks
python -m scripts.chat_sft

# Custom task mixture
python -m scripts.chat_sft \\
    --tasks "GSM8K,MMLU,HumanEval" \\
    --total-steps 5000

Multi-GPU SFT

bash
# Distributed SFT training
torchrun --nproc_per_node=4 -m scripts.chat_sft \\
    --base-model base_model_v1 \\
    --batch-size 32 \\
    --lr 1e-4

Task Selection

Available tasks for SFT:

  • GSM8K: Grade school mathematics
  • MMLU: Broad knowledge questions
  • HumanEval: Code generation
  • ARC: Reasoning challenges
  • SpellingBee: Letter counting (custom)
bash
# Train on specific task subset
python -m scripts.chat_sft --tasks "GSM8K,HumanEval"

# Custom task weights
python -m scripts.chat_sft --task-weights "GSM8K:2,MMLU:1,HumanEval:3"

Stage 5: Reinforcement Learning

Optimize model behavior using human preference data.

Basic RL Training

bash
# RL training from SFT checkpoint
python -m scripts.chat_rl \\
    --sft-model sft_model_v1 \\
    --total-steps 1000

Advanced RL Configuration

bash
# Custom RL training
python -m scripts.chat_rl \\
    --sft-model sft_model_v1 \\
    --reward-model reward_v1 \\
    --kl-penalty 0.1 \\
    --batch-size 16 \\
    --ppo-epochs 4

RL Parameters:

  • --kl-penalty: KL divergence penalty weight
  • --ppo-epochs: PPO optimization epochs
  • --clip-range: PPO clipping parameter
  • --value-loss-coeff: Value function loss weight

Training Configuration

Hardware Requirements

Minimum Requirements:

  • Single GPU: 8GB+ VRAM for tiny/small models
  • Multi-GPU: 4x 8GB GPUs for medium models
  • Large models: 8x 16GB+ GPUs recommended

Memory Optimization:

  • Use gradient checkpointing for large models
  • Enable mixed precision (bfloat16)
  • Adjust batch size based on VRAM

Data Preparation

Training data should be in conversation format:

json
{
  "messages": [
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."}
  ]
}

Data Sources:

  • FineWeb-Edu (base training)
  • OpenHermes (SFT)
  • Custom conversation datasets

Monitoring Training

Track training progress with built-in reporting:

bash
# View training report
cat ~/nanochat_data/report.md

# Monitor loss in real-time
tail -f training.log

Key Metrics:

  • Training loss (should decrease steadily)
  • Validation loss (watch for overfitting)
  • Learning rate schedule
  • Memory usage per GPU

Resuming Training

All scripts support checkpoint resumption:

bash
# Resume from latest checkpoint
python -m scripts.base_train --resume

# Resume from specific step
python -m scripts.base_train --resume --resume-step 10000

# Resume with different settings
python -m scripts.base_train --resume --lr 5e-4

Best Practices

Model Sizing

  • Start with tiny/small models for experimentation
  • Scale up only when needed
  • Consider compute budget vs. performance trade-offs

Hyperparameter Tuning

  • Use learning rate warmup (10% of total steps)
  • Decay learning rate in final 10% of training
  • Monitor loss curves for optimal stopping

Data Quality

  • Clean and deduplicate training data
  • Balance different data sources
  • Validate conversation format consistency

Debugging

  • Start with small batch sizes to test setup
  • Use gradient clipping (max_norm=1.0)
  • Check for NaN losses or gradient explosions

Troubleshooting

Common Issues

Out of Memory:

bash
# Reduce batch size
--batch-size 32

# Enable gradient checkpointing
--gradient-checkpointing

# Use smaller model
--model-size tiny

Slow Training:

bash
# Increase batch size if memory allows
--batch-size 256

# Use multiple GPUs
torchrun --nproc_per_node=8

# Enable mixed precision
--dtype bfloat16

Poor Convergence:

bash
# Adjust learning rate
--lr 3e-4

# Increase warmup steps
--lr-warmup-steps 1000

# Check data quality and format

Getting Help

  • Check logs in ~/nanochat_data/
  • Review training report for diagnostics
  • Verify data format matches expected schema
  • Test with smaller models first

Sources:

  • speedrun.sh (complete training pipeline)
  • scripts/tok_train.py (tokenizer training)
  • scripts/base_train.py (base model training)
  • scripts/chat_sft.py (supervised fine-tuning)
  • scripts/chat_rl.py (reinforcement learning)
Last updated: 1/10/2026