Training Your Model
Training Your Model
Complete guide to training nanochat models from tokenizer creation through reinforcement learning fine-tuning.
Overview
Nanochat uses a multi-stage training pipeline that progressively builds capabilities:
- Tokenizer Training - Create custom vocabulary optimized for your data
- Base Model Training - Learn fundamental language patterns
- Mid-Training - Expand knowledge and capabilities
- Supervised Fine-Tuning - Learn conversation format and tasks
- Reinforcement Learning - Optimize for human preferences
Each stage builds on the previous one, creating increasingly capable models.
Quick Start
For a complete training run, use the provided speedrun script:
# Complete end-to-end training
bash speedrun.sh
# Train specific stages only
bash speedrun.sh tok # Tokenizer only
bash speedrun.sh base # Base model only
bash speedrun.sh sft # SFT only (requires base model)
The speedrun script automatically handles dependencies and stages.
Stage 1: Tokenizer Training
Create a custom tokenizer optimized for your training data.
Basic Training
# Train tokenizer on default data
python -m scripts.tok_train
# Custom vocabulary size
python -m scripts.tok_train --vocab-size 32000
# Train on specific data splits
python -m scripts.tok_train --train-ratio 0.95 --val-ratio 0.05
Advanced Configuration
# Custom tokenizer with specific settings
python -m scripts.tok_train \\
--vocab-size 32768 \\
--train-ratio 0.9 \\
--val-ratio 0.1 \\
--num-merges 32512 \\
--vocab-dir custom_tokenizer
Key Parameters:
--vocab-size: Total vocabulary size (default: 32768)--num-merges: BPE merges (vocab_size - 256)--train-ratio: Training data ratio (default: 0.9)--val-ratio: Validation data ratio (default: 0.1)
Output Files
Tokenizer training produces:
tokenizer.json- Tokenizer configurationvocab.txt- Vocabulary mappingmerges.txt- BPE merge rules
Stage 2: Base Model Training
Train the foundation transformer model on language modeling.
Single GPU Training
# Basic base model training
python -m scripts.base_train
# With custom settings
python -m scripts.base_train \\
--model-size tiny \\
--seq-len 1024 \\
--batch-size 128 \\
--lr 1e-3
Multi-GPU Training
# Distributed training on 8 GPUs
torchrun --nproc_per_node=8 -m scripts.base_train \\
--model-size small \\
--batch-size 64 \\
--total-steps 50000
Model Sizes
Available model configurations:
| Size | Parameters | Layers | Hidden | Heads | Context |
|---|---|---|---|---|---|
| tiny | 20M | 12 | 768 | 12 | 1024 |
| small | 124M | 12 | 768 | 12 | 1024 |
| medium | 354M | 24 | 1024 | 16 | 1024 |
| large | 774M | 36 | 1280 | 20 | 1024 |
Training Parameters
Core Settings:
--model-size: Model architecture size--seq-len: Context length (default: 1024)--batch-size: Per-GPU batch size--total-steps: Total training steps--lr: Peak learning rate
Optimization:
--optimizer:adamwormuon(default: adamw)--weight-decay: L2 regularization--lr-warmup-steps: Learning rate warmup--lr-decay-steps: Learning rate decay
Stage 3: Mid-Training (Optional)
Specialized training phase for knowledge expansion.
# Mid-training on base model
python -m scripts.mid_train \\
--base-model-tag base_v1 \\
--total-steps 10000 \\
--lr 5e-4
Mid-training uses:
- Lower learning rate than base training
- Specialized datasets (e.g., math, code)
- Continued pretraining approach
Stage 4: Supervised Fine-Tuning
Train the model to follow conversation format and complete tasks.
Basic SFT
# SFT on default tasks
python -m scripts.chat_sft
# Custom task mixture
python -m scripts.chat_sft \\
--tasks "GSM8K,MMLU,HumanEval" \\
--total-steps 5000
Multi-GPU SFT
# Distributed SFT training
torchrun --nproc_per_node=4 -m scripts.chat_sft \\
--base-model base_model_v1 \\
--batch-size 32 \\
--lr 1e-4
Task Selection
Available tasks for SFT:
- GSM8K: Grade school mathematics
- MMLU: Broad knowledge questions
- HumanEval: Code generation
- ARC: Reasoning challenges
- SpellingBee: Letter counting (custom)
# Train on specific task subset
python -m scripts.chat_sft --tasks "GSM8K,HumanEval"
# Custom task weights
python -m scripts.chat_sft --task-weights "GSM8K:2,MMLU:1,HumanEval:3"
Stage 5: Reinforcement Learning
Optimize model behavior using human preference data.
Basic RL Training
# RL training from SFT checkpoint
python -m scripts.chat_rl \\
--sft-model sft_model_v1 \\
--total-steps 1000
Advanced RL Configuration
# Custom RL training
python -m scripts.chat_rl \\
--sft-model sft_model_v1 \\
--reward-model reward_v1 \\
--kl-penalty 0.1 \\
--batch-size 16 \\
--ppo-epochs 4
RL Parameters:
--kl-penalty: KL divergence penalty weight--ppo-epochs: PPO optimization epochs--clip-range: PPO clipping parameter--value-loss-coeff: Value function loss weight
Training Configuration
Hardware Requirements
Minimum Requirements:
- Single GPU: 8GB+ VRAM for tiny/small models
- Multi-GPU: 4x 8GB GPUs for medium models
- Large models: 8x 16GB+ GPUs recommended
Memory Optimization:
- Use gradient checkpointing for large models
- Enable mixed precision (bfloat16)
- Adjust batch size based on VRAM
Data Preparation
Training data should be in conversation format:
{
"messages": [
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."}
]
}
Data Sources:
- FineWeb-Edu (base training)
- OpenHermes (SFT)
- Custom conversation datasets
Monitoring Training
Track training progress with built-in reporting:
# View training report
cat ~/nanochat_data/report.md
# Monitor loss in real-time
tail -f training.log
Key Metrics:
- Training loss (should decrease steadily)
- Validation loss (watch for overfitting)
- Learning rate schedule
- Memory usage per GPU
Resuming Training
All scripts support checkpoint resumption:
# Resume from latest checkpoint
python -m scripts.base_train --resume
# Resume from specific step
python -m scripts.base_train --resume --resume-step 10000
# Resume with different settings
python -m scripts.base_train --resume --lr 5e-4
Best Practices
Model Sizing
- Start with tiny/small models for experimentation
- Scale up only when needed
- Consider compute budget vs. performance trade-offs
Hyperparameter Tuning
- Use learning rate warmup (10% of total steps)
- Decay learning rate in final 10% of training
- Monitor loss curves for optimal stopping
Data Quality
- Clean and deduplicate training data
- Balance different data sources
- Validate conversation format consistency
Debugging
- Start with small batch sizes to test setup
- Use gradient clipping (max_norm=1.0)
- Check for NaN losses or gradient explosions
Troubleshooting
Common Issues
Out of Memory:
# Reduce batch size
--batch-size 32
# Enable gradient checkpointing
--gradient-checkpointing
# Use smaller model
--model-size tiny
Slow Training:
# Increase batch size if memory allows
--batch-size 256
# Use multiple GPUs
torchrun --nproc_per_node=8
# Enable mixed precision
--dtype bfloat16
Poor Convergence:
# Adjust learning rate
--lr 3e-4
# Increase warmup steps
--lr-warmup-steps 1000
# Check data quality and format
Getting Help
- Check logs in
~/nanochat_data/ - Review training report for diagnostics
- Verify data format matches expected schema
- Test with smaller models first
Related Pages
Sources:
- speedrun.sh (complete training pipeline)
- scripts/tok_train.py (tokenizer training)
- scripts/base_train.py (base model training)
- scripts/chat_sft.py (supervised fine-tuning)
- scripts/chat_rl.py (reinforcement learning)