Getting Started Guide
Getting Started Guide
This guide will walk you through setting up and running NanoChat for the first time. Whether you're looking to train your own model from scratch or explore the existing codebase, this document provides step-by-step instructions for getting up and running quickly.
Quick Start (Recommended)
The fastest way to experience nanochat is to run the complete $100 training pipeline:
Source: README.md:17-25
# Clone the repository
git clone https://github.com/karpathy/nanochat.git
cd nanochat
# Run the complete training pipeline (~4 hours on 8x H100)
bash speedrun.sh
For longer runs, use a screen session:
screen -L -Logfile speedrun.log -S speedrun bash speedrun.sh
After training completes, chat with your model:
source .venv/bin/activate
python -m scripts.chat_web
Visit the URL shown to access the web interface!
System Requirements
Hardware Requirements
Recommended Setup
- 8x H100 GPUs (80GB VRAM each) for full training pipeline
- High-speed interconnect (NVLink/InfiniBand) for multi-GPU training
- Fast storage (NVMe SSD) for dataset streaming
- High memory bandwidth for optimal performance
Alternative Configurations
Source: README.md:87-99
- The code will run just fine on the Ampere 8XA100 GPU node as well, but a bit slower.
- All code will run just fine on even a single GPU by omitting `torchrun`, and will produce ~identical results (code will automatically switch to gradient accumulation), but you'll have to wait 8 times longer.
- If your GPU(s) have less than 80GB, you'll have to tune some of the hyperparameters or you will OOM / run out of VRAM. Look for `--device_batch_size` in the scripts and reduce it until things fit. E.g. from 32 (default) to 16, 8, 4, 2, or even 1.
- Most of the code is fairly vanilla PyTorch so it should run on anything that supports that - xpu, mps, or etc, but I haven't implemented this out of the box so it might take a bit of tinkering.
Memory Guidelines
- 80GB VRAM: Full training with default hyperparameters
- 40GB VRAM: Reduce
--device_batch_sizeto 16 - 24GB VRAM: Reduce
--device_batch_sizeto 8 or 4 - 16GB VRAM: Reduce
--device_batch_sizeto 2 or 1 - <16GB VRAM: Requires significant hyperparameter tuning
Software Requirements
- Python 3.10+
- PyTorch 2.9.0+ with CUDA support
- CUDA 12.8+ (for GPU training)
- uv package manager (installed automatically)
Environment Setup
1. Python Environment
NanoChat uses uv for fast, reliable Python package management:
Source: speedrun.sh:25-35
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create a local virtual environment
uv venv
# Install dependencies with GPU support
uv sync --extra gpu
# Activate the environment
source .venv/bin/activate
2. Base Directory Configuration
By default, nanochat stores data and models in ~/.cache/nanochat:
export NANOCHAT_BASE_DIR="$HOME/.cache/nanochat"
mkdir -p $NANOCHAT_BASE_DIR
You can change this location by setting the NANOCHAT_BASE_DIR environment variable.
3. Hardware Detection
NanoChat automatically detects your hardware configuration:
Source: nanochat/common.py:158-170
def autodetect_device_type():
if torch.cuda.is_available():
return "cuda"
elif torch.backends.mps.is_available():
return "mps"
else:
return "cpu"
The system will configure itself optimally for your available hardware.
Training Your First Model
Complete Training Pipeline
The speedrun.sh script runs the entire pipeline:
- Environment Setup: Installs dependencies and sets up directories
- Tokenizer Training: Trains a BPE tokenizer on 2B characters
- Data Download: Downloads pretraining data in background
- Base Pretraining: Trains the foundational language model
- Mid-training: Adds conversation structure and tool use
- Supervised Fine-tuning: Optimizes for instruction following
- Evaluation: Tests on multiple benchmarks
- Report Generation: Creates comprehensive training report
Individual Training Stages
You can also run stages individually for experimentation:
1. Tokenizer Training
# Download training data (first 8 shards)
python -m nanochat.dataset -n 8
# Train tokenizer
python -m scripts.tok_train --max_chars=2000000000 --vocab_size=65536
# Evaluate tokenizer
python -m scripts.tok_eval
2. Base Pretraining
# Download full dataset (240 shards for d20 model)
python -m nanochat.dataset -n 240
# Train base model (8 GPUs)
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=20 --target_param_data_ratio=20
# Or single GPU (slower)
python -m scripts.base_train --depth=20 --target_param_data_ratio=20 --device_batch_size=4
3. Mid-training
# Download identity conversations
curl -L -o $NANOCHAT_BASE_DIR/identity_conversations.jsonl https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl
# Run mid-training
torchrun --standalone --nproc_per_node=8 -m scripts.mid_train
4. Supervised Fine-tuning
# Run SFT
torchrun --standalone --nproc_per_node=8 -m scripts.chat_sft
# Evaluate chat capabilities
torchrun --standalone --nproc_per_node=8 -m scripts.chat_eval -- -i sft
Memory Management
Reducing Memory Usage
If you encounter OOM (Out of Memory) errors, try these approaches:
1. Reduce Batch Size
Source: scripts/mid_train.py:45-50
# Default batch size
--device_batch_size=32
# For 40GB VRAM
--device_batch_size=16
# For 24GB VRAM
--device_batch_size=8
# For 16GB VRAM
--device_batch_size=4
2. Reduce Model Size
# Smaller model (fewer layers)
--depth=12 # instead of default 20
# Smaller context length
--max_seq_len=1024 # instead of default 2048
3. Mixed Precision Training
# Use bfloat16 (default on CUDA)
--dtype=bfloat16
# Use float32 for stability (if needed)
--dtype=float32
Single GPU Training
For single GPU setups, simply omit torchrun:
# Single GPU base training
python -m scripts.base_train --depth=12 --device_batch_size=8 --target_param_data_ratio=20
The code automatically adjusts gradient accumulation to maintain the same effective batch size.
Inference and Chat
Command Line Interface
For quick testing, use the CLI chat interface:
# Interactive chat
python -m scripts.chat_cli
# Single prompt
python -m scripts.chat_cli -p "Why is the sky blue?"
# Adjust generation parameters
python -m scripts.chat_cli --temperature 0.8 --top_k 50
Web Interface
For a ChatGPT-like experience, use the web interface:
# Single GPU serving
python -m scripts.chat_web
# Multi-GPU serving (4 GPUs)
python -m scripts.chat_web --num-gpus 4
# Custom model source
python -m scripts.chat_web --source mid # or sft, rl
The web interface supports:
- Streaming responses with real-time generation
- Tool use with Python calculator integration
- Multi-GPU serving for higher throughput
- Conversation history and context management
Evaluation
Running Evaluations
Evaluate your model on standard benchmarks:
# Single task evaluation
python -m scripts.chat_eval -a ARC-Easy
# Multiple tasks
python -m scripts.chat_eval -a ARC-Easy,MMLU,GSM8K
# Distributed evaluation (8 GPUs)
torchrun --nproc_per_node=8 -m scripts.chat_eval -- -a GSM8K
Available Benchmarks
- ARC-Easy/Challenge: Science reasoning
- MMLU: Academic knowledge
- GSM8K: Mathematical reasoning with tools
- HumanEval: Code generation
- SpellingBee: Letter counting and spelling
Base Model Evaluation
For base models (before chat training), use the CORE metric:
# Evaluate base model quality
torchrun --nproc_per_node=8 -m scripts.base_eval
Customization
Adding Your Own Data
- Create conversation files: Save as JSONL with conversation format
- Update training scripts: Add your data to TaskMixture
- Adjust mixing ratios: Control how much of your data to include
Example conversation format:
{
"messages": [
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi there! How can I help you today?"}
]
}
Hyperparameter Tuning
Key hyperparameters to experiment with:
Model Architecture:
--depth=20 # Number of transformer layers
--aspect_ratio=64 # Model dimension scaling
--head_dim=128 # Attention head dimension
--max_seq_len=2048 # Context length
Training:
--matrix_lr=0.02 # Learning rate for Muon optimizer
--embedding_lr=0.2 # Learning rate for AdamW (embeddings)
--weight_decay=0.0 # L2 regularization
--warmup_ratio=0.0 # LR warmup fraction
--warmdown_ratio=0.4 # LR decay fraction
Monitoring and Debugging
Weights & Biases Integration
Enable experiment tracking with wandb:
# First, log in to wandb
wandb login
# Run with experiment tracking
WANDB_RUN=my_experiment bash speedrun.sh
Progress Monitoring
Monitor training progress:
# View screen session
screen -r speedrun
# Follow log file
tail -f speedrun.log
# Check model outputs
ls ~/.cache/nanochat/
Training Reports
After training, check the automatically generated report:
# View training summary
cat report.md
# Check final metrics
tail report.md
Troubleshooting
Common Issues
1. CUDA Out of Memory
# Reduce batch size
--device_batch_size=16 # or 8, 4, 2, 1
# Reduce model size
--depth=12 # instead of 20
# Reduce sequence length
--max_seq_len=1024 # instead of 2048
2. Data Download Issues
# Manually download specific shards
python -m nanochat.dataset -n 10
# Check data directory
ls ~/.cache/nanochat/base_data/
3. Tokenizer Issues
# Retrain tokenizer with more data
python -m scripts.tok_train --max_chars=5000000000 --vocab_size=32768
# Check tokenizer compression
python -m scripts.tok_eval
Getting Help
- Check the logs: Training scripts provide detailed logging
- Review the report: The generated
report.mdcontains comprehensive metrics - GitHub issues: Search existing issues or create a new one
- Community discussions: Join the discussion threads in the repository
Next Steps
Once you have nanochat running:
- Experiment with hyperparameters to improve performance
- Add your own datasets for domain-specific fine-tuning
- Implement new evaluation tasks for your use cases
- Scale up to larger models with more compute
- Deploy your model for production use cases
The codebase is designed to be hackable and extensible - explore the source code and make it your own!
Sources:
- README.md (quick start and hardware requirements)
- speedrun.sh (complete training pipeline)
- scripts/* (individual training and evaluation scripts)
- nanochat/common.py (hardware detection and setup)