Getting Started Guide

Getting Started Guide

This guide will walk you through setting up and running NanoChat for the first time. Whether you're looking to train your own model from scratch or explore the existing codebase, this document provides step-by-step instructions for getting up and running quickly.

The fastest way to experience nanochat is to run the complete $100 training pipeline:

Source: README.md:17-25

bash
# Clone the repository
git clone https://github.com/karpathy/nanochat.git
cd nanochat

# Run the complete training pipeline (~4 hours on 8x H100)
bash speedrun.sh

For longer runs, use a screen session:

bash
screen -L -Logfile speedrun.log -S speedrun bash speedrun.sh

After training completes, chat with your model:

bash
source .venv/bin/activate
python -m scripts.chat_web

Visit the URL shown to access the web interface!

System Requirements

Hardware Requirements

  • 8x H100 GPUs (80GB VRAM each) for full training pipeline
  • High-speed interconnect (NVLink/InfiniBand) for multi-GPU training
  • Fast storage (NVMe SSD) for dataset streaming
  • High memory bandwidth for optimal performance

Alternative Configurations

Source: README.md:87-99

markdown
- The code will run just fine on the Ampere 8XA100 GPU node as well, but a bit slower.
- All code will run just fine on even a single GPU by omitting `torchrun`, and will produce ~identical results (code will automatically switch to gradient accumulation), but you'll have to wait 8 times longer.
- If your GPU(s) have less than 80GB, you'll have to tune some of the hyperparameters or you will OOM / run out of VRAM. Look for `--device_batch_size` in the scripts and reduce it until things fit. E.g. from 32 (default) to 16, 8, 4, 2, or even 1.
- Most of the code is fairly vanilla PyTorch so it should run on anything that supports that - xpu, mps, or etc, but I haven't implemented this out of the box so it might take a bit of tinkering.

Memory Guidelines

  • 80GB VRAM: Full training with default hyperparameters
  • 40GB VRAM: Reduce --device_batch_size to 16
  • 24GB VRAM: Reduce --device_batch_size to 8 or 4
  • 16GB VRAM: Reduce --device_batch_size to 2 or 1
  • <16GB VRAM: Requires significant hyperparameter tuning

Software Requirements

  • Python 3.10+
  • PyTorch 2.9.0+ with CUDA support
  • CUDA 12.8+ (for GPU training)
  • uv package manager (installed automatically)

Environment Setup

1. Python Environment

NanoChat uses uv for fast, reliable Python package management:

Source: speedrun.sh:25-35

bash
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create a local virtual environment
uv venv

# Install dependencies with GPU support
uv sync --extra gpu

# Activate the environment
source .venv/bin/activate

2. Base Directory Configuration

By default, nanochat stores data and models in ~/.cache/nanochat:

bash
export NANOCHAT_BASE_DIR="$HOME/.cache/nanochat"
mkdir -p $NANOCHAT_BASE_DIR

You can change this location by setting the NANOCHAT_BASE_DIR environment variable.

3. Hardware Detection

NanoChat automatically detects your hardware configuration:

Source: nanochat/common.py:158-170

python
def autodetect_device_type():
    if torch.cuda.is_available():
        return "cuda"
    elif torch.backends.mps.is_available():
        return "mps" 
    else:
        return "cpu"

The system will configure itself optimally for your available hardware.

Training Your First Model

Complete Training Pipeline

The speedrun.sh script runs the entire pipeline:

  1. Environment Setup: Installs dependencies and sets up directories
  2. Tokenizer Training: Trains a BPE tokenizer on 2B characters
  3. Data Download: Downloads pretraining data in background
  4. Base Pretraining: Trains the foundational language model
  5. Mid-training: Adds conversation structure and tool use
  6. Supervised Fine-tuning: Optimizes for instruction following
  7. Evaluation: Tests on multiple benchmarks
  8. Report Generation: Creates comprehensive training report

Individual Training Stages

You can also run stages individually for experimentation:

1. Tokenizer Training

bash
# Download training data (first 8 shards)
python -m nanochat.dataset -n 8

# Train tokenizer
python -m scripts.tok_train --max_chars=2000000000 --vocab_size=65536

# Evaluate tokenizer
python -m scripts.tok_eval

2. Base Pretraining

bash
# Download full dataset (240 shards for d20 model)
python -m nanochat.dataset -n 240

# Train base model (8 GPUs)
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=20 --target_param_data_ratio=20

# Or single GPU (slower)
python -m scripts.base_train --depth=20 --target_param_data_ratio=20 --device_batch_size=4

3. Mid-training

bash
# Download identity conversations
curl -L -o $NANOCHAT_BASE_DIR/identity_conversations.jsonl https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl

# Run mid-training
torchrun --standalone --nproc_per_node=8 -m scripts.mid_train

4. Supervised Fine-tuning

bash
# Run SFT
torchrun --standalone --nproc_per_node=8 -m scripts.chat_sft

# Evaluate chat capabilities
torchrun --standalone --nproc_per_node=8 -m scripts.chat_eval -- -i sft

Memory Management

Reducing Memory Usage

If you encounter OOM (Out of Memory) errors, try these approaches:

1. Reduce Batch Size

Source: scripts/mid_train.py:45-50

bash
# Default batch size
--device_batch_size=32

# For 40GB VRAM
--device_batch_size=16

# For 24GB VRAM  
--device_batch_size=8

# For 16GB VRAM
--device_batch_size=4

2. Reduce Model Size

bash
# Smaller model (fewer layers)
--depth=12  # instead of default 20

# Smaller context length
--max_seq_len=1024  # instead of default 2048

3. Mixed Precision Training

bash
# Use bfloat16 (default on CUDA)
--dtype=bfloat16

# Use float32 for stability (if needed)
--dtype=float32

Single GPU Training

For single GPU setups, simply omit torchrun:

bash
# Single GPU base training
python -m scripts.base_train --depth=12 --device_batch_size=8 --target_param_data_ratio=20

The code automatically adjusts gradient accumulation to maintain the same effective batch size.

Inference and Chat

Command Line Interface

For quick testing, use the CLI chat interface:

bash
# Interactive chat
python -m scripts.chat_cli

# Single prompt
python -m scripts.chat_cli -p "Why is the sky blue?"

# Adjust generation parameters
python -m scripts.chat_cli --temperature 0.8 --top_k 50

Web Interface

For a ChatGPT-like experience, use the web interface:

bash
# Single GPU serving
python -m scripts.chat_web

# Multi-GPU serving (4 GPUs)
python -m scripts.chat_web --num-gpus 4

# Custom model source
python -m scripts.chat_web --source mid  # or sft, rl

The web interface supports:

  • Streaming responses with real-time generation
  • Tool use with Python calculator integration
  • Multi-GPU serving for higher throughput
  • Conversation history and context management

Evaluation

Running Evaluations

Evaluate your model on standard benchmarks:

bash
# Single task evaluation
python -m scripts.chat_eval -a ARC-Easy

# Multiple tasks
python -m scripts.chat_eval -a ARC-Easy,MMLU,GSM8K

# Distributed evaluation (8 GPUs)
torchrun --nproc_per_node=8 -m scripts.chat_eval -- -a GSM8K

Available Benchmarks

  • ARC-Easy/Challenge: Science reasoning
  • MMLU: Academic knowledge
  • GSM8K: Mathematical reasoning with tools
  • HumanEval: Code generation
  • SpellingBee: Letter counting and spelling

Base Model Evaluation

For base models (before chat training), use the CORE metric:

bash
# Evaluate base model quality
torchrun --nproc_per_node=8 -m scripts.base_eval

Customization

Adding Your Own Data

  1. Create conversation files: Save as JSONL with conversation format
  2. Update training scripts: Add your data to TaskMixture
  3. Adjust mixing ratios: Control how much of your data to include

Example conversation format:

json
{
  "messages": [
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi there! How can I help you today?"}
  ]
}

Hyperparameter Tuning

Key hyperparameters to experiment with:

Model Architecture:

bash
--depth=20           # Number of transformer layers
--aspect_ratio=64    # Model dimension scaling
--head_dim=128       # Attention head dimension
--max_seq_len=2048   # Context length

Training:

bash
--matrix_lr=0.02     # Learning rate for Muon optimizer
--embedding_lr=0.2   # Learning rate for AdamW (embeddings)
--weight_decay=0.0   # L2 regularization
--warmup_ratio=0.0   # LR warmup fraction
--warmdown_ratio=0.4 # LR decay fraction

Monitoring and Debugging

Weights & Biases Integration

Enable experiment tracking with wandb:

bash
# First, log in to wandb
wandb login

# Run with experiment tracking
WANDB_RUN=my_experiment bash speedrun.sh

Progress Monitoring

Monitor training progress:

bash
# View screen session
screen -r speedrun

# Follow log file
tail -f speedrun.log

# Check model outputs
ls ~/.cache/nanochat/

Training Reports

After training, check the automatically generated report:

bash
# View training summary
cat report.md

# Check final metrics
tail report.md

Troubleshooting

Common Issues

1. CUDA Out of Memory

bash
# Reduce batch size
--device_batch_size=16  # or 8, 4, 2, 1

# Reduce model size
--depth=12  # instead of 20

# Reduce sequence length
--max_seq_len=1024  # instead of 2048

2. Data Download Issues

bash
# Manually download specific shards
python -m nanochat.dataset -n 10

# Check data directory
ls ~/.cache/nanochat/base_data/

3. Tokenizer Issues

bash
# Retrain tokenizer with more data
python -m scripts.tok_train --max_chars=5000000000 --vocab_size=32768

# Check tokenizer compression
python -m scripts.tok_eval

Getting Help

  1. Check the logs: Training scripts provide detailed logging
  2. Review the report: The generated report.md contains comprehensive metrics
  3. GitHub issues: Search existing issues or create a new one
  4. Community discussions: Join the discussion threads in the repository

Next Steps

Once you have nanochat running:

  1. Experiment with hyperparameters to improve performance
  2. Add your own datasets for domain-specific fine-tuning
  3. Implement new evaluation tasks for your use cases
  4. Scale up to larger models with more compute
  5. Deploy your model for production use cases

The codebase is designed to be hackable and extensible - explore the source code and make it your own!


Sources:

  • README.md (quick start and hardware requirements)
  • speedrun.sh (complete training pipeline)
  • scripts/* (individual training and evaluation scripts)
  • nanochat/common.py (hardware detection and setup)
Last updated: 1/10/2026