Getting Started Guide

This guide will walk you through setting up and running NanoChat for the first time. Whether you're looking to train your own model from scratch or explore the existing codebase, this document provides step-by-step instructions for getting up and running quickly.

Quick Start (Recommended)

The fastest way to experience nanochat is to run the complete $100 training pipeline:

Source: README.md:17-25

bash

# Clone the repository
git clone https://github.com/karpathy/nanochat.git
cd nanochat

# Run the complete training pipeline (~4 hours on 8x H100)
bash speedrun.sh

For longer runs, use a screen session:

bash

screen -L -Logfile speedrun.log -S speedrun bash speedrun.sh

After training completes, chat with your model:

bash

source .venv/bin/activate
python -m scripts.chat_web

Visit the URL shown to access the web interface!

System Requirements

Hardware Requirements

Recommended Setup

8x H100 GPUs (80GB VRAM each) for full training pipeline
High-speed interconnect (NVLink/InfiniBand) for multi-GPU training
Fast storage (NVMe SSD) for dataset streaming
High memory bandwidth for optimal performance

Alternative Configurations

Source: README.md:87-99

markdown

- The code will run just fine on the Ampere 8XA100 GPU node as well, but a bit slower.
- All code will run just fine on even a single GPU by omitting `torchrun`, and will produce ~identical results (code will automatically switch to gradient accumulation), but you'll have to wait 8 times longer.
- If your GPU(s) have less than 80GB, you'll have to tune some of the hyperparameters or you will OOM / run out of VRAM. Look for `--device_batch_size` in the scripts and reduce it until things fit. E.g. from 32 (default) to 16, 8, 4, 2, or even 1.
- Most of the code is fairly vanilla PyTorch so it should run on anything that supports that - xpu, mps, or etc, but I haven't implemented this out of the box so it might take a bit of tinkering.

Memory Guidelines

80GB VRAM: Full training with default hyperparameters
40GB VRAM: Reduce --device_batch_size to 16
24GB VRAM: Reduce --device_batch_size to 8 or 4
16GB VRAM: Reduce --device_batch_size to 2 or 1
<16GB VRAM: Requires significant hyperparameter tuning

Software Requirements

Python 3.10+
PyTorch 2.9.0+ with CUDA support
CUDA 12.8+ (for GPU training)
uv package manager (installed automatically)

Environment Setup

1. Python Environment

NanoChat uses uv for fast, reliable Python package management:

Source: speedrun.sh:25-35

bash

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create a local virtual environment
uv venv

# Install dependencies with GPU support
uv sync --extra gpu

# Activate the environment
source .venv/bin/activate

2. Base Directory Configuration

By default, nanochat stores data and models in ~/.cache/nanochat:

bash

export NANOCHAT_BASE_DIR="$HOME/.cache/nanochat"
mkdir -p $NANOCHAT_BASE_DIR

You can change this location by setting the NANOCHAT_BASE_DIR environment variable.

3. Hardware Detection

NanoChat automatically detects your hardware configuration:

Source: nanochat/common.py:158-170

python

def autodetect_device_type():
    if torch.cuda.is_available():
        return "cuda"
    elif torch.backends.mps.is_available():
        return "mps" 
    else:
        return "cpu"

The system will configure itself optimally for your available hardware.

Training Your First Model

Complete Training Pipeline

The speedrun.sh script runs the entire pipeline:

Environment Setup: Installs dependencies and sets up directories
Tokenizer Training: Trains a BPE tokenizer on 2B characters
Data Download: Downloads pretraining data in background
Base Pretraining: Trains the foundational language model
Mid-training: Adds conversation structure and tool use
Supervised Fine-tuning: Optimizes for instruction following
Evaluation: Tests on multiple benchmarks
Report Generation: Creates comprehensive training report

Individual Training Stages

You can also run stages individually for experimentation:

1. Tokenizer Training

bash

# Download training data (first 8 shards)
python -m nanochat.dataset -n 8

# Train tokenizer
python -m scripts.tok_train --max_chars=2000000000 --vocab_size=65536

# Evaluate tokenizer
python -m scripts.tok_eval

2. Base Pretraining

bash

# Download full dataset (240 shards for d20 model)
python -m nanochat.dataset -n 240

# Train base model (8 GPUs)
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=20 --target_param_data_ratio=20

# Or single GPU (slower)
python -m scripts.base_train --depth=20 --target_param_data_ratio=20 --device_batch_size=4

3. Mid-training

bash

# Download identity conversations
curl -L -o $NANOCHAT_BASE_DIR/identity_conversations.jsonl https://karpathy-public.s3.us-west-2.amazonaws.com/identity_conversations.jsonl

# Run mid-training
torchrun --standalone --nproc_per_node=8 -m scripts.mid_train

4. Supervised Fine-tuning

bash

# Run SFT
torchrun --standalone --nproc_per_node=8 -m scripts.chat_sft

# Evaluate chat capabilities
torchrun --standalone --nproc_per_node=8 -m scripts.chat_eval -- -i sft

Memory Management

Reducing Memory Usage

If you encounter OOM (Out of Memory) errors, try these approaches:

1. Reduce Batch Size

Source: scripts/mid_train.py:45-50

bash

# Default batch size
--device_batch_size=32

# For 40GB VRAM
--device_batch_size=16

# For 24GB VRAM  
--device_batch_size=8

# For 16GB VRAM
--device_batch_size=4

2. Reduce Model Size

bash

# Smaller model (fewer layers)
--depth=12  # instead of default 20

# Smaller context length
--max_seq_len=1024  # instead of default 2048

3. Mixed Precision Training

bash

# Use bfloat16 (default on CUDA)
--dtype=bfloat16

# Use float32 for stability (if needed)
--dtype=float32

Single GPU Training

For single GPU setups, simply omit torchrun:

bash

# Single GPU base training
python -m scripts.base_train --depth=12 --device_batch_size=8 --target_param_data_ratio=20

The code automatically adjusts gradient accumulation to maintain the same effective batch size.

Inference and Chat

Command Line Interface

For quick testing, use the CLI chat interface:

bash

# Interactive chat
python -m scripts.chat_cli

# Single prompt
python -m scripts.chat_cli -p "Why is the sky blue?"

# Adjust generation parameters
python -m scripts.chat_cli --temperature 0.8 --top_k 50

Web Interface

For a ChatGPT-like experience, use the web interface:

bash

# Single GPU serving
python -m scripts.chat_web

# Multi-GPU serving (4 GPUs)
python -m scripts.chat_web --num-gpus 4

# Custom model source
python -m scripts.chat_web --source mid  # or sft, rl

The web interface supports:

Streaming responses with real-time generation
Tool use with Python calculator integration
Multi-GPU serving for higher throughput
Conversation history and context management

Evaluation

Running Evaluations

Evaluate your model on standard benchmarks:

bash

# Single task evaluation
python -m scripts.chat_eval -a ARC-Easy

# Multiple tasks
python -m scripts.chat_eval -a ARC-Easy,MMLU,GSM8K

# Distributed evaluation (8 GPUs)
torchrun --nproc_per_node=8 -m scripts.chat_eval -- -a GSM8K

Available Benchmarks

ARC-Easy/Challenge: Science reasoning
MMLU: Academic knowledge
GSM8K: Mathematical reasoning with tools
HumanEval: Code generation
SpellingBee: Letter counting and spelling

Base Model Evaluation

For base models (before chat training), use the CORE metric:

bash

# Evaluate base model quality
torchrun --nproc_per_node=8 -m scripts.base_eval

Customization

Adding Your Own Data

Create conversation files: Save as JSONL with conversation format
Update training scripts: Add your data to TaskMixture
Adjust mixing ratios: Control how much of your data to include

Example conversation format:

json

{
  "messages": [
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi there! How can I help you today?"}
  ]
}

Hyperparameter Tuning

Key hyperparameters to experiment with:

Model Architecture:

bash

--depth=20           # Number of transformer layers
--aspect_ratio=64    # Model dimension scaling
--head_dim=128       # Attention head dimension
--max_seq_len=2048   # Context length

Training:

bash

--matrix_lr=0.02     # Learning rate for Muon optimizer
--embedding_lr=0.2   # Learning rate for AdamW (embeddings)
--weight_decay=0.0   # L2 regularization
--warmup_ratio=0.0   # LR warmup fraction
--warmdown_ratio=0.4 # LR decay fraction

Monitoring and Debugging

Weights & Biases Integration

Enable experiment tracking with wandb:

bash

# First, log in to wandb
wandb login

# Run with experiment tracking
WANDB_RUN=my_experiment bash speedrun.sh

Progress Monitoring

Monitor training progress:

bash

# View screen session
screen -r speedrun

# Follow log file
tail -f speedrun.log

# Check model outputs
ls ~/.cache/nanochat/

Training Reports

After training, check the automatically generated report:

bash

# View training summary
cat report.md

# Check final metrics
tail report.md

Troubleshooting

Common Issues

1. CUDA Out of Memory

bash

# Reduce batch size
--device_batch_size=16  # or 8, 4, 2, 1

# Reduce model size
--depth=12  # instead of 20

# Reduce sequence length
--max_seq_len=1024  # instead of 2048

2. Data Download Issues

bash

# Manually download specific shards
python -m nanochat.dataset -n 10

# Check data directory
ls ~/.cache/nanochat/base_data/

3. Tokenizer Issues

bash

# Retrain tokenizer with more data
python -m scripts.tok_train --max_chars=5000000000 --vocab_size=32768

# Check tokenizer compression
python -m scripts.tok_eval

Getting Help

Check the logs: Training scripts provide detailed logging
Review the report: The generated report.md contains comprehensive metrics
GitHub issues: Search existing issues or create a new one
Community discussions: Join the discussion threads in the repository

Next Steps

Once you have nanochat running:

Experiment with hyperparameters to improve performance
Add your own datasets for domain-specific fine-tuning
Implement new evaluation tasks for your use cases
Scale up to larger models with more compute
Deploy your model for production use cases

The codebase is designed to be hackable and extensible - explore the source code and make it your own!

Sources:

README.md (quick start and hardware requirements)
speedrun.sh (complete training pipeline)
scripts/* (individual training and evaluation scripts)
nanochat/common.py (hardware detection and setup)

# Getting Started Guide

# Getting Started Guide

# Quick Start (Recommended)

# System Requirements

# Hardware Requirements

# Recommended Setup

# Alternative Configurations

# Memory Guidelines

# Software Requirements

# Environment Setup

# 1. Python Environment

# 2. Base Directory Configuration

# 3. Hardware Detection

# Training Your First Model

# Complete Training Pipeline

# Individual Training Stages

# 1. Tokenizer Training

# 2. Base Pretraining

# 3. Mid-training

# 4. Supervised Fine-tuning

# Memory Management

# Reducing Memory Usage

# 1. Reduce Batch Size

# 2. Reduce Model Size

# 3. Mixed Precision Training

# Single GPU Training

# Inference and Chat

# Command Line Interface

# Web Interface

# Evaluation

# Running Evaluations

# Available Benchmarks

# Base Model Evaluation

# Customization

# Adding Your Own Data

# Hyperparameter Tuning

# Monitoring and Debugging

# Weights & Biases Integration

# Progress Monitoring

# Training Reports

# Troubleshooting

# Common Issues

# 1. CUDA Out of Memory

# 2. Data Download Issues

# 3. Tokenizer Issues

# Getting Help

# Next Steps

Getting Started Guide

Getting Started Guide

Quick Start (Recommended)

System Requirements

Hardware Requirements

Recommended Setup

Alternative Configurations

Memory Guidelines

Software Requirements

Environment Setup

1. Python Environment

2. Base Directory Configuration

3. Hardware Detection

Training Your First Model

Complete Training Pipeline

Individual Training Stages

1. Tokenizer Training

2. Base Pretraining

3. Mid-training

4. Supervised Fine-tuning

Memory Management

Reducing Memory Usage

1. Reduce Batch Size

2. Reduce Model Size

3. Mixed Precision Training

Single GPU Training

Inference and Chat

Command Line Interface

Web Interface

Evaluation

Running Evaluations

Available Benchmarks

Base Model Evaluation

Customization

Adding Your Own Data

Hyperparameter Tuning

Monitoring and Debugging

Weights & Biases Integration

Progress Monitoring

Training Reports

Troubleshooting

Common Issues

1. CUDA Out of Memory

2. Data Download Issues

3. Tokenizer Issues

Getting Help

Next Steps