API Reference

API Reference

Complete API reference for nanochat components, interfaces, and web endpoints.

Overview

Nanochat provides multiple API interfaces:

  • Core Python API - Direct access to models, tokenizers, and engines
  • Web API - HTTP endpoints for chat and completions
  • Command Line Interface - Scripts for training and evaluation
  • Task Framework API - Interface for custom evaluation tasks

This reference covers the most commonly used APIs and their parameters.

Core Python API

GPT Model

Main transformer model class for language generation.

python
from nanochat.gpt import GPT, GPTConfig

# Create model configuration
config = GPTConfig(
    vocab_size=32768,
    n_layer=12,
    n_head=12,
    n_embd=768,
    block_size=1024
)

# Initialize model
model = GPT(config)

# Forward pass
logits = model(input_ids)  # (batch_size, seq_len, vocab_size)

GPTConfig Parameters

Parameter Type Default Description
vocab_size int 32768 Vocabulary size
n_layer int 12 Number of transformer layers
n_head int 12 Number of attention heads
n_embd int 768 Hidden dimension
block_size int 1024 Maximum sequence length
dropout float 0.0 Dropout probability
bias bool True Use bias in linear layers
rope_base float 10000.0 RoPE frequency base

GPT Methods

python
# Generate text
def generate(self, input_ids, max_new_tokens=256, temperature=1.0, top_k=None):
    """Generate continuation of input sequence"""
    
# Get model device
def get_device(self):
    """Return device model is on"""
    
# Enable/disable gradient computation  
def train(self, mode=True):
    """Set training mode"""
    
def eval(self):
    """Set evaluation mode"""

Engine

High-level interface for text generation and chat.

python
from nanochat.engine import Engine

# Initialize engine
engine = Engine(model, tokenizer)

# Generate text
results, probabilities = engine.generate_batch(
    prompt_tokens,
    num_samples=1,
    max_tokens=256,
    temperature=0.7,
    top_k=50
)

# Streaming generation
for token_batch, masks in engine.generate(
    prompt_tokens,
    num_samples=1,
    max_tokens=256,
    temperature=0.7
):
    # Process tokens as they're generated
    pass

Generation Parameters

Parameter Type Default Description
num_samples int 1 Number of parallel samples
max_tokens int 256 Maximum tokens to generate
temperature float 1.0 Sampling temperature (0.0-2.0)
top_k int None Top-k sampling parameter
seed int None Random seed for generation

Tokenizer

Text encoding/decoding interface.

python
from nanochat.tokenizer import HuggingFaceTokenizer

# Load tokenizer
tokenizer = HuggingFaceTokenizer.from_pretrained("path/to/tokenizer")

# Encode text
tokens = tokenizer.encode("Hello, world!")  # [123, 456, 789]

# Decode tokens  
text = tokenizer.decode([123, 456, 789])  # "Hello, world!"

# Render conversation
tokens, mask = tokenizer.render_conversation({
    "messages": [
        {"role": "user", "content": "Hello"},
        {"role": "assistant", "content": "Hi there!"}
    ]
})

Tokenizer Methods

python
# Basic encoding/decoding
def encode(self, text: str) -> List[int]:
    """Encode text to token IDs"""
    
def decode(self, token_ids: List[int]) -> str:
    """Decode token IDs to text"""

# Special tokens
def get_bos_token_id(self) -> int:
    """Get beginning-of-sequence token ID"""
    
def encode_special(self, special_token: str) -> int:
    """Encode special token to ID"""

# Conversation handling
def render_conversation(self, conversation: dict) -> Tuple[List[int], List[bool]]:
    """Convert conversation to tokens with loss mask"""
    
def render_for_completion(self, conversation: dict) -> List[int]:
    """Render conversation prompt for completion"""

Task Framework

Interface for creating custom evaluation tasks.

python
from tasks.common import Task

class CustomTask(Task):
    @property
    def eval_type(self):
        return 'generative'  # or 'categorical'
    
    def num_examples(self):
        return len(self.dataset)
    
    def get_example(self, index):
        """Return conversation dict for example at index"""
        return {
            "messages": [
                {"role": "user", "content": "..."},
                {"role": "assistant", "content": "..."}
            ]
        }
    
    def evaluate(self, conversation, completion):
        """Evaluate completion against ground truth"""
        return success_boolean

Task Composition

python
from tasks.common import TaskMixture, TaskSequence

# Mix multiple tasks
mixture = TaskMixture([task1, task2, task3])

# Sequential tasks
sequence = TaskSequence([task1, task2, task3])

# Task slicing
subset = task[100:200]  # Examples 100-199

Web API

RESTful HTTP API for chat completions and health monitoring.

Base URL

text
http://localhost:8000

Chat Completions

Stream chat completions using OpenAI-compatible format.

http
POST /chat/completions
Content-Type: application/json

{
  "messages": [
    {"role": "user", "content": "Hello!"}
  ],
  "temperature": 0.7,
  "max_tokens": 100,
  "top_k": 50
}

Request Parameters

Field Type Required Description
messages Array Yes Conversation messages
temperature float No Sampling temperature (0.0-2.0)
max_tokens int No Maximum tokens to generate
top_k int No Top-k sampling parameter

Message Format

json
{
  "role": "user",        // "user" or "assistant"
  "content": "text"      // Message content
}

Response Format

Server-Sent Events (SSE) stream:

text
data: {"token": "Hello", "gpu": 0}

data: {"token": " there", "gpu": 0}

data: {"done": true}

Rate Limits

  • Maximum 500 messages per request
  • Maximum 8000 characters per message
  • Maximum 32000 characters total conversation
  • Temperature: 0.0-2.0
  • Top-k: 1-200
  • Max tokens: 1-4096

Health Check

Monitor server status and worker availability.

http
GET /health

Response

json
{
  "status": "ok",
  "ready": true,
  "num_gpus": 4,
  "available_workers": 3
}

Statistics

Get detailed worker pool statistics.

http
GET /stats

Response

json
{
  "total_workers": 4,
  "available_workers": 3,
  "busy_workers": 1,
  "workers": [
    {"gpu_id": 0, "device": "cuda:0"},
    {"gpu_id": 1, "device": "cuda:1"},
    {"gpu_id": 2, "device": "cuda:2"},
    {"gpu_id": 3, "device": "cuda:3"}
  ]
}

Error Responses

Standard HTTP error codes with JSON error messages:

json
{
  "detail": "Error message description"
}

Common error codes:

  • 400: Bad Request (invalid parameters)
  • 422: Validation Error (malformed request)
  • 500: Internal Server Error

Command Line Interface

Training Scripts

Base Model Training

bash
python -m scripts.base_train [OPTIONS]

Options:

  • --model-size {tiny,small,medium,large}: Model architecture
  • --batch-size INT: Batch size per GPU
  • --total-steps INT: Total training steps
  • --lr FLOAT: Peak learning rate
  • --seq-len INT: Context length
  • --dtype {float32,bfloat16}: Precision

Chat Fine-Tuning

bash
python -m scripts.chat_sft [OPTIONS]

Options:

  • --tasks TEXT: Comma-separated task names
  • --base-model TEXT: Base model to fine-tune
  • --total-steps INT: Training steps
  • --lr FLOAT: Learning rate

Tokenizer Training

bash
python -m scripts.tok_train [OPTIONS]

Options:

  • --vocab-size INT: Vocabulary size
  • --train-ratio FLOAT: Training data ratio
  • --val-ratio FLOAT: Validation data ratio

Evaluation Scripts

Base Model Evaluation

bash
python -m scripts.base_eval [OPTIONS]

Options:

  • --hf-path TEXT: HuggingFace model path
  • --max-per-task INT: Examples per task
  • --model-tag TEXT: Model tag
  • --step INT: Training step

Chat Model Evaluation

bash
python -m scripts.chat_eval [OPTIONS]

Options:

  • -i, --source {sft,mid,rl}: Model source
  • -a, --task-name TEXT: Task names (pipe-separated)
  • -t, --temperature FLOAT: Sampling temperature
  • -n, --num-samples INT: Samples per problem
  • -b, --batch-size INT: Evaluation batch size

Inference Scripts

CLI Chat

bash
python -m scripts.chat_cli [OPTIONS]

Options:

  • -i, --source {sft,mid,rl}: Model source
  • -p, --prompt TEXT: Single prompt mode
  • -t, --temperature FLOAT: Generation temperature
  • -k, --top-k INT: Top-k parameter

Web Server

bash
python -m scripts.chat_web [OPTIONS]

Options:

  • -n, --num-gpus INT: Number of GPUs
  • -p, --port INT: Server port
  • --host TEXT: Bind host
  • -i, --source {sft,mid,rl}: Model source

Configuration

Model Loading

python
from nanochat.checkpoint_manager import load_model

# Load specific model
model, tokenizer, meta = load_model(
    source="sft",           # Model type: base, mid, sft, rl
    device="cuda:0",        # Device
    phase="eval",           # Phase: train or eval
    model_tag="v1.0",      # Optional model tag
    step=10000             # Optional specific step
)

Environment Variables

bash
# Data directory
export NANOCHAT_DATA_DIR=/path/to/data

# Model cache directory  
export NANOCHAT_MODEL_DIR=/path/to/models

# Tokenizer path
export NANOCHAT_TOKENIZER_PATH=/path/to/tokenizer

Error Handling

Common Exceptions

python
# Model loading errors
try:
    model, tokenizer, meta = load_model("sft", device)
except FileNotFoundError:
    print("Model checkpoint not found")
except torch.cuda.OutOfMemoryError:
    print("Insufficient GPU memory")

# Generation errors
try:
    results = engine.generate_batch(tokens)
except RuntimeError as e:
    print(f"Generation failed: {e}")

Debugging

python
# Enable debug logging
import logging
logging.basicConfig(level=logging.DEBUG)

# Check model info
print(f"Model parameters: {sum(p.numel() for p in model.parameters())}")
print(f"Model device: {model.get_device()}")

# Validate tokenizer
tokens = tokenizer.encode("test")
decoded = tokenizer.decode(tokens)
assert decoded == "test"

Examples

Basic Text Generation

python
from nanochat.checkpoint_manager import load_model
from nanochat.engine import Engine

# Load model
model, tokenizer, meta = load_model("sft", "cuda:0", phase="eval")
engine = Engine(model, tokenizer)

# Generate text
prompt = "What is the capital of France?"
tokens = tokenizer.encode(prompt)
results, _ = engine.generate_batch(
    tokens,
    max_tokens=50,
    temperature=0.7
)

response = tokenizer.decode(results[0])
print(response)

Custom Task Evaluation

python
from tasks.common import Task

class MathTask(Task):
    def get_example(self, index):
        problem = self.problems[index]
        return {
            "messages": [
                {"role": "user", "content": problem["question"]},
                {"role": "assistant", "content": problem["answer"]}
            ]
        }
    
    def evaluate(self, conversation, completion):
        expected = conversation["messages"][1]["content"]
        return completion.strip() == expected.strip()

# Use in evaluation
task = MathTask()
accuracy = evaluate_task(model, tokenizer, task)

Web API Client

python
import requests
import json

def chat_with_api(messages, temperature=0.7):
    response = requests.post(
        "http://localhost:8000/chat/completions",
        json={
            "messages": messages,
            "temperature": temperature,
            "max_tokens": 100
        },
        stream=True
    )
    
    for line in response.iter_lines():
        if line.startswith(b"data: "):
            data = json.loads(line[6:])
            if "token" in data:
                print(data["token"], end="", flush=True)
            elif data.get("done"):
                break

# Usage
messages = [{"role": "user", "content": "Hello!"}]
chat_with_api(messages)

Sources:

  • nanochat/gpt.py (GPT model API)
  • nanochat/engine.py (inference engine API)
  • nanochat/tokenizer.py (tokenizer API)
  • scripts/chat_web.py (web API endpoints)
Last updated: 1/10/2026