API Reference

Complete API reference for nanochat components, interfaces, and web endpoints.

Overview

Nanochat provides multiple API interfaces:

Core Python API - Direct access to models, tokenizers, and engines
Web API - HTTP endpoints for chat and completions
Command Line Interface - Scripts for training and evaluation
Task Framework API - Interface for custom evaluation tasks

This reference covers the most commonly used APIs and their parameters.

Core Python API

GPT Model

Main transformer model class for language generation.

python

from nanochat.gpt import GPT, GPTConfig

# Create model configuration
config = GPTConfig(
    vocab_size=32768,
    n_layer=12,
    n_head=12,
    n_embd=768,
    block_size=1024
)

# Initialize model
model = GPT(config)

# Forward pass
logits = model(input_ids)  # (batch_size, seq_len, vocab_size)

GPTConfig Parameters

Parameter	Type	Default	Description
`vocab_size`	int	32768	Vocabulary size
`n_layer`	int	12	Number of transformer layers
`n_head`	int	12	Number of attention heads
`n_embd`	int	768	Hidden dimension
`block_size`	int	1024	Maximum sequence length
`dropout`	float	0.0	Dropout probability
`bias`	bool	True	Use bias in linear layers
`rope_base`	float	10000.0	RoPE frequency base

GPT Methods

python

# Generate text
def generate(self, input_ids, max_new_tokens=256, temperature=1.0, top_k=None):
    """Generate continuation of input sequence"""
    
# Get model device
def get_device(self):
    """Return device model is on"""
    
# Enable/disable gradient computation  
def train(self, mode=True):
    """Set training mode"""
    
def eval(self):
    """Set evaluation mode"""

Engine

High-level interface for text generation and chat.

python

from nanochat.engine import Engine

# Initialize engine
engine = Engine(model, tokenizer)

# Generate text
results, probabilities = engine.generate_batch(
    prompt_tokens,
    num_samples=1,
    max_tokens=256,
    temperature=0.7,
    top_k=50
)

# Streaming generation
for token_batch, masks in engine.generate(
    prompt_tokens,
    num_samples=1,
    max_tokens=256,
    temperature=0.7
):
    # Process tokens as they're generated
    pass

Generation Parameters

Parameter	Type	Default	Description
`num_samples`	int	1	Number of parallel samples
`max_tokens`	int	256	Maximum tokens to generate
`temperature`	float	1.0	Sampling temperature (0.0-2.0)
`top_k`	int	None	Top-k sampling parameter
`seed`	int	None	Random seed for generation

Tokenizer

Text encoding/decoding interface.

python

from nanochat.tokenizer import HuggingFaceTokenizer

# Load tokenizer
tokenizer = HuggingFaceTokenizer.from_pretrained("path/to/tokenizer")

# Encode text
tokens = tokenizer.encode("Hello, world!")  # [123, 456, 789]

# Decode tokens  
text = tokenizer.decode([123, 456, 789])  # "Hello, world!"

# Render conversation
tokens, mask = tokenizer.render_conversation({
    "messages": [
        {"role": "user", "content": "Hello"},
        {"role": "assistant", "content": "Hi there!"}
    ]
})

Tokenizer Methods

python

# Basic encoding/decoding
def encode(self, text: str) -> List[int]:
    """Encode text to token IDs"""
    
def decode(self, token_ids: List[int]) -> str:
    """Decode token IDs to text"""

# Special tokens
def get_bos_token_id(self) -> int:
    """Get beginning-of-sequence token ID"""
    
def encode_special(self, special_token: str) -> int:
    """Encode special token to ID"""

# Conversation handling
def render_conversation(self, conversation: dict) -> Tuple[List[int], List[bool]]:
    """Convert conversation to tokens with loss mask"""
    
def render_for_completion(self, conversation: dict) -> List[int]:
    """Render conversation prompt for completion"""

Task Framework

Interface for creating custom evaluation tasks.

python

from tasks.common import Task

class CustomTask(Task):
    @property
    def eval_type(self):
        return 'generative'  # or 'categorical'
    
    def num_examples(self):
        return len(self.dataset)
    
    def get_example(self, index):
        """Return conversation dict for example at index"""
        return {
            "messages": [
                {"role": "user", "content": "..."},
                {"role": "assistant", "content": "..."}
            ]
        }
    
    def evaluate(self, conversation, completion):
        """Evaluate completion against ground truth"""
        return success_boolean

Task Composition

python

from tasks.common import TaskMixture, TaskSequence

# Mix multiple tasks
mixture = TaskMixture([task1, task2, task3])

# Sequential tasks
sequence = TaskSequence([task1, task2, task3])

# Task slicing
subset = task[100:200]  # Examples 100-199

Web API

RESTful HTTP API for chat completions and health monitoring.

Base URL

text

http://localhost:8000

Chat Completions

Stream chat completions using OpenAI-compatible format.

http

POST /chat/completions
Content-Type: application/json

{
  "messages": [
    {"role": "user", "content": "Hello!"}
  ],
  "temperature": 0.7,
  "max_tokens": 100,
  "top_k": 50
}

Request Parameters

Field	Type	Required	Description
`messages`	Array	Yes	Conversation messages
`temperature`	float	No	Sampling temperature (0.0-2.0)
`max_tokens`	int	No	Maximum tokens to generate
`top_k`	int	No	Top-k sampling parameter

Message Format

json

{
  "role": "user",        // "user" or "assistant"
  "content": "text"      // Message content
}

Response Format

Server-Sent Events (SSE) stream:

text

data: {"token": "Hello", "gpu": 0}

data: {"token": " there", "gpu": 0}

data: {"done": true}

Rate Limits

Maximum 500 messages per request
Maximum 8000 characters per message
Maximum 32000 characters total conversation
Temperature: 0.0-2.0
Top-k: 1-200
Max tokens: 1-4096

Health Check

Monitor server status and worker availability.

http

GET /health

Response

json

{
  "status": "ok",
  "ready": true,
  "num_gpus": 4,
  "available_workers": 3
}

Statistics

Get detailed worker pool statistics.

http

GET /stats

Response

json

{
  "total_workers": 4,
  "available_workers": 3,
  "busy_workers": 1,
  "workers": [
    {"gpu_id": 0, "device": "cuda:0"},
    {"gpu_id": 1, "device": "cuda:1"},
    {"gpu_id": 2, "device": "cuda:2"},
    {"gpu_id": 3, "device": "cuda:3"}
  ]
}

Error Responses

Standard HTTP error codes with JSON error messages:

json

{
  "detail": "Error message description"
}

Common error codes:

400: Bad Request (invalid parameters)
422: Validation Error (malformed request)
500: Internal Server Error

Command Line Interface

Training Scripts

Base Model Training

bash

python -m scripts.base_train [OPTIONS]

Options:

--model-size {tiny,small,medium,large}: Model architecture
--batch-size INT: Batch size per GPU
--total-steps INT: Total training steps
--lr FLOAT: Peak learning rate
--seq-len INT: Context length
--dtype {float32,bfloat16}: Precision

Chat Fine-Tuning

bash

python -m scripts.chat_sft [OPTIONS]

Options:

--tasks TEXT: Comma-separated task names
--base-model TEXT: Base model to fine-tune
--total-steps INT: Training steps
--lr FLOAT: Learning rate

Tokenizer Training

bash

python -m scripts.tok_train [OPTIONS]

Options:

--vocab-size INT: Vocabulary size
--train-ratio FLOAT: Training data ratio
--val-ratio FLOAT: Validation data ratio

Evaluation Scripts

Base Model Evaluation

bash

python -m scripts.base_eval [OPTIONS]

Options:

--hf-path TEXT: HuggingFace model path
--max-per-task INT: Examples per task
--model-tag TEXT: Model tag
--step INT: Training step

Chat Model Evaluation

bash

python -m scripts.chat_eval [OPTIONS]

Options:

-i, --source {sft,mid,rl}: Model source
-a, --task-name TEXT: Task names (pipe-separated)
-t, --temperature FLOAT: Sampling temperature
-n, --num-samples INT: Samples per problem
-b, --batch-size INT: Evaluation batch size

Inference Scripts

CLI Chat

bash

python -m scripts.chat_cli [OPTIONS]

Options:

-i, --source {sft,mid,rl}: Model source
-p, --prompt TEXT: Single prompt mode
-t, --temperature FLOAT: Generation temperature
-k, --top-k INT: Top-k parameter

Web Server

bash

python -m scripts.chat_web [OPTIONS]

Options:

-n, --num-gpus INT: Number of GPUs
-p, --port INT: Server port
--host TEXT: Bind host
-i, --source {sft,mid,rl}: Model source

Configuration

Model Loading

python

from nanochat.checkpoint_manager import load_model

# Load specific model
model, tokenizer, meta = load_model(
    source="sft",           # Model type: base, mid, sft, rl
    device="cuda:0",        # Device
    phase="eval",           # Phase: train or eval
    model_tag="v1.0",      # Optional model tag
    step=10000             # Optional specific step
)

Environment Variables

bash

# Data directory
export NANOCHAT_DATA_DIR=/path/to/data

# Model cache directory  
export NANOCHAT_MODEL_DIR=/path/to/models

# Tokenizer path
export NANOCHAT_TOKENIZER_PATH=/path/to/tokenizer

Error Handling

Common Exceptions

python

# Model loading errors
try:
    model, tokenizer, meta = load_model("sft", device)
except FileNotFoundError:
    print("Model checkpoint not found")
except torch.cuda.OutOfMemoryError:
    print("Insufficient GPU memory")

# Generation errors
try:
    results = engine.generate_batch(tokens)
except RuntimeError as e:
    print(f"Generation failed: {e}")

Debugging

python

# Enable debug logging
import logging
logging.basicConfig(level=logging.DEBUG)

# Check model info
print(f"Model parameters: {sum(p.numel() for p in model.parameters())}")
print(f"Model device: {model.get_device()}")

# Validate tokenizer
tokens = tokenizer.encode("test")
decoded = tokenizer.decode(tokens)
assert decoded == "test"

Examples

Basic Text Generation

python

from nanochat.checkpoint_manager import load_model
from nanochat.engine import Engine

# Load model
model, tokenizer, meta = load_model("sft", "cuda:0", phase="eval")
engine = Engine(model, tokenizer)

# Generate text
prompt = "What is the capital of France?"
tokens = tokenizer.encode(prompt)
results, _ = engine.generate_batch(
    tokens,
    max_tokens=50,
    temperature=0.7
)

response = tokenizer.decode(results[0])
print(response)

Custom Task Evaluation

python

from tasks.common import Task

class MathTask(Task):
    def get_example(self, index):
        problem = self.problems[index]
        return {
            "messages": [
                {"role": "user", "content": problem["question"]},
                {"role": "assistant", "content": problem["answer"]}
            ]
        }
    
    def evaluate(self, conversation, completion):
        expected = conversation["messages"][1]["content"]
        return completion.strip() == expected.strip()

# Use in evaluation
task = MathTask()
accuracy = evaluate_task(model, tokenizer, task)

Web API Client

python

import requests
import json

def chat_with_api(messages, temperature=0.7):
    response = requests.post(
        "http://localhost:8000/chat/completions",
        json={
            "messages": messages,
            "temperature": temperature,
            "max_tokens": 100
        },
        stream=True
    )
    
    for line in response.iter_lines():
        if line.startswith(b"data: "):
            data = json.loads(line[6:])
            if "token" in data:
                print(data["token"], end="", flush=True)
            elif data.get("done"):
                break

# Usage
messages = [{"role": "user", "content": "Hello!"}]
chat_with_api(messages)

Sources:

nanochat/gpt.py (GPT model API)
nanochat/engine.py (inference engine API)
nanochat/tokenizer.py (tokenizer API)
scripts/chat_web.py (web API endpoints)

# API Reference

# API Reference

# Overview

# Core Python API

# GPT Model

# GPTConfig Parameters

# GPT Methods

# Engine

# Generation Parameters

# Tokenizer

# Tokenizer Methods

# Task Framework

# Task Composition

# Web API

# Base URL

# Chat Completions

# Request Parameters

# Message Format

# Response Format

# Rate Limits

# Health Check

# Response

# Statistics

# Response

# Error Responses

# Command Line Interface

# Training Scripts

# Base Model Training

# Chat Fine-Tuning

# Tokenizer Training

# Evaluation Scripts

# Base Model Evaluation

# Chat Model Evaluation

# Inference Scripts

# CLI Chat

# Web Server

# Configuration

# Model Loading

# Environment Variables

# Error Handling

# Common Exceptions

# Debugging

# Examples

# Basic Text Generation

# Custom Task Evaluation

# Web API Client

# Related Pages

API Reference

API Reference

Overview

Core Python API

GPT Model

GPTConfig Parameters

GPT Methods

Engine

Generation Parameters

Tokenizer

Tokenizer Methods

Task Framework

Task Composition

Web API

Base URL

Chat Completions

Request Parameters

Message Format

Response Format

Rate Limits

Health Check

Response

Statistics

Response

Error Responses

Command Line Interface

Training Scripts

Base Model Training

Chat Fine-Tuning

Tokenizer Training

Evaluation Scripts

Base Model Evaluation

Chat Model Evaluation

Inference Scripts

CLI Chat

Web Server

Configuration

Model Loading

Environment Variables

Error Handling

Common Exceptions

Debugging

Examples

Basic Text Generation

Custom Task Evaluation

Web API Client

Related Pages