API Reference
API Reference
Complete API reference for nanochat components, interfaces, and web endpoints.
Overview
Nanochat provides multiple API interfaces:
- Core Python API - Direct access to models, tokenizers, and engines
- Web API - HTTP endpoints for chat and completions
- Command Line Interface - Scripts for training and evaluation
- Task Framework API - Interface for custom evaluation tasks
This reference covers the most commonly used APIs and their parameters.
Core Python API
GPT Model
Main transformer model class for language generation.
from nanochat.gpt import GPT, GPTConfig
# Create model configuration
config = GPTConfig(
vocab_size=32768,
n_layer=12,
n_head=12,
n_embd=768,
block_size=1024
)
# Initialize model
model = GPT(config)
# Forward pass
logits = model(input_ids) # (batch_size, seq_len, vocab_size)
GPTConfig Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
vocab_size |
int | 32768 | Vocabulary size |
n_layer |
int | 12 | Number of transformer layers |
n_head |
int | 12 | Number of attention heads |
n_embd |
int | 768 | Hidden dimension |
block_size |
int | 1024 | Maximum sequence length |
dropout |
float | 0.0 | Dropout probability |
bias |
bool | True | Use bias in linear layers |
rope_base |
float | 10000.0 | RoPE frequency base |
GPT Methods
# Generate text
def generate(self, input_ids, max_new_tokens=256, temperature=1.0, top_k=None):
"""Generate continuation of input sequence"""
# Get model device
def get_device(self):
"""Return device model is on"""
# Enable/disable gradient computation
def train(self, mode=True):
"""Set training mode"""
def eval(self):
"""Set evaluation mode"""
Engine
High-level interface for text generation and chat.
from nanochat.engine import Engine
# Initialize engine
engine = Engine(model, tokenizer)
# Generate text
results, probabilities = engine.generate_batch(
prompt_tokens,
num_samples=1,
max_tokens=256,
temperature=0.7,
top_k=50
)
# Streaming generation
for token_batch, masks in engine.generate(
prompt_tokens,
num_samples=1,
max_tokens=256,
temperature=0.7
):
# Process tokens as they're generated
pass
Generation Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
num_samples |
int | 1 | Number of parallel samples |
max_tokens |
int | 256 | Maximum tokens to generate |
temperature |
float | 1.0 | Sampling temperature (0.0-2.0) |
top_k |
int | None | Top-k sampling parameter |
seed |
int | None | Random seed for generation |
Tokenizer
Text encoding/decoding interface.
from nanochat.tokenizer import HuggingFaceTokenizer
# Load tokenizer
tokenizer = HuggingFaceTokenizer.from_pretrained("path/to/tokenizer")
# Encode text
tokens = tokenizer.encode("Hello, world!") # [123, 456, 789]
# Decode tokens
text = tokenizer.decode([123, 456, 789]) # "Hello, world!"
# Render conversation
tokens, mask = tokenizer.render_conversation({
"messages": [
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi there!"}
]
})
Tokenizer Methods
# Basic encoding/decoding
def encode(self, text: str) -> List[int]:
"""Encode text to token IDs"""
def decode(self, token_ids: List[int]) -> str:
"""Decode token IDs to text"""
# Special tokens
def get_bos_token_id(self) -> int:
"""Get beginning-of-sequence token ID"""
def encode_special(self, special_token: str) -> int:
"""Encode special token to ID"""
# Conversation handling
def render_conversation(self, conversation: dict) -> Tuple[List[int], List[bool]]:
"""Convert conversation to tokens with loss mask"""
def render_for_completion(self, conversation: dict) -> List[int]:
"""Render conversation prompt for completion"""
Task Framework
Interface for creating custom evaluation tasks.
from tasks.common import Task
class CustomTask(Task):
@property
def eval_type(self):
return 'generative' # or 'categorical'
def num_examples(self):
return len(self.dataset)
def get_example(self, index):
"""Return conversation dict for example at index"""
return {
"messages": [
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
]
}
def evaluate(self, conversation, completion):
"""Evaluate completion against ground truth"""
return success_boolean
Task Composition
from tasks.common import TaskMixture, TaskSequence
# Mix multiple tasks
mixture = TaskMixture([task1, task2, task3])
# Sequential tasks
sequence = TaskSequence([task1, task2, task3])
# Task slicing
subset = task[100:200] # Examples 100-199
Web API
RESTful HTTP API for chat completions and health monitoring.
Base URL
http://localhost:8000
Chat Completions
Stream chat completions using OpenAI-compatible format.
POST /chat/completions
Content-Type: application/json
{
"messages": [
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7,
"max_tokens": 100,
"top_k": 50
}
Request Parameters
| Field | Type | Required | Description |
|---|---|---|---|
messages |
Array | Yes | Conversation messages |
temperature |
float | No | Sampling temperature (0.0-2.0) |
max_tokens |
int | No | Maximum tokens to generate |
top_k |
int | No | Top-k sampling parameter |
Message Format
{
"role": "user", // "user" or "assistant"
"content": "text" // Message content
}
Response Format
Server-Sent Events (SSE) stream:
data: {"token": "Hello", "gpu": 0}
data: {"token": " there", "gpu": 0}
data: {"done": true}
Rate Limits
- Maximum 500 messages per request
- Maximum 8000 characters per message
- Maximum 32000 characters total conversation
- Temperature: 0.0-2.0
- Top-k: 1-200
- Max tokens: 1-4096
Health Check
Monitor server status and worker availability.
GET /health
Response
{
"status": "ok",
"ready": true,
"num_gpus": 4,
"available_workers": 3
}
Statistics
Get detailed worker pool statistics.
GET /stats
Response
{
"total_workers": 4,
"available_workers": 3,
"busy_workers": 1,
"workers": [
{"gpu_id": 0, "device": "cuda:0"},
{"gpu_id": 1, "device": "cuda:1"},
{"gpu_id": 2, "device": "cuda:2"},
{"gpu_id": 3, "device": "cuda:3"}
]
}
Error Responses
Standard HTTP error codes with JSON error messages:
{
"detail": "Error message description"
}
Common error codes:
400: Bad Request (invalid parameters)422: Validation Error (malformed request)500: Internal Server Error
Command Line Interface
Training Scripts
Base Model Training
python -m scripts.base_train [OPTIONS]
Options:
--model-size {tiny,small,medium,large}: Model architecture--batch-size INT: Batch size per GPU--total-steps INT: Total training steps--lr FLOAT: Peak learning rate--seq-len INT: Context length--dtype {float32,bfloat16}: Precision
Chat Fine-Tuning
python -m scripts.chat_sft [OPTIONS]
Options:
--tasks TEXT: Comma-separated task names--base-model TEXT: Base model to fine-tune--total-steps INT: Training steps--lr FLOAT: Learning rate
Tokenizer Training
python -m scripts.tok_train [OPTIONS]
Options:
--vocab-size INT: Vocabulary size--train-ratio FLOAT: Training data ratio--val-ratio FLOAT: Validation data ratio
Evaluation Scripts
Base Model Evaluation
python -m scripts.base_eval [OPTIONS]
Options:
--hf-path TEXT: HuggingFace model path--max-per-task INT: Examples per task--model-tag TEXT: Model tag--step INT: Training step
Chat Model Evaluation
python -m scripts.chat_eval [OPTIONS]
Options:
-i, --source {sft,mid,rl}: Model source-a, --task-name TEXT: Task names (pipe-separated)-t, --temperature FLOAT: Sampling temperature-n, --num-samples INT: Samples per problem-b, --batch-size INT: Evaluation batch size
Inference Scripts
CLI Chat
python -m scripts.chat_cli [OPTIONS]
Options:
-i, --source {sft,mid,rl}: Model source-p, --prompt TEXT: Single prompt mode-t, --temperature FLOAT: Generation temperature-k, --top-k INT: Top-k parameter
Web Server
python -m scripts.chat_web [OPTIONS]
Options:
-n, --num-gpus INT: Number of GPUs-p, --port INT: Server port--host TEXT: Bind host-i, --source {sft,mid,rl}: Model source
Configuration
Model Loading
from nanochat.checkpoint_manager import load_model
# Load specific model
model, tokenizer, meta = load_model(
source="sft", # Model type: base, mid, sft, rl
device="cuda:0", # Device
phase="eval", # Phase: train or eval
model_tag="v1.0", # Optional model tag
step=10000 # Optional specific step
)
Environment Variables
# Data directory
export NANOCHAT_DATA_DIR=/path/to/data
# Model cache directory
export NANOCHAT_MODEL_DIR=/path/to/models
# Tokenizer path
export NANOCHAT_TOKENIZER_PATH=/path/to/tokenizer
Error Handling
Common Exceptions
# Model loading errors
try:
model, tokenizer, meta = load_model("sft", device)
except FileNotFoundError:
print("Model checkpoint not found")
except torch.cuda.OutOfMemoryError:
print("Insufficient GPU memory")
# Generation errors
try:
results = engine.generate_batch(tokens)
except RuntimeError as e:
print(f"Generation failed: {e}")
Debugging
# Enable debug logging
import logging
logging.basicConfig(level=logging.DEBUG)
# Check model info
print(f"Model parameters: {sum(p.numel() for p in model.parameters())}")
print(f"Model device: {model.get_device()}")
# Validate tokenizer
tokens = tokenizer.encode("test")
decoded = tokenizer.decode(tokens)
assert decoded == "test"
Examples
Basic Text Generation
from nanochat.checkpoint_manager import load_model
from nanochat.engine import Engine
# Load model
model, tokenizer, meta = load_model("sft", "cuda:0", phase="eval")
engine = Engine(model, tokenizer)
# Generate text
prompt = "What is the capital of France?"
tokens = tokenizer.encode(prompt)
results, _ = engine.generate_batch(
tokens,
max_tokens=50,
temperature=0.7
)
response = tokenizer.decode(results[0])
print(response)
Custom Task Evaluation
from tasks.common import Task
class MathTask(Task):
def get_example(self, index):
problem = self.problems[index]
return {
"messages": [
{"role": "user", "content": problem["question"]},
{"role": "assistant", "content": problem["answer"]}
]
}
def evaluate(self, conversation, completion):
expected = conversation["messages"][1]["content"]
return completion.strip() == expected.strip()
# Use in evaluation
task = MathTask()
accuracy = evaluate_task(model, tokenizer, task)
Web API Client
import requests
import json
def chat_with_api(messages, temperature=0.7):
response = requests.post(
"http://localhost:8000/chat/completions",
json={
"messages": messages,
"temperature": temperature,
"max_tokens": 100
},
stream=True
)
for line in response.iter_lines():
if line.startswith(b"data: "):
data = json.loads(line[6:])
if "token" in data:
print(data["token"], end="", flush=True)
elif data.get("done"):
break
# Usage
messages = [{"role": "user", "content": "Hello!"}]
chat_with_api(messages)
Related Pages
Sources:
- nanochat/gpt.py (GPT model API)
- nanochat/engine.py (inference engine API)
- nanochat/tokenizer.py (tokenizer API)
- scripts/chat_web.py (web API endpoints)