Glossary

Glossary

This glossary defines key terms and concepts used throughout the NanoChat documentation and codebase. Terms are organized alphabetically for easy reference.

A

AdamW

Adaptive Moment Estimation optimizer with Weight decay. Used in nanochat for embedding and language modeling head parameters. Provides adaptive learning rates per parameter while applying L2 regularization.

ARC (AI2 Reasoning Challenge)

Multiple-choice science reasoning benchmark dataset. Available in two variants:

  • ARC-Easy: Elementary-level questions
  • ARC-Challenge: Middle and high school-level questions

Autocast

PyTorch's automatic mixed precision context manager that automatically converts operations to lower precision (like bfloat16) for efficiency while maintaining numerical stability.

Autoregressive Generation

Text generation approach where each token is generated sequentially, conditioning on all previously generated tokens. This is the standard approach for large language models.

B

Base Pretraining

The initial training phase where the model learns to predict the next token in text sequences from a large corpus of unlabeled text. This creates the foundational language understanding.

BFloat16

16-bit brain floating point format developed by Google. Offers similar dynamic range to float32 but uses half the memory. Widely used in modern ML training for efficiency.

BOS (Beginning of Sequence)

Special token (<|bos|>) that marks the beginning of a document or sequence. Used to delimit document boundaries during tokenization and training.

BPE (Byte Pair Encoding)

Subword tokenization algorithm that iteratively merges the most frequent pairs of characters or character sequences. Used by nanochat for efficient text representation.

C

Causal Self-Attention

Attention mechanism where each position can only attend to previous positions, enforcing the autoregressive property. Prevents the model from "seeing the future" during training.

CORE Metric

Comprehensive evaluation metric from the DCLM (Data-Centric Language Modeling) paper. Used to assess base model quality before chat fine-tuning.

ChatCORE

CORE metric variant designed for evaluating conversational models. Measures the model's ability to engage in natural dialogue.

D

DDP (Distributed Data Parallel)

PyTorch's mechanism for training models across multiple GPUs by synchronizing gradients. Each GPU processes a different subset of the data in parallel.

Device Batch Size

Number of examples processed on a single GPU device per forward pass. Distinguished from total batch size, which is device batch size × number of devices.

E

Embedding

Vector representation of tokens. In nanochat, token embeddings convert discrete tokens into continuous vectors that the transformer can process.

Engine

NanoChat's inference system that handles efficient text generation with KV caching, tool use, and multi-sample generation capabilities.

F

FLOP (Floating Point Operation)

Basic unit of computational work. NanoChat estimates FLOPs per token to measure computational efficiency and predict training costs.

Forward Pass

Process of computing model outputs given inputs. During training, followed by backward pass to compute gradients.

G

Gradient Accumulation

Technique to simulate larger batch sizes by accumulating gradients over multiple micro-batches before updating parameters. Useful for memory-constrained environments.

Group Query Attention (GQA)

Optimization where multiple query heads share the same key and value heads. Reduces memory usage during inference while maintaining most of the performance.

GSM8K (Grade School Math 8K)

Dataset of 8,000 grade school math word problems requiring multi-step reasoning. Used to evaluate mathematical reasoning capabilities.

H

Head Dimension

Size of each attention head. In nanochat, calculated as model dimension divided by number of heads (e.g., 768 / 6 = 128).

HumanEval

Code generation benchmark consisting of Python programming problems. Tests the model's ability to generate correct code from natural language descriptions.

I

Identity Conversations

Synthetic conversations designed to give the model a specific personality and background. Used during mid-training to establish the model's character.

K

KV Cache

Key-Value cache that stores attention keys and values from previous tokens during autoregressive generation. Avoids recomputing these values, significantly speeding up inference.

L

Language Modeling Head (lm_head)

Final linear layer that converts model hidden states to vocabulary logits. Predicts the probability distribution over the next token.

Learning Rate Multiplier (LRM)

Scaling factor applied to base learning rates during training. Typically follows a schedule with warmup, constant phase, and warmdown.

M

Mid-training

Intermediate training phase between base pretraining and supervised fine-tuning. Introduces conversation structure, special tokens, and tool use capabilities.

MMLU (Massive Multitask Language Understanding)

Benchmark covering 57 academic subjects from elementary to professional level. Tests broad knowledge across humanities, social sciences, STEM, and more.

MFU (Model FLOPs Utilization)

Metric measuring how efficiently the hardware is utilized, calculated as actual FLOPs per second divided by theoretical peak FLOPs per second.

Muon

Second-order optimization algorithm used for transformer matrix parameters. Provides better convergence properties than first-order optimizers like Adam.

P

Parquet

Columnar storage format used for nanochat's pretraining dataset. Provides efficient compression and fast access patterns for streaming data.

Pre-normalization

Applying layer normalization before attention and MLP layers rather than after. Generally provides better training stability than post-normalization.

Q

QK Normalization

Normalization applied to query and key vectors before attention computation. Improves training stability and performance.

R

ReLU² (ReLU Squared)

Activation function where ReLU(x) is squared: max(0, x)². Used in nanochat's MLP layers as an alternative to GELU or Swish activations.

Reinforcement Learning (RL)

Training phase that optimizes the model's responses using reward signals. Currently used in nanochat for improving mathematical reasoning on GSM8K.

RMSNorm (Root Mean Square Normalization)

Normalization technique that scales inputs by their RMS value. Simpler than LayerNorm and doesn't require learnable parameters in nanochat's implementation.

RoPE (Rotary Position Embedding)

Position encoding method that rotates query and key vectors based on their position. Provides better length generalization than learned position embeddings.

Row Group

Unit of organization in parquet files, typically containing 1024 documents in nanochat's dataset. Used for efficient distributed data loading.

S

SFT (Supervised Fine-Tuning)

Training phase that fine-tunes the model on curated instruction-following examples. Teaches the model to follow user instructions and engage in helpful dialogue.

Softcap

Technique that smoothly limits the range of logits using tanh function. Prevents extreme values that could cause numerical instability.

Special Tokens

Dedicated tokens with specific meanings:

  • <|bos|>: Beginning of sequence
  • <|user_start|>, <|user_end|>: User message boundaries
  • <|assistant_start|>, <|assistant_end|>: Assistant message boundaries
  • <|python_start|>, <|python_end|>: Tool invocation boundaries
  • <|output_start|>, <|output_end|>: Tool output boundaries

Speedrun

Complete $100 training pipeline script that trains a nanochat model from scratch in approximately 4 hours on 8x H100 GPUs.

T

Task

Abstract base class representing an evaluation dataset. Provides consistent interface for different benchmarks with support for both categorical and generative evaluation.

TaskMixture

Composition utility that combines multiple tasks with deterministic shuffling. Used during training to mix different types of data.

Temperature

Sampling parameter that controls randomness in generation. Lower values (closer to 0) make output more deterministic, higher values increase creativity.

TikToken

Fast tokenizer library from OpenAI. Used in nanochat's RustBPE tokenizer for efficient inference after training with rustbpe.

Token Buffer

Circular buffer (deque) used in the data pipeline to efficiently stream tokens from disk to GPU. Provides O(1) append and pop operations.

Tool Use

Capability that allows the model to invoke external tools (like a Python calculator) during generation. Implemented through special tokens and forced token injection.

Top-k Sampling

Sampling strategy that restricts generation to the k most probable tokens at each step. Balances quality and diversity in generated text.

U

Untied Weights

Architecture choice where input embeddings and output projection use separate weight matrices. Contrasts with tied weights where they share the same matrix.

uv

Fast Python package manager used by nanochat for dependency management and virtual environment creation.

V

Validation Loss

Loss computed on held-out validation data to monitor training progress and detect overfitting.

Vocab Size

Size of the tokenizer vocabulary, typically padded to multiples of 64 for computational efficiency in distributed training.

W

Wandb (Weights & Biases)

Experiment tracking platform used for monitoring training metrics, visualizing progress, and comparing different runs.

Warmdown

Period at the end of training where learning rate is gradually decreased to improve final model quality.

Warmup

Period at the beginning of training where learning rate is gradually increased from zero to the target value. Helps stabilize early training.


This glossary covers the core terminology used throughout nanochat. For more detailed information about specific concepts, refer to the relevant documentation sections linked throughout this wiki.

Last updated: 1/10/2026