Tokenizer

The NanoChat tokenizer implements a GPT-4 style BPE (Byte Pair Encoding) tokenizer with special tokens for conversation structure and tool use. This document covers the tokenizer architecture, training process, and conversation rendering capabilities.

Overview

Source: nanochat/tokenizer.py:1-10

python

"""
BPE Tokenizer in the style of GPT-4.

Two implementations are available:
1) HuggingFace Tokenizer that can do both training and inference but is really confusing
2) Our own RustBPE Tokenizer for training and tiktoken for efficient inference
"""

NanoChat provides two tokenizer implementations:

HuggingFaceTokenizer: Full-featured implementation using HuggingFace tokenizers
RustBPETokenizer: Hybrid approach using rustbpe for training and tiktoken for efficient inference

Special Tokens

The tokenizer includes special tokens for conversation structure and tool use:

Source: nanochat/tokenizer.py:11-21

python

SPECIAL_TOKENS = [
    # every document begins with the Beginning of Sequence (BOS) token that delimits documents
    "<|bos|>",
    # tokens below are only used during finetuning to render Conversations into token ids
    "<|user_start|>", # user messages
    "<|user_end|>",
    "<|assistant_start|>", # assistant messages
    "<|assistant_end|>",
    "<|python_start|>", # assistant invokes python REPL tool
    "<|python_end|>",
    "<|output_start|>", # python REPL outputs back to assistant
    "<|output_end|>",
]

These tokens serve specific purposes:

<|bos|>: Beginning of sequence, delimits document boundaries
<|user_start|> / <|user_end|>: Wrap user messages in conversations
<|assistant_start|> / <|assistant_end|>: Wrap assistant messages
<|python_start|> / <|python_end|>: Delimit tool invocations
<|output_start|> / <|output_end|>: Delimit tool outputs

Text Splitting Pattern

The tokenizer uses a modified GPT-4 regex pattern for text splitting:

Source: nanochat/tokenizer.py:23-27

python

# NOTE: this split pattern deviates from GPT-4 in that we use \p{N}{1,2} instead of \p{N}{1,3}
# I did this because I didn't want to "waste" too many tokens on numbers for smaller vocab sizes.
# I haven't validated that this is actually a good idea, TODO.
SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,2}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""

The key modification is using \p{N}{1,2} instead of \p{N}{1,3} for number tokenization, optimizing for smaller vocabulary sizes.

HuggingFace Tokenizer Implementation

Initialization and Training

Source: nanochat/tokenizer.py:55-85

python

@classmethod
def train_from_iterator(cls, text_iterator, vocab_size):
    # Configure the HuggingFace Tokenizer
    tokenizer = HFTokenizer(BPE(
        byte_fallback=True, # needed!
        unk_token=None,
        fuse_unk=False,
    ))
    # Normalizer: None
    tokenizer.normalizer = None
    
    # Pre-tokenizer: GPT-4 style
    gpt4_split_regex = Regex(SPLIT_PATTERN)
    tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
        pre_tokenizers.Split(pattern=gpt4_split_regex, behavior="isolated", invert=False),
        pre_tokenizers.ByteLevel(add_prefix_space=False, use_regex=False)
    ])
    
    # Decoder: ByteLevel (pairs with ByteLevel pre-tokenizer)
    tokenizer.decoder = decoders.ByteLevel()
    
    # Post-processor: None
    tokenizer.post_processor = None
    
    # Trainer: BPE
    trainer = BpeTrainer(
        vocab_size=vocab_size,
        show_progress=True,
        min_frequency=0, # no minimum frequency
        initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
        special_tokens=SPECIAL_TOKENS,
    )
    
    # Kick off the training
    tokenizer.train_from_iterator(text_iterator, trainer)
    return cls(tokenizer)

Key Configuration Elements

Byte Fallback: Ensures any byte sequence can be tokenized
No Normalization: Preserves exact text as provided
Regex Pre-tokenization: GPT-4 style text splitting
ByteLevel Encoding: Handles all possible byte sequences
BPE Training: Learns subword units from training data

RustBPE Tokenizer Implementation

Training with RustBPE

Source: nanochat/tokenizer.py:155-180

python

@classmethod
def train_from_iterator(cls, text_iterator, vocab_size):
    # 1) train using rustbpe
    tokenizer = rustbpe.Tokenizer()
    # the special tokens are inserted later in __init__, we don't train them here
    vocab_size_no_special = vocab_size - len(SPECIAL_TOKENS)
    assert vocab_size_no_special >= 256, f"vocab_size_no_special must be at least 256, got {vocab_size_no_special}"
    tokenizer.train_from_iterator(text_iterator, vocab_size_no_special, pattern=SPLIT_PATTERN)
    
    # 2) construct the associated tiktoken encoding for inference
    pattern = tokenizer.get_pattern()
    mergeable_ranks_list = tokenizer.get_mergeable_ranks()
    mergeable_ranks = {bytes(k): v for k, v in mergeable_ranks_list}
    tokens_offset = len(mergeable_ranks)
    special_tokens = {name: tokens_offset + i for i, name in enumerate(SPECIAL_TOKENS)}
    enc = tiktoken.Encoding(
        name="rustbpe",
        pat_str=pattern,
        mergeable_ranks=mergeable_ranks, # dict[bytes, int] (token bytes -> merge priority rank)
        special_tokens=special_tokens, # dict[str, int] (special token name -> token id)
    )
    return cls(enc, "<|bos|>")

Inference with tiktoken

Source: nanochat/tokenizer.py:210-235

python

def encode(self, text, prepend=None, append=None, num_threads=8):
    # text can be either a string or a list of strings

    if prepend is not None:
        prepend_id = prepend if isinstance(prepend, int) else self.encode_special(prepend)
    if append is not None:
        append_id = append if isinstance(append, int) else self.encode_special(append)

    if isinstance(text, str):
        ids = self.enc.encode_ordinary(text)
        if prepend is not None:
            ids.insert(0, prepend_id)
        if append is not None:
            ids.append(append_id)
    elif isinstance(text, list):
        ids = self.enc.encode_ordinary_batch(text, num_threads=num_threads)
        if prepend is not None:
            for ids_row in ids:
                ids_row.insert(0, prepend_id)
        if append is not None:
            for ids_row in ids:
                ids_row.append(append_id)
    else:
        raise ValueError(f"Invalid input type: {type(text)}")

    return ids

Conversation Rendering

The tokenizer's most sophisticated feature is conversation rendering, which converts chat conversations to token sequences with proper masking for training.

Main Rendering Function

Source: nanochat/tokenizer.py:265-295

python

def render_conversation(self, conversation, max_tokens=2048):
    """
    Tokenize a single Chat conversation (which we call a "doc" or "document" here).
    Returns:
    - ids: list[int] is a list of token ids of this rendered conversation
    - mask: list[int] of same length, mask = 1 for tokens that the Assistant is expected to train on.
    """
    # ids, masks that we will return and a helper function to help build them up.
    ids, mask = [], []
    def add_tokens(token_ids, mask_val):
        if isinstance(token_ids, int):
            token_ids = [token_ids]
        ids.extend(token_ids)
        mask.extend([mask_val] * len(token_ids))

    # sometimes the first message is a system message...
    # => just merge it with the second (user) message
    if conversation["messages"][0]["role"] == "system":
        conversation = copy.deepcopy(conversation) # avoid mutating the original
        messages = conversation["messages"]
        assert messages[1]["role"] == "user", "System message must be followed by a user message"
        messages[1]["content"] = messages[0]["content"] + "\n\n" + messages[1]["content"]
        messages = messages[1:]
    else:
        messages = conversation["messages"]

Training Mask Generation

The mask indicates which tokens the assistant should be trained on:

mask = 0: Tokens the assistant doesn't need to predict (user messages, special tokens)
mask = 1: Tokens the assistant should learn to generate (assistant responses)

Source: nanochat/tokenizer.py:305-340

python

# now we can tokenize the conversation
add_tokens(bos, 0)
for i, message in enumerate(messages):

    # some sanity checking here around assumptions, to prevent footguns
    must_be_from = "user" if i % 2 == 0 else "assistant"
    assert message["role"] == must_be_from, f"Message {i} is from {message['role']} but should be from {must_be_from}"

    # content can be either a simple string or a list of parts (e.g. containing tool calls)
    content = message["content"]

    if message["role"] == "user":
        assert isinstance(content, str), "User messages are simply expected to be strings"
        value_ids = self.encode(content)
        add_tokens(user_start, 0)
        add_tokens(value_ids, 0)
        add_tokens(user_end, 0)
    elif message["role"] == "assistant":
        add_tokens(assistant_start, 0)
        if isinstance(content, str):
            # simple string => simply add the tokens
            value_ids = self.encode(content)
            add_tokens(value_ids, 1)  # mask = 1 for assistant content
        elif isinstance(content, list):
            for part in content:
                value_ids = self.encode(part["text"])
                if part["type"] == "text":
                    add_tokens(value_ids, 1)  # supervised text generation
                elif part["type"] == "python":
                    # python tool call => add tokens inside special markers
                    add_tokens(python_start, 1)
                    add_tokens(value_ids, 1)
                    add_tokens(python_end, 1)
                elif part["type"] == "python_output":
                    # python output => not supervised (comes from Python at test time)
                    add_tokens(output_start, 0)
                    add_tokens(value_ids, 0)
                    add_tokens(output_end, 0)
        add_tokens(assistant_end, 1)

Tool Use Support

The tokenizer handles complex assistant messages with tool invocations:

Source: nanochat/tokenizer.py:325-340

python

elif isinstance(content, list):
    for part in content:
        value_ids = self.encode(part["text"])
        if part["type"] == "text":
            # string part => simply add the tokens
            add_tokens(value_ids, 1)
        elif part["type"] == "python":
            # python tool call => add the tokens inside <|python_start|> and <|python_end|>
            add_tokens(python_start, 1)
            add_tokens(value_ids, 1)
            add_tokens(python_end, 1)
        elif part["type"] == "python_output":
            # python output => add the tokens inside <|output_start|> and <|output_end|>
            # none of these tokens are supervised because the tokens come from Python at test time
            add_tokens(output_start, 0)
            add_tokens(value_ids, 0)
            add_tokens(output_end, 0)
        else:
            raise ValueError(f"Unknown part type: {part['type']}")

Debugging and Visualization

Tokenization Visualization

Source: nanochat/tokenizer.py:350-365

python

def visualize_tokenization(self, ids, mask, with_token_id=False):
    """Small helper function useful in debugging: visualize the tokenization of render_conversation"""
    RED = '\033[91m'
    GREEN = '\033[92m'
    RESET = '\033[0m'
    GRAY = '\033[90m'
    tokens = []
    for i, (token_id, mask_val) in enumerate(zip(ids, mask)):
        token_str = self.decode([token_id])
        color = GREEN if mask_val == 1 else RED
        tokens.append(f"{color}{token_str}{RESET}")
        if with_token_id:
            tokens.append(f"{GRAY}({token_id}){RESET}")
    return '|'.join(tokens)

This function provides color-coded visualization:

Green: Tokens the assistant is trained on (mask=1)
Red: Tokens the assistant doesn't train on (mask=0)
Gray: Token IDs (optional)

Reinforcement Learning Support

Source: nanochat/tokenizer.py:367-385

python

def render_for_completion(self, conversation):
    """
    Used during Reinforcement Learning. In that setting, we want to
    render the conversation priming the Assistant for a completion.
    Unlike the Chat SFT case, we don't need to return the mask.
    """
    # We have some surgery to do: we need to pop the last message (of the Assistant)
    conversation = copy.deepcopy(conversation) # avoid mutating the original
    messages = conversation["messages"]
    assert messages[-1]["role"] == "assistant", "Last message must be from the Assistant"
    messages.pop() # remove the last message (of the Assistant) inplace

    # Now tokenize the conversation
    ids, mask = self.render_conversation(conversation)

    # Finally, to prime the Assistant for a completion, append the Assistant start token
    assistant_start = self.encode_special("<|assistant_start|>")
    ids.append(assistant_start)
    return ids

Utility Functions

Special Token Handling

Source: nanochat/tokenizer.py:200-205

python

@lru_cache(maxsize=32)
def encode_special(self, text):
    return self.enc.encode_single_token(text)

def get_bos_token_id(self):
    return self.bos_token_id

BOS Token Detection

Source: nanochat/tokenizer.py:130-140

python

def get_bos_token_id(self):
    # Different HuggingFace models use different BOS tokens and there is little consistency
    # 1) attempt to find a <|bos|> token
    bos = self.encode_special("<|bos|>")
    # 2) if that fails, attempt to find a <|endoftext|> token (e.g. GPT-2 models)
    if bos is None:
        bos = self.encode_special("<|endoftext|>")
    # 3) if these fail, it's better to crash than to silently return None
    assert bos is not None, "Failed to find BOS token in tokenizer"
    return bos

Integration Examples

CLI Chat Integration

Source: scripts/chat_cli.py:71-85

python

# Add User message to the conversation
conversation_tokens.append(user_start)
conversation_tokens.extend(tokenizer.encode(user_input))
conversation_tokens.append(user_end)

# Kick off the assistant
conversation_tokens.append(assistant_start)
generate_kwargs = {
    "num_samples": 1,
    "max_tokens": 256,
    "temperature": args.temperature,
    "top_k": args.top_k,
}
response_tokens = []
print("\nAssistant: ", end="", flush=True)
with autocast_ctx:
    for token_column, token_masks in engine.generate(conversation_tokens, **generate_kwargs):
        token = token_column[0]
        response_tokens.append(token)
        token_text = tokenizer.decode([token])
        print(token_text, end="", flush=True)

Performance Considerations

Efficient Encoding: RustBPE + tiktoken combo optimizes for training and inference
Batch Processing: Supports multi-threaded batch encoding
Caching: Special token encoding is cached with @lru_cache
Memory Management: Conversation copying avoids mutation of original data

The tokenizer provides a complete solution for converting between text and tokens, with sophisticated support for conversation structure, tool use, and training optimization.

Sources:

nanochat/tokenizer.py (complete tokenizer implementation)
Special token definitions and conversation rendering
Integration examples from CLI and web interfaces

# Tokenizer

# Tokenizer

# Overview

# Special Tokens

# Text Splitting Pattern

# HuggingFace Tokenizer Implementation

# Initialization and Training

# Key Configuration Elements

# RustBPE Tokenizer Implementation

# Training with RustBPE

# Inference with tiktoken

# Conversation Rendering

# Main Rendering Function

# Training Mask Generation

# Tool Use Support

# Debugging and Visualization

# Tokenization Visualization

# Reinforcement Learning Support

# Utility Functions

# Special Token Handling

# BOS Token Detection

# Integration Examples

# CLI Chat Integration

# Performance Considerations

Tokenizer

Tokenizer

Overview

Special Tokens

Text Splitting Pattern

HuggingFace Tokenizer Implementation

Initialization and Training

Key Configuration Elements

RustBPE Tokenizer Implementation

Training with RustBPE

Inference with tiktoken

Conversation Rendering

Main Rendering Function

Training Mask Generation

Tool Use Support

Debugging and Visualization

Tokenization Visualization

Reinforcement Learning Support

Utility Functions

Special Token Handling

BOS Token Detection

Integration Examples

CLI Chat Integration

Performance Considerations