Building a Large Language Model from Scratch

Preface

This short sample gives you a feel for the book “Building a Large Language Model from Scratch — A Step‑by‑Step Guide Using Python and PyTorch.” The book is a practical, hands‑on journey: you will assemble a compact, readable GPT‑style model from first principles, train it on a small corpus, sample from it, and ship it in tiny but useful ways. Every concept is grounded in runnable code—scripts and notebooks—and every chapter keeps shapes and data flow explicit so you always know what’s happening and why.

What You’ll Learn

Foundations: tensors, automatic differentiation, and tidy training loops in PyTorch.
Text pipeline: tokenization, vocabularies, batching, and the conventions used by modern LLMs.
Attention mechanics: queries, keys, values; masking; multi‑head attention and pre‑norm residual blocks.
Building a minimal GPT: embeddings, transformer blocks, and a language‑model head in clean, testable code.
Training and sampling: efficient loops, decoding strategies (greedy, temperature, top‑k/top‑p), and sanity checks.
Evaluation: perplexity plus small, interpretable text metrics to reason about quality beyond a single number.
Going further: quality‑of‑life training upgrades, LoRA adapters, scaling intuition, and simple deployment paths.

How to Use This Sample

Skim the Preface to understand the scope and the style of the project.
Browse Chapter 1 to see how the narrative and the code interact.
Clone the companion code repository and run the validators to check your environment.
If the ideas resonate, dive into the full manuscript: each chapter builds on the last, and the code is intentionally approachable for self‑study.

1. Introduction

Large language models feel like magic—until you build one. By the time you finish this book, the spell will be broken in the best possible way: you’ll understand how a GPT‑style model represents text, attends to context, and learns to write the next token. We will not chase scale. We will chase clarity, then turn that clarity into solid, well‑tested code. Our tiny model has a name: attoLLM—small enough to understand, powerful enough to teach.

You’ll Need

Hardware: none.
Software: Python optional (for quick checks).
Data: none.
Validate: python -m code.env_check prints Python + device; python code/hello_world.py says Hello.
If stuck: Your First Run

At a Glance

Why build one yourself
How we’ll work together
Journey map
Your first run
What “small” means
Why transformers won
Diagram narrative
Practicalities and expectations
Where we’re heading next
Sources worth knowing

You’ll Learn

By the end of this chapter you’ll know what we’ll build (and why we’re building a small model), how the book balances explanation with runnable code, and how the journey unfolds—from setup to sampling to deployment. You’ll also run a tiny sanity check so your environment says hello back.

1.1. Why Build One Yourself?

Most people meet LLMs from the outside: a chat box, a spinning cursor, a paragraph of text that sounds suspiciously confident. That view hides the simple, beautiful mechanics underneath. Building even a small model changes how you read papers, debug pipelines, and reason about trade‑offs. You’ll see which knobs matter, and which are superstition.

Along the way, you’ll also pick up the engineering rituals that make research practical: reproducible setups, clean module boundaries, and scripts you can run twice without fear. The result is not just knowledge of transformers, but a way of working.

1.2. How We’ll Work Together

This manuscript alternates between explanation and action. Short, playful experiments appear inline as console or IPython snippets so you can try ideas immediately. Longer code—anything worth reusing—lives in code/ as importable modules. Figures are stored in figures/; several are generated from scripts so that diagrams evolve with the text instead of drifting out of date. Every example is runnable.

Tip	If you are reading a rendered HTML or PDF, you can still run the code. Clone the repository, set up a Python environment, and follow along. Every snippet is designed to be small enough to type or paste, and every script has a single clear purpose.

1.3. The Journey at a Glance

To ground the story, here is a map of what we’ll build and why.

Let’s walk that diagram:

Repo Setup & Env Checks. We start with scaffolding that keeps effort focused on ideas, not yak‑shaving. A tiny environment check reports your Python version and whether PyTorch sees a GPU or Apple’s MPS backend. Nothing fancy—just enough to prevent “works on my machine” surprises.
Data & Tokenization. Raw text is messy and, to a computer, meaningless. Tokenization converts characters into integers, and a vocabulary maps those integers back to words or subwords. You’ll write the simplest thing first, then appreciate why more advanced tokenizers exist.
Embeddings + Transformer Blocks. Tokens become vectors through learned embeddings. Self‑attention lets each position in the sequence look at others and decide what matters. Feed‑forward layers refine those interactions. We’ll build these parts step by step and keep tensor shapes obvious.
Training (Cross‑Entropy + AdamW). Training is a conversation between predictions and reality. We compute a loss that measures how wrong the model is about the next token, then use an optimizer to nudge millions of parameters in a better direction.
Sampling (Temperature, Top‑k/p). Once the model knows “what could come next,” we need to decide “what should come next.” You’ll learn how sampling shapes a model’s personality—from pedantic to poetic.
Evaluation & Deployment. We’ll measure perplexity and sanity‑check outputs, then package the model for a small CLI or a simple app. Not production at hyperscale—production at human scale.

1.4. Your First Run

Before the theory, a friendly handshake from your environment. The env_check module prints a short report about Python and PyTorch so you know where you stand.

!python -m code.env_check

You’ll see your Python version, OS details, and whether CUDA (NVIDIA GPUs) or MPS (Apple Silicon) is available. A GPU is optional for this book; everything runs on CPU, just more slowly. If PyTorch is missing, the message will tell you so—install it when Chapter 5 begins.

Prefer to prod the system inline? Here’s a tiny IPython console session that mirrors what env_check does:

In [1]: import sys, platform
   ...: print(platform.python_version())
   ...: print(platform.platform())
   ...: print(sys.executable)

In [2]: try:
   ...:     import torch
   ...:     print(torch.__version__)
   ...:     print("CUDA available:", torch.cuda.is_available())
   ...: except Exception as e:
   ...:     print("PyTorch not installed yet:", e)

And because every good story begins with a greeting, here’s a tiny script we’ll keep around as a sanity check:

!python code/hello_world.py

1.5. What “Small” Means—and Why It’s Enough

We are going to build a model that is deliberately modest. Its context window will be in the hundreds of tokens, and its parameter count in the millions, not the billions. It will learn from small, curated corpora, not the entire public internet. That restraint is a feature. It keeps training times reasonable, makes mistakes comprehensible, and ensures that improvements feel earned. More importantly, the logic scales: the code you write here is the same code that runs in larger models—only the numbers differ.

1.6. Why Transformers Won (In One Paragraph)

Recurrent neural networks read text from left to right, one step at a time, and struggle to remember far‑away information. Convolutions can see farther, but only through windows that grow awkwardly large. Self‑attention breaks both limitations. Each position computes a weighted view of every other position using queries, keys, and values. The model learns which tokens should talk, and by how much. Add residual connections, layer normalization, and position‑wise feed‑forward networks, and you have a block that stacks cleanly and trains efficiently.

If this paragraph felt swift, don’t worry. We’ll unpack each idea carefully—with diagrams, tiny numerical examples, and your own tensors as witnesses. Understanding emerges from small, complete pieces you can run and trust.

1.7. A Narrative of the Diagram

Diagrams often act like confident strangers; they look helpful but never introduce themselves. Let’s name the parts in the roadmap you saw earlier.

Repo Setup & Env Checks. This is your lab bench. Clean benches make for better experiments. The repository’s layout separates prose, code, data, and figures. The Makefile builds the book. The .gitignore keeps junk out of version control. You can focus on the science.

Data & Tokenization. The first transformation turns text into integers. We begin with a character‑level tokenizer to make the mechanics vivid, then discuss subword tokenization and why it matters for scale and generalization. You’ll see how the choice of vocabulary affects both training speed and model behavior.

Embeddings & Blocks. An embedding layer maps each token ID to a vector in a learned space. Self‑attention heads compare tokens using dot products; masks enforce causality so the model can’t peek at the future. Feed‑forward layers add non‑linear capacity. Residual connections help gradients flow. The block diagram will stop looking like a circuit board and start reading like a story.

Training. Cross‑entropy loss quantifies “how surprised” the model should be by the true next token. AdamW adjusts parameters using estimates of first and second moments while decoupling weight decay for better generalization. We’ll log metrics with tqdm and, later, tensorboard so your progress feels tangible.

Sampling & Evaluation. A trained model doesn’t just predict; it converses with uncertainty. Temperature turns the confidence dial. Top‑k sampling chooses from the k most likely tokens; top‑p chooses from the smallest set whose cumulative probability exceeds p. Perplexity gives a coarse numeric sense of quality; your judgment provides the rest.

Deployment. Packaging the model is about ergonomics: a CLI that accepts a prompt and prints a reply, a minimal web UI to share with a colleague, or an API if you’re feeling ambitious. Good packaging makes experiments social.

1.8. Practicalities and Expectations

You will get the most from this book if you type some code, make at least one mistake, and fix it. The mistakes are not bugs in the text; they are part of learning to think with tensors. When a shape mismatch happens, you’ll learn to read PyTorch tracebacks like maps. Curiosity beats copy‑paste.

Note	All code is tested on Python 3.10+ and PyTorch 2.x. Apple Silicon is supported via the MPS backend. A GPU is optional; patience is mandatory.

1.9. Where We’re Heading Next

The next chapter ensures you can move quickly in the shell and collaborate with AI assistants without outsourcing your thinking. Then we set up dependencies, explore hardware realities, and dive into PyTorch’s model of computation. From there we climb, one transform at a time, until a small model begins to write.

1.10. Sources Worth Knowing

Vaswani et al. (2017), Attention Is All You Need, introduced the transformer, replacing recurrence with attention. Radford et al. (2018) Improving Language Understanding by Generative Pre-Training and (2019) Language Models are Unsupervised Multitask Learners demonstrated how generative pretraining scales. Brown et al. (2020), Language Models are Few-Shot Learners, mapped the frontier with GPT‑3 and few‑shot prompts. These papers are not prerequisites, but they are good companions.

1.11. Appendix: Script Listings

Below are the small scripts we used in this chapter. They are included verbatim so you can read them inline; the same files live under code/.

1.11.1. code/hello_world.py

def main() -> None:
    print("Hello, LLM world!")


if __name__ == "__main__":
    main()

1.11.2. code/env_check.py

"""Minimal environment and device sanity check.

Run with: python -m code.env_check
"""
from __future__ import annotations

import os
import platform
import sys


def main() -> None:
    print("== Environment ==")
    print("Python:", platform.python_version())
    print("Platform:", platform.platform())
    print("Executable:", sys.executable)
    print("CWD:", os.getcwd())

    try:
        import torch  # type: ignore

        print("\n== PyTorch ==")
        print("torch:", torch.__version__)
        cuda = torch.cuda.is_available()
        mps = getattr(torch.backends, "mps", None)
        print("CUDA available:", cuda)
        if cuda:
            print("CUDA device count:", torch.cuda.device_count())
            if torch.cuda.device_count() > 0:
                print("CUDA device 0:", torch.cuda.get_device_name(0))
        print("MPS available:", bool(mps and torch.backends.mps.is_available()))
    except Exception as e:  # pragma: no cover - diagnostics only
        print("\nPyTorch not installed or not importable:", e)


if __name__ == "__main__":
    main()