Compiler toolchain: from AST to machine code

On this page

Skip to content

    Overview

    The most effective way to understand compilers is to build one. Not a production compiler — a minimal one that parses arithmetic expressions, lowers them to LLVM IR, and emits Mach-O binaries on macOS. The goal is not to create a useful language but to understand the pipeline: lexical analysis, parsing, type checking, IR generation, optimization, and code generation.

    Lexical Analysis

    The lexer tokenizes input into a stream of tokens: numbers, operators, parentheses, and identifiers. The key design decision is whether to use a hand-written lexer or a generator. I chose hand-written because the token set is small (6 token types) and a hand-written lexer gives you complete control over error recovery.

    The lexer produces tokens like NUMBER(42), PLUS, MINUS, MULTIPLY, DIVIDE, LPAREN, RPAREN. Each token carries its position in the source for error reporting.

    Parsing

    The parser uses a recursive descent approach with operator precedence climbing. This is simpler than a table-driven parser for a language with only arithmetic expressions, and it gives you natural error recovery through exception handling.

    The grammar:

    expression → term (('+' | '-') term)*
    term       → factor (('*' | '/') factor)*
    factor     → NUMBER | '(' expression ')'

    The precedence climbing algorithm handles the operator precedence naturally: * and / bind tighter than + and -. The parser builds an AST where each node is either a literal, a binary operation, or a function call (added as an extension).

    Type Checking

    The type checker walks the AST and assigns a type to each node. For arithmetic expressions, this is straightforward: NUMBER is i32, binary operations on i32 produce i32, and division by zero is a compile-time error for literal operands.

    The type checker also handles implicit widening: i32 + i64 produces i64. This is important for LLVM IR generation because LLVM has strict type requirements.

    LLVM IR Generation

    The IR generator walks the AST and emits LLVM IR using the LLVM C API. Each AST node becomes a series of LLVM instructions. A binary operation like 3 + 4 becomes:

    %1 = add i32 3, 4
    ret i32 %1

    The IR generator handles variable bindings by maintaining a symbol table that maps variable names to LLVM values. When a variable is referenced, the generator looks up its current LLVM value in the symbol table.

    Code Generation

    LLVM handles the code generation step. The IR is passed through LLVM’s optimization pipeline (opt -O2) and then linked into a Mach-O executable using ld64. The final binary runs natively on macOS without any runtime or interpreter.

    Lessons Learned

    • Operator precedence is harder than it looks. The first implementation had a bug where 3 + 4 * 5 was parsed as (3 + 4) * 5 instead of 3 + (4 * 5). The precedence climbing algorithm fixed it.
    • LLVM’s C API is verbose. Each LLVM instruction requires multiple API calls to create types, values, and instructions. A higher-level wrapper would save significant code.
    • Error recovery in recursive descent parsing is surprisingly easy. Wrap each production in a try-catch and emit a diagnostic on failure.