Local LLM inference pipeline on M-series Macs

On this page

Skip to content

    Overview

    Running a 70B parameter model on a MacBook Pro with 36GB unified memory is not something you can just pip install and expect to work. It requires understanding the memory layout, the quantization tradeoffs, and the pipeline orchestration that keeps everything from OOMing mid-generation.

    This project documents the full stack I built to run Llama 3.1 70B quantized on M2/M3 Macs — from GGUF conversion through vRAM-aware batching, prompt routing, and output post-processing.

    The Memory Problem

    Unified memory on Apple Silicon is a double-edged sword. It gives you a single address space for CPU and GPU, which is great for data sharing. It also means the entire model weights, KV cache, and activation tensors all compete for the same pool. A 70B model in FP16 needs ~140GB — nowhere near what a consumer Mac offers.

    The solution is quantization. GGUF format with Q4_K_M quantization brings 70B down to ~39GB. That fits in 36GB with a tight margin, but leaves almost no room for the KV cache during generation. The pipeline has to be careful about when to offload layers to disk and when to evict cached prompts.

    Architecture

    The pipeline has four stages:

    1. Weight loading — GGUF files are loaded with llama.cpp’s memory mapping. Layers are split across GPU and CPU based on available vRAM, with the GPU getting the attention and feed-forward layers (the compute-heavy ones) and the CPU handling normalization and gating.

    2. Prompt routing — Incoming prompts are classified by type (code generation, question answering, summarization) and routed through different system prompt templates. Each template has a different context budget, which affects how much of the KV cache is reserved upfront.

    3. Generation loop — The core inference loop uses speculative decoding when a smaller draft model is available, falling back to standard autoregressive generation. KV cache eviction runs on a sliding window basis to prevent memory pressure from accumulating over long sessions.

    4. Output post-processing — Generated text is parsed for structured outputs (JSON, code blocks, lists). Malformed outputs trigger a retry with stricter formatting constraints in the prompt.

    Results

    On an M2 Max with 36GB:

    • 70B Q4_K_M: ~3.2 tokens/sec sustained, ~2.8 tokens/sec under load
    • 70B Q5_K_M: ~2.4 tokens/sec, fits only with aggressive KV cache eviction
    • 13B Q5_K_M: ~18 tokens/sec, headroom for larger context windows

    The bottleneck is always memory bandwidth, not compute. The M-series unified memory architecture delivers ~400GB/s, which is the throughput ceiling for any quantization strategy.

    Lessons Learned

    • Quantization is not free. Q4 vs Q5 on 70B shows measurable quality degradation on code generation tasks. The gap is smaller on natural language, but still present.
    • Speculative decoding only helps when the draft model is fast enough to not become the new bottleneck. On a Mac, the draft model runs on the same CPU, so the speedup is marginal.
    • The KV cache is the silent killer. It grows linearly with context length and batch size. Any pipeline that handles long conversations needs an eviction strategy baked in from day one.