Local LLM inference pipeline on M-series Macs —

On this page

Overview

Running a 70B parameter model on a MacBook Pro with 36GB unified memory is not something you can just pip install and expect to work. It requires understanding the memory layout, the quantization tradeoffs, and the pipeline orchestration that keeps everything from OOMing mid-generation.

This project documents the full stack I built to run Llama 3.1 70B quantized on M2/M3 Macs — from GGUF conversion through vRAM-aware batching, prompt routing, and output post-processing.

The Memory Problem

Unified memory on Apple Silicon is a double-edged sword. It gives you a single address space for CPU and GPU, which is great for data sharing. It also means the entire model weights, KV cache, and activation tensors all compete for the same pool. A 70B model in FP16 needs ~140GB — nowhere near what a consumer Mac offers.

The solution is quantization. GGUF format with Q4_K_M quantization brings 70B down to ~39GB. That fits in 36GB with a tight margin, but leaves almost no room for the KV cache during generation. The pipeline has to be careful about when to offload layers to disk and when to evict cached prompts.

Architecture

The pipeline has four stages:

Weight loading — GGUF files are loaded with llama.cpp’s memory mapping. Layers are split across GPU and CPU based on available vRAM, with the GPU getting the attention and feed-forward layers (the compute-heavy ones) and the CPU handling normalization and gating.
Prompt routing — Incoming prompts are classified by type (code generation, question answering, summarization) and routed through different system prompt templates. Each template has a different context budget, which affects how much of the KV cache is reserved upfront.
Generation loop — The core inference loop uses speculative decoding when a smaller draft model is available, falling back to standard autoregressive generation. KV cache eviction runs on a sliding window basis to prevent memory pressure from accumulating over long sessions.
Output post-processing — Generated text is parsed for structured outputs (JSON, code blocks, lists). Malformed outputs trigger a retry with stricter formatting constraints in the prompt.

Results

On an M2 Max with 36GB:

70B Q4_K_M: ~3.2 tokens/sec sustained, ~2.8 tokens/sec under load
70B Q5_K_M: ~2.4 tokens/sec, fits only with aggressive KV cache eviction
13B Q5_K_M: ~18 tokens/sec, headroom for larger context windows

The bottleneck is always memory bandwidth, not compute. The M-series unified memory architecture delivers ~400GB/s, which is the throughput ceiling for any quantization strategy.

Lessons Learned

Quantization is not free. Q4 vs Q5 on 70B shows measurable quality degradation on code generation tasks. The gap is smaller on natural language, but still present.
Speculative decoding only helps when the draft model is fast enough to not become the new bottleneck. On a Mac, the draft model runs on the same CPU, so the speedup is marginal.
The KV cache is the silent killer. It grows linearly with context length and batch size. Any pipeline that handles long conversations needs an eviction strategy baked in from day one.

Local LLM inference pipeline on M-series Macs —

Overview

The Memory Problem

Architecture

Results

Lessons Learned

RAG evaluation harness for retrieval quality —

Custom automation layer over GitHub Actions —

RAG evaluation harness for retrieval quality

Custom automation layer over GitHub Actions