Skip to content

Benchmarks

Head-to-head against the Python stacks (mlx-lm 0.31.3, mlx-optiq 0.2.1), same machine (M4 Pro, 24 GB), same day, same Hugging Face snapshots, preflight-gated clean machine, median-of-N with warmups discarded.

The curated table (parity / performance / quality) with per-row provenance lives in benchmarks/RESULTS.md.

mlx-bunmlx-lmoptiq
TTFT, served (warm)45–90 ms219–224 ms222–331 ms
server start → ready0.36–0.47 s0.76–0.98 s0.79–1.00 s
decode through HTTP (e4b / 12B / 26B)54.5 / 25.2 / 54.953.5 / — / 52.253.5 / 25.5 / †
server tax vs own direct decode≈ 0%−5…−7%≈ 0%
direct decode (engine only)−1.9…−4.4% vs mlx-lmbaseline−0.8…−1.2%
12B decode @8k context23.3 (23.0 kv-mixed)24.423.2 kv-mixed

In this matrix, our direct decode trailed mlx-lm on every model (12B −1.9%, 26B −2.9%, e4b −4.4% at short context; the 12B gap grew to −4.5% @8k), and optiq’s served 12B edges ours by ~1% (25.5 vs 25.2) — while paying 3.7× the TTFT.

Post-matrix (2026-06-11), the decode gap was root-caused and fixed: a prefill→decode allocator-reclaim stall that mlx-lm clears with mx.clear_cache and bills to prompt time (we billed it to decode). After the reference-faithful fix, same-session paired runs put the 12B ahead at short context (25.1 vs 24.0) and at parity @8k (23.8 vs 23.9). A clean-machine re-measure is pending; e4b retains a ~5% per-step host-overhead residual.

Served through HTTP — how agents actually use a local model — mlx-bun has the fastest decode on e4b and the 26B, and the fastest TTFT and startup everywhere by 2–5×.

† optiq serve produced no output on the 26B (the Metal OOM crash class from Python’s non-lazy load transient — reproduced in isolation; mlx-bun and mlx-lm both served the same model from the same machine state). One further optiq cell is blocked on an upstream optiq bug; both are documented in the results file.

These numbers are the 2026-06-11 cleared-machine re-run with the long-context guard active (every @8k row verified at its requested context).