Why mlx-bun
MLX is Apple’s ML framework: hand-tuned Metal kernels for Apple Silicon, with official bindings for Python, C++, Swift, and C — but no JavaScript story. Today, a JS/TS app that wants local MLX inference must shell out to a Python server (mlx-lm, optiq) and accept that stack’s fragility: venv setup, brittle download tooling, segfaults on exit, monkey-patched HTTP layers.
The performance-neutral layer
Section titled “The performance-neutral layer”The performance-critical work — every matmul, every attention pass — lives in
MLX’s C++/Metal core and is exposed through mlx-c. The Python layer on top is
pure orchestration: model loading, tokenization, the sampling loop, serving.
That layer is performance-neutral (the GPU dominates), so it can be rewritten in
any runtime without losing speed.
Why Bun specifically
Section titled “Why Bun specifically”bun:ffi— the lowest-overhead FFI of any JS runtime; bindsmlx-cdirectly, no node-gyp, no binding compilation.- Lazy native weight loading — MLX’s loader materializes tensors on first use, so opening an 8.9 GB model takes milliseconds, and warm restarts are near-instant on cached pages.
Bun.Image— native OS image codecs (HEIC, AVIF, WebP, JPEG, …) for the vision path, EXIF auto-orientation included.bun:sqlite— built-in storage for the model registry and eval DB.- One binary, one toolchain — runtime, package manager, test runner.
The result is a single binary with no Python anywhere — and, served over HTTP, the fastest startup and TTFT of any stack tested, with bit-exact correctness against the Python reference.