Skip to content

CLI reference

Commands are shown as mlx-bun <verb>. From a clone the identical command is bun src/cli.ts <verb>. Model arguments are substring queries against the registry (e4b, 26B, 12B-it); a query matching more than one model errors out and lists the candidates — just make it more specific.

Start the OpenAI/Anthropic-compatible server. Bare mlx-bun is an alias for mlx-bun serve.

Terminal window
mlx-bun serve gemma --port 8090 # OpenAI-compatible server
mlx-bun serve gemma --memory-budget 18 # ...with admission control (GB)
mlx-bun serve e4b --no-open # don't open the browser chat UI

Common flags (full list in Server configuration):

FlagEffect
--port <n>Listen port (default 8090)
--memory-budget <GB>Reject loads/requests that can’t fit the budget
--no-openDon’t auto-open the chat UI
--no-kv-quant / --kv-bits <n>Control mixed-precision KV
--adapter id=dirMount a LoRA adapter at startup

Resumable, checksum-verified download into the standard Hugging Face cache.

Terminal window
mlx-bun get mlx-community/gemma-4-12B-it-OptiQ-4bit

Downloads resume across interruption, every blob is sha-verified, and the layout matches huggingface_hub exactly — an existing HF cache is picked up as-is.

Index the models in your HF cache into the registry so ls, serve, and fit can find them by substring.

Terminal window
mlx-bun scan
Terminal window
mlx-bun ls # size, params, quant, capabilities
mlx-bun ls --vision --max-size 10GB # filter

Deterministic memory assessment: does it fit, what’s the max context, predicted tok/s.

Terminal window
mlx-bun fit gemma --ctx 32768 # for this machine
mlx-bun fit gemma --ctx 8192 --skus # across the Apple Silicon lineup

See Choosing a model for how it computes.

Terminal window
mlx-bun evals

Point your own pi install at the local server.

Terminal window
mlx-bun harness pi

The head-to-head matrix against mlx-lm/optiq is a script (reboot first for clean numbers; it’s preflight-gated and resumable, writing benchmarks-h2h-<date>.md):

Terminal window
./benchmark.sh