Skip to content

Server API

mlx-bun serve (or bun scripts/serve.ts) exposes an OpenAI-compatible HTTP API on one model. The request’s model field is ignored; the loaded model’s id is echoed back. By default generation is serialized through a single queue (one GPU, batch = 1); --batch N switches the server into bf16 continuous batching.

For the full set of start flags (--port, --memory-budget, --prompt-cache, --batch, --kv-quant, --thinking, sampling defaults, and the perf levers) and a compatibility matrix of which combinations compose, see server-config.md. This doc covers the request/response wire format. LoRA adapters are mounted at runtime via POST /v1/adapters (below), not a start flag.

Request body (OpenAI chat schema; unknown fields ignored):

{
"messages": [ /* role: system | user | assistant | tool */ ],
"stream": false,
"max_tokens": 1024, // or max_completion_tokens (wins)
"temperature": 0.7, // 0 = greedy
"top_p": 0, "top_k": 0, // 0 = off
"seed": 1234, // omit for time-derived
"repetition_penalty": 1.1, // optional
"stop": "\n\n", // or ["###", "\n\n"] (spec: up to 4)
"tools": [ /* OpenAI function tools */ ],
"tool_choice": "auto", // "none" disables tools
"chat_template_kwargs": { // forwarded to the chat template
"enable_thinking": false // MiniCPM5 / Qwen3.5: <think> channel on/off
},
"reasoning_effort": "medium", // "none"|"minimal"|"low"|"medium"|"high"
// gates enable_thinking on Qwen3.5/MiniCPM5:
// "none" → off, any other level → on.
// Overrides chat_template_kwargs.enable_thinking.
"hlg": { // HLG tone-curve sampling (per request).
"enabled": true, // merged over --hlg-sampling server defaults.
"width": 4, // logit-width of the tone plateau
"shoulder": 4, // top shoulder width
"toe": 6, // bottom toe width
"pivot_offset": 6 // pivot point offset from top
},
"adapter": "id" // LoRA: "id", stacked "a+b", or "none"
}

Sampling defaults follow the model author’s generation_config.json when a field is omitted (optiq serve’s gen_config behavior); explicit request values always win. MiniCPM5 defaults to the no-think direct answer mode unless chat_template_kwargs.enable_thinking is true.

stop sequences are matched on decoded text, not token ids, so a sequence that spans token boundaries still fires. Generation halts at the first match; the stop sequence itself is excluded from the content and finish_reason is "stop".

Message content is a string or an array of parts: { "type": "text", "text": ... } and { "type": "image_url", "image_url": { "url": "data:image/png;base64,..." } } (http/https URLs also accepted; PNG, JPEG, HEIC, AVIF, WebP, TIFF, GIF, BMP via native OS codecs; requires a model with the vision sidecar).

Non-streaming response:

{
"id": "chatcmpl-…", "object": "chat.completion", "created": 1760000000,
"model": "<loaded model id>",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "", // "" when only tool calls
"tool_calls": [{ // present when the model called tools
"id": "call_…", "type": "function",
"function": { "name": "", "arguments": "{…json…}" }
}]
},
"finish_reason": "stop" | "length" | "tool_calls"
}],
"usage": {
"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0,
"prompt_tokens_details": { "cached_tokens": 0 } // prompt-cache reuse
}
}

Streaming ("stream": true) is SSE: data: <chunk>\n\n per event, terminated by data: "[DONE]". Chunks are chat.completion.chunk objects whose choices[0].delta carries {role}, then {content} increments (multi-byte sequences are held back until decodable; text that could begin a stop sequence is held back until disambiguated, so no part of a stop sequence is ever streamed), then for tool calls a final {tool_calls: [{index, id, type, function}]} delta; the last chunk carries finish_reason and usage.

Tool round-trip: send the assistant message with its tool_calls back, followed by { "role": "tool", "tool_call_id": …, "content": … } messages; multi-turn prompt prefixes reuse the KV prompt cache automatically.

Tool-call parsing is per model family. Gemma 4 uses its native <|tool_call><tool_call|> sentinel tokens. Qwen3.5 emits decoded text wrapped in <tool_call><function=name><parameter=key>value XML equals-style blocks. MiniCPM5 emits XML in decoded text (<function name="…"><param name="…">…, CDATA-wrapped values supported). For both Qwen3.5 and MiniCPM5 content before the tool markup still streams live, only the markup is withheld and converted to tool_calls. Argument values are decoded against the tool’s JSON schema (string-typed params stay strings); markup that fails to parse falls back to plain content.

All errors are { "error": { "message": …, ... } }.

  • 400 — malformed JSON, empty messages, unknown adapter id, vision request on a model without a sidecar, prompt build failures.
  • 400 with "type": "memory_admission", "code": "context_over_budget"prompt + max_tokens exceeds the memory budget’s max safe context (only when serving with --memory-budget; the GPU OOM this prevents would kill the process, so it is refused up front). Lower max_tokens or shorten the prompt; the ceiling is visible at /stats.

POST /v1/messages (Anthropic Messages API)

Section titled “POST /v1/messages (Anthropic Messages API)”

Anthropic-protocol surface over the same engine — on by default, like optiq serve. Point any Anthropic-SDK tool at the server (ANTHROPIC_BASE_URL=http://localhost:8090, any x-api-key) — Claude Code works as a client this way.

  • system (string or text blocks), messages with string or content-block arrays; tool_use / tool_result blocks map to the native gemma tool-calling path (better than the optiq shim, which inlines them as text); image blocks (base64 or url source) hit the vision path on sidecar models.
  • tools ({name, description, input_schema}) map to function tools; server-tool types (web_search, …) are dropped silently.
  • max_tokens, temperature, top_p, top_k, stop_sequences, stream as in the Anthropic spec.
  • Response: {id: "msg_…", type: "message", content: [{type: "text"} | {type: "tool_use"}…], stop_reason, usage: {input_tokens, output_tokens, cache_read_input_tokens}}cache_read_input_tokens comes from the prompt cache.
  • Streaming follows the Anthropic event grammar exactly: message_start → content_block_start/delta/stop (text_delta, input_json_delta) → message_delta (stop_reason + usage) → message_stop. Errors are event: error frames.
  • Errors: {type: "error", error: {type: "invalid_request_error" | "api_error", message}}.

Responses-protocol surface (Codex, Cursor, Continue, Cline, and the OpenAI SDK speak this now). Oracle: optiq responses shim.

  • input (string or item array: message, function_call, function_call_output), instructions (merged with any system/developer items into one leading system message), max_output_tokens, temperature, top_p, top_k, flat tools/tool_choice (built-in tool types dropped), stream.
  • previous_response_id resumption: pass a prior response id instead of resending the conversation; the server splices the stored input + output back in (instructions carry forward when omitted). Store is per-process, 1 h TTL, 32 MiB byte-capped LRU — observable at GET /stats (response_store). Unknown/expired id → 404.
  • Response: {id: "resp_…", object: "response", status: "completed" | "incomplete", output: [{type: "message"|"function_call"…}], usage}.
  • Streaming event chain: response.created → response.in_progress → response.output_item.added → response.content_part.added → response.output_text.delta… → response.output_text.done → response.content_part.done → response.output_item.done → response.completed (+ response.function_call_arguments.delta/.done for tool calls).

{ "object": "list", "data": [{ "id": "<model id>", "object": "model", … }] }

{
"server": { "owner": "serve" | "pi-session" | "embedded", "model": "...", "started_at": 0 },
"prompt_cache": { "entries": 0, "bytes": 0, "max_bytes": 0, "hits": 0, "misses": 0 },
"response_store": { "entries": 0, "bytes": 0, "max_bytes": 33554432, "ttl_ms": 3600000 },
"kv_quant": { "mode": "mixed (kv_config.json)" | "uniform-kv8" | "bf16",
"layers": { "kv4": 8, "bf16": 40 },
"attention": { "global": 10, "sliding_window": 38 } },
"admission": {
"max_safe_context": 0, // tokens; requests above this 400
"memory_budget_bytes": null, // explicit budget, or null (machine default)
"usable_bytes": 0,
"weights_bytes": 0
},
"batch": {
"configured": 1, // the --batch N value
"batched": false, // batching enabled (N>1)
"active_rows": 0 // rows currently decoding in the batch
}
}

Returns all models found in the local HuggingFace hub cache (via the registry scan), each annotated with a fit assessment for this machine. Response is cached for 30 seconds (registry scan + config reads; no tensor bytes are read).

{
"models": [{
"repo_id": "",
"model_type": "gemma3" | "minicpm5" | "qwen3" | ,
"size_bytes": 0,
"quant_bits": 4,
"vision": false, // has vision sidecar
"supported": true, // recognized model family
"serving": false, // currently loaded in this server
"assessment": { // null if config unreadable
"fits": true,
"max_safe_context": 8192,
"predicted_decode_tps": 0.0
}
}]
}

Snapshot of the last 5 model downloads (active, completed, or errored) initiated via mlx-bun download or the web library panel.

{
"downloads": [{
"repo_id": "",
"state": "active" | "done" | "error",
"current_file": "model.safetensors" | null,
"received_bytes": 0,
"total_bytes": 0,
"files_done": 0,
"files_total": 0,
"bytes_per_sec": 0, // rolling ~5 s window
"started_at": 1760000000,
"finished_at": null,
"error": "" // present on state "error"
}]
}

Fit assessment for the loaded model on this machine, plus a capability matrix across Apple SKUs. Used by the status page. Experts bytes come from the registry so MoE active-parameter predictions match the serve banner and mlx-bun fit. When the eval DB has a real measured decode rate for this model snapshot, it is included and takes precedence over the prediction.

{
"machine": { "chip": "M4 Pro", "ram_bytes": 0, "bandwidth_gbs": 0.0 },
"context_tokens": 8192, // current admission ceiling
"typical_context_tokens": 8192, // min(8192, context_tokens)
"typical_decode_tps": 0.0, // predicted at typical_context_tokens
"measured_decode_tps": null, // real number from eval DB, or null
"measured_at": null, // unix ms of measurement, or null
"report": {
"fits": true,
"weights_bytes": 0,
"kv_bytes": 0,
"transient_bytes": 0,
"total_bytes": 0,
"usable_bytes": 0,
"max_safe_context": 8192,
"predicted_decode_tps": 0.0
},
"sku_matrix_ctx": 32768,
"sku_matrix": [{
"sku": "M4 Pro 24 GB", "ram_gb": 24,
"fits": true, "max_context": 32768, "decode_tps": 0.0
}]
}
  • GET /v1/adapters{ adapters: [{ id, path, rank, scale, size_bytes, mounted_layers }] }
  • POST /v1/adapters{ "id": "...", "path": "/dir" }; mounts through the generation queue (never races a forward pass). 400 on shape/compat mismatch — validation is all-or-nothing.
  • DELETE /v1/adapters/<id> — unmount; 404 if not mounted.

Select per request with the adapter body field. Prompt-cache entries are namespaced per adapter spec, so switching adapters never reuses another adapter’s KV.

~/.pi/agent/models.json:

{
"providers": {
"mlx-bun": {
"baseUrl": "http://localhost:8090/v1",
"api": "openai-completions",
"apiKey": "sk-anything-nonempty",
"models": [{ "id": "<model id from /v1/models>" }]
}
}
}

Any OpenAI SDK works the same way: baseURL: "http://localhost:8090/v1", any non-empty apiKey.