
A hands-on guide to running LLM inference on Apple Silicon using vLLM Metal — via Docker Model Runner or native vLLM serve.
發佈於 星期二,三月 3 2026
English
Local LLM Inference on Mac: vLLM Metal Quick Start Guide (EN)
Two ways to run vLLM inference on Apple Silicon: via Docker Model Runner or native vLLM serve.
Prerequisites
| Requirement | Minimum |
|---|---|
| Mac | macOS on Apple Silicon |
| Memory | 16GB+ recommended (8B models need at least 8GB free) |
| Docker Desktop | 4.62.0+ (only for Docker path) |
What is vLLM Metal?
vLLM Metal is a vLLM plugin that enables Metal GPU-accelerated inference on Apple Silicon. It uses Apple's MLX framework as the compute backend, leveraging unified memory for zero-copy operations.
Key features:
- MLX-Accelerated Inference: Faster than PyTorch MPS on Apple Silicon
- Zero-Copy Operations: CPU and GPU share the same memory pool — no data transfers
- Full vLLM Compatibility: OpenAI-compatible API out of the box
- Rust High-Performance Extensions: Core BlockAllocator 355x faster than Python
- Vision-Language Model Support: LLaVA and other multimodal models via mlx-vlm
Path 1: Docker Model Runner
Best for quick setup and Docker workflow integration.
Install the Backend
docker model install-runner --backend vllm --gpu metal
This pulls a self-contained Python 3.12 environment to ~/.docker/model-runner/vllm-metal/.
Note: vllm-metal runs natively on the host, not inside a container. Metal GPU requires direct hardware access.
Verify
docker model status
Pull and Run Models
# Pull an MLX model from Hugging Face
docker model pull hf.co/mlx-community/Qwen3-8B-4bit
# Interactive chat
docker model run hf.co/mlx-community/Qwen3-8B-4bit "Hello"
# List downloaded models
docker model list
Other models you can try:
docker model pull hf.co/mlx-community/Llama-3.2-1B-Instruct-4bit
docker model pull hf.co/mlx-community/Mistral-7B-Instruct-v0.3-4bit
Enable TCP API
By default the API is only available via Docker socket. To enable HTTP access:
docker desktop enable model-runner --tcp=12434
Call the API
# List models
curl http://localhost:12434/engines/v1/models
# Chat completion
curl -X POST http://localhost:12434/engines/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "hf.co/mlx-community/Qwen3-8B-4bit",
"messages": [
{"role": "user", "content": "Explain Docker in one sentence"}
],
"max_tokens": 256
}'
Nginx Reverse Proxy (LAN Access)
The TCP endpoint only listens on localhost. To expose it to other devices on your network:
docker run -d --name model-proxy \
-p 0.0.0.0:12435:12435 \
--add-host=host.docker.internal:host-gateway \
nginx:alpine \
sh -c 'echo "server { listen 12435; location / { proxy_pass http://host.docker.internal:12434; proxy_set_header Host \$host; proxy_buffering off; } }" > /etc/nginx/conf.d/default.conf && nginx -g "daemon off;"'
Test: curl http://192.168.x.x:12435/engines/v1/models
Stop: docker stop model-proxy && docker rm model-proxy
Path 2: Native vLLM Serve
Best for developers who want full control over configuration.
Install
# One-line install (creates ~/.venv-vllm-metal)
curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm-metal/main/install.sh | bash
# Activate
source ~/.venv-vllm-metal/bin/activate
Start Model Service
vllm serve mlx-community/Qwen3-8B-4bit \
--port 30000 \
--served-model-name Qwen3-8B-4bit
Call the API
curl -X POST http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-8B-4bit",
"messages": [
{"role": "user", "content": "What is vLLM?"}
],
"max_tokens": 512
}'
Environment Variables
export VLLM_METAL_MEMORY_FRACTION=0.7 # Memory usage (0~1)
export VLLM_METAL_USE_PAGED_ATTENTION=1 # Experimental paged attention
export VLLM_METAL_DEBUG=1 # Debug logging
export VLLM_METAL_PREFIX_CACHE=1 # Prefix caching
Architecture
┌─────────────────────────────────────────────────────────────┐
│ vLLM Core │
│ Engine, Scheduler, API Server, Tokenizer │
└──────────────────────────┬──────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────┐
│ vllm_metal Plugin Layer │
│ ┌──────────────┐ ┌─────────────┐ ┌───────────────────┐ │
│ │MetalPlatform │ │ MetalWorker │ │ MetalModelRunner │ │
│ └──────────────┘ └─────────────┘ └───────────────────┘ │
└──────────────────────────┬──────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────┐
│ Compute Backend │
│ ┌────────────────────┐ ┌───────────────────────────┐ │
│ │ MLX Backend │ │ PyTorch Backend │ │
│ │ (Primary) │ │ (Model Loading/Interop) │ │
│ └────────────────────┘ └───────────────────────────┘ │
└──────────────────────────┬──────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────┐
│ Metal GPU / Apple Silicon UMA │
└─────────────────────────────────────────────────────────────┘
References
- vllm-project/vllm-metal
- Docker Model Runner Documentation
- Docker Model Runner + vLLM on macOS
- MLX Framework
- mlx-community Models
中文
在 Mac 上跑 vLLM Metal:快速入門指南
兩種方式在 Apple Silicon 上跑 vLLM 推理:透過 Docker Model Runner 或原生 vLLM serve。
環境需求
| 項目 | 需求 |
|---|---|
| Mac | macOS on Apple Silicon |
| 記憶體 | 建議 16GB 以上(8B 模型至少需 8GB 可用) |
| Docker Desktop | 4.62.0+(僅 Docker 方式需要) |
什麼是 vLLM Metal?
vLLM Metal 是一個 vLLM 外掛,讓 vLLM 能在 Apple Silicon 上透過 Metal GPU 加速推理。它使用 Apple 的 MLX 框架作為運算後端,利用統一記憶體實現零拷貝操作。
方式一:Docker Model Runner
適合快速上手、整合 Docker 工作流程。
安裝後端
docker model install-runner --backend vllm --gpu metal
會將完整 Python 3.12 環境拉取到 ~/.docker/model-runner/vllm-metal/。
注意:vllm-metal 是原生執行在主機上的,不是跑在容器內。Metal GPU 需要直接存取硬體。
確認安裝
docker model status
拉取與執行模型
# 從 Hugging Face 拉取 MLX 模型
docker model pull hf.co/mlx-community/Qwen3-8B-4bit
# 互動式對話
docker model run hf.co/mlx-community/Qwen3-8B-4bit "你好"
# 列出已下載的模型
docker model list
其他可用模型:
docker model pull hf.co/mlx-community/Llama-3.2-1B-Instruct-4bit
docker model pull hf.co/mlx-community/Mistral-7B-Instruct-v0.3-4bit
啟用 TCP API
預設 API 只能透過 Docker socket 存取。啟用 HTTP 存取:
docker desktop enable model-runner --tcp=12434
呼叫 API
# 列出模型
curl http://localhost:12434/engines/v1/models
# Chat completion
curl -X POST http://localhost:12434/engines/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "hf.co/mlx-community/Qwen3-8B-4bit",
"messages": [
{"role": "user", "content": "用一句話解釋什麼是 Docker"}
],
"max_tokens": 256
}'
Nginx 反向代理(區域網路存取)
TCP 端點預設只監聽 localhost。讓區域網路其他裝置也能存取:
docker run -d --name model-proxy \
-p 0.0.0.0:12435:12435 \
--add-host=host.docker.internal:host-gateway \
nginx:alpine \
sh -c 'echo "server { listen 12435; location / { proxy_pass http://host.docker.internal:12434; proxy_set_header Host \$host; proxy_buffering off; } }" > /etc/nginx/conf.d/default.conf && nginx -g "daemon off;"'
測試:curl http://192.168.x.x:12435/engines/v1/models
停止:docker stop model-proxy && docker rm model-proxy
方式二:原生 vLLM Serve
適合開發者,需要完整設定控制。
安裝
# 一鍵安裝(建立 ~/.venv-vllm-metal)
curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm-metal/main/install.sh | bash
# 啟用
source ~/.venv-vllm-metal/bin/activate
啟動模型服務
vllm serve mlx-community/Qwen3-8B-4bit \
--port 30000 \
--served-model-name Qwen3-8B-4bit
呼叫 API
curl -X POST http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-8B-4bit",
"messages": [
{"role": "user", "content": "什麼是 vLLM?"}
],
"max_tokens": 512
}'
環境變數
export VLLM_METAL_MEMORY_FRACTION=0.7 # 記憶體使用比例(0~1)
export VLLM_METAL_USE_PAGED_ATTENTION=1 # 實驗性 Paged Attention
export VLLM_METAL_DEBUG=1 # 除錯日誌
export VLLM_METAL_PREFIX_CACHE=1 # 前綴快取
架構
┌─────────────────────────────────────────────────────────────┐
│ vLLM 核心層 │
│ 引擎、排程器、API Server、Tokenizer │
└──────────────────────────┬──────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────┐
│ vllm_metal 外掛層 │
│ ┌──────────────┐ ┌─────────────┐ ┌───────────────────┐ │
│ │MetalPlatform │ │ MetalWorker │ │ MetalModelRunner │ │
│ └──────────────┘ └─────────────┘ └───────────────────┘ │
└──────────────────────────┬──────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────┐
│ 運算後端層 │
│ ┌────────────────────┐ ┌───────────────────────────┐ │
│ │ MLX 後端 │ │ PyTorch 後端 │ │
│ │ (主要推理) │ │ (模型載入/互操作) │ │
│ └────────────────────┘ └───────────────────────────┘ │
└──────────────────────────┬──────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────┐
│ Metal GPU / Apple Silicon 統一記憶體 │
└─────────────────────────────────────────────────────────────┘
參考資源
Artificial-Intelligence
您可能也喜歡

- # Artificial-Intelligence
蒲福風級表的數位定量化-AI應用的概念模型
蒲福風級表的數位定量化-AI應用的概念模型/A Conceptual Model for Quantifying the Beaufort Wind Scale Using AI Applications

- # Artificial-Intelligence
安裝開源FLUX.1-dev搭配ComfyUI
FLUX.1是AI圖像生成領域的重大突破,由Black Forest Labs團隊開發的這款開源模型,在圖像細節、構圖和美感方面已超越了許多主流模型。本教學將引導您在ComfyUI環境中安裝並運行FLUX.1-dev,為您的創意項目提供頂級AI圖像生成能力。

- # Artificial-Intelligence
Mac安裝Ollama與Open WebUI
在 macOS 上安裝 Ollama 與 Open WebUI,讓你能夠本地運行和管理大型語言模型(LLM)。本指南將涵蓋安裝步驟、設定方式,以及如何使用 Open WebUI 來操作 Ollama 的模型,提升 AI 互動體驗。