English

Local LLM Inference on Mac: vLLM Metal Quick Start Guide (EN)

Two ways to run vLLM inference on Apple Silicon: via Docker Model Runner or native vLLM serve.

Prerequisites

Requirement	Minimum
Mac	macOS on Apple Silicon
Memory	16GB+ recommended (8B models need at least 8GB free)
Docker Desktop	4.62.0+ (only for Docker path)

What is vLLM Metal?

vLLM Metal is a vLLM plugin that enables Metal GPU-accelerated inference on Apple Silicon. It uses Apple's MLX framework as the compute backend, leveraging unified memory for zero-copy operations.

Key features:

MLX-Accelerated Inference: Faster than PyTorch MPS on Apple Silicon
Zero-Copy Operations: CPU and GPU share the same memory pool — no data transfers
Full vLLM Compatibility: OpenAI-compatible API out of the box
Rust High-Performance Extensions: Core BlockAllocator 355x faster than Python
Vision-Language Model Support: LLaVA and other multimodal models via mlx-vlm

Path 1: Docker Model Runner

Best for quick setup and Docker workflow integration.

Install the Backend

docker model install-runner --backend vllm --gpu metal

This pulls a self-contained Python 3.12 environment to ~/.docker/model-runner/vllm-metal/.

Note: vllm-metal runs natively on the host, not inside a container. Metal GPU requires direct hardware access.

Verify

docker model status

Pull and Run Models

# Pull an MLX model from Hugging Face
docker model pull hf.co/mlx-community/Qwen3-8B-4bit

# Interactive chat
docker model run hf.co/mlx-community/Qwen3-8B-4bit "Hello"

# List downloaded models
docker model list

Other models you can try:

docker model pull hf.co/mlx-community/Llama-3.2-1B-Instruct-4bit
docker model pull hf.co/mlx-community/Mistral-7B-Instruct-v0.3-4bit

Enable TCP API

By default the API is only available via Docker socket. To enable HTTP access:

docker desktop enable model-runner --tcp=12434

Call the API

# List models
curl http://localhost:12434/engines/v1/models

# Chat completion
curl -X POST http://localhost:12434/engines/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "hf.co/mlx-community/Qwen3-8B-4bit",
    "messages": [
      {"role": "user", "content": "Explain Docker in one sentence"}
    ],
    "max_tokens": 256
  }'

Nginx Reverse Proxy (LAN Access)

The TCP endpoint only listens on localhost. To expose it to other devices on your network:

docker run -d --name model-proxy \
  -p 0.0.0.0:12435:12435 \
  --add-host=host.docker.internal:host-gateway \
  nginx:alpine \
  sh -c 'echo "server { listen 12435; location / { proxy_pass http://host.docker.internal:12434; proxy_set_header Host \$host; proxy_buffering off; } }" > /etc/nginx/conf.d/default.conf && nginx -g "daemon off;"'

Test: curl http://192.168.x.x:12435/engines/v1/models

Stop: docker stop model-proxy && docker rm model-proxy

Path 2: Native vLLM Serve

Best for developers who want full control over configuration.

Install

# One-line install (creates ~/.venv-vllm-metal)
curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm-metal/main/install.sh | bash

# Activate
source ~/.venv-vllm-metal/bin/activate

Start Model Service

vllm serve mlx-community/Qwen3-8B-4bit \
  --port 30000 \
  --served-model-name Qwen3-8B-4bit

Call the API

curl -X POST http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-8B-4bit",
    "messages": [
      {"role": "user", "content": "What is vLLM?"}
    ],
    "max_tokens": 512
  }'

Environment Variables

export VLLM_METAL_MEMORY_FRACTION=0.7    # Memory usage (0~1)
export VLLM_METAL_USE_PAGED_ATTENTION=1  # Experimental paged attention
export VLLM_METAL_DEBUG=1                # Debug logging
export VLLM_METAL_PREFIX_CACHE=1         # Prefix caching

Architecture

┌─────────────────────────────────────────────────────────────┐
│                        vLLM Core                            │
│           Engine, Scheduler, API Server, Tokenizer          │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│                   vllm_metal Plugin Layer                    │
│  ┌──────────────┐  ┌─────────────┐  ┌───────────────────┐  │
│  │MetalPlatform │  │ MetalWorker │  │ MetalModelRunner  │  │
│  └──────────────┘  └─────────────┘  └───────────────────┘  │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│                  Compute Backend                             │
│  ┌────────────────────┐      ┌───────────────────────────┐  │
│  │    MLX Backend     │      │    PyTorch Backend        │  │
│  │   (Primary)        │      │   (Model Loading/Interop) │  │
│  └────────────────────┘      └───────────────────────────┘  │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│              Metal GPU / Apple Silicon UMA                   │
└─────────────────────────────────────────────────────────────┘

References

中文

在 Mac 上跑 vLLM Metal：快速入門指南

兩種方式在 Apple Silicon 上跑 vLLM 推理：透過 Docker Model Runner 或原生 vLLM serve。

環境需求

項目	需求
Mac	macOS on Apple Silicon
記憶體	建議 16GB 以上（8B 模型至少需 8GB 可用）
Docker Desktop	4.62.0+（僅 Docker 方式需要）

什麼是 vLLM Metal？

vLLM Metal 是一個 vLLM 外掛，讓 vLLM 能在 Apple Silicon 上透過 Metal GPU 加速推理。它使用 Apple 的 MLX 框架作為運算後端，利用統一記憶體實現零拷貝操作。

方式一：Docker Model Runner

適合快速上手、整合 Docker 工作流程。

安裝後端

docker model install-runner --backend vllm --gpu metal

會將完整 Python 3.12 環境拉取到 ~/.docker/model-runner/vllm-metal/。

注意：vllm-metal 是原生執行在主機上的，不是跑在容器內。Metal GPU 需要直接存取硬體。

確認安裝

docker model status

拉取與執行模型

# 從 Hugging Face 拉取 MLX 模型
docker model pull hf.co/mlx-community/Qwen3-8B-4bit

# 互動式對話
docker model run hf.co/mlx-community/Qwen3-8B-4bit "你好"

# 列出已下載的模型
docker model list

其他可用模型：

docker model pull hf.co/mlx-community/Llama-3.2-1B-Instruct-4bit
docker model pull hf.co/mlx-community/Mistral-7B-Instruct-v0.3-4bit

啟用 TCP API

預設 API 只能透過 Docker socket 存取。啟用 HTTP 存取：

docker desktop enable model-runner --tcp=12434

呼叫 API

# 列出模型
curl http://localhost:12434/engines/v1/models

# Chat completion
curl -X POST http://localhost:12434/engines/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "hf.co/mlx-community/Qwen3-8B-4bit",
    "messages": [
      {"role": "user", "content": "用一句話解釋什麼是 Docker"}
    ],
    "max_tokens": 256
  }'

Nginx 反向代理（區域網路存取）

TCP 端點預設只監聽 localhost。讓區域網路其他裝置也能存取：

docker run -d --name model-proxy \
  -p 0.0.0.0:12435:12435 \
  --add-host=host.docker.internal:host-gateway \
  nginx:alpine \
  sh -c 'echo "server { listen 12435; location / { proxy_pass http://host.docker.internal:12434; proxy_set_header Host \$host; proxy_buffering off; } }" > /etc/nginx/conf.d/default.conf && nginx -g "daemon off;"'

測試：curl http://192.168.x.x:12435/engines/v1/models

停止：docker stop model-proxy && docker rm model-proxy

方式二：原生 vLLM Serve

適合開發者，需要完整設定控制。

安裝

# 一鍵安裝（建立 ~/.venv-vllm-metal）
curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm-metal/main/install.sh | bash

# 啟用
source ~/.venv-vllm-metal/bin/activate

啟動模型服務

vllm serve mlx-community/Qwen3-8B-4bit \
  --port 30000 \
  --served-model-name Qwen3-8B-4bit

呼叫 API

curl -X POST http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-8B-4bit",
    "messages": [
      {"role": "user", "content": "什麼是 vLLM？"}
    ],
    "max_tokens": 512
  }'

環境變數

export VLLM_METAL_MEMORY_FRACTION=0.7    # 記憶體使用比例（0~1）
export VLLM_METAL_USE_PAGED_ATTENTION=1  # 實驗性 Paged Attention
export VLLM_METAL_DEBUG=1                # 除錯日誌
export VLLM_METAL_PREFIX_CACHE=1         # 前綴快取

架構

┌─────────────────────────────────────────────────────────────┐
│                        vLLM 核心層                          │
│           引擎、排程器、API Server、Tokenizer               │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│                   vllm_metal 外掛層                         │
│  ┌──────────────┐  ┌─────────────┐  ┌───────────────────┐  │
│  │MetalPlatform │  │ MetalWorker │  │ MetalModelRunner  │  │
│  └──────────────┘  └─────────────┘  └───────────────────┘  │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│                      運算後端層                              │
│  ┌────────────────────┐      ┌───────────────────────────┐  │
│  │    MLX 後端        │      │    PyTorch 後端           │  │
│  │   （主要推理）     │      │   （模型載入/互操作）     │  │
│  └────────────────────┘      └───────────────────────────┘  │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│              Metal GPU / Apple Silicon 統一記憶體             │
└─────────────────────────────────────────────────────────────┘

English

Local LLM Inference on Mac: vLLM Metal Quick Start Guide (EN)

Prerequisites

What is vLLM Metal?

Path 1: Docker Model Runner

Install the Backend

Verify

Pull and Run Models

Enable TCP API

Call the API

Nginx Reverse Proxy (LAN Access)

Path 2: Native vLLM Serve

Install

Start Model Service

Call the API

Environment Variables

Architecture

References

中文

在 Mac 上跑 vLLM Metal：快速入門指南

環境需求

什麼是 vLLM Metal？

方式一：Docker Model Runner

安裝後端

確認安裝

拉取與執行模型

啟用 TCP API

呼叫 API

Nginx 反向代理（區域網路存取）

方式二：原生 vLLM Serve

安裝

啟動模型服務

呼叫 API

環境變數

架構

參考資源

您可能也喜歡

蒲福風級表的數位定量化-AI應用的概念模型

安裝開源FLUX.1-dev搭配ComfyUI

Mac安裝Ollama與Open WebUI