Infinirc
Local LLM Inference on Mac: vLLM Metal Quick Start Guide
Local LLM Inference on Mac: vLLM Metal Quick Start Guide

A hands-on guide to running LLM inference on Apple Silicon using vLLM Metal — via Docker Model Runner or native vLLM serve.

發佈於  星期二,三月 3 2026

English

Local LLM Inference on Mac: vLLM Metal Quick Start Guide (EN)

Two ways to run vLLM inference on Apple Silicon: via Docker Model Runner or native vLLM serve.

Prerequisites

RequirementMinimum
MacmacOS on Apple Silicon
Memory16GB+ recommended (8B models need at least 8GB free)
Docker Desktop4.62.0+ (only for Docker path)

What is vLLM Metal?

vLLM Metal is a vLLM plugin that enables Metal GPU-accelerated inference on Apple Silicon. It uses Apple's MLX framework as the compute backend, leveraging unified memory for zero-copy operations.

Key features:

  • MLX-Accelerated Inference: Faster than PyTorch MPS on Apple Silicon
  • Zero-Copy Operations: CPU and GPU share the same memory pool — no data transfers
  • Full vLLM Compatibility: OpenAI-compatible API out of the box
  • Rust High-Performance Extensions: Core BlockAllocator 355x faster than Python
  • Vision-Language Model Support: LLaVA and other multimodal models via mlx-vlm

Path 1: Docker Model Runner

Best for quick setup and Docker workflow integration.

Install the Backend

docker model install-runner --backend vllm --gpu metal

This pulls a self-contained Python 3.12 environment to ~/.docker/model-runner/vllm-metal/.

Note: vllm-metal runs natively on the host, not inside a container. Metal GPU requires direct hardware access.

Verify

docker model status

Pull and Run Models

# Pull an MLX model from Hugging Face
docker model pull hf.co/mlx-community/Qwen3-8B-4bit

# Interactive chat
docker model run hf.co/mlx-community/Qwen3-8B-4bit "Hello"

# List downloaded models
docker model list

Other models you can try:

docker model pull hf.co/mlx-community/Llama-3.2-1B-Instruct-4bit
docker model pull hf.co/mlx-community/Mistral-7B-Instruct-v0.3-4bit

Enable TCP API

By default the API is only available via Docker socket. To enable HTTP access:

docker desktop enable model-runner --tcp=12434

Call the API

# List models
curl http://localhost:12434/engines/v1/models

# Chat completion
curl -X POST http://localhost:12434/engines/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "hf.co/mlx-community/Qwen3-8B-4bit",
    "messages": [
      {"role": "user", "content": "Explain Docker in one sentence"}
    ],
    "max_tokens": 256
  }'

Nginx Reverse Proxy (LAN Access)

The TCP endpoint only listens on localhost. To expose it to other devices on your network:

docker run -d --name model-proxy \
  -p 0.0.0.0:12435:12435 \
  --add-host=host.docker.internal:host-gateway \
  nginx:alpine \
  sh -c 'echo "server { listen 12435; location / { proxy_pass http://host.docker.internal:12434; proxy_set_header Host \$host; proxy_buffering off; } }" > /etc/nginx/conf.d/default.conf && nginx -g "daemon off;"'

Test: curl http://192.168.x.x:12435/engines/v1/models

Stop: docker stop model-proxy && docker rm model-proxy


Path 2: Native vLLM Serve

Best for developers who want full control over configuration.

Install

# One-line install (creates ~/.venv-vllm-metal)
curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm-metal/main/install.sh | bash

# Activate
source ~/.venv-vllm-metal/bin/activate

Start Model Service

vllm serve mlx-community/Qwen3-8B-4bit \
  --port 30000 \
  --served-model-name Qwen3-8B-4bit

Call the API

curl -X POST http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-8B-4bit",
    "messages": [
      {"role": "user", "content": "What is vLLM?"}
    ],
    "max_tokens": 512
  }'

Environment Variables

export VLLM_METAL_MEMORY_FRACTION=0.7    # Memory usage (0~1)
export VLLM_METAL_USE_PAGED_ATTENTION=1  # Experimental paged attention
export VLLM_METAL_DEBUG=1                # Debug logging
export VLLM_METAL_PREFIX_CACHE=1         # Prefix caching

Architecture

┌─────────────────────────────────────────────────────────────┐
│                        vLLM Core                            │
│           Engine, Scheduler, API Server, Tokenizer          │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│                   vllm_metal Plugin Layer                    │
│  ┌──────────────┐  ┌─────────────┐  ┌───────────────────┐  │
│  │MetalPlatform │  │ MetalWorker │  │ MetalModelRunner  │  │
│  └──────────────┘  └─────────────┘  └───────────────────┘  │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│                  Compute Backend                             │
│  ┌────────────────────┐      ┌───────────────────────────┐  │
│  │    MLX Backend     │      │    PyTorch Backend        │  │
│  │   (Primary)        │      │   (Model Loading/Interop) │  │
│  └────────────────────┘      └───────────────────────────┘  │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│              Metal GPU / Apple Silicon UMA                   │
└─────────────────────────────────────────────────────────────┘

References



中文

在 Mac 上跑 vLLM Metal:快速入門指南

兩種方式在 Apple Silicon 上跑 vLLM 推理:透過 Docker Model Runner 或原生 vLLM serve。

環境需求

項目需求
MacmacOS on Apple Silicon
記憶體建議 16GB 以上(8B 模型至少需 8GB 可用)
Docker Desktop4.62.0+(僅 Docker 方式需要)

什麼是 vLLM Metal?

vLLM Metal 是一個 vLLM 外掛,讓 vLLM 能在 Apple Silicon 上透過 Metal GPU 加速推理。它使用 Apple 的 MLX 框架作為運算後端,利用統一記憶體實現零拷貝操作。


方式一:Docker Model Runner

適合快速上手、整合 Docker 工作流程。

安裝後端

docker model install-runner --backend vllm --gpu metal

會將完整 Python 3.12 環境拉取到 ~/.docker/model-runner/vllm-metal/

注意:vllm-metal 是原生執行在主機上的,不是跑在容器內。Metal GPU 需要直接存取硬體。

確認安裝

docker model status

拉取與執行模型

# 從 Hugging Face 拉取 MLX 模型
docker model pull hf.co/mlx-community/Qwen3-8B-4bit

# 互動式對話
docker model run hf.co/mlx-community/Qwen3-8B-4bit "你好"

# 列出已下載的模型
docker model list

其他可用模型:

docker model pull hf.co/mlx-community/Llama-3.2-1B-Instruct-4bit
docker model pull hf.co/mlx-community/Mistral-7B-Instruct-v0.3-4bit

啟用 TCP API

預設 API 只能透過 Docker socket 存取。啟用 HTTP 存取:

docker desktop enable model-runner --tcp=12434

呼叫 API

# 列出模型
curl http://localhost:12434/engines/v1/models

# Chat completion
curl -X POST http://localhost:12434/engines/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "hf.co/mlx-community/Qwen3-8B-4bit",
    "messages": [
      {"role": "user", "content": "用一句話解釋什麼是 Docker"}
    ],
    "max_tokens": 256
  }'

Nginx 反向代理(區域網路存取)

TCP 端點預設只監聽 localhost。讓區域網路其他裝置也能存取:

docker run -d --name model-proxy \
  -p 0.0.0.0:12435:12435 \
  --add-host=host.docker.internal:host-gateway \
  nginx:alpine \
  sh -c 'echo "server { listen 12435; location / { proxy_pass http://host.docker.internal:12434; proxy_set_header Host \$host; proxy_buffering off; } }" > /etc/nginx/conf.d/default.conf && nginx -g "daemon off;"'

測試:curl http://192.168.x.x:12435/engines/v1/models

停止:docker stop model-proxy && docker rm model-proxy


方式二:原生 vLLM Serve

適合開發者,需要完整設定控制。

安裝

# 一鍵安裝(建立 ~/.venv-vllm-metal)
curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm-metal/main/install.sh | bash

# 啟用
source ~/.venv-vllm-metal/bin/activate

啟動模型服務

vllm serve mlx-community/Qwen3-8B-4bit \
  --port 30000 \
  --served-model-name Qwen3-8B-4bit

呼叫 API

curl -X POST http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-8B-4bit",
    "messages": [
      {"role": "user", "content": "什麼是 vLLM?"}
    ],
    "max_tokens": 512
  }'

環境變數

export VLLM_METAL_MEMORY_FRACTION=0.7    # 記憶體使用比例(0~1)
export VLLM_METAL_USE_PAGED_ATTENTION=1  # 實驗性 Paged Attention
export VLLM_METAL_DEBUG=1                # 除錯日誌
export VLLM_METAL_PREFIX_CACHE=1         # 前綴快取

架構

┌─────────────────────────────────────────────────────────────┐
│                        vLLM 核心層                          │
│           引擎、排程器、API Server、Tokenizer               │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│                   vllm_metal 外掛層                         │
│  ┌──────────────┐  ┌─────────────┐  ┌───────────────────┐  │
│  │MetalPlatform │  │ MetalWorker │  │ MetalModelRunner  │  │
│  └──────────────┘  └─────────────┘  └───────────────────┘  │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│                      運算後端層                              │
│  ┌────────────────────┐      ┌───────────────────────────┐  │
│  │    MLX 後端        │      │    PyTorch 後端           │  │
│  │   (主要推理)     │      │   (模型載入/互操作)     │  │
│  └────────────────────┘      └───────────────────────────┘  │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────┐
│              Metal GPU / Apple Silicon 統一記憶體             │
└─────────────────────────────────────────────────────────────┘

參考資源

  • Artificial-Intelligence