>
企业如果不想把数据送到第三方API,或者需要降低成本,本地部署LLM是必经之路。
本文解决:
Hugging Face Transformers做推理时,每次请求都要: 1. 重新计算所有token的Key和Value(KV Cache) 2. GPU显存被碎片化的KV Cache占满 3. 每次只能处理一个请求(serial generation)
``python
`Hugging Face传统推理方式(示意)
问题:每个新请求都重新计算所有历史token的KV Cache
def generate_huggingface(model, prompt_token_ids):
"""
每次调用都要:
1. 将prompt_tokens转成embedding
2. 逐个token生成,每个token生成时都要:
a. 重新计算attention(所有历史token)
b. 保存新的KV cache(碎片化存储)
3. 显存不够时只能停下来等
"""
input_ids = prompt_token_ids
past_key_values = None # 每次都是None,必须重新计算
for _ in range(max_new_tokens):
# attention掩码每次都要重建
outputs = model(
input_ids=input_ids[:, -1:], # 只输入最后一个token
past_key_values=past_key_values, # 传进来是None
use_cache=True
)
# 每次循环past_key_values体积都翻倍(线性增长)
past_key_values = outputs.past_key_values
logits = outputs.logits
next_token = logits.argmax(-1)
input_ids = torch.cat([input_ids, next_token], dim=-1)
if next_token == eos_token_id:
break
return input_ids
vLLM做了两个关键优化:
PagedAttention: 像操作系统的内存分页一样管理KV Cache,把显存分成固定大小的"页",按需分配,按页回收,碎片率从60%降到5%。
Continuous Batching: 当一个请求生成完EOS token时,立即插入新请求,而不是等所有请求都完成才开始下一批。这样GPU利用率从30%提升到90%+。
`python
`vLLM内部工作流程(简化原理)
class PagedAttention:
"""
vLLM的核心:把KV Cache当成虚拟内存页面来管理
传统方式(Transformer):KV Cache是连续的tensor,存成:
[batch, num_heads, seq_len, head_dim]
问题:不同请求长度不同,显存碎片化严重
PagedAttention方式:
KV Cache分成多个"块",每块大小固定(如64个token)
每个请求按需分配块,按块存储,块可以共享
"""
def __init__(self, block_size=16):
self.block_size = block_size
# 管理所有KV缓存块的分配表
self.block_manager = {}
def allocate(self, num_tokens):
"""按需分配KV块,需要多少分配多少"""
num_blocks = (num_tokens + self.block_size - 1) // self.block_size
blocks = []
for _ in range(num_blocks):
block = self.gpu_memory.allocate_page()
blocks.append(block)
return blocks
def attention(self, query, key, value, block_mapping):
"""在分散的块上做attention"""
# 把物理上分散的块拼成逻辑连续序列
# 然后做标准的scaled dot-product attention
...
| 指标 | Hugging Face | vLLM | |------|-------------|------| | Throughput(Qwen-7B) | 1 req/s | 12 req/s | | GPU显存利用率 | ~40% | ~95% | | 首批token延迟(P99) | 2.3s | 0.8s | | 并发能力(7B,24G显存) | 2并发 | 20并发 |
`
`
`bash
===== 步骤1:确认CUDA版本 =====
nvcc --version
输出示例:Cuda compilation tools, release 12.1, V12.1.105
如果是11.8,把下面的 cu121 换成 cu118
如果遇到 "No module named _cuda" 错误:
`bash
原因:vLLM安装时没有找到CUDA
解决:设置环境变量后再安装
export CUDA_HOME=/usr/local/cuda-12.1
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
pip uninstall vllm -y
pip install vllm -- cu121
`
2.3 Docker方式安装(更简单)
`bash
用NVIDIA官方Docker镜像,避免CUDA版本问题
docker pull nvcr.io/nvidia/cuda:12.1.0-devel-ubuntu22.04
运行容器(需要nvidia-container-toolkit)
docker run --gpus all \
--name vllm-server \
-it \
-p 8000:8000 \
-v /data/models:/root/.cache/huggingface \
-v /data/vllm:/tmp/vllm \
nvidia/cuda:12.1.0-devel-ubuntu22.04 \
bash
在容器内安装vLLM
pip install vllm==0.4.0
`
3. 启动vLLM推理服务(命令行)
3.1 基本启动命令
`bash
下载模型(以Qwen2-7B为例)
模型来源:HuggingFace(需要申请通过)或ModelScope
这里用ModelScope国内镜像,更快
export HF_ENDPOINT=https://hf-mirror.com
如果你已经下载好了模型,跳过这步
首次下载需要科学上网或镜像,以下命令会自动下载
huggingface-cli download Qwen/Qwen2-7B-Instruct --local-dir /data/models/Qwen2-7B-Instruct
启动vLLM推理服务(OpenAI兼容API)
python -m vllm.entrypoints.openai.api_server \
--model /data/models/Qwen2-7B-Instruct \
--tokenizer /data/models/Qwen2-7B-Instruct \
--dtype half \ # 半精度,节省显存
--gpu-memory-utilization 0.85 \ # 留15%显存给KV Cache,不要设太大
--max-model-len 8192 \ # 最大上下文长度,根据需求调整
--tensor-parallel-size 1 \ # 单卡设为1,多卡可以设2/4/8
--port 8000 \
--host 0.0.0.0
`
启动成功后的输出:
`
INFO: Started server process [...]
INFO: Uvicorn running on http://0.0.0.0:8000
`
3.2 常用启动参数详解
| 参数 | 说明 | 建议值 |
|------|------|--------|
|
--model | 模型路径或HuggingFace模型ID | 必须 |
| --dtype | 精度:half=FP16, float16=FP16, bfloat16=BF16 | half |
| --gpu-memory-utilization | GPU显存用于KV Cache的比例 | 0.85(单卡),0.75(多卡) |
| --max-model-len | 最大上下文长度(tokens) | 8192或32768(取决于显存) |
| --tensor-parallel-size | 张量并行度=GPU数量 | 1(单卡),2/4/8(多卡) |
| --port | HTTP端口 | 8000 |
| --limit-ratio | 最大同时处理的request比例 | 2.0(可处理2倍并发) |
| --enforce-eager | 禁用CUDA Graphs(兼容性更好,速度稍慢) | 遇到报错时加 |
3.3 多卡并行启动
`bash
4卡并行(以Qwen2-72B为例,72B需要多卡)
python -m vllm.entrypoints.openai.api_server \
--model /data/models/Qwen2-72B-Instruct \
--dtype half \
--gpu-memory-utilization 0.75 \
--max-model-len 4096 \
--tensor-parallel-size 4 \ # 用4张GPU
--port 8000
`
张量并行原理: 模型权重按层切分到多张GPU,每次推理时多GPU协作计算。比数据并行更节省显存,适合大模型。
4. FastAPI包装:OpenAI兼容API
4.1 为什么需要FastAPI包装
vLLM已经自带了OpenAI兼容API(
/v1/chat/completions, /v1/completions),但生产环境往往需要:
- 身份认证(API Key)
- 请求日志和审计
- 限流(Rate Limiting)
- 业务逻辑定制
- 监控和指标
所以在vLLM前面加一层FastAPI代理:
`
客户端请求
↓
FastAPI代理层(认证+限流+日志+定制逻辑)
↓
vLLM推理服务
↓
返回结果
`
4.2 完整代码
`python
file: vllm_proxy.py
"""
vLLM推理服务代理:FastAPI + OpenAI兼容接口
功能:
1. API Key认证
2. 请求限流(基于IP+用户ID)
3. 请求日志
4. 错误处理和重试
5. 自定义prompt模板
"""
from fastapi import FastAPI, HTTPException, Header, Request, Depends
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import StreamingResponse, JSONResponse
from pydantic import BaseModel
from typing import Optional, List, Dict, Any
import httpx
import time
import os
import json
from datetime import datetime
from collections import defaultdict
import asyncio
============ 配置 ============
VLLM_BASE_URL = os.getenv("VLLM_BASE_URL", "http://localhost:8000/v1")
API_KEY = os.getenv("API_KEY", "your-secret-api-key-here") # 生产环境必须改
RATE_LIMIT = 60 # 每分钟最多60次请求
RATE_WINDOW = 60 # 滑动窗口60秒
============ FastAPI初始化 ============
app = FastAPI(title="LLM Proxy API", version="1.0.0")
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # 生产环境建议限制具体域名
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
============ 限流实现 ============
class RateLimiter:
"""滑动窗口限流器"""
def __init__(self, max_requests: int, window_seconds: int):
self.max_requests = max_requests
self.window_seconds = window_seconds
self.requests: Dict[str, List[float]] = defaultdict(list)
async def check(self, key: str) -> bool:
"""检查是否超过限流,返回True=允许,False=超限"""
now = time.time()
# 清理过期记录
self.requests[key] = [
t for t in self.requests[key]
if now - t < self.window_seconds
]
if len(self.requests[key]) >= self.max_requests:
return False
self.requests[key].append(now)
return True
rate_limiter = RateLimiter(max_requests=RATE_LIMIT, window_seconds=RATE_WINDOW)
============ 数据模型 ============
class ChatMessage(BaseModel):
role: str
content: str
class ChatCompletionRequest(BaseModel):
model: str
messages: List[ChatMessage]
temperature: float = 0.7
max_tokens: int = 2048
stream: bool = False
stop: Optional[List[str]] = None
top_p: float = 1.0
frequency_penalty: float = 0.0
presence_penalty: float = 0.0
============ 依赖项 ============
async def verify_api_key(x_api_key: str = Header(None)):
"""验证API Key"""
if x_api_key != API_KEY:
raise HTTPException(status_code=401, detail="Invalid API Key")
return x_api_key
async def check_rate_limit(request: Request):
"""检查限流(基于客户端IP)"""
client_ip = request.client.host
if not await rate_limiter.check(client_ip):
raise HTTPException(
status_code=429,
detail=f"Rate limit exceeded. Max {RATE_LIMIT} requests per minute."
)
============ 日志中间件 ============
@app.middleware("http")
async def log_requests(request: Request, call_next):
"""记录每个请求的详细信息"""
start_time = time.time()
# 只记录LLM相关接口
if "/chat" in request.url.path or "/completions" in request.url.path:
log_data = {
"timestamp": datetime.now().isoformat(),
"method": request.method,
"path": str(request.url.path),
"client_ip": request.client.host,
"user_agent": request.headers.get("user-agent", ""),
}
# 读取请求体(如果需要)
if request.method == "POST":
body = await request.body()
if body:
try:
body_json = json.loads(body)
# 脱敏处理:截断过长内容
messages = body_json.get("messages", [])
log_data["num_messages"] = len(messages)
if messages:
last_msg = messages[-1]["content"][:100]
log_data["last_message_preview"] = last_msg
except:
pass
response = await call_next(request)
log_data["duration_ms"] = round((time.time() - start_time) * 1000, 1)
log_data["status_code"] = response.status_code
# 打印日志(实际生产环境写入文件或日志服务)
print(f"[LLM Request] {json.dumps(log_data)}")
return response
return await call_next(request)
============ 核心API端点 ============
@app.post("/v1/chat/completions")
async def chat_completions(
request_data: ChatCompletionRequest,
request: Request,
_: str = Depends(verify_api_key),
__= Depends(check_rate_limit)
):
"""
OpenAI兼容的Chat Completions接口
直接透传给vLLM,加上重试和错误处理
"""
# 构建vLLM请求体
vllm_payload = {
"model": request_data.model,
"messages": [m.dict() for m in request_data.messages],
"temperature": request_data.temperature,
"max_tokens": request_data.max_tokens,
"stream": request_data.stream,
"stop": request_data.stop,
"top_p": request_data.top_p,
"frequency_penalty": request_data.frequency_penalty,
"presence_penalty": request_data.presence_penalty,
}
# 调用vLLM(带重试)
async with httpx.AsyncClient(timeout=120.0) as client:
for attempt in range(3):
try:
response = await client.post(
f"{VLLM_BASE_URL}/chat/completions",
json=vllm_payload
)
if response.status_code == 200:
return response.json()
elif response.status_code == 429:
# 限流,等一等再重试
await asyncio.sleep(2 ** attempt)
continue
else:
raise HTTPException(
status_code=response.status_code,
detail=f"vLLM error: {response.text}"
)
except httpx.TimeoutException:
if attempt == 2:
raise HTTPException(status_code=504, detail="vLLM timeout")
await asyncio.sleep(1)
continue
raise HTTPException(status_code=503, detail="vLLM service unavailable")
============ 健康检查 ============
@app.get("/health")
async def health_check():
"""健康检查(用于负载均衡器探活)"""
try:
async with httpx.AsyncClient(timeout=5.0) as client:
resp = await client.get(f"{VLLM_BASE_URL}/health")
return {"status": "healthy", "vllm": "ok"}
except:
return {"status": "degraded", "vllm": "unreachable"}
@app.get("/metrics")
async def metrics():
"""简易监控端点(实际生产环境建议接Prometheus)"""
return {
"timestamp": datetime.now().isoformat(),
"rate_limit": {
"max_per_minute": RATE_LIMIT,
},
"vllm_endpoint": VLLM_BASE_URL,
}
============ 启动 ============
if __name__ == "__main__":
import uvicorn
uvicorn.run(
"vllm_proxy:app",
host="0.0.0.0",
port=8001, # 代理层监听8001,vLLM监听8000
workers=4, # 4个worker
log_level="info"
)
`
4.3 运行FastAPI代理
`bash
先启动vLLM(后台运行)
python -m vllm.entrypoints.openai.api_server \
--model /data/models/Qwen2-7B-Instruct \
--port 8000 \
--gpu-memory-utilization 0.85 &
# & 表示后台运行
sleep 10 # 等待vLLM启动完成
再启动FastAPI代理
export VLLM_BASE_URL=http://localhost:8000/v1
export API_KEY=my-secret-key-2024
python vllm_proxy.py
`
4.4 测试API
`bash
测试健康检查
curl http://localhost:8001/health
测试聊天接口(OpenAI SDK兼容)
curl -X POST http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-API-Key: my-secret-key-2024" \
-d '{
"model": "Qwen2-7B-Instruct",
"messages": [
{"role": "system", "content": "你是投肯智能的技术助手"},
{"role": "user", "content": "解释一下vLLM的PagedAttention原理"}
],
"temperature": 0.7,
"max_tokens": 500
}'
`
5. 常见错误与排查
错误1:CUDA out of memory
原因: 显存不够,原因:模型太大、batch太大、context太长。
排查步骤:
`bash
1. 查看GPU显存使用情况
nvidia-smi
看 "Memory Usage" 列,如果已用 + 预留 > 总容量,就会OOM
2. 检查模型参数量和显存需求
7B模型 FP16 约需 14GB显存
14B模型 FP16 约需 28GB显存
72B模型 FP16 约需 144GB显存(需要多卡)
3. 减小 --gpu-memory-utilization
从0.9降到0.8,再降到0.7
4. 减小 --max-model-len
从8192降到4096,显存需求减半
5. 检查是否有其他进程占用显存
ps aux | grep python
`
解决方案:
`bash
方案A:减小显存占用
python -m vllm.entrypoints.openai.api_server \
--model /data/models/Qwen2-7B-Instruct \
--gpu-memory-utilization 0.7 \ # 降低到70%
--max-model-len 4096 \ # 减小上下文
--enforce-eager # 禁用CUDA Graphs(更稳定)
方案B:用量化模型(精度略有损失,但显存大幅减少)
AWQ量化:7B模型从14GB降到7GB
pip install autoawq
python -m vllm.entrypoints.openai.api_server \
--model /data/models/Qwen2-7B-Instruct-AWQ \
--quantization awq \
--dtype half
`
错误2:vLLM启动时报 "AssertionError: tensor parallel size must be less than or equal to..."
原因:
--tensor-parallel-size 超过实际GPU数量。
解决:
`bash
查看GPU数量
nvidia-smi --query-gpu=gpu_name --format=csv,noheader | wc -l
设置正确的并行度(如果只有2张卡,就不能设4)
python -m vllm.entrypoints.openai.api_server \
--model /data/models/Qwen2-7B-Instruct \
--tensor-parallel-size 2 \ # 和实际GPU数量一致
`
错误3:推理速度很慢,首token要等10秒+
原因: 可能是因为context太长,每次都要重新计算。
排查:
`bash
查看当前正在处理多少个请求
vLLM日志中会显示 "num_batched_tokens" 和 "num_waiting"
检查是否是context太长
启动时加 --enforce-eager 看是否有改善
python -m vllm.entrypoints.openai.api_server \
--model /data/models/Qwen2-7B-Instruct \
--enforce-eager # 禁用CUDA Graphs,更慢但更稳定
`
解决:
`bash
启用CUDA Graphs(加速,但RTX 30系列可能有兼容问题)
python -m vllm.entrypoints.openai.api_server \
--model /data/models/Qwen2-7B-Instruct \
--enforce-eager \ # 移除这个选项,默认启用CUDA Graphs
--disable-log-requests \ # 减少日志IO开销
`
错误4:模型加载时报 "Error in loading HuggingFace tokenizer"
原因: tokenizer路径不对,或者tokenizer文件损坏。
解决:
`bash
检查模型目录结构
ls -la /data/models/Qwen2-7B-Instruct/
应该包含:config.json, tokenizer.json, tokenizer_config.json, *.safetensors 等
如果tokenizer缺失,手动下载
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download Qwen/Qwen2-7B-Instruct tokenizer.json --local-dir /data/models/Qwen2-7B-Instruct
修复:手动指定tokenizer路径
python -m vllm.entrypoints.openai.api_server \
--model /data/models/Qwen2-7B-Instruct \
--tokenizer /data/models/Qwen2-7B-Instruct \
`
错误5:httpx调用vLLM超时
原因: vLLM推理太慢,或者vLLM进程挂了。
`python
在代理中捕获超时并返回合理错误
async with httpx.AsyncClient(timeout=120.0) as client:
try:
response = await client.post(...)
except httpx.TimeoutException:
raise HTTPException(
status_code=504,
detail="LLM inference timeout. Please try again or reduce your prompt length."
)
`
6. 生产环境完整部署脚本
`bash
file: deploy_vllm.sh
#!/bin/bash
生产环境部署脚本
set -e
MODEL_PATH="/data/models/Qwen2-7B-Instruct"
PORT_VLLM=8000
PORT_PROXY=8001
LOG_DIR="/var/log/vllm"
MAX_MODEL_LEN=8192
GPU_MEM_UTIL=0.85
mkdir -p $LOG_DIR
启动vLLM(用systemd管理,重启自动恢复)
sudo tee /etc/systemd/system/vllm.service > /dev/null <[Service]
Type=simple
User=root
WorkingDirectory=/root
ExecStart=/root/miniconda3/envs/vllm/bin/python -m vllm.entrypoints.openai.api_server \
--model $MODEL_PATH \
--port $PORT_VLLM \
--gpu-memory-utilization $GPU_MEM_UTIL \
--max-model-len $MAX_MODEL_LEN \
--dtype half \
--tensor-parallel-size 1
Restart=always
RestartSec=10
StandardOutput=append:$LOG_DIR/vllm.log
StandardError=append:$LOG_DIR/vllm_error.log
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable vllm
sudo systemctl start vllm
检查启动状态
sleep 5
if systemctl is-active vllm; then
echo "vLLM started successfully"
else
echo "vLLM failed to start. Check logs:"
tail -50 $LOG_DIR/vllm_error.log
exit 1
fi
启动FastAPI代理(同样用systemd)
sudo tee /etc/systemd/system/vllm-proxy.service > /dev/null <[Service]
Type=simple
User=root
WorkingDirectory=/root
ExecStart=/root/miniconda3/envs/vllm/bin/python /root/vllm_proxy.py
Restart=always
RestartSec=10
Environment="VLLM_BASE_URL=http://localhost:${PORT_VLLM}/v1"
Environment="API_KEY=change-this-in-production"
StandardOutput=append:$LOG_DIR/proxy.log
StandardError=append:$LOG_DIR/proxy_error.log
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl enable vllm-proxy
sudo systemctl start vllm-proxy
echo "Deployment complete!"
echo "vLLM API: http://localhost:$PORT_VLLM/v1"
echo "Proxy API: http://localhost:$PORT_PROXY/v1"
`
总结
1. vLLM快的原因:PagedAttention(显存分页管理)+ 连续批处理(GPU利用率从30%→90%)
2. 安装关键:CUDA版本必须匹配,安装完记得验证
import vllm
3. 显存不够:降 gpu-memory-utilization、降 max-model-len、换量化模型
4. 多卡并行:--tensor-parallel-size=N`,N必须 ≤ GPU数量
5. FastAPI代理:vLLM前面加一层代理,做认证、限流、日志、监控
6. 生产环境:用 systemd 管理进程,设置自动重启和日志轮转
> 下一篇预告:《用LLaMA Index构建本地知识库:索引类型选择+分块策略+查询优化实战》