简介
vLLM是生产级别的大模型推理服务。能够发挥出较高硬件配置的性能。适用于高并发等负载较重的场景。
相比之下Ollama是一个本地化的大模型服务。适用的场景为轻量级应用或个人开发适用。
对比Ollama,vLLM的优势为:
- 性能高,并发推理能力强。
- 能够充分的利用多核心CPU和多GPU资源。
- 提供了大量详细的配置参数。可以细致的调优。
- 支持API key认证方式。Ollama服务无认证,不使用防火墙存在安全风险。
缺点:
- 部署较为复杂。无法像Ollama那样开箱即用,通过命令拉取模型,一套ollama命令搞定所有。
- 对硬件的要求较高。要求运行环境拥有GPU。当然纯CPU环境也可用,需要自己编译vLLM。
- 资源占用较高,开发环境不建议使用。
- 无法一个服务运行多个模型。如有这种需求需要启动多个vLLM服务。
准备工作
在所有工作开始之前需要下载模型。
为了演示方便,降低对安装环境的要求。我们使用较小参数的模型。例如下载deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B所有文件,放置在/home/paul/Documents/models
中。模型的下载链接为:deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B at main
纯CPU模式
在无GPU的环境vLLM无法直接使用,需要事先编译。
编译
官方文档:CPU — vLLM
编译并制作纯CPU模式vLLM的镜像步骤如下:
git clone https://github.com/vllm-project/vllm.git
cd vllm
docker build -f docker/Dockerfile.cpu --tag vllm-cpu-env --target vllm-openai .
如果无法拉取ubuntu:22.04
镜像,可以先执行:
docker pull swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/ubuntu:22.04
docker tag swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/ubuntu:22.04 docker.io/ubuntu:22.04
再进行镜像构建。
运行
使用如下命令启动Docker镜像:
docker run -d -v /home/paul/Documents/models:/models -p 8000:8000 --env "HF_HUB_OFFLINE=1" --ipc=host vllm-cpu-env --model /models --served_model_name deepseek
其中:
- HF_HUB_OFFLINE=1表示使用离线模式。
- --model用来指定模型文件在Docker镜像内的目录位置。通过
-v
参数将宿主机的模型目录挂载到了镜像中。 - served_model_name指定对外提供服务时候的模型名称,可以和
--model
对应的实际模型名称不一致。
使用docker logs -f {docker进程名}
查看启动日志。直到出现类似下方的日志,表示启动成功。
INFO 03-13 01:05:19 [loader.py:422] Loading weights took 34.51 seconds
INFO 03-13 01:05:19 [executor_base.py:111] # cpu blocks: 9362, # CPU blocks: 0
INFO 03-13 01:05:19 [executor_base.py:116] Maximum concurrency for 131072 tokens per request: 1.14x
INFO 03-13 01:05:20 [llm_engine.py:441] init engine (profile, create kv cache, warmup model) took 0.59 seconds
INFO 03-13 01:05:20 [api_server.py:958] Starting vLLM API server on http://0.0.0.0:8000
INFO 03-13 01:05:20 [launcher.py:26] Available routes are:
INFO 03-13 01:05:20 [launcher.py:34] Route: /openapi.json, Methods: HEAD, GET
INFO 03-13 01:05:20 [launcher.py:34] Route: /docs, Methods: HEAD, GET
INFO 03-13 01:05:20 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 03-13 01:05:20 [launcher.py:34] Route: /redoc, Methods: HEAD, GET
INFO 03-13 01:05:20 [launcher.py:34] Route: /health, Methods: GET
INFO 03-13 01:05:20 [launcher.py:34] Route: /ping, Methods: POST, GET
INFO 03-13 01:05:20 [launcher.py:34] Route: /tokenize, Methods: POST
INFO 03-13 01:05:20 [launcher.py:34] Route: /detokenize, Methods: POST
INFO 03-13 01:05:20 [launcher.py:34] Route: /v1/models, Methods: GET
INFO 03-13 01:05:20 [launcher.py:34] Route: /version, Methods: GET
INFO 03-13 01:05:20 [launcher.py:34] Route: /v1/chat/completions, Methods: POST
INFO 03-13 01:05:20 [launcher.py:34] Route: /v1/completions, Methods: POST
INFO 03-13 01:05:20 [launcher.py:34] Route: /v1/embeddings, Methods: POST
INFO 03-13 01:05:20 [launcher.py:34] Route: /pooling, Methods: POST
INFO 03-13 01:05:20 [launcher.py:34] Route: /score, Methods: POST
INFO 03-13 01:05:20 [launcher.py:34] Route: /v1/score, Methods: POST
INFO 03-13 01:05:20 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST
INFO 03-13 01:05:20 [launcher.py:34] Route: /rerank, Methods: POST
INFO 03-13 01:05:20 [launcher.py:34] Route: /v1/rerank, Methods: POST
INFO 03-13 01:05:20 [launcher.py:34] Route: /v2/rerank, Methods: POST
INFO 03-13 01:05:20 [launcher.py:34] Route: /invocations, Methods: POST
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
测试方法:
可通过curl调用服务接口。
curl --location --request POST 'http://127.0.0.1:8000/v1/chat/completions' \
--header 'User-Agent: Apifox/1.0.0 (https://apifox.com)' \
--header 'Content-Type: application/json' \
--data-raw '{
"model": "deepseek",
"messages": [
{
"role": "user",
"content": "你好!请你介绍一下自己"
}
]
}'
如果服务无异常,会接收到类似如下返回值:
{"id":"chatcmpl-038e244b71b24f24b7e0d71b30e85648","object":"chat.completion","created":1742366316,"model":"deepseek","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"您好!我是由中国的深度求索(DeepSeek)公司开发的智能助手DeepSeek-R1。如您有任何任何问题,我会尽我所能为您提供帮助。\n</think>\n\n您好!我是由中国的深度求索(DeepSeek)公司开发的智能助手DeepSeek-R1。如您有任何任何问题,我会尽我所能为您提供帮助。","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":83,"completion_tokens":73,"prompt_tokens_details":null},"prompt_logprobs":null}
注意事项
CPU模式不支持GGUF格式模型文件。
使用GGUF启动会遇到如下错误:
AttributeError("'_OpNamespace' '_C' object has no attribute 'ggml_dequantize'")
相关的GitHub issue: [Bug]: vllm-cpu docker gguf: AttributeError: '_OpNamespace' '_C' object has no attribute 'ggml_dequantize' · Issue #8500 · vllm-project/vllm
vLLM不支持一个服务多个模型。如果需要同时使用多个模型,需要启动多个vLLM服务。
GPU模式
使用Nvidia CUDA无需编译,可以直接使用。如果使用AMD ROCm或者Intel的GPU。需要按照官方文档编译之后才能使用。
运行
官方文档参见。GPU — vLLM
使用pip install vllm
安装vllm。
mkdir project_path
cd project_path
python -m venv ./
source bin/activate
pip install vllm
然后使用vllm serve
命令启动模型服务。
此外也可以使用Docker运行。官方文档参见:Using Docker — vLLM
使用如下命令启动镜像:
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-v0.1
单机多GPU卡支持
vLLM支持推理任务分散在多GPU上运行,能够充分发挥多GPU性能。
示例命令:
vllm serve /home/paul/Qwen2.5-70B-Instruct/ --tensor-parallel-size 8 --dtype auto --gpu-memory-utilization 0.95 --max-model-len 16384 --enable-auto-tool-choice --tool-call-parser hermes --served-model-name Qwen2.5-70B-Instruct --kv-cache-dtype fp8_e5m2
与性能相关的参数为:
- --tensor-parallel-size:并行推理数,建议和GPU个数相同。
- --gpu-memory-utilization:GPU显存使用率。
- --kv-cache-dtype:KV量化类型。
- --max-num-seqs:一次推理最多能处理的sequences数量。
- --max-num-batched-tokens:一次推理最多能处理的tokens数量。
- --cpu-offload-gb:GPU显存卸载到内存的空间,单位是GB。可以理解为使用内存增加显存。但是要求CPU和GPU快速互联。否则会显著降低每秒输入的token数。
使用Docker方式运行时这些调优参数同样可用。
API Key认证
默认情况vLLM服务是无需认证可以直接使用的。端口暴露在公网存在安全风险。
为了解决这个问题,我们可以添加--api-key
参数指定API Key,这样只有请求携带了正确的API Key之时才能够通过认证。
指定API Key启动服务的方式如下所示:
docker run -d -v /home/paul/Documents/models:/models -p 8000:8000 --env "HF_HUB_OFFLINE=1" --ipc=host vllm-cpu-env --model /models --served_model_name deepseek --api-key abcdefg
上面的脚本使用abcdefg
作为API Key启动了vLLM服务。
增加了认证之后,访问API需要再请求头添加bearer token,示例如下:
curl --location --request POST 'http://127.0.0.1:8000/v1/chat/completions' \
--header 'User-Agent: Apifox/1.0.0 (https://apifox.com)' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer abcdefg' \
--data-raw '{
"model": "deepseek",
"messages": [
{
"role": "user",
"content": "你好!请你介绍一下自己"
}
]
}'
如果API Key错误,会收到如下返回值:
{"error":"Unauthorized"}
全参数
vllm server
命令所有的参数为:
usage: api_server.py [-h] [--host HOST] [--port PORT]
[--uvicorn-log-level {debug,info,warning,error,critical,trace}]
[--allow-credentials] [--allowed-origins ALLOWED_ORIGINS]
[--allowed-methods ALLOWED_METHODS]
[--allowed-headers ALLOWED_HEADERS] [--api-key API_KEY]
[--lora-modules LORA_MODULES [LORA_MODULES ...]]
[--prompt-adapters PROMPT_ADAPTERS [PROMPT_ADAPTERS ...]]
[--chat-template CHAT_TEMPLATE]
[--chat-template-content-format {auto,string,openai}]
[--response-role RESPONSE_ROLE]
[--ssl-keyfile SSL_KEYFILE] [--ssl-certfile SSL_CERTFILE]
[--ssl-ca-certs SSL_CA_CERTS] [--enable-ssl-refresh]
[--ssl-cert-reqs SSL_CERT_REQS] [--root-path ROOT_PATH]
[--middleware MIDDLEWARE] [--return-tokens-as-token-ids]
[--disable-frontend-multiprocessing]
[--enable-request-id-headers] [--enable-auto-tool-choice]
[--tool-call-parser {granite-20b-fc,granite,hermes,internlm,jamba,llama3_json,mistral,pythonic} or name registered in --tool-parser-plugin]
[--tool-parser-plugin TOOL_PARSER_PLUGIN] [--model MODEL]
[--task {auto,generate,embedding,embed,classify,score,reward,transcription}]
[--tokenizer TOKENIZER] [--hf-config-path HF_CONFIG_PATH]
[--skip-tokenizer-init] [--revision REVISION]
[--code-revision CODE_REVISION]
[--tokenizer-revision TOKENIZER_REVISION]
[--tokenizer-mode {auto,slow,mistral,custom}]
[--trust-remote-code]
[--allowed-local-media-path ALLOWED_LOCAL_MEDIA_PATH]
[--download-dir DOWNLOAD_DIR]
[--load-format {auto,pt,safetensors,npcache,dummy,tensorizer,sharded_state,gguf,bitsandbytes,mistral,runai_streamer}]
[--config-format {auto,hf,mistral}]
[--dtype {auto,half,float16,bfloat16,float,float32}]
[--kv-cache-dtype {auto,fp8,fp8_e5m2,fp8_e4m3}]
[--max-model-len MAX_MODEL_LEN]
[--guided-decoding-backend GUIDED_DECODING_BACKEND]
[--logits-processor-pattern LOGITS_PROCESSOR_PATTERN]
[--model-impl {auto,vllm,transformers}]
[--distributed-executor-backend {ray,mp,uni,external_launcher}]
[--pipeline-parallel-size PIPELINE_PARALLEL_SIZE]
[--tensor-parallel-size TENSOR_PARALLEL_SIZE]
[--enable-expert-parallel]
[--max-parallel-loading-workers MAX_PARALLEL_LOADING_WORKERS]
[--ray-workers-use-nsight]
[--block-size {8,16,32,64,128}]
[--enable-prefix-caching | --no-enable-prefix-caching]
[--disable-sliding-window] [--use-v2-block-manager]
[--num-lookahead-slots NUM_LOOKAHEAD_SLOTS] [--seed SEED]
[--swap-space SWAP_SPACE]
[--cpu-offload-gb CPU_OFFLOAD_GB]
[--gpu-memory-utilization GPU_MEMORY_UTILIZATION]
[--num-gpu-blocks-override NUM_GPU_BLOCKS_OVERRIDE]
[--max-num-batched-tokens MAX_NUM_BATCHED_TOKENS]
[--max-num-partial-prefills MAX_NUM_PARTIAL_PREFILLS]
[--max-long-partial-prefills MAX_LONG_PARTIAL_PREFILLS]
[--long-prefill-token-threshold LONG_PREFILL_TOKEN_THRESHOLD]
[--max-num-seqs MAX_NUM_SEQS]
[--max-logprobs MAX_LOGPROBS] [--disable-log-stats]
[--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,ptpc_fp8,fbgemm_fp8,modelopt,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes,qqq,hqq,experts_int8,neuron_quant,ipex,quark,moe_wna16,None}]
[--rope-scaling ROPE_SCALING] [--rope-theta ROPE_THETA]
[--hf-overrides HF_OVERRIDES] [--enforce-eager]
[--max-seq-len-to-capture MAX_SEQ_LEN_TO_CAPTURE]
[--disable-custom-all-reduce]
[--tokenizer-pool-size TOKENIZER_POOL_SIZE]
[--tokenizer-pool-type TOKENIZER_POOL_TYPE]
[--tokenizer-pool-extra-config TOKENIZER_POOL_EXTRA_CONFIG]
[--limit-mm-per-prompt LIMIT_MM_PER_PROMPT]
[--mm-processor-kwargs MM_PROCESSOR_KWARGS]
[--disable-mm-preprocessor-cache] [--enable-lora]
[--enable-lora-bias] [--max-loras MAX_LORAS]
[--max-lora-rank MAX_LORA_RANK]
[--lora-extra-vocab-size LORA_EXTRA_VOCAB_SIZE]
[--lora-dtype {auto,float16,bfloat16}]
[--long-lora-scaling-factors LONG_LORA_SCALING_FACTORS]
[--max-cpu-loras MAX_CPU_LORAS] [--fully-sharded-loras]
[--enable-prompt-adapter]
[--max-prompt-adapters MAX_PROMPT_ADAPTERS]
[--max-prompt-adapter-token MAX_PROMPT_ADAPTER_TOKEN]
[--device {auto,cuda,neuron,cpu,openvino,tpu,xpu,hpu}]
[--num-scheduler-steps NUM_SCHEDULER_STEPS]
[--multi-step-stream-outputs [MULTI_STEP_STREAM_OUTPUTS]]
[--scheduler-delay-factor SCHEDULER_DELAY_FACTOR]
[--enable-chunked-prefill [ENABLE_CHUNKED_PREFILL]]
[--speculative-model SPECULATIVE_MODEL]
[--speculative-model-quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,ptpc_fp8,fbgemm_fp8,modelopt,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes,qqq,hqq,experts_int8,neuron_quant,ipex,quark,moe_wna16,None}]
[--num-speculative-tokens NUM_SPECULATIVE_TOKENS]
[--speculative-disable-mqa-scorer]
[--speculative-draft-tensor-parallel-size SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE]
[--speculative-max-model-len SPECULATIVE_MAX_MODEL_LEN]
[--speculative-disable-by-batch-size SPECULATIVE_DISABLE_BY_BATCH_SIZE]
[--ngram-prompt-lookup-max NGRAM_PROMPT_LOOKUP_MAX]
[--ngram-prompt-lookup-min NGRAM_PROMPT_LOOKUP_MIN]
[--spec-decoding-acceptance-method {rejection_sampler,typical_acceptance_sampler}]
[--typical-acceptance-sampler-posterior-threshold TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD]
[--typical-acceptance-sampler-posterior-alpha TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA]
[--disable-logprobs-during-spec-decoding [DISABLE_LOGPROBS_DURING_SPEC_DECODING]]
[--model-loader-extra-config MODEL_LOADER_EXTRA_CONFIG]
[--ignore-patterns IGNORE_PATTERNS]
[--preemption-mode PREEMPTION_MODE]
[--served-model-name SERVED_MODEL_NAME [SERVED_MODEL_NAME ...]]
[--qlora-adapter-name-or-path QLORA_ADAPTER_NAME_OR_PATH]
[--show-hidden-metrics-for-version SHOW_HIDDEN_METRICS_FOR_VERSION]
[--otlp-traces-endpoint OTLP_TRACES_ENDPOINT]
[--collect-detailed-traces COLLECT_DETAILED_TRACES]
[--disable-async-output-proc]
[--scheduling-policy {fcfs,priority}]
[--scheduler-cls SCHEDULER_CLS]
[--override-neuron-config OVERRIDE_NEURON_CONFIG]
[--override-pooler-config OVERRIDE_POOLER_CONFIG]
[--compilation-config COMPILATION_CONFIG]
[--kv-transfer-config KV_TRANSFER_CONFIG]
[--worker-cls WORKER_CLS]
[--worker-extension-cls WORKER_EXTENSION_CLS]
[--generation-config GENERATION_CONFIG]
[--override-generation-config OVERRIDE_GENERATION_CONFIG]
[--enable-sleep-mode] [--calculate-kv-scales]
[--additional-config ADDITIONAL_CONFIG]
[--enable-reasoning] [--reasoning-parser {deepseek_r1}]
[--disable-log-requests] [--max-log-len MAX_LOG_LEN]
[--disable-fastapi-docs] [--enable-prompt-tokens-details]
全参数解释:
引擎参数 | vLLM 中文站