Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen/llms.txt

Use this file to discover all available pages before exploring further.

vLLM is a high-throughput and memory-efficient inference engine for large language models. It provides significant performance improvements over standard PyTorch inference through continuous batching, PagedAttention, and optimized CUDA kernels.

Why vLLM?

High Throughput

2-3x faster than standard inference with continuous batching

Memory Efficient

PagedAttention reduces memory waste by up to 80%

Easy Integration

Compatible with HuggingFace models and OpenAI API format

Multi-GPU Support

Built-in tensor parallelism for distributed inference

Installation

1

Install vLLM

For CUDA 12.1 and PyTorch 2.1:
pip install vllm
For other CUDA versions, see vLLM Installation Guide
2

Verify Installation

python -c "import vllm; print(vllm.__version__)"
3

Using Docker (Recommended)

docker pull qwenllm/qwen:cu121
docker run --gpus all -it --rm qwenllm/qwen:cu121 bash
vLLM requires CUDA 11.4 or higher and a GPU with compute capability 7.0 or higher.

GPU Requirements

Memory Requirements by Model Size

Modelseq_len 2048seq_len 8192seq_len 16384seq_len 32768
Qwen-1.8B6.22GB7.46GB--
Qwen-7B17.94GB20.96GB--
Qwen-7B-Int49.10GB12.26GB--
Qwen-14B33.40GB---
Qwen-14B-Int413.30GB---
Qwen-72B166.87GB185.50GB210.80GB253.80GB
Qwen-72B-Int455.37GB73.66GB97.79GB158.80GB

Supported Consumer GPUs

GPU MemoryGPU ModelsSupported Qwen Models
24GBRTX 4090/3090/A5000Qwen-1.8B, Qwen-7B, Qwen-7B-Int4, Qwen-14B-Int4
16GBRTX A4000Qwen-1.8B, Qwen-7B-Int4, Qwen-14B-Int4
12GBRTX 3080TiQwen-1.8B, Qwen-14B-Int4
11GBRTX 2080TiQwen-1.8B
Bfloat16 requires GPU compute capability ≥ 8.0. For older GPUs, use --dtype float16.

Quick Start

Standalone OpenAI API Server

Deploy an OpenAI-compatible API server with vLLM:
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --dtype bfloat16 \
  --chat-template template_chatml.jinja

Chat Template Configuration

Download and use the ChatML template for proper formatting:
# Download template
wget https://raw.githubusercontent.com/QwenLM/Qwen/main/examples/template_chatml.jinja

# Use with vLLM
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --chat-template template_chatml.jinja
The chat template file is required for proper message formatting with the Qwen models.

Python Wrapper

Use the vLLM wrapper for Transformers-like interface:
1

Download the Wrapper

wget https://raw.githubusercontent.com/QwenLM/Qwen/main/examples/vllm_wrapper.py
2

Use in Python

from vllm_wrapper import vLLMWrapper

# Single GPU
model = vLLMWrapper('Qwen/Qwen-7B-Chat', tensor_parallel_size=1)

# Multi-GPU (4 GPUs)
# model = vLLMWrapper('Qwen/Qwen-7B-Chat', tensor_parallel_size=4)

# Int4 model
# model = vLLMWrapper('Qwen/Qwen-7B-Chat-Int4', 
#                     tensor_parallel_size=1, 
#                     dtype="float16")

# Chat interface
response, history = model.chat(query="Hello, who are you?", history=None)
print(response)

response, history = model.chat(
    query="Tell me about quantum computing", 
    history=history
)
print(response)

Wrapper Configuration

from vllm_wrapper import vLLMWrapper

model = vLLMWrapper(
    model_dir='Qwen/Qwen-7B-Chat',
    trust_remote_code=True,
    tensor_parallel_size=1,        # Number of GPUs
    gpu_memory_utilization=0.98,   # GPU memory fraction
    dtype='bfloat16',              # 'bfloat16', 'float16', 'float32'
    max_model_len=8192,            # Maximum sequence length
)

API Usage

Using OpenAI Python Client

import openai

openai.api_base = "http://localhost:8000/v1"
openai.api_key = "none"

response = openai.ChatCompletion.create(
    model="Qwen",
    messages=[
        {"role": "user", "content": "What is artificial intelligence?"}
    ],
    stream=False,
    stop_token_ids=[151645]  # Required for vLLM
)

print(response.choices[0].message.content)
For vLLM standalone API, you must set stop_token_ids=[151645] or stop=["<|im_end|>"] to prevent infinite generation.

Advanced Configuration

Performance Tuning

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --tensor-parallel-size 1 \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 256 \
  --chat-template template_chatml.jinja

Configuration Parameters

--model
string
required
Model name or path (HuggingFace format)
--tensor-parallel-size
int
default:"1"
Number of GPUs for tensor parallelism
--dtype
string
default:"auto"
Data type: auto, bfloat16, float16, float32
--max-model-len
int
Maximum sequence length (prompt + generation)
--gpu-memory-utilization
float
default:"0.90"
Fraction of GPU memory to use (0.0 to 1.0)
--max-num-seqs
int
default:"256"
Maximum number of sequences processed in parallel
--max-num-batched-tokens
int
Maximum tokens processed in a batch
--swap-space
int
default:"4"
CPU swap space size in GB
--disable-log-requests
boolean
Disable request logging for reduced overhead

Multi-GPU Deployment

Tensor Parallelism

Distribute model layers across multiple GPUs:
# 2 GPUs
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-14B-Chat \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --dtype bfloat16

# 4 GPUs for Qwen-72B
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-72B-Chat \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --dtype bfloat16

# 8 GPUs for maximum performance
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-72B-Chat \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --dtype bfloat16 \
  --max-num-seqs 512

GPU Selection

Control which GPUs to use:
# Use specific GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-72B-Chat \
  --trust-remote-code \
  --tensor-parallel-size 4

# Use GPUs on different nodes (requires Ray)
ray start --head
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-72B-Chat \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --distributed-executor-backend ray

Production Deployment

Systemd Service

Create /etc/systemd/system/qwen-vllm.service:
[Unit]
Description=Qwen vLLM OpenAI API Server
After=network.target

[Service]
Type=simple
User=qwen
WorkingDirectory=/opt/qwen
Environment="PATH=/opt/qwen/venv/bin:/usr/local/cuda/bin"
Environment="CUDA_VISIBLE_DEVICES=0,1,2,3"
ExecStart=/opt/qwen/venv/bin/python -m vllm.entrypoints.openai.api_server \
  --model /models/Qwen-72B-Chat \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype bfloat16 \
  --chat-template /opt/qwen/template_chatml.jinja
Restart=always
RestartSec=10
StandardOutput=append:/var/log/qwen-vllm/output.log
StandardError=append:/var/log/qwen-vllm/error.log

[Install]
WantedBy=multi-user.target
Manage the service:
sudo systemctl daemon-reload
sudo systemctl enable qwen-vllm
sudo systemctl start qwen-vllm
sudo systemctl status qwen-vllm

Docker Deployment

docker run --gpus all -d \
  --name qwen-vllm \
  --restart always \
  -p 8000:8000 \
  -v /models:/models:ro \
  -v /templates:/templates:ro \
  qwenllm/qwen:cu121 \
  python -m vllm.entrypoints.openai.api_server \
    --model /models/Qwen-7B-Chat \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000 \
    --chat-template /templates/template_chatml.jinja

Load Balancing

Nginx configuration for multiple vLLM instances:
upstream vllm_backend {
    least_conn;
    server 127.0.0.1:8000 max_fails=3 fail_timeout=30s;
    server 127.0.0.1:8001 max_fails=3 fail_timeout=30s;
    server 127.0.0.1:8002 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    server_name api.example.com;

    location / {
        proxy_pass http://vllm_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
    }
}

Performance Benchmarks

Throughput Comparison

Qwen-7B on A100 80GB GPU:
MethodThroughput (tokens/s)Latency (ms/token)Max Batch Size
PyTorch40.9324.41-4
vLLM68.514.6256+
vLLM (4 GPUs)245.24.11024+

Memory Efficiency

Qwen-72B memory usage:
ConfigurationGPU MemorySupported Batch Size
PyTorch (2xA100)144.69GB1-2
vLLM (2xA100)165GB64
vLLM (4xA100)166GB256+

Limitations

Current vLLM Limitations with Qwen:
  1. Dynamic NTK ROPE: vLLM does not support dynamic NTK ROPE scaling. Long sequence generation quality may degrade.
  2. Context Length: Maximum context length is fixed at model initialization. Cannot dynamically extend beyond max_model_len.
  3. Repetition Penalty: Requires vLLM ≥ 0.2.2 for repetition penalty support.

Troubleshooting

Error: torch.cuda.OutOfMemoryErrorSolutions:
  • Reduce --gpu-memory-utilization (try 0.85 or 0.80)
  • Decrease --max-model-len
  • Use quantized Int4 model
  • Increase --tensor-parallel-size
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-7B-Chat-Int4 \
  --dtype float16 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 4096
Error: ValueError: trust_remote_code is requiredSolution: Always include --trust-remote-code:
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-7B-Chat \
  --trust-remote-code
Issue: Model generates indefinitelySolution: Set proper stop tokens:
response = openai.ChatCompletion.create(
    model="Qwen",
    messages=[...],
    stop_token_ids=[151645]  # Essential!
)
Issue: Not achieving expected performanceSolutions:
  • Increase --max-num-seqs for more concurrent requests
  • Use --dtype bfloat16 instead of float16/float32
  • Disable request logging with --disable-log-requests
  • Check GPU utilization with nvidia-smi
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --max-num-seqs 512 \
  --dtype bfloat16 \
  --disable-log-requests
Error: Issues with multi-GPU deploymentSolutions:
  • Ensure all GPUs have same model
  • Check NCCL configuration
  • Verify GPU visibility:
nvidia-smi
echo $CUDA_VISIBLE_DEVICES
  • Test with Ray backend:
ray start --head
python -m vllm.entrypoints.openai.api_server \
  --tensor-parallel-size 4 \
  --distributed-executor-backend ray

Monitoring

Health Checks

# Check if server is running
curl http://localhost:8000/health

# List available models
curl http://localhost:8000/v1/models

# Test inference
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen",
    "messages": [{"role": "user", "content": "test"}],
    "max_tokens": 10,
    "stop_token_ids": [151645]
  }'

Metrics Collection

vLLM exposes Prometheus metrics:
curl http://localhost:8000/metrics

Next Steps

FastChat Integration

Add web UI and more features with FastChat

Production Guide

Production deployment best practices

Performance Tuning

Advanced performance optimization

Monitoring Setup

Set up comprehensive monitoring