Documentation Index Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen/llms.txt
Use this file to discover all available pages before exploring further.
This guide covers best practices, optimization strategies, and production considerations for deploying Qwen models at scale.
Architecture Design
Deployment Architecture
Recommended Stack
Inference Engine vLLM for production workloads
High throughput
Memory efficient
Multi-GPU support
Orchestration FastChat for management
Model routing
Load balancing
Web UI optional
Reverse Proxy Nginx or Traefik
SSL termination
Rate limiting
Request routing
Monitoring Prometheus + Grafana
Metrics collection
Alerting
Visualization
Model Selection
Choose the Right Model Size
Select based on latency and throughput requirements: Model Use Case Latency Quality Qwen-1.8B High-throughput, simple tasks ~50ms Good Qwen-7B Balanced performance ~100ms Excellent Qwen-14B Complex reasoning ~150ms Superior Qwen-72B Mission-critical, highest quality ~400ms Best
Use quantization to reduce memory and improve throughput: # Int4 quantization (recommended)
model = "Qwen/Qwen-7B-Chat-Int4"
# 70% memory reduction, minimal quality loss
# Int8 quantization
model = "Qwen/Qwen-7B-Chat-Int8"
# 40% memory reduction, negligible quality loss
Quality Comparison (MMLU scores):
BF16: 55.8
Int8: 55.4 (-0.4)
Int4: 55.1 (-0.7)
Context Length Optimization
Set appropriate max_model_len based on use case: # Short conversations (most use cases)
--max-model-len 4096
# Long documents
--max-model-len 8192
# Extended context (requires more memory)
--max-model-len 16384
Longer context increases memory usage linearly.
vLLM Configuration
High Throughput
Low Latency
Memory Optimized
Balanced
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen-7B-Chat-Int4 \
--trust-remote-code \
--dtype float16 \
--max-model-len 4096 \
--max-num-seqs 512 \
--gpu-memory-utilization 0.95 \
--disable-log-requests \
--tensor-parallel-size 1
Multi-GPU Strategies
Tensor Parallelism
Model Replication
Hybrid Approach
Split single model across GPUs: # Best for large models (72B)
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen-72B-Chat \
--trust-remote-code \
--tensor-parallel-size 4 \
--dtype bfloat16
Pros : Higher throughput per model
Cons : All GPUs serve single modelRun separate instances: # GPU 0
CUDA_VISIBLE_DEVICES = 0 python -m vllm... --port 8000
# GPU 1
CUDA_VISIBLE_DEVICES = 1 python -m vllm... --port 8001
# GPU 2
CUDA_VISIBLE_DEVICES = 2 python -m vllm... --port 8002
Pros : Independent failure domains
Cons : Requires load balancerCombine both strategies: # Pair 1: GPUs 0-1
CUDA_VISIBLE_DEVICES = 0,1 python -m vllm... \
--tensor-parallel-size 2 --port 8000
# Pair 2: GPUs 2-3
CUDA_VISIBLE_DEVICES = 2,3 python -m vllm... \
--tensor-parallel-size 2 --port 8001
Pros : Balanced throughput and redundancy
Recommended for 4+ GPU systems
Security
Authentication
API Key Authentication
JWT Authentication
OAuth2
from fastapi import Security, HTTPException
from fastapi.security import APIKeyHeader
API_KEYS = { "sk-key1" , "sk-key2" , "sk-key3" }
api_key_header = APIKeyHeader( name = "X-API-Key" )
async def verify_api_key ( api_key : str = Security(api_key_header)):
if api_key not in API_KEYS :
raise HTTPException( status_code = 403 , detail = "Invalid API key" )
return api_key
@app.post ( "/v1/chat/completions" , dependencies = [Depends(verify_api_key)])
async def chat_completion ( request : ChatRequest):
# Your endpoint logic
pass
from fastapi import Depends, HTTPException
from fastapi.security import HTTPBearer
import jwt
security = HTTPBearer()
SECRET_KEY = "your-secret-key"
async def verify_token ( credentials = Security(security)):
token = credentials.credentials
try :
payload = jwt.decode(token, SECRET_KEY , algorithms = [ "HS256" ])
return payload
except jwt.InvalidTokenError:
raise HTTPException( status_code = 401 , detail = "Invalid token" )
@app.post ( "/v1/chat/completions" , dependencies = [Depends(verify_token)])
async def chat_completion ( request : ChatRequest):
pass
from fastapi import Depends
from fastapi.security import OAuth2PasswordBearer
import httpx
oauth2_scheme = OAuth2PasswordBearer( tokenUrl = "token" )
async def verify_oauth_token ( token : str = Depends(oauth2_scheme)):
# Verify with OAuth provider
async with httpx.AsyncClient() as client:
response = await client.get(
"https://oauth-provider.com/verify" ,
headers = { "Authorization" : f "Bearer { token } " }
)
if response.status_code != 200 :
raise HTTPException( status_code = 401 , detail = "Invalid token" )
return response.json()
SSL/TLS Configuration
Nginx SSL configuration:
server {
listen 443 ssl http2;
server_name api.example.com;
# SSL certificates
ssl_certificate /etc/ssl/certs/api.example.com.crt;
ssl_certificate_key /etc/ssl/private/api.example.com.key;
# SSL configuration
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
ssl_prefer_server_ciphers on ;
# Security headers
add_header Strict-Transport-Security "max-age=31536000" always;
add_header X-Frame-Options "DENY" always;
add_header X-Content-Type-Options "nosniff" always;
location / {
proxy_pass http://localhost:8000;
proxy_set_header Host $ host ;
proxy_set_header X-Real-IP $ remote_addr ;
proxy_set_header X-Forwarded-For $ proxy_add_x_forwarded_for ;
proxy_set_header X-Forwarded-Proto $ scheme ;
}
}
Rate Limiting
FastAPI Rate Limiting
Nginx Rate Limiting
Redis-based Limiting
from fastapi import Request
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
limiter = Limiter( key_func = get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
@app.post ( "/v1/chat/completions" )
@limiter.limit ( "100/minute" )
async def chat_completion ( request : Request):
# Your endpoint logic
pass
Monitoring & Observability
Prometheus Metrics
Expose metrics for monitoring:
from prometheus_client import Counter, Histogram, Gauge, make_asgi_app
import time
# Define metrics
request_count = Counter(
'qwen_requests_total' ,
'Total requests' ,
[ 'model' , 'status' ]
)
request_duration = Histogram(
'qwen_request_duration_seconds' ,
'Request duration' ,
[ 'model' ]
)
active_requests = Gauge(
'qwen_active_requests' ,
'Active requests' ,
[ 'model' ]
)
tokens_generated = Counter(
'qwen_tokens_generated_total' ,
'Total tokens generated' ,
[ 'model' ]
)
# Add metrics endpoint
metrics_app = make_asgi_app()
app.mount( "/metrics" , metrics_app)
@app.post ( "/v1/chat/completions" )
async def chat_completion ( request : ChatRequest):
start_time = time.time()
active_requests.labels( model = request.model).inc()
try :
response = await generate_response(request)
request_count.labels( model = request.model, status = "success" ).inc()
tokens_generated.labels( model = request.model).inc(response.usage.completion_tokens)
return response
except Exception as e:
request_count.labels( model = request.model, status = "error" ).inc()
raise
finally :
duration = time.time() - start_time
request_duration.labels( model = request.model).observe(duration)
active_requests.labels( model = request.model).dec()
Grafana Dashboard
Key metrics to monitor:
Throughput
Requests per second
Tokens per second
Batch size utilization
Latency
p50, p95, p99 response times
Time to first token (TTFT)
Inter-token latency
Resources
GPU utilization
GPU memory usage
CPU and system memory
Errors
Error rate
Timeout rate
Queue depth
Health Checks
Implement comprehensive health checks:
from fastapi import status
import psutil
import torch
@app.get ( "/health" )
async def health_check ():
return { "status" : "healthy" }
@app.get ( "/health/detailed" )
async def detailed_health_check ():
# Check GPU
gpu_available = torch.cuda.is_available()
if gpu_available:
gpu_memory = torch.cuda.get_device_properties( 0 ).total_memory
gpu_memory_used = torch.cuda.memory_allocated( 0 )
gpu_utilization = (gpu_memory_used / gpu_memory) * 100
else :
gpu_utilization = None
# Check system resources
cpu_percent = psutil.cpu_percent( interval = 1 )
memory = psutil.virtual_memory()
# Check model
model_loaded = model is not None
health_status = {
"status" : "healthy" if model_loaded and gpu_available else "degraded" ,
"model_loaded" : model_loaded,
"gpu" : {
"available" : gpu_available,
"utilization" : gpu_utilization,
},
"system" : {
"cpu_percent" : cpu_percent,
"memory_percent" : memory.percent,
"memory_available_gb" : memory.available / ( 1024 ** 3 )
},
"timestamp" : datetime.utcnow().isoformat()
}
if health_status[ "status" ] == "degraded" :
return JSONResponse(
status_code = status. HTTP_503_SERVICE_UNAVAILABLE ,
content = health_status
)
return health_status
Scaling Strategies
Horizontal Scaling
Load Balancer Setup
Configure Nginx for multiple backends: upstream qwen_backend {
least_conn ; # Use least connections algorithm
server 10.0.1.10:8000 max_fails = 3 fail_timeout=30s;
server 10.0.1.11:8000 max_fails = 3 fail_timeout=30s;
server 10.0.1.12:8000 max_fails = 3 fail_timeout=30s;
}
server {
location / {
proxy_pass http://qwen_backend;
proxy_next_upstream error timeout http_502 http_503;
}
}
Session Affinity
For stateful applications: upstream qwen_backend {
ip_hash ; # Sticky sessions based on client IP
server 10.0.1.10:8000;
server 10.0.1.11:8000;
}
Auto-scaling
Use Kubernetes HPA or cloud auto-scaling: apiVersion : autoscaling/v2
kind : HorizontalPodAutoscaler
metadata :
name : qwen-hpa
spec :
scaleTargetRef :
apiVersion : apps/v1
kind : Deployment
name : qwen-deployment
minReplicas : 2
maxReplicas : 10
metrics :
- type : Resource
resource :
name : cpu
target :
type : Utilization
averageUtilization : 70
Vertical Scaling
Upgrade to larger GPUs or more GPUs per node:
Current Upgrade Path Performance Gain 1x RTX 3090 1x A100 40GB 1.5-2x throughput 1x A100 40GB 1x A100 80GB Larger models/batches 1x A100 2x A100 (TP) 1.7-1.9x throughput 2x A100 (TP) 4x A100 (TP) Support Qwen-72B
Disaster Recovery
Backup Strategy
Model Checkpoints
Configuration
Database
# Backup to S3
aws s3 sync /models/Qwen-7B-Chat \
s3://your-bucket/models/Qwen-7B-Chat/ \
--exclude "*.git/*"
# Backup to local NAS
rsync -avz --progress /models/ backup-server:/backups/models/
# Version control for configs
git init /etc/qwen
cd /etc/qwen
git add .
git commit -m "Production configuration"
git push origin main
# Backup Redis (if using for caching)
redis-cli --rdb /backup/dump.rdb
# Backup logs
tar -czf logs- $( date +%Y%m%d ) .tar.gz /var/log/qwen/
Disaster Recovery Plan
Documentation
Maintain runbooks with:
System architecture diagrams
Deployment procedures
Rollback procedures
Contact information
Testing
Regularly test:
Failover procedures
Backup restoration
Load balancer health checks
Monitoring alerts
Automation
Automate recovery: # Example recovery script
#!/bin/bash
set -e
echo "Starting disaster recovery..."
# Stop failed services
systemctl stop qwen- *
# Restore from backup
aws s3 sync s3://backup/models /models/
# Restart services
systemctl start qwen-controller
sleep 10
systemctl start qwen-worker
sleep 30
systemctl start qwen-api
# Verify health
curl -f http://localhost:8000/health || exit 1
echo "Recovery complete"
Cost Optimization
GPU Utilization
Higher batch sizes improve GPU utilization: # Before: batch_size=1, 40% GPU util
--max-num-seqs 1
# After: batch_size=128, 85% GPU util
--max-num-seqs 128
Monitor with: nvidia-smi dmon -s u
For non-critical workloads: # AWS Spot Instances: 70% cost savings
# Azure Spot VMs: 60-90% cost savings
# GCP Preemptible VMs: 80% cost savings
Implement graceful shutdown: import signal
def handle_termination ( signum , frame ):
print ( "Received termination signal, graceful shutdown..." )
# Finish current requests
# Save state if needed
sys.exit( 0 )
signal.signal(signal. SIGTERM , handle_termination)
Match GPU to model size: Model Minimum GPU Recommended GPU Cost Efficiency Qwen-7B-Int4 RTX 3090 A10 High Qwen-7B RTX 3090 A100 40GB Medium Qwen-14B-Int4 A100 40GB A100 40GB High Qwen-72B-Int4 2x A100 40GB 2x A100 80GB Medium
Checklist
Use this checklist before going to production:
Pre-deployment
Deployment
Post-deployment
Troubleshooting
High Latency
Low Throughput
Memory Issues
Check GPU utilization: nvidia-smi
Review batch size: increase --max-num-seqs
Check network latency between services
Review logs for bottlenecks
Consider tensor parallelism
Increase concurrent requests
Optimize max-num-batched-tokens
Use quantized models
Enable continuous batching (vLLM default)
Add more workers
Reduce gpu-memory-utilization
Decrease max-model-len
Use Int4 quantization
Enable swap space
Use tensor parallelism
Additional Resources
vLLM Documentation Official vLLM documentation
FastChat GitHub FastChat source and examples
Kubernetes Guide Deploy on Kubernetes
Performance Tuning Advanced optimization guide