Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen/llms.txt

Use this file to discover all available pages before exploring further.

FastChat is a comprehensive platform for deploying LLMs with web UI, REST API, and distributed serving capabilities. When combined with vLLM, it provides production-grade performance with an intuitive interface.

Overview

FastChat provides a three-component architecture:
1

Controller

Manages distributed workers and routes requests
2

Model Worker

Loads and serves the model (can use vLLM backend)
3

API/UI Server

Provides web interface or OpenAI-compatible API

Installation

pip install "fschat[model_worker,webui]==0.2.33" "openai<1.0" vllm
FastChat 0.2.33 is the recommended version for stability with Qwen models.

Quick Start

Web UI Deployment

1

Start the Controller

The controller manages model workers:
python -m fastchat.serve.controller
Runs on http://localhost:21001 by default.
2

Launch Model Worker

Start vLLM worker for high performance:
python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --dtype bfloat16
3

Start Web Server

Launch the Gradio web interface:
python -m fastchat.serve.gradio_web_server
Access at http://localhost:7860

OpenAI API Deployment

1

Start Controller

python -m fastchat.serve.controller
2

Launch vLLM Worker

python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --dtype bfloat16
3

Start API Server

python -m fastchat.serve.openai_api_server \
  --host localhost \
  --port 8000

Configuration

Worker Configuration

python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --dtype bfloat16

Worker Parameters

--model-path
string
required
Path to model checkpoint (HuggingFace or local path)
--trust-remote-code
boolean
required
Required for Qwen models
--tensor-parallel-size
int
default:"1"
Number of GPUs for tensor parallelism
--dtype
string
default:"auto"
Model data type: auto, bfloat16, float16, float32
--gpu-memory-utilization
float
default:"0.90"
Fraction of GPU memory to use
--max-num-seqs
int
default:"256"
Maximum concurrent sequences
--worker-address
string
Worker listening address
--controller-address
string
Controller address to register with

API Server Configuration

python -m fastchat.serve.openai_api_server \
  --host 0.0.0.0 \
  --port 8000 \
  --controller-address http://localhost:21001 \
  --api-keys sk-key1 sk-key2
--host
string
default:"localhost"
API server bind address
--port
int
default:"8000"
API server port
--controller-address
string
Address of the controller service
--api-keys
string[]
List of valid API keys for authentication

API Usage

OpenAI Python Client

import openai

openai.api_base = "http://localhost:8000/v1"
openai.api_key = "none"  # Or your API key if configured

response = openai.ChatCompletion.create(
    model="Qwen",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is machine learning?"}
    ],
    temperature=0.7,
    max_tokens=2048
)

print(response.choices[0].message.content)
Unlike vLLM standalone mode, FastChat handles stop tokens automatically. No need to specify stop_token_ids.

Multi-Model Deployment

Deploy multiple models simultaneously:
1

Start Controller

python -m fastchat.serve.controller
2

Launch Multiple Workers

Start workers on different ports:
# Worker 1: Qwen-7B
python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --dtype bfloat16 \
  --worker-address http://localhost:21002

# Worker 2: Qwen-14B
python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-14B-Chat \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --dtype bfloat16 \
  --worker-address http://localhost:21003

# Worker 3: Qwen-72B
CUDA_VISIBLE_DEVICES=2,3,4,5 python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-72B-Chat \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --dtype bfloat16 \
  --worker-address http://localhost:21004
3

Start API Server

python -m fastchat.serve.openai_api_server \
  --host 0.0.0.0 \
  --port 8000

Model Selection

Clients can specify which model to use:
import openai

openai.api_base = "http://localhost:8000/v1"
openai.api_key = "none"

# List available models
models = openai.Model.list()
print(models)

# Use specific model
response = openai.ChatCompletion.create(
    model="Qwen-7B-Chat",  # Or "Qwen-14B-Chat", "Qwen-72B-Chat"
    messages=[{"role": "user", "content": "Hello"}]
)

Production Deployment

Systemd Services

Create systemd service files for each component:
[Unit]
Description=FastChat Controller
After=network.target

[Service]
Type=simple
User=qwen
WorkingDirectory=/opt/qwen
Environment="PATH=/opt/qwen/venv/bin"
ExecStart=/opt/qwen/venv/bin/python -m fastchat.serve.controller
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
Manage services:
# Enable and start all services
sudo systemctl daemon-reload
sudo systemctl enable fastchat-controller fastchat-worker fastchat-api
sudo systemctl start fastchat-controller
sleep 5
sudo systemctl start fastchat-worker
sleep 10
sudo systemctl start fastchat-api

# Check status
sudo systemctl status fastchat-controller
sudo systemctl status fastchat-worker
sudo systemctl status fastchat-api

# View logs
sudo journalctl -u fastchat-worker -f

Docker Compose

Complete deployment with Docker Compose:
docker-compose.yml
version: '3.8'

services:
  controller:
    image: qwenllm/qwen:cu121
    container_name: fastchat-controller
    command: python -m fastchat.serve.controller
    ports:
      - "21001:21001"
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:21001"]
      interval: 30s
      timeout: 10s
      retries: 3

  worker:
    image: qwenllm/qwen:cu121
    container_name: fastchat-worker
    command: >
      python -m fastchat.serve.vllm_worker
      --model-path /models/Qwen-7B-Chat
      --trust-remote-code
      --dtype bfloat16
      --controller-address http://controller:21001
    volumes:
      - /path/to/models:/models:ro
    depends_on:
      - controller
    restart: always
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  api-server:
    image: qwenllm/qwen:cu121
    container_name: fastchat-api
    command: >
      python -m fastchat.serve.openai_api_server
      --host 0.0.0.0
      --port 8000
      --controller-address http://controller:21001
    ports:
      - "8000:8000"
    depends_on:
      - controller
      - worker
    restart: always
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/v1/models"]
      interval: 30s
      timeout: 10s
      retries: 3

  web-server:
    image: qwenllm/qwen:cu121
    container_name: fastchat-web
    command: >
      python -m fastchat.serve.gradio_web_server
      --controller-address http://controller:21001
    ports:
      - "7860:7860"
    depends_on:
      - controller
      - worker
    restart: always
Launch:
docker-compose up -d

# Scale workers
docker-compose up -d --scale worker=3

# View logs
docker-compose logs -f worker

Load Balancing

FastChat controller automatically load balances across multiple workers:
# Start controller
python -m fastchat.serve.controller

# Start multiple workers for same model (horizontal scaling)
CUDA_VISIBLE_DEVICES=0 python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --worker-address http://localhost:21002

CUDA_VISIBLE_DEVICES=1 python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --worker-address http://localhost:21003

CUDA_VISIBLE_DEVICES=2 python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code \
  --worker-address http://localhost:21004

# Start API server
python -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 8000
The controller distributes requests across workers automatically.

Monitoring

Worker Status

Check registered workers:
curl http://localhost:21001/list_models

Health Checks

# API server health
curl http://localhost:8000/v1/models

# Test inference
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen",
    "messages": [{"role": "user", "content": "hello"}],
    "max_tokens": 10
  }'

Logging

Enable detailed logging:
# Set log level
export FASTCHAT_LOG_LEVEL=DEBUG

# Run with logging
python -m fastchat.serve.vllm_worker \
  --model-path Qwen/Qwen-7B-Chat \
  --trust-remote-code 2>&1 | tee worker.log

Troubleshooting

Error: Worker not appearing in controllerSolutions:
  • Check controller is running: curl http://localhost:21001
  • Verify controller address in worker: --controller-address http://localhost:21001
  • Check network connectivity between services
  • Review logs for connection errors
Error: API returns empty model listSolutions:
  • Ensure workers have registered successfully
  • Check controller status: curl http://localhost:21001/list_models
  • Wait for model loading to complete (can take minutes)
  • Check worker logs for errors
Issue: High latency in responsesSolutions:
  • Use vLLM worker instead of model_worker
  • Increase --max-num-seqs on worker
  • Add more workers for horizontal scaling
  • Enable tensor parallelism for large models
  • Use quantized models (Int4/Int8)
Error: Worker process exits unexpectedlySolutions:
  • Check GPU memory: nvidia-smi
  • Reduce --gpu-memory-utilization
  • Use smaller model or quantized version
  • Check CUDA compatibility
  • Review system logs: dmesg | grep -i error

Advanced Features

Custom System Prompts

Set system prompts in the web UI or API:
import openai

response = openai.ChatCompletion.create(
    model="Qwen",
    messages=[
        {
            "role": "system", 
            "content": "You are an expert Python programmer. Always provide code examples."
        },
        {
            "role": "user", 
            "content": "How do I read a CSV file?"
        }
    ]
)

Conversation History

Maintain multi-turn conversations:
import openai

history = []

def chat(user_message):
    history.append({"role": "user", "content": user_message})
    
    response = openai.ChatCompletion.create(
        model="Qwen",
        messages=history
    )
    
    assistant_message = response.choices[0].message.content
    history.append({"role": "assistant", "content": assistant_message})
    
    return assistant_message

# Multi-turn conversation
print(chat("What is Python?"))
print(chat("What are its main features?"))
print(chat("Give me an example"))

Performance Comparison

FastChat vs Standalone

FeatureStandalone vLLMFastChat + vLLM
PerformanceSameSame
Web UINoYes
Multi-modelManualAutomatic
Load BalancingExternalBuilt-in
Setup ComplexityLowMedium
Production ReadyYesYes

Next Steps

Production Guide

Best practices for production deployments

Monitoring

Set up comprehensive monitoring

Performance Tuning

Advanced optimization techniques

API Reference

Complete API documentation