Installation

This guide covers everything you need to install and configure Qwen models, from basic dependencies to advanced optimizations.

System Requirements

Minimum Requirements

Python

Python 3.8 or higher

PyTorch

PyTorch 1.12+ (2.0+ recommended)

CUDA

CUDA 11.4+ (for GPU users)

GPU Memory

Varies by model size (see table below)

GPU Memory Requirements

Minimum GPU memory needed for inference (generating 2048 tokens):

Model	BF16/FP16	Int8	Int4
Qwen-1.8B	4.23GB	3.48GB	2.91GB
Qwen-7B	16.99GB	11.20GB	8.21GB
Qwen-14B	30.15GB	18.81GB	13.01GB
Qwen-72B	144.69GB (2xA100)	81.27GB (2xA100)	48.86GB

For fine-tuning, memory requirements are higher. Q-LoRA requires minimum:

Qwen-1.8B: 5.8GB
Qwen-7B: 11.5GB
Qwen-14B: 18.7GB
Qwen-72B: 61.4GB

Basic Installation

Step 1: Install Core Dependencies

Install the required Python packages from the requirements file:

pip install transformers>=4.32.0,<4.38.0 accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy

Understanding the Dependencies

transformers: Hugging Face library for loading and running models
accelerate: Efficient model loading and distributed inference
tiktoken: Fast tokenization library
einops: Tensor operations for attention mechanisms
transformers_stream_generator: Streaming text generation support
scipy: Scientific computing utilities

Step 2: Verify Installation

Test your installation with this simple script:

import torch
import transformers
from transformers import AutoTokenizer

print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")

Flash Attention (Recommended)

Flash Attention significantly improves inference speed and reduces memory usage. Installation is optional but highly recommended.

Check Compatibility

Flash Attention requires:

GPU with FP16 or BF16 support
CUDA 11.4 or higher
PyTorch 1.12 or higher

Verify your setup:

import torch
print(f"BF16 supported: {torch.cuda.is_bf16_supported()}")
print(f"FP16 supported: {torch.cuda.get_device_capability()[0] >= 7}")

Install Flash Attention

Qwen supports Flash Attention 2 for optimal performance:

git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention
pip install .

This installation may take 10-30 minutes as it compiles CUDA kernels. Ensure you have sufficient disk space (~5GB for build files).

Install Optional Components (Flash Attention v2.1.1 and below)

For older versions of Flash Attention, you may optionally install additional components:

# Optional: Layer norm optimization
# pip install csrc/layer_norm

# Optional: Rotary embedding optimization (not needed for flash-attn > 2.1.1)
# pip install csrc/rotary

These are optional and may slow down the installation. Skip if flash-attention version is higher than 2.1.1.

Verify Flash Attention

Test that Flash Attention is working:

try:
    import flash_attn
    print(f"Flash Attention version: {flash_attn.__version__}")
    print("Flash Attention installed successfully!")
except ImportError:
    print("Flash Attention not available")

Performance Impact

With Flash Attention enabled, you can expect:

40% faster batch inference
20-30% lower memory usage
Support for longer sequences without OOM errors

Docker Installation

Using Docker is the fastest way to get started with Qwen, as it includes all dependencies pre-configured.

Pre-built Docker Images

Qwen provides official Docker images that skip most environment setup steps:

# Pull the official Qwen Docker image
docker pull qwenllm/qwen:latest

# Run the container with GPU support
docker run --gpus all -it qwenllm/qwen:latest

Make sure you have NVIDIA Container Toolkit installed to use GPUs with Docker.

Custom Dockerfile

If you need a custom setup, create your own Dockerfile:

FROM nvidia/cuda:11.8.0-devel-ubuntu22.04

# Install Python and pip
RUN apt-get update && apt-get install -y \
    python3.10 \
    python3-pip \
    git \
    && rm -rf /var/lib/apt/lists/*

# Install dependencies
RUN pip3 install --no-cache-dir \
    transformers>=4.32.0,<4.38.0 \
    accelerate \
    tiktoken \
    einops \
    transformers_stream_generator==0.0.4 \
    scipy \
    torch>=2.0.0

# Optional: Install Flash Attention
RUN git clone https://github.com/Dao-AILab/flash-attention && \
    cd flash-attention && \
    pip install . && \
    cd .. && \
    rm -rf flash-attention

# Set working directory
WORKDIR /workspace

CMD ["/bin/bash"]

Build and run:

docker build -t qwen-custom .
docker run --gpus all -it -v $(pwd):/workspace qwen-custom

Quantization Dependencies

To use quantized models (Int4/Int8), install additional libraries:

AutoGPTQ Installation

pip install auto-gptq optimum

Version Compatibility: AutoGPTQ packages are highly dependent on your PyTorch and CUDA versions. If you encounter installation issues:

For PyTorch 2.1: auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1
For PyTorch 2.0: auto-gptq<0.5.0 transformers<4.35.0 optimum<1.14.0 peft>=0.5.0,<0.6.0

If pre-compiled wheels don’t work, build from source:

git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
pip install -e .

Verify Quantization Support

try:
    from auto_gptq import AutoGPTQForCausalLM
    print("AutoGPTQ installed successfully!")
except ImportError as e:
    print(f"AutoGPTQ not available: {e}")

Fine-tuning Dependencies

For training and fine-tuning, install additional packages:

LoRA and Q-LoRA

pip install "peft<0.8.0"

peft>=0.8.0 has a known issue with loading Qwen tokenizers. Use peft<0.8.0 until the issue is resolved.

DeepSpeed (for distributed training)

pip install deepspeed

Pydantic Compatibility: DeepSpeed may conflict with pydantic>=2.0. If you encounter errors, ensure pydantic<2.0:

pip install "pydantic<2.0"

Full Fine-tuning Requirements

# Install all training dependencies
pip install "peft<0.8.0" deepspeed "pydantic<2.0" tensorboard

Platform-Specific Installation

x86 Platforms (Intel CPUs/GPUs)

For Intel Core/Xeon processors or Arc GPUs, use OpenVINO for optimized inference:

pip install openvino openvino-dev

See the OpenVINO notebooks for Qwen-specific examples.

Ascend NPU

For Huawei Ascend 910 NPU:

# Install Ascend toolkit first, then:
pip install torch-npu

Refer to the ascend-support directory in the Qwen repository for detailed instructions.

Hygon DCU

For Hygon DCU acceleration:

# Follow DCU-specific installation
# See dcu-support directory for details

Installing from Source

To get the latest development version or contribute to Qwen:

Clone the Repository

git clone https://github.com/QwenLM/Qwen.git
cd Qwen

Install Dependencies

pip install -r requirements.txt

Run from Source

# You can now import and use local model files
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load from local directory
model_path = "./Qwen-7B-Chat"  # Your local model path
tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    trust_remote_code=True
).eval()

Downloading Models

From Hugging Face

Models are automatically downloaded when you first load them:

from transformers import AutoModelForCausalLM, AutoTokenizer

# This will download from Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True
).eval()

From ModelScope

For users with better access to ModelScope:

from modelscope import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer

# Download to local directory
model_dir = snapshot_download('qwen/Qwen-7B-Chat')

# Load from local directory
tokenizer = AutoTokenizer.from_pretrained(
    model_dir,
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    model_dir,
    device_map="auto",
    trust_remote_code=True
).eval()

Manual Download

You can also manually download model files:

Using Git LFS
Using Hugging Face CLI

# Install git-lfs
git lfs install

# Clone the model repository
git clone https://huggingface.co/Qwen/Qwen-7B-Chat

# Install huggingface-cli
pip install huggingface_hub

# Download model
huggingface-cli download Qwen/Qwen-7B-Chat --local-dir ./Qwen-7B-Chat

Environment Variables

Configure these environment variables for optimal performance:

# Set cache directory for models
export HF_HOME=/path/to/cache
export TRANSFORMERS_CACHE=/path/to/cache

# Enable offline mode (use cached models only)
export HF_DATASETS_OFFLINE=1
export TRANSFORMERS_OFFLINE=1

# Set number of threads for PyTorch
export OMP_NUM_THREADS=8

# CUDA optimizations
export CUDA_LAUNCH_BLOCKING=0

Verification

Run this complete verification script to ensure everything is working:

import sys
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

def verify_installation():
    print("=" * 50)
    print("Qwen Installation Verification")
    print("=" * 50)
    
    # Python version
    print(f"\nPython version: {sys.version}")
    
    # PyTorch
    print(f"PyTorch version: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    
    if torch.cuda.is_available():
        print(f"CUDA version: {torch.version.cuda}")
        print(f"cuDNN version: {torch.backends.cudnn.version()}")
        print(f"Number of GPUs: {torch.cuda.device_count()}")
        for i in range(torch.cuda.device_count()):
            print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")
            print(f"    Memory: {torch.cuda.get_device_properties(i).total_memory / 1024**3:.2f} GB")
    
    # Transformers
    print(f"\nTransformers version: {transformers.__version__}")
    
    # Flash Attention
    try:
        import flash_attn
        print(f"Flash Attention version: {flash_attn.__version__}")
    except ImportError:
        print("Flash Attention: Not installed")
    
    # AutoGPTQ
    try:
        from auto_gptq import AutoGPTQForCausalLM
        print("AutoGPTQ: Installed")
    except ImportError:
        print("AutoGPTQ: Not installed")
    
    # PEFT
    try:
        import peft
        print(f"PEFT version: {peft.__version__}")
    except ImportError:
        print("PEFT: Not installed")
    
    # DeepSpeed
    try:
        import deepspeed
        print(f"DeepSpeed version: {deepspeed.__version__}")
    except ImportError:
        print("DeepSpeed: Not installed")
    
    print("\n" + "=" * 50)
    print("Verification complete!")
    print("=" * 50)

if __name__ == "__main__":
    verify_installation()

Save this as verify_installation.py and run:

python verify_installation.py

Troubleshooting

Installation fails with CUDA errors

Ensure your CUDA version matches PyTorch requirements:

# Check CUDA version
nvcc --version

# Install matching PyTorch version
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Flash Attention compilation fails

Common solutions:

Ensure you have CUDA development tools: sudo apt-get install cuda-toolkit-11-8
Update your GCC compiler: sudo apt-get install build-essential

Set environment variables:

export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

Try installing from PyPI: pip install flash-attn --no-build-isolation

AutoGPTQ version conflicts

If you see errors about incompatible versions:

# Uninstall existing versions
pip uninstall auto-gptq optimum transformers peft

# Install compatible versions
pip install torch==2.1.0
pip install auto-gptq>=0.5.1 transformers>=4.35.0 optimum>=1.14.0 peft>=0.6.1

ImportError: trust_remote_code

Update transformers to a compatible version:

pip install "transformers>=4.32.0,<4.38.0"

Out of disk space during installation

Flash Attention compilation requires ~5GB temporary space:

# Set temporary directory to location with more space
export TMPDIR=/path/to/large/tmp
pip install flash-attn

Model download is slow or fails

Try these solutions:

Use ModelScope instead of Hugging Face (see above)

Set a mirror:

export HF_ENDPOINT=https://hf-mirror.com

Download manually with git-lfs or huggingface-cli
Resume interrupted downloads by re-running the same command

Next Steps

Quickstart

Get started with Qwen in under 5 minutes

Model Selection

Choose the right model for your use case

Inference

Learn about inference options and optimizations

Docker Setup

Deploy Qwen with Docker in production

Getting Started

Models

Inference

Quantization

Fine-tuning

Advanced Features

Deployment

Demos

Documentation Index

​System Requirements

​Minimum Requirements

Python

PyTorch

CUDA

GPU Memory

​GPU Memory Requirements

​Basic Installation

​Step 1: Install Core Dependencies

​Step 2: Verify Installation

​Flash Attention (Recommended)

​Performance Impact

​Docker Installation

​Pre-built Docker Images

​Custom Dockerfile

​Quantization Dependencies

​AutoGPTQ Installation

​Verify Quantization Support

​Fine-tuning Dependencies

​LoRA and Q-LoRA

​DeepSpeed (for distributed training)

​Full Fine-tuning Requirements

​Platform-Specific Installation

​x86 Platforms (Intel CPUs/GPUs)

​Ascend NPU

​Hygon DCU

​Installing from Source

​Downloading Models

​From Hugging Face

​From ModelScope

​Manual Download

​Environment Variables

​Verification

​Troubleshooting

​Next Steps

Quickstart

Model Selection

Inference

Docker Setup

System Requirements

Minimum Requirements

GPU Memory Requirements

Basic Installation

Step 1: Install Core Dependencies

Step 2: Verify Installation

Flash Attention (Recommended)

Performance Impact

Docker Installation

Pre-built Docker Images

Custom Dockerfile

Quantization Dependencies

AutoGPTQ Installation

Verify Quantization Support

Fine-tuning Dependencies

LoRA and Q-LoRA

DeepSpeed (for distributed training)

Full Fine-tuning Requirements

Platform-Specific Installation

x86 Platforms (Intel CPUs/GPUs)

Ascend NPU

Hygon DCU

Installing from Source

Downloading Models

From Hugging Face

From ModelScope

Manual Download

Environment Variables

Verification

Troubleshooting

Next Steps