Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen/llms.txt

Use this file to discover all available pages before exploring further.

This guide covers common issues you may encounter when working with Qwen models and their solutions.

Installation Issues

Flash Attention Installation Fails

Symptoms:
  • Compilation errors when installing flash-attention
  • CUDA version mismatch errors
  • Missing CUDA development files
Solutions:
1

Verify GPU compatibility

Flash Attention only works on:
  • Turing architecture: T4, RTX 2080, etc.
  • Ampere architecture: A100, RTX 3090, etc.
  • Ada architecture: RTX 4090, etc.
  • Hopper architecture: H100, etc.
Check your GPU:
nvidia-smi --query-gpu=name --format=csv
2

Verify CUDA version

Flash Attention requires CUDA 11.4+:
nvidia-smi  # Check Driver Version and CUDA Version
nvcc --version  # Check installed CUDA toolkit
3

Install from source

git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention
pip install .
4

Alternative: Skip Flash Attention

Flash Attention is optional. If installation continues to fail, proceed without it:
# Models will work fine without flash attention
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True,
    use_flash_attn=False  # Explicitly disable
).eval()

Package Dependency Conflicts

Error: Version conflicts between transformers, peft, optimum, auto-gptq Recommended versions:
# For torch 2.1+
pip install torch>=2.1
pip install auto-gptq>=0.5.1
pip install transformers>=4.35.0
pip install optimum>=1.14.0
pip install "peft>=0.6.1,<0.8.0"

# For torch 2.0.x
pip install "torch>=2.0,<2.1"
pip install "auto-gptq<0.5.0"
pip install "transformers<4.35.0"
pip install "optimum<1.14.0"
pip install "peft>=0.5.0,<0.6.0"

Git LFS Files Not Downloaded

Symptoms:
  • qwen.tiktoken is only a few bytes (text pointer)
  • Model files are text pointers instead of actual binaries
  • “File not found” errors for model checkpoints
Solution:
# Install git-lfs
git lfs install

# Pull LFS files
cd /path/to/Qwen
git lfs pull

# Verify qwen.tiktoken is ~2MB, not a text file
ls -lh qwen.tiktoken

Model Loading Issues

Model Won’t Load Locally

Checklist:
# Check for all required files:
ls -lh model_directory/

# Required files:
# - config.json
# - generation_config.json  
# - model*.safetensors (or model*.bin)
# - tokenizer_config.json
# - qwen.tiktoken
# - modeling_qwen.py
# - tokenization_qwen.py
# - configuration_qwen.py
cd Qwen
git pull origin main

# Verify you're on latest version
git log -1
# ALWAYS required for Qwen models
model = AutoModelForCausalLM.from_pretrained(
    "path/to/model",
    trust_remote_code=True  # This is required!
)
import torch

# Test loading a checkpoint file
checkpoint = torch.load("model.safetensors", map_location="cpu")
print(f"Checkpoint loaded successfully, {len(checkpoint)} keys")

Out of Memory (OOM) When Loading

Symptoms:
  • RuntimeError: CUDA out of memory
  • System freezes when loading model
  • Model loads but crashes during inference
Solutions:
1

Use quantized models

# Int4 uses ~50% less memory than Int8, ~75% less than BF16
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    device_map="auto",
    trust_remote_code=True
).eval()
2

Enable device_map='auto'

# Automatically distributes model across available devices
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",  # Important for multi-GPU
    trust_remote_code=True
).eval()
3

Use CPU offloading

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    offload_folder="offload",  # Offload to disk
    offload_state_dict=True,
    trust_remote_code=True
).eval()
4

Switch to smaller model

If none of the above work, use a smaller model size:
  • Qwen-7B → Qwen-1.8B
  • Qwen-14B → Qwen-7B
  • Qwen-72B → Qwen-14B or Qwen-7B

Inference Issues

Gibberish or Garbled Output

Problem 1: Using base model instead of chat model
# Wrong - base model doesn't follow instructions
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", ...)

# Correct - use chat model
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", ...)
Problem 2: Incomplete UTF-8 sequences in streaming
# Solution: Update to latest code
cd Qwen
git pull

# Or set error handling
tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    trust_remote_code=True,
    errors="ignore"  # or "replace"
)
Problem 3: Wrong decoding parameters
# Use appropriate sampling parameters
response, history = model.chat(
    tokenizer,
    "Your question",
    history=history,
    temperature=0.7,  # Lower = more deterministic
    top_p=0.9,
    top_k=50
)

Model Not Following Instructions

Check 1: Using correct model type
# Verify you loaded the -Chat model
print(model.config.name_or_path)  
# Should contain "-Chat"
Check 2: Using correct prompt format For Qwen-Chat, use the chat() method:
# Correct
response, history = model.chat(tokenizer, "Hello", history=None)

# Wrong - don't use generate() directly for chat models
response = model.generate(...)  
Check 3: System prompt (for Qwen-72B-Chat and Qwen-1.8B-Chat)
# Use system prompt for better instruction following
response, history = model.chat(
    tokenizer,
    "Your question",
    history=None,
    system="You are a helpful assistant."
)

Slow Inference Speed

Diagnosis:
import time

start = time.time()
response, history = model.chat(tokenizer, "Hello", history=None)
end = time.time()

print(f"Time: {end - start:.2f}s")
print(f"Tokens: {len(tokenizer.encode(response))}")
print(f"Speed: {len(tokenizer.encode(response)) / (end - start):.2f} tokens/s")
Solutions:
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True,
    use_flash_attn=True  # Requires compatible GPU
).eval()
# Int4 is faster than BF16
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    device_map="auto",
    trust_remote_code=True
).eval()
cd Qwen
git pull
pip install -r requirements.txt --upgrade
vLLM provides optimized inference:
pip install vllm

# See deployment documentation for details

Poor Performance on Long Context

Enable NTK and LogN attention:
# Check config.json
import json

with open("config.json") as f:
    config = json.load(f)

print("use_dynamic_ntk:", config.get("use_dynamic_ntk"))  # Should be true
print("use_logn_attn:", config.get("use_logn_attn"))      # Should be true
If false, manually enable:
model.config.use_dynamic_ntk = True
model.config.use_logn_attn = True

Fine-tuning Issues

OOM During Training

Solutions in order of effectiveness:
1

Use Q-LoRA instead of LoRA

# Q-LoRA uses quantized base model
bash finetune/finetune_qlora_single_gpu.sh
Saves ~40-50% memory compared to LoRA.
2

Reduce batch size, increase gradient accumulation

# In training script:
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16
3

Enable gradient checkpointing

--gradient_checkpointing True
4

Use DeepSpeed ZeRO

# For LoRA training
bash finetune/finetune_lora_ds.sh
5

Reduce sequence length

--model_max_length 1024  # Instead of 2048

Training Loss Not Decreasing

Checklist:
// Each sample should look like:
{
  "id": "unique_id",
  "conversations": [
    {"from": "user", "value": "Question"},
    {"from": "assistant", "value": "Answer"}
  ]
}
# Try different learning rates
--learning_rate 1e-5  # Default
--learning_rate 5e-6  # If loss explodes
--learning_rate 2e-5  # If loss doesn't move
print(model.training)  # Should be True during training
for name, param in model.named_parameters():
    if param.requires_grad:
        print(f"Trainable: {name}")

Quantized Model Finetuning Issues

Problem: Can’t load LoRA adapter after Q-LoRA training
# Solution: Load with AutoPeftModelForCausalLM
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    "path/to/adapter",
    device_map="auto",
    trust_remote_code=True
).eval()
Problem: Missing .cpp and .cu files after saving Manually copy these files from the original model directory:
  • cache_autogptq_cuda_256.cpp
  • cache_autogptq_cuda_kernel_256.cu
  • Other .cpp and .cu files

Quantization Issues

AutoGPTQ Installation Fails

Check PyTorch and CUDA compatibility:
python -c "import torch; print(torch.__version__); print(torch.version.cuda)"
Install matching auto-gptq wheel:
# For torch 2.1 + CUDA 11.8
pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/

# For torch 2.0 + CUDA 11.8  
pip install "auto-gptq<0.5.0" --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
See AutoGPTQ repo for more wheels.

Quantized Model Slower Than Expected

Note: Loading with AutoModelForCausalLM.from_pretrained() is ~20% slower than the autogptq library directly. This is a known issue reported to HuggingFace team. Workaround: Use the autogptq library directly for maximum speed.

Tool Usage and ReAct Issues

Plugin Not Being Called

Check prompt format:
# Make sure to use proper ReAct prompt format
# See examples/react_prompt.md for details

prompt = """Answer the following questions as best you can. You have access to the following tools:

{tool_descriptions}

Use the following format:

Question: the input question
Thought: think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (repeat Thought/Action/Action Input/Observation as needed)
Thought: I now know the final answer
Final Answer: the final answer

Question: {question}
Thought:"""

HuggingFace Agent Issues

Verify Qwen-Chat is being used:
from transformers import HfAgent

agent = HfAgent(
    "Qwen/Qwen-7B-Chat",  # Must be -Chat model
    trust_remote_code=True
)

Docker Issues

Container Fails to Start

Check GPU availability:
# Test NVIDIA Docker runtime
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
Verify sufficient resources:
# Check available memory
free -h

# Check available disk space
df -h

Slow Image Download

Use a Docker registry mirror (especially for users in China):
# Configure Docker daemon.json
sudo vim /etc/docker/daemon.json
Add:
{
  "registry-mirrors": ["https://your-mirror.com"]
}
Restart Docker:
sudo systemctl restart docker

Platform-Specific Issues

Windows

Long path issues:
# Enable long paths in Windows
New-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1 -PropertyType DWORD -Force
WSL2 recommended for better compatibility:
wsl --install
wsl --set-default-version 2

macOS

Metal/MPS not officially supported. Use CPU or cloud deployment.
# CPU-only on macOS
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="cpu",
    trust_remote_code=True
).eval()

Getting Help

If issues persist after trying these solutions:
  1. Search existing issues: GitHub Issues
  2. Check the FAQ: FAQ
  3. Open a new issue with:
    • Full error traceback
    • Environment details (python --version, pip list, nvidia-smi)
    • Minimal reproducible code
    • Steps already tried
  4. Join the community:
When reporting issues, please provide as much context as possible and use English when possible to help more people understand and assist.