Troubleshooting

This guide covers common issues you may encounter when working with Qwen models and their solutions.

Installation Issues

Flash Attention Installation Fails

Symptoms:

Compilation errors when installing flash-attention
CUDA version mismatch errors
Missing CUDA development files

Solutions:

Verify GPU compatibility

Flash Attention only works on:

Turing architecture: T4, RTX 2080, etc.
Ampere architecture: A100, RTX 3090, etc.
Ada architecture: RTX 4090, etc.
Hopper architecture: H100, etc.

Check your GPU:

nvidia-smi --query-gpu=name --format=csv

Verify CUDA version

Flash Attention requires CUDA 11.4+:

nvidia-smi  # Check Driver Version and CUDA Version
nvcc --version  # Check installed CUDA toolkit

Install from source

git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention
pip install .

Alternative: Skip Flash Attention

Flash Attention is optional. If installation continues to fail, proceed without it:

# Models will work fine without flash attention
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True,
    use_flash_attn=False  # Explicitly disable
).eval()

Package Dependency Conflicts

Error: Version conflicts between transformers, peft, optimum, auto-gptq Recommended versions:

# For torch 2.1+
pip install torch>=2.1
pip install auto-gptq>=0.5.1
pip install transformers>=4.35.0
pip install optimum>=1.14.0
pip install "peft>=0.6.1,<0.8.0"

# For torch 2.0.x
pip install "torch>=2.0,<2.1"
pip install "auto-gptq<0.5.0"
pip install "transformers<4.35.0"
pip install "optimum<1.14.0"
pip install "peft>=0.5.0,<0.6.0"

Git LFS Files Not Downloaded

Symptoms:

qwen.tiktoken is only a few bytes (text pointer)
Model files are text pointers instead of actual binaries
“File not found” errors for model checkpoints

Solution:

# Install git-lfs
git lfs install

# Pull LFS files
cd /path/to/Qwen
git lfs pull

# Verify qwen.tiktoken is ~2MB, not a text file
ls -lh qwen.tiktoken

Model Loading Issues

Model Won’t Load Locally

Checklist:

1. Verify all files are present

# Check for all required files:
ls -lh model_directory/

# Required files:
# - config.json
# - generation_config.json  
# - model*.safetensors (or model*.bin)
# - tokenizer_config.json
# - qwen.tiktoken
# - modeling_qwen.py
# - tokenization_qwen.py
# - configuration_qwen.py

2. Check code version

cd Qwen
git pull origin main

# Verify you're on latest version
git log -1

3. Set trust_remote_code=True

# ALWAYS required for Qwen models
model = AutoModelForCausalLM.from_pretrained(
    "path/to/model",
    trust_remote_code=True  # This is required!
)

4. Verify checkpoint integrity

import torch

# Test loading a checkpoint file
checkpoint = torch.load("model.safetensors", map_location="cpu")
print(f"Checkpoint loaded successfully, {len(checkpoint)} keys")

Symptoms:

RuntimeError: CUDA out of memory
System freezes when loading model
Model loads but crashes during inference

Solutions:

Use quantized models

# Int4 uses ~50% less memory than Int8, ~75% less than BF16
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    device_map="auto",
    trust_remote_code=True
).eval()

Enable device_map='auto'

# Automatically distributes model across available devices
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",  # Important for multi-GPU
    trust_remote_code=True
).eval()

Use CPU offloading

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    offload_folder="offload",  # Offload to disk
    offload_state_dict=True,
    trust_remote_code=True
).eval()

Switch to smaller model

If none of the above work, use a smaller model size:

Qwen-7B → Qwen-1.8B
Qwen-14B → Qwen-7B
Qwen-72B → Qwen-14B or Qwen-7B

Inference Issues

Gibberish or Garbled Output

Problem 1: Using base model instead of chat model

# Wrong - base model doesn't follow instructions
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", ...)

# Correct - use chat model
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", ...)

Problem 2: Incomplete UTF-8 sequences in streaming

# Solution: Update to latest code
cd Qwen
git pull

# Or set error handling
tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    trust_remote_code=True,
    errors="ignore"  # or "replace"
)

Problem 3: Wrong decoding parameters

# Use appropriate sampling parameters
response, history = model.chat(
    tokenizer,
    "Your question",
    history=history,
    temperature=0.7,  # Lower = more deterministic
    top_p=0.9,
    top_k=50
)

Model Not Following Instructions

Check 1: Using correct model type

# Verify you loaded the -Chat model
print(model.config.name_or_path)  
# Should contain "-Chat"

Check 2: Using correct prompt format For Qwen-Chat, use the chat() method:

# Correct
response, history = model.chat(tokenizer, "Hello", history=None)

# Wrong - don't use generate() directly for chat models
response = model.generate(...)  

Check 3: System prompt (for Qwen-72B-Chat and Qwen-1.8B-Chat)

# Use system prompt for better instruction following
response, history = model.chat(
    tokenizer,
    "Your question",
    history=None,
    system="You are a helpful assistant."
)

Slow Inference Speed

Diagnosis:

import time

start = time.time()
response, history = model.chat(tokenizer, "Hello", history=None)
end = time.time()

print(f"Time: {end - start:.2f}s")
print(f"Tokens: {len(tokenizer.encode(response))}")
print(f"Speed: {len(tokenizer.encode(response)) / (end - start):.2f} tokens/s")

Solutions:

Enable Flash Attention

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True,
    use_flash_attn=True  # Requires compatible GPU
).eval()

Use quantized models

# Int4 is faster than BF16
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    device_map="auto",
    trust_remote_code=True
).eval()

Update to latest code

cd Qwen
git pull
pip install -r requirements.txt --upgrade

Use vLLM for deployment

vLLM provides optimized inference:

pip install vllm

# See deployment documentation for details

Poor Performance on Long Context

Enable NTK and LogN attention:

# Check config.json
import json

with open("config.json") as f:
    config = json.load(f)

print("use_dynamic_ntk:", config.get("use_dynamic_ntk"))  # Should be true
print("use_logn_attn:", config.get("use_logn_attn"))      # Should be true

If false, manually enable:

model.config.use_dynamic_ntk = True
model.config.use_logn_attn = True

Fine-tuning Issues

OOM During Training

Solutions in order of effectiveness:

Use Q-LoRA instead of LoRA

# Q-LoRA uses quantized base model
bash finetune/finetune_qlora_single_gpu.sh

Saves ~40-50% memory compared to LoRA.

Reduce batch size, increase gradient accumulation

# In training script:
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16

Enable gradient checkpointing

--gradient_checkpointing True

Use DeepSpeed ZeRO

# For LoRA training
bash finetune/finetune_lora_ds.sh

Reduce sequence length

--model_max_length 1024  # Instead of 2048

Training Loss Not Decreasing

Checklist:

1. Verify data format

// Each sample should look like:
{
  "id": "unique_id",
  "conversations": [
    {"from": "user", "value": "Question"},
    {"from": "assistant", "value": "Answer"}
  ]
}

2. Check learning rate

# Try different learning rates
--learning_rate 1e-5  # Default
--learning_rate 5e-6  # If loss explodes
--learning_rate 2e-5  # If loss doesn't move

3. Verify model is in training mode

print(model.training)  # Should be True during training

4. Check if parameters are frozen

for name, param in model.named_parameters():
    if param.requires_grad:
        print(f"Trainable: {name}")

Quantized Model Finetuning Issues

Problem: Can’t load LoRA adapter after Q-LoRA training

# Solution: Load with AutoPeftModelForCausalLM
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    "path/to/adapter",
    device_map="auto",
    trust_remote_code=True
).eval()

Problem: Missing .cpp and .cu files after saving Manually copy these files from the original model directory:

cache_autogptq_cuda_256.cpp
cache_autogptq_cuda_kernel_256.cu
Other .cpp and .cu files

Quantization Issues

AutoGPTQ Installation Fails

Check PyTorch and CUDA compatibility:

python -c "import torch; print(torch.__version__); print(torch.version.cuda)"

Install matching auto-gptq wheel:

# For torch 2.1 + CUDA 11.8
pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/

# For torch 2.0 + CUDA 11.8  
pip install "auto-gptq<0.5.0" --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/

See AutoGPTQ repo for more wheels.

Quantized Model Slower Than Expected

Note: Loading with AutoModelForCausalLM.from_pretrained() is ~20% slower than the autogptq library directly. This is a known issue reported to HuggingFace team. Workaround: Use the autogptq library directly for maximum speed.

Tool Usage and ReAct Issues

Plugin Not Being Called

Check prompt format:

# Make sure to use proper ReAct prompt format
# See examples/react_prompt.md for details

prompt = """Answer the following questions as best you can. You have access to the following tools:

{tool_descriptions}

Use the following format:

Question: the input question
Thought: think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (repeat Thought/Action/Action Input/Observation as needed)
Thought: I now know the final answer
Final Answer: the final answer

Question: {question}
Thought:"""

HuggingFace Agent Issues

Verify Qwen-Chat is being used:

from transformers import HfAgent

agent = HfAgent(
    "Qwen/Qwen-7B-Chat",  # Must be -Chat model
    trust_remote_code=True
)

Docker Issues

Container Fails to Start

Check GPU availability:

# Test NVIDIA Docker runtime
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

Verify sufficient resources:

# Check available memory
free -h

# Check available disk space
df -h

Slow Image Download

Use a Docker registry mirror (especially for users in China):

# Configure Docker daemon.json
sudo vim /etc/docker/daemon.json

Add:

{
  "registry-mirrors": ["https://your-mirror.com"]
}

Restart Docker:

sudo systemctl restart docker

Platform-Specific Issues

Windows

Long path issues:

# Enable long paths in Windows
New-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1 -PropertyType DWORD -Force

WSL2 recommended for better compatibility:

wsl --install
wsl --set-default-version 2

macOS

Metal/MPS not officially supported. Use CPU or cloud deployment.

# CPU-only on macOS
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="cpu",
    trust_remote_code=True
).eval()

Getting Help

If issues persist after trying these solutions:

Search existing issues: GitHub Issues
Check the FAQ: FAQ
Open a new issue with:
- Full error traceback
- Environment details (python --version, pip list, nvidia-smi)
- Minimal reproducible code
- Steps already tried
Join the community:
- Discord
- WeChat (see main README)

When reporting issues, please provide as much context as possible and use English when possible to help more people understand and assist.

Guides

Support

Troubleshooting

Installation Issues

Flash Attention Installation Fails

Package Dependency Conflicts

Git LFS Files Not Downloaded

Model Loading Issues

Model Won’t Load Locally

Out of Memory (OOM) When Loading

Inference Issues

Gibberish or Garbled Output

Model Not Following Instructions

Slow Inference Speed

Poor Performance on Long Context

Fine-tuning Issues

OOM During Training

Training Loss Not Decreasing

Quantized Model Finetuning Issues

Quantization Issues

AutoGPTQ Installation Fails

Quantized Model Slower Than Expected

Tool Usage and ReAct Issues

Plugin Not Being Called

HuggingFace Agent Issues

Docker Issues

Container Fails to Start

Slow Image Download

Platform-Specific Issues

Windows

macOS

Getting Help

Guides

Support

Documentation Index

​Installation Issues

​Flash Attention Installation Fails

​Package Dependency Conflicts

​Git LFS Files Not Downloaded

​Model Loading Issues

​Model Won’t Load Locally

​Out of Memory (OOM) When Loading

​Inference Issues

​Gibberish or Garbled Output

​Model Not Following Instructions

​Slow Inference Speed

​Poor Performance on Long Context

​Fine-tuning Issues

​OOM During Training

​Training Loss Not Decreasing

​Quantized Model Finetuning Issues

​Quantization Issues

​AutoGPTQ Installation Fails

​Quantized Model Slower Than Expected

​Tool Usage and ReAct Issues

​Plugin Not Being Called

​HuggingFace Agent Issues

​Docker Issues

​Container Fails to Start

​Slow Image Download

​Platform-Specific Issues

​Windows

​macOS

​Getting Help

Installation Issues

Flash Attention Installation Fails

Package Dependency Conflicts

Git LFS Files Not Downloaded

Model Loading Issues

Model Won’t Load Locally

Out of Memory (OOM) When Loading

Inference Issues

Gibberish or Garbled Output

Model Not Following Instructions

Slow Inference Speed

Poor Performance on Long Context

Fine-tuning Issues

OOM During Training

Training Loss Not Decreasing

Quantized Model Finetuning Issues

Quantization Issues

AutoGPTQ Installation Fails

Quantized Model Slower Than Expected

Tool Usage and ReAct Issues

Plugin Not Being Called

HuggingFace Agent Issues

Docker Issues

Container Fails to Start

Slow Image Download

Platform-Specific Issues

Windows

macOS

Getting Help