Qwen supports multiple fine-tuning approaches to adapt the pretrained models to your specific tasks and domains. This guide provides an overview of the available methods and helps you choose the right approach for your use case.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen/llms.txt
Use this file to discover all available pages before exploring further.
Available Fine-tuning Methods
Qwen provides three primary fine-tuning methods, each with different memory requirements and training characteristics:Full-Parameter
Update all model parameters for maximum performance
LoRA
Efficient adapter-based training with low memory usage
Q-LoRA
LoRA on quantized models for minimal GPU requirements
Method Comparison
Choose your fine-tuning approach based on available resources and requirements:| Method | GPU Memory (7B) | Training Speed | Performance | Use Case |
|---|---|---|---|---|
| Full-Parameter | ~43.5GB (2 GPUs) | Moderate | Highest | Production models with ample resources |
| LoRA | ~20.1GB (1 GPU) | Fast | High | Balanced approach for most use cases |
| LoRA (emb) | ~33.7GB (1 GPU) | Fast | High | Fine-tuning base models with new tokens |
| Q-LoRA | ~11.5GB (1 GPU) | Slower | Good | Limited GPU memory scenarios |
Memory statistics are for Qwen-7B with sequence length 256. Requirements increase with longer sequences.
Model Size Considerations
Memory Requirements by Model Size
Minimum GPU memory for Q-LoRA fine-tuning (most memory-efficient method):Qwen-1.8B
Qwen-1.8B
- Q-LoRA: 5.8GB GPU memory
- LoRA: 6.7GB GPU memory
- Full-parameter: 43.5GB GPU memory (single GPU)
- Suitable for consumer GPUs (RTX 3090, 4090)
Qwen-7B
Qwen-7B
- Q-LoRA: 11.5GB GPU memory
- LoRA: 20.1GB GPU memory
- Full-parameter: Requires 2x A100 GPUs
- Recommended for professional workstations
Qwen-14B
Qwen-14B
- Q-LoRA: 18.7GB GPU memory
- LoRA: Requires multiple GPUs or DeepSpeed ZeRO-3
- Full-parameter: Requires 4+ A100 GPUs
- Enterprise-grade hardware required
Qwen-72B
Qwen-72B
- Q-LoRA: 61.4GB GPU memory (A100-80GB)
- LoRA + DeepSpeed ZeRO-3: 4x A100-80GB GPUs
- Full-parameter: Requires 8+ A100 GPUs
- Large-scale training infrastructure needed
Key Features
Training Framework Support
All fine-tuning methods support:- DeepSpeed: Distributed training with ZeRO optimization (stages 2 and 3)
- FSDP: Fully Sharded Data Parallel (alternative to DeepSpeed)
- Flash Attention 2: Accelerated training and reduced memory usage
- Gradient Checkpointing: Trade computation for memory savings
Supported Precision
Training Script Overview
Qwen provides production-ready training scripts in thefinetune/ directory:
Data Format
All fine-tuning methods use the same JSON conversation format:Quick Start
Prepare Training Data
Create your training data in JSON format following the conversation structure above.
Choose Fine-tuning Method
Select the appropriate method based on your GPU memory and requirements:
- Limited GPU memory (< 12GB): Use Q-LoRA with smaller models
- Single GPU (16-40GB): Use LoRA
- Multiple GPUs: Use LoRA or Full-parameter with DeepSpeed
Performance Benchmarks
Qwen-7B Fine-tuning Performance (Single A100-80GB)
| Sequence Length | LoRA Memory | LoRA Speed | Q-LoRA Memory | Q-LoRA Speed |
|---|---|---|---|---|
| 256 | 20.1GB | 1.2s/iter | 11.5GB | 3.0s/iter |
| 512 | 20.4GB | 1.5s/iter | 11.5GB | 3.0s/iter |
| 1024 | 21.5GB | 2.8s/iter | 12.3GB | 3.5s/iter |
| 2048 | 23.8GB | 5.2s/iter | 13.9GB | 7.0s/iter |
| 4096 | 29.7GB | 10.1s/iter | 16.9GB | 11.6s/iter |
| 8192 | 36.6GB | 21.3s/iter | 23.5GB | 22.3s/iter |
Batch size: 1, Gradient accumulation: 8, Flash Attention 2 enabled
Special Considerations
Base Model vs Chat Model
When fine-tuning base models (e.g., Qwen-7B) with LoRA:- The embedding (
wte) and output (lm_head) layers are automatically set as trainable - This is necessary for the model to learn ChatML format tokens
- Requires more memory than fine-tuning chat models
- Cannot use DeepSpeed ZeRO-3 with trainable embeddings
- No additional trainable parameters needed
- Lower memory requirements
- Compatible with DeepSpeed ZeRO-3
Next Steps
Data Preparation
Learn how to prepare high-quality training data
Multi-node Training
Scale training across multiple machines