Full-parameter fine-tuning updates all model weights during training, providing the most comprehensive adaptation to your task. This method achieves the best performance but requires significant GPU resources.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Full-parameter fine-tuning is recommended when:- You have access to multiple high-memory GPUs (A100 80GB or similar)
- You need maximum model performance for production deployment
- Your task requires significant deviation from the pretrained behavior
- You can afford longer training times and higher computational costs
Hardware Requirements
Memory Requirements by Model Size
| Model | GPUs Required | Memory per GPU | Total GPU Memory |
|---|---|---|---|
| Qwen-1.8B | 1x A100 | 43.5GB | 43.5GB |
| Qwen-7B | 2x A100 | ~40GB | ~80GB |
| Qwen-14B | 4x A100 | ~30GB | ~120GB |
| Qwen-72B | 8x A100 | ~80GB | ~640GB |
Performance Benchmarks
Qwen-1.8B on Single A100-80GB:| Sequence Length | GPU Memory | Speed |
|---|---|---|
| 256 | 43.5GB | 2.1s/iter |
| 512 | 43.5GB | 2.2s/iter |
| 1024 | 43.5GB | 2.2s/iter |
| 2048 | 43.5GB | 2.3s/iter |
| 4096 | 47.1GB | 2.8s/iter |
| 8192 | 48.3GB | 5.6s/iter |
Batch size: 1, Gradient accumulation: 8, Flash Attention 2 enabled, BF16 precision
Installation
Install the required dependencies:Training Configuration
Basic Training Script
Thefinetune/finetune_ds.sh script provides a complete configuration for distributed full-parameter training:
finetune/finetune_ds.sh
Running Training
Prepare Your Data
Monitor Progress
Training logs will show loss and learning rate:Checkpoints are saved to
output_qwen/ every 1000 steps.DeepSpeed Configuration
Full-parameter training uses DeepSpeed ZeRO-3 to distribute model parameters across GPUs:finetune/ds_config_zero3.json
ZeRO-3 Features
- Parameter Sharding: Distributes all model parameters across GPUs
- Gradient Sharding: Distributes gradients across GPUs
- Optimizer State Sharding: Distributes optimizer states across GPUs
- Communication Overlap: Overlaps communication with computation
Hyperparameter Guide
Learning Rate
Recommended Learning Rates
Recommended Learning Rates
- Conservative: 5e-6 (safer, slower convergence)
- Standard: 1e-5 (recommended starting point)
- Aggressive: 2e-5 (faster convergence, risk of instability)
Learning Rate Scheduling
Learning Rate Scheduling
- Cosine decay with 1% warmup steps
- Gradually reduces learning rate over training
- Helps achieve better convergence
Batch Size and Gradient Accumulation
per_device_batch_size × gradient_accumulation_steps × num_gpus
For 2 GPUs: 1 × 16 × 2 = 32 effective batch size
Sequence Length
| Max Length | Memory Impact | Use Case |
|---|---|---|
| 512 | Baseline | Short conversations |
| 1024 | +10-20% | Standard conversations |
| 2048 | +30-50% | Long conversations |
| 4096 | +100%+ | Very long contexts |
| 8192 | +200%+ | Maximum context |
Training Duration
Number of Epochs
Dataset Size Guidelines
Dataset Size Guidelines
- Small (less than 1K samples): 10-20 epochs
- Medium (1K-10K samples): 3-5 epochs
- Large (more than 10K samples): 1-3 epochs
Monitoring Overfitting
Monitoring Overfitting
Watch for these signs of overfitting:
- Training loss continues decreasing while validation loss increases
- Model memorizes training examples verbatim
- Poor generalization to new inputs
- Reduce number of epochs
- Increase dataset size
- Add regularization (weight decay)
Checkpointing
- Saves checkpoint every 1000 steps
- Keeps only the last 10 checkpoints
- Automatically deletes older checkpoints to save disk space
Checkpoint Structure
Advanced Options
Gradient Checkpointing
- Memory savings: 30-50% reduction
- Speed impact: 20-30% slower training
- Recommended: Always enable for full-parameter training
Mixed Precision Training
- Wider dynamic range than FP16
- No loss scaling required
- Consistent with Qwen pretraining
- Requires Ampere GPUs or newer (A100, RTX 30xx+)
Troubleshooting
Out of Memory Errors
Out of Memory Errors
Solutions:
- Reduce
model_max_length - Enable
gradient_checkpointing - Reduce
per_device_train_batch_sizeto 1 - Add more GPUs
- Use DeepSpeed ZeRO-3 with CPU offloading:
Training Divergence (Loss → NaN)
Training Divergence (Loss → NaN)
Causes and solutions:
- Learning rate too high: Reduce to 5e-6
- Gradient explosion: Enable gradient clipping (automatic with DeepSpeed)
- Data quality issues: Check for corrupted samples
- Mixed precision issues: Try BF16 instead of FP16
Slow Training Speed
Slow Training Speed
Optimizations:
- Enable Flash Attention 2
- Use
--lazy_preprocess True - Increase
gradient_accumulation_steps, reducesave_steps - Ensure high-bandwidth inter-GPU connection (NVLink)
- Profile with:
DeepSpeed Initialization Errors
DeepSpeed Initialization Errors
Common fixes:
- Install compatible versions:
torch>=2.0,deepspeed>=0.10 - Check CUDA version compatibility
- Verify all GPUs are accessible:
nvidia-smi - Ensure consistent PyTorch versions across all nodes (for multi-node)
Inference After Training
Load and use your fine-tuned model:Next Steps
LoRA Fine-tuning
Learn about memory-efficient LoRA training
Multi-node Training
Scale to multiple machines for even larger models