> ## Documentation Index
> Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen/llms.txt
> Use this file to discover all available pages before exploring further.

# Full-Parameter Fine-tuning

> Train Qwen models by updating all parameters for maximum performance

Full-parameter fine-tuning updates all model weights during training, providing the most comprehensive adaptation to your task. This method achieves the best performance but requires significant GPU resources.

## Overview

Full-parameter fine-tuning is recommended when:

* You have access to multiple high-memory GPUs (A100 80GB or similar)
* You need maximum model performance for production deployment
* Your task requires significant deviation from the pretrained behavior
* You can afford longer training times and higher computational costs

<Warning>
  Full-parameter fine-tuning of Qwen-7B requires at least **2x A100-80GB GPUs**. Single-GPU training will result in out-of-memory errors.
</Warning>

## Hardware Requirements

### Memory Requirements by Model Size

| Model     | GPUs Required | Memory per GPU | Total GPU Memory |
| --------- | ------------- | -------------- | ---------------- |
| Qwen-1.8B | 1x A100       | 43.5GB         | 43.5GB           |
| Qwen-7B   | 2x A100       | \~40GB         | \~80GB           |
| Qwen-14B  | 4x A100       | \~30GB         | \~120GB          |
| Qwen-72B  | 8x A100       | \~80GB         | \~640GB          |

### Performance Benchmarks

**Qwen-1.8B on Single A100-80GB:**

| Sequence Length | GPU Memory | Speed     |
| --------------- | ---------- | --------- |
| 256             | 43.5GB     | 2.1s/iter |
| 512             | 43.5GB     | 2.2s/iter |
| 1024            | 43.5GB     | 2.2s/iter |
| 2048            | 43.5GB     | 2.3s/iter |
| 4096            | 47.1GB     | 2.8s/iter |
| 8192            | 48.3GB     | 5.6s/iter |

<Note>
  Batch size: 1, Gradient accumulation: 8, Flash Attention 2 enabled, BF16 precision
</Note>

## Installation

Install the required dependencies:

```bash theme={null}
# Install base requirements
pip install -r requirements.txt

# Install DeepSpeed for distributed training
pip install deepspeed

# Install Flash Attention 2 (optional but recommended)
git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
```

<Tip>
  Flash Attention 2 significantly reduces memory usage and improves training speed. Highly recommended for full-parameter training.
</Tip>

## Training Configuration

### Basic Training Script

The `finetune/finetune_ds.sh` script provides a complete configuration for distributed full-parameter training:

```bash finetune/finetune_ds.sh theme={null}
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1

# Number of GPUs per node
GPUS_PER_NODE=$(python -c 'import torch; print(torch.cuda.device_count())')

# Multi-node settings (for single node, use defaults)
NNODES=${NNODES:-1}
NODE_RANK=${NODE_RANK:-0}
MASTER_ADDR=${MASTER_ADDR:-localhost}
MASTER_PORT=${MASTER_PORT:-6001}

MODEL="Qwen/Qwen-7B"
DATA="path_to_data.json"

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"

torchrun $DISTRIBUTED_ARGS finetune.py \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --bf16 True \
    --output_dir output_qwen \
    --num_train_epochs 5 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 10 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --model_max_length 512 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --deepspeed finetune/ds_config_zero3.json
```

## Running Training

<Steps>
  <Step title="Prepare Your Data">
    Create your training data in the required JSON format:

    ```json theme={null}
    [
      {
        "id": "sample_1",
        "conversations": [
          {"from": "user", "value": "Your question here"},
          {"from": "assistant", "value": "Expected response here"}
        ]
      }
    ]
    ```

    See [Data Preparation](/finetuning/data-preparation) for details.
  </Step>

  <Step title="Launch Training">
    Run the training script with your model and data paths:

    ```bash theme={null}
    bash finetune/finetune_ds.sh \
      -m Qwen/Qwen-7B \
      -d /path/to/your/data.json
    ```
  </Step>

  <Step title="Monitor Progress">
    Training logs will show loss and learning rate:

    ```text theme={null}
    {'loss': 2.345, 'learning_rate': 1e-05, 'epoch': 0.1}
    {'loss': 1.876, 'learning_rate': 9.5e-06, 'epoch': 0.2}
    ```

    Checkpoints are saved to `output_qwen/` every 1000 steps.
  </Step>

  <Step title="Load Fine-tuned Model">
    After training completes, load your model:

    ```python theme={null}
    from transformers import AutoModelForCausalLM, AutoTokenizer

    model = AutoModelForCausalLM.from_pretrained(
        "output_qwen",
        device_map="auto",
        trust_remote_code=True
    ).eval()

    tokenizer = AutoTokenizer.from_pretrained(
        "output_qwen",
        trust_remote_code=True
    )

    response, history = model.chat(tokenizer, "Hello!", history=None)
    print(response)
    ```
  </Step>
</Steps>

## DeepSpeed Configuration

Full-parameter training uses **DeepSpeed ZeRO-3** to distribute model parameters across GPUs:

```json finetune/ds_config_zero3.json theme={null}
{
    "bf16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "offload_param": {
            "device": "none",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}
```

### ZeRO-3 Features

* **Parameter Sharding**: Distributes all model parameters across GPUs
* **Gradient Sharding**: Distributes gradients across GPUs
* **Optimizer State Sharding**: Distributes optimizer states across GPUs
* **Communication Overlap**: Overlaps communication with computation

<Tip>
  ZeRO-3 enables training of much larger models than would fit on a single GPU, but requires high inter-GPU bandwidth for best performance.
</Tip>

## Hyperparameter Guide

### Learning Rate

```bash theme={null}
--learning_rate 1e-5
```

Full-parameter fine-tuning uses a **lower learning rate** (1e-5) compared to LoRA (3e-4) because all parameters are being updated.

<AccordionGroup>
  <Accordion title="Recommended Learning Rates">
    * **Conservative**: 5e-6 (safer, slower convergence)
    * **Standard**: 1e-5 (recommended starting point)
    * **Aggressive**: 2e-5 (faster convergence, risk of instability)
  </Accordion>

  <Accordion title="Learning Rate Scheduling">
    ```bash theme={null}
    --lr_scheduler_type "cosine" \
    --warmup_ratio 0.01
    ```

    * Cosine decay with 1% warmup steps
    * Gradually reduces learning rate over training
    * Helps achieve better convergence
  </Accordion>
</AccordionGroup>

### Batch Size and Gradient Accumulation

```bash theme={null}
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16
```

Effective batch size = `per_device_batch_size × gradient_accumulation_steps × num_gpus`

For 2 GPUs: 1 × 16 × 2 = **32 effective batch size**

<Warning>
  Increasing `per_device_train_batch_size` beyond 1 may cause OOM errors. Adjust `gradient_accumulation_steps` instead.
</Warning>

### Sequence Length

```bash theme={null}
--model_max_length 512
```

Longer sequences require more memory:

| Max Length | Memory Impact | Use Case               |
| ---------- | ------------- | ---------------------- |
| 512        | Baseline      | Short conversations    |
| 1024       | +10-20%       | Standard conversations |
| 2048       | +30-50%       | Long conversations     |
| 4096       | +100%+        | Very long contexts     |
| 8192       | +200%+        | Maximum context        |

## Training Duration

### Number of Epochs

```bash theme={null}
--num_train_epochs 5
```

<AccordionGroup>
  <Accordion title="Dataset Size Guidelines">
    * **Small (less than 1K samples)**: 10-20 epochs
    * **Medium (1K-10K samples)**: 3-5 epochs
    * **Large (more than 10K samples)**: 1-3 epochs
  </Accordion>

  <Accordion title="Monitoring Overfitting">
    Watch for these signs of overfitting:

    * Training loss continues decreasing while validation loss increases
    * Model memorizes training examples verbatim
    * Poor generalization to new inputs

    Solutions:

    * Reduce number of epochs
    * Increase dataset size
    * Add regularization (weight decay)
  </Accordion>
</AccordionGroup>

## Checkpointing

```bash theme={null}
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 10
```

* Saves checkpoint every 1000 steps
* Keeps only the last 10 checkpoints
* Automatically deletes older checkpoints to save disk space

### Checkpoint Structure

```bash theme={null}
output_qwen/
├── checkpoint-1000/
│   ├── pytorch_model.bin
│   ├── config.json
│   ├── trainer_state.json
│   └── optimizer.pt
├── checkpoint-2000/
└── ...
```

## Advanced Options

### Gradient Checkpointing

```bash theme={null}
--gradient_checkpointing True
```

Trades computation for memory by recomputing activations during backward pass:

* **Memory savings**: 30-50% reduction
* **Speed impact**: 20-30% slower training
* **Recommended**: Always enable for full-parameter training

### Mixed Precision Training

<CodeGroup>
  ```bash BF16 (Recommended) theme={null}
  --bf16 True
  ```

  ```bash FP16 (Alternative) theme={null}
  --fp16 True \
  --deepspeed finetune/ds_config_zero3.json
  ```
</CodeGroup>

**BF16 advantages**:

* Wider dynamic range than FP16
* No loss scaling required
* Consistent with Qwen pretraining
* Requires Ampere GPUs or newer (A100, RTX 30xx+)

## Troubleshooting

<AccordionGroup>
  <Accordion title="Out of Memory Errors">
    Solutions:

    1. Reduce `model_max_length`
    2. Enable `gradient_checkpointing`
    3. Reduce `per_device_train_batch_size` to 1
    4. Add more GPUs
    5. Use DeepSpeed ZeRO-3 with CPU offloading:

    ```json theme={null}
    "offload_param": {
        "device": "cpu",
        "pin_memory": true
    }
    ```
  </Accordion>

  <Accordion title="Training Divergence (Loss → NaN)">
    Causes and solutions:

    * **Learning rate too high**: Reduce to 5e-6
    * **Gradient explosion**: Enable gradient clipping (automatic with DeepSpeed)
    * **Data quality issues**: Check for corrupted samples
    * **Mixed precision issues**: Try BF16 instead of FP16
  </Accordion>

  <Accordion title="Slow Training Speed">
    Optimizations:

    1. Enable Flash Attention 2
    2. Use `--lazy_preprocess True`
    3. Increase `gradient_accumulation_steps`, reduce `save_steps`
    4. Ensure high-bandwidth inter-GPU connection (NVLink)
    5. Profile with:

    ```bash theme={null}
    --report_to "tensorboard" \
    --logging_dir ./logs
    ```
  </Accordion>

  <Accordion title="DeepSpeed Initialization Errors">
    Common fixes:

    * Install compatible versions: `torch>=2.0`, `deepspeed>=0.10`
    * Check CUDA version compatibility
    * Verify all GPUs are accessible: `nvidia-smi`
    * Ensure consistent PyTorch versions across all nodes (for multi-node)
  </Accordion>
</AccordionGroup>

## Inference After Training

Load and use your fine-tuned model:

```python theme={null}
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "output_qwen",
    device_map="auto",
    trust_remote_code=True,
    bf16=True
).eval()

tokenizer = AutoTokenizer.from_pretrained(
    "output_qwen",
    trust_remote_code=True
)

# Single-turn conversation
response, history = model.chat(
    tokenizer,
    "What can you help me with?",
    history=None
)
print(response)

# Multi-turn conversation
response, history = model.chat(
    tokenizer,
    "Tell me more about that",
    history=history
)
print(response)
```

## Next Steps

<CardGroup cols={2}>
  <Card title="LoRA Fine-tuning" icon="puzzle-piece" href="/finetuning/lora">
    Learn about memory-efficient LoRA training
  </Card>

  <Card title="Multi-node Training" icon="network-wired" href="/finetuning/multinode">
    Scale to multiple machines for even larger models
  </Card>
</CardGroup>
