Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen/llms.txt

Use this file to discover all available pages before exploring further.

Qwen models outperform baseline models of similar sizes across a series of benchmark datasets, evaluating capabilities in natural language understanding, mathematical problem solving, coding, and more.

Overall Performance

Qwen-72B achieves better performance than LLaMA2-70B on all tasks and outperforms GPT-3.5 on 7 out of 10 tasks. Qwen-72B Performance Radar

Base Model Benchmarks

Performance comparison across major benchmarks for all Qwen base models:
ModelMMLUC-EvalGSM8KMATHHumanEvalMBPPBBHCMMLU
5-shot5-shot8-shot4-shot0-shot3-shot3-shot5-shot
LLaMA2-7B46.832.516.73.312.820.838.231.8
LLaMA2-13B55.041.429.65.018.930.345.638.4
LLaMA2-34B62.6-42.26.222.633.044.1-
ChatGLM2-6B47.951.732.46.5--33.7-
InternLM-7B51.053.431.26.310.414.037.051.8
InternLM-20B62.158.852.67.925.635.652.559.0
Baichuan2-7B54.756.324.65.618.324.241.657.1
Baichuan2-13B59.559.052.810.117.130.249.062.0
Yi-34B76.381.867.915.926.238.266.482.6
XVERSE-65B70.868.660.3-26.3---
Qwen-1.8B45.356.132.32.315.214.222.352.1
Qwen-7B58.263.551.711.629.931.645.062.2
Qwen-14B66.372.161.324.832.340.853.471.0
Qwen-72B77.483.378.935.235.452.267.783.6
For all compared models, we report the best scores between their official reported results and OpenCompass.

World Knowledge

C-Eval Performance

C-Eval is a comprehensive benchmark testing common-sense capability in Chinese, covering 52 subjects across humanities, social sciences, STEM, and other specialties. Qwen-7B on C-Eval Validation Set:
ModelAverage
Alpaca-7B28.9
Vicuna-7B31.2
ChatGLM-6B37.1
Baichuan-7B42.7
ChatGLM2-6B50.9
InternLM-7B53.4
ChatGPT53.5
Claude-v1.355.5
Qwen-7B60.8
Qwen-7B on C-Eval Test Set:
ModelAvg.Avg. (Hard)STEMSocial SciencesHumanitiesOthers
ChatGLM-6B38.929.233.348.341.338.0
Chinese-Alpaca-Plus-13B41.530.536.649.743.141.2
Baichuan-7B42.831.538.252.046.239.3
ChatGLM2-6B51.737.148.660.551.349.8
InternLM-7B52.837.148.067.455.445.8
Qwen-7B59.641.052.874.163.155.2

MMLU Performance

MMLU evaluates English comprehension abilities across 57 subtasks spanning different academic fields and difficulty levels. 5-shot MMLU Accuracy:
ModelAverageSTEMSocial SciencesHumanitiesOthers
LLaMA-7B35.130.538.334.038.1
Baichuan-7B42.335.648.938.448.1
LLaMA2-7B45.336.451.242.952.2
LLaMA-13B46.935.853.845.053.3
ChatGLM2-6B47.941.254.443.754.5
InternLM-7B51.0----
Baichuan-13B51.641.660.947.458.5
LLaMA2-13B54.844.162.652.861.1
ChatGLM2-12B56.248.265.152.660.9
Qwen-7B56.747.665.951.564.7

Coding Capabilities

HumanEval

Zero-shot Pass@1 performance on HumanEval benchmark:
ModelPass@1
Baichuan-7B9.2
ChatGLM2-6B9.2
InternLM-7B10.4
LLaMA-7B10.5
LLaMA2-7B12.8
Baichuan-13B12.8
LLaMA-13B15.8
MPT-7B18.3
LLaMA2-13B18.3
Qwen-7B24.4

Mathematical Reasoning

GSM8K

8-shot accuracy on GSM8K benchmark:
ModelAccuracy
MPT-7B6.8
Falcon-7B6.8
Baichuan-7B9.7
LLaMA-7B11.0
LLaMA2-7B14.6
LLaMA-13B17.8
Baichuan-13B26.6
LLaMA2-13B28.7
InternLM-7B31.2
ChatGLM2-6B32.4
ChatGLM2-12B40.9
Qwen-7B51.6

Translation Capabilities

WMT22

5-shot BLEU scores on WMT22 translation tasks:
ModelAveragezh-enen-zh
InternLM-7B11.89.014.5
LLaMA-7B12.716.78.7
LLaMA-13B15.819.512.0
LLaMA2-7B19.921.917.9
Bloom-7B20.319.121.4
LLaMA2-13B23.322.424.2
PolyLM-13B23.620.227.0
Baichuan-7B24.622.626.6
Qwen-7B27.524.330.6

Chat Model Performance

World Knowledge (Chat)

Zero-shot C-Eval Validation Set:
ModelAvg. Acc.
LLaMA2-7B-Chat31.9
LLaMA2-13B-Chat40.6
Chinese-Alpaca-2-7B41.3
Chinese-Alpaca-Plus-13B43.3
Baichuan-13B-Chat50.4
ChatGLM2-6B-Chat50.7
InternLM-7B-Chat53.2
Qwen-7B-Chat54.2
Zero-shot MMLU:
ModelAvg. Acc.
ChatGLM2-6B-Chat45.5
LLaMA2-7B-Chat47.0
InternLM-7B-Chat50.8
Baichuan-13B-Chat52.1
ChatGLM2-12B-Chat52.1
Qwen-7B-Chat53.9

Coding (Chat)

Zero-shot Pass@1 on HumanEval:
ModelPass@1
LLaMA2-7B-Chat12.2
InternLM-7B-Chat14.0
Baichuan-13B-Chat16.5
LLaMA2-13B-Chat18.9
Qwen-7B-Chat24.4

Math (Chat)

GSM8K performance:
ModelZero-shot Acc.4-shot Acc.
ChatGLM2-6B-Chat-28.0
LLaMA2-7B-Chat20.428.2
LLaMA2-13B-Chat29.436.7
InternLM-7B-Chat32.634.5
Baichuan-13B-Chat-36.3
ChatGLM2-12B-Chat-38.1
Qwen-7B-Chat41.143.5

Quantized Model Performance

Quantized models maintain near-lossless performance while improving memory efficiency:
QuantizationMMLUCEval (val)GSM8KHumaneval
Qwen-1.8B-Chat (BF16)43.355.633.726.2
Qwen-1.8B-Chat (Int8)43.155.833.027.4
Qwen-1.8B-Chat (Int4)42.952.831.225.0
Qwen-7B-Chat (BF16)55.859.750.337.2
Qwen-7B-Chat (Int8)55.459.448.334.8
Qwen-7B-Chat (Int4)55.159.249.729.9
Qwen-14B-Chat (BF16)64.669.860.143.9
Qwen-14B-Chat (Int8)63.668.660.048.2
Qwen-14B-Chat (Int4)63.369.059.845.7
Qwen-72B-Chat (BF16)74.480.176.464.6
Qwen-72B-Chat (Int8)73.580.173.562.2
Qwen-72B-Chat (Int4)73.480.175.361.6

Tool Usage Capabilities

Qwen-7B-Chat performance on tool selection and usage: ReAct Prompting Evaluation:
ModelTool Selection (Acc.↑)Tool Input (Rouge-L↑)False Positive Error↓
GPT-495%0.9015.0%
GPT-3.585%0.8875.0%
Qwen-7B99%0.899.7%
HuggingFace Agent Benchmark:
ModelTool Selection↑Tool Used↑Code↑
GPT-4100.00100.0097.41
GPT-3.595.3796.3087.04
StarCoder-15.5B87.0487.9668.89
Qwen-7B90.7492.5974.07
The plugins in the evaluation set do not appear in Qwen’s training set, demonstrating genuine generalization capability.

Long Context Performance

Perplexity (PPL) on arXiv dataset with extended context lengths:
Model102420484096819216384
Qwen-7B4.233.7839.35469.812645.09
+ dynamic_ntk4.233.783.593.665.71
+ dynamic_ntk + logn4.233.783.583.564.62
+ dynamic_ntk + logn + local_attn4.233.783.583.494.32
Qwen supports training-free long-context inference from 2048 to over 8192 tokens using NTK-aware interpolation, LogN attention scaling, and local window attention.

Additional Resources

For detailed model performance on additional benchmark datasets, please refer to the technical report.