Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen/llms.txt

Use this file to discover all available pages before exploring further.

Overview

The Qwen Chat API provides methods for conversational interactions with the model. It supports both synchronous and streaming responses, multi-turn conversations with history, and custom system prompts.

chat() Method

Generate a complete response for a user query:
response, updated_history = model.chat(
    tokenizer,
    query="What is quantum computing?",
    history=None,
    system="You are a helpful assistant."
)

print(response)

Parameters

tokenizer
AutoTokenizer
required
Tokenizer instance for encoding/decoding text
query
str
required
User’s current message or question
history
list[tuple[str, str]]
default:"None"
Conversation history as list of (user_message, assistant_response) tuples:
history = [
    ("Hello", "Hi! How can I help you today?"),
    ("What's the weather?", "I don't have access to weather data.")
]
system
str
default:"You are a helpful assistant."
System prompt defining the assistant’s behavior and role
stop_words_ids
list[list[int]]
default:"None"
Token ID sequences that trigger generation termination:
stop_words_ids = [
    tokenizer.encode("<|im_end|>"),
    tokenizer.encode("\n\n")
]
**gen_kwargs
dict
Additional generation parameters (see GenerationConfig)

Returns

response
str
The model’s generated response text
history
list[tuple[str, str]]
Updated conversation history including the current exchange

chat_stream() Method

Generate a streaming response for real-time display:
for partial_response in model.chat_stream(
    tokenizer,
    query="Explain neural networks",
    history=history,
    system="You are a helpful assistant."
):
    print(partial_response, end="", flush=True)

Parameters

Same as chat() method.

Yields

partial_response
str
Incrementally generated response text. Each yield contains the full response up to the current point (not just the delta).

Multi-turn Conversation Example

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True
).eval()

tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    trust_remote_code=True
)

# Initialize conversation
history = []
system = "You are a helpful AI assistant."

# First turn
response, history = model.chat(
    tokenizer,
    "Hello! Who are you?",
    history=history,
    system=system
)
print(f"Assistant: {response}")

# Second turn (with context)
response, history = model.chat(
    tokenizer,
    "What can you help me with?",
    history=history,
    system=system
)
print(f"Assistant: {response}")

# History now contains both exchanges
print(f"History length: {len(history)}")

Streaming Response Example

import sys

query = "Write a short poem about AI"

for response in model.chat_stream(
    tokenizer,
    query,
    history=history,
    generation_config=generation_config
):
    # Clear and rewrite output
    sys.stdout.write('\r' + ' ' * 80 + '\r')
    sys.stdout.write(response)
    sys.stdout.flush()

print()  # New line after completion

Custom System Prompts

# Technical expert
system = "You are an expert software engineer specializing in Python."

response, history = model.chat(
    tokenizer,
    "How do I optimize this code?",
    system=system
)

# Creative writing
system = "You are a creative writing assistant who helps with storytelling."

response, history = model.chat(
    tokenizer,
    "Help me write a story about space exploration",
    system=system
)

Using Stop Words

# Stop generation at specific sequences
stop_words = ["Observation:", "<|endoftext|>"]
stop_words_ids = [tokenizer.encode(s) for s in stop_words]

response, history = model.chat(
    tokenizer,
    query="Generate a function call",
    stop_words_ids=stop_words_ids
)

Generation with Parameters

response, history = model.chat(
    tokenizer,
    query="Tell me a creative story",
    history=history,
    temperature=0.8,
    top_p=0.9,
    top_k=50,
    max_new_tokens=512
)

Chat Message Format

Internally, chat messages use the ChatML format:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant
Hi! How can I help you today?<|im_end|>
The chat() and chat_stream() methods handle this formatting automatically.