Documentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The Qwen Chat API provides methods for conversational interactions with the model. It supports both synchronous and streaming responses, multi-turn conversations with history, and custom system prompts.
chat() Method
Generate a complete response for a user query:
response, updated_history = model.chat(
tokenizer,
query="What is quantum computing?",
history=None,
system="You are a helpful assistant."
)
print(response)
Parameters
Tokenizer instance for encoding/decoding text
User’s current message or question
history
list[tuple[str, str]]
default:"None"
Conversation history as list of (user_message, assistant_response) tuples:history = [
("Hello", "Hi! How can I help you today?"),
("What's the weather?", "I don't have access to weather data.")
]
system
str
default:"You are a helpful assistant."
System prompt defining the assistant’s behavior and role
stop_words_ids
list[list[int]]
default:"None"
Token ID sequences that trigger generation termination:stop_words_ids = [
tokenizer.encode("<|im_end|>"),
tokenizer.encode("\n\n")
]
Returns
The model’s generated response text
Updated conversation history including the current exchange
chat_stream() Method
Generate a streaming response for real-time display:
for partial_response in model.chat_stream(
tokenizer,
query="Explain neural networks",
history=history,
system="You are a helpful assistant."
):
print(partial_response, end="", flush=True)
Parameters
Same as chat() method.
Yields
Incrementally generated response text. Each yield contains the full response up to the current point (not just the delta).
Multi-turn Conversation Example
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-7B-Chat",
device_map="auto",
trust_remote_code=True
).eval()
tokenizer = AutoTokenizer.from_pretrained(
"Qwen/Qwen-7B-Chat",
trust_remote_code=True
)
# Initialize conversation
history = []
system = "You are a helpful AI assistant."
# First turn
response, history = model.chat(
tokenizer,
"Hello! Who are you?",
history=history,
system=system
)
print(f"Assistant: {response}")
# Second turn (with context)
response, history = model.chat(
tokenizer,
"What can you help me with?",
history=history,
system=system
)
print(f"Assistant: {response}")
# History now contains both exchanges
print(f"History length: {len(history)}")
Streaming Response Example
import sys
query = "Write a short poem about AI"
for response in model.chat_stream(
tokenizer,
query,
history=history,
generation_config=generation_config
):
# Clear and rewrite output
sys.stdout.write('\r' + ' ' * 80 + '\r')
sys.stdout.write(response)
sys.stdout.flush()
print() # New line after completion
Custom System Prompts
# Technical expert
system = "You are an expert software engineer specializing in Python."
response, history = model.chat(
tokenizer,
"How do I optimize this code?",
system=system
)
# Creative writing
system = "You are a creative writing assistant who helps with storytelling."
response, history = model.chat(
tokenizer,
"Help me write a story about space exploration",
system=system
)
Using Stop Words
# Stop generation at specific sequences
stop_words = ["Observation:", "<|endoftext|>"]
stop_words_ids = [tokenizer.encode(s) for s in stop_words]
response, history = model.chat(
tokenizer,
query="Generate a function call",
stop_words_ids=stop_words_ids
)
Generation with Parameters
response, history = model.chat(
tokenizer,
query="Tell me a creative story",
history=history,
temperature=0.8,
top_p=0.9,
top_k=50,
max_new_tokens=512
)
Internally, chat messages use the ChatML format:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant
Hi! How can I help you today?<|im_end|>
The chat() and chat_stream() methods handle this formatting automatically.