Documentation Index Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen/llms.txt
Use this file to discover all available pages before exploring further.
Beyond the basic web demo, Qwen can be integrated with Gradio in advanced ways to create sophisticated applications. This guide covers custom implementations, optimization techniques, and production-ready patterns.
Overview
The Qwen web demo (web_demo.py) serves as a foundation for building custom Gradio applications. This page explores advanced patterns, customizations, and best practices for production deployments.
Architecture
The Gradio integration uses several key components:
Gradio Interface
├── UI Components (Blocks API )
│ ├── Chatbot widget
│ ├── Textbox input
│ └── Action buttons
├── State Management
│ ├── Conversation history
│ └── Task state
├── Text Processing
│ ├── Markdown rendering (mdtex2html)
│ └── Code highlighting
└── Model Integration
├── Streaming generation
└── Memory management
Custom Text Processing
Markdown Enhancement
The demo uses mdtex2html for enhanced markdown rendering:
def postprocess ( self , y ):
if y is None :
return []
for i, (message, response) in enumerate (y):
y[i] = (
None if message is None else mdtex2html.convert(message),
None if response is None else mdtex2html.convert(response),
)
return y
gr.Chatbot.postprocess = postprocess
This enables:
LaTeX equation rendering
Enhanced table formatting
Better code block styling
Proper handling of special characters
The demo includes custom logic to properly format code blocks with syntax highlighting.
if "```" in line:
count += 1
items = line.split( "`" )
if count % 2 == 1 :
lines[i] = f '<pre><code class="language- { items[ - 1 ] } ">'
else :
lines[i] = f "<br></code></pre>"
Special Character Handling
Inside code blocks, special characters are escaped:
line = line.replace( "`" , r " \` " )
line = line.replace( "<" , "<" )
line = line.replace( ">" , ">" )
line = line.replace( " " , " " )
line = line.replace( "*" , "*" )
line = line.replace( "_" , "_" )
line = line.replace( "-" , "-" )
State Management
Conversation History
Gradio’s State component maintains conversation context:
task_history = gr.State([])
The history structure:
task_history = [
( "User message 1" , "Assistant response 1" ),
( "User message 2" , "Assistant response 2" ),
# ...
]
Display vs. Task History
The demo maintains two separate histories:
Chatbot Display (_chatbot): Formatted for UI display
Task History (_task_history): Raw text for model context
def predict ( _query , _chatbot , _task_history ):
print ( f "User: { _parse_text(_query) } " )
_chatbot.append((_parse_text(_query), "" )) # Display
# ... generation ...
_task_history.append((_query, full_response)) # Raw history
This separation ensures:
Clean display with formatting
Accurate model context without HTML
Independent management of each
Streaming Implementation
Real-Time Response Generation
The demo implements streaming using Python generators:
for response in model.chat_stream(tokenizer, _query, history = _task_history, generation_config = config):
_chatbot[ - 1 ] = (_parse_text(_query), _parse_text(response))
yield _chatbot
full_response = _parse_text(response)
Each yield statement updates the UI in real-time, creating a smooth streaming effect.
Benefits of Streaming
Immediate Feedback : Users see responses start appearing instantly
Better UX : Reduces perceived latency
Interruptible : Can stop generation if needed
Progress Indication : Shows the model is working
UI Components
Custom Branding
The interface includes Qwen branding:
gr.Markdown( """
<p align="center"><img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/logo_qwen.jpg" style="height: 80px"/><p>""" )
gr.Markdown( """<center><font size=8>Qwen-Chat Bot</center>""" )
Model Links
The demo displays links to model resources:
gr.Markdown( """
<center><font size=4>
Qwen-7B <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖 </a> |
<a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a>  |
Qwen-7B-Chat <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖 </a> |
<a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a>  |
...
</center>""" )
Three main buttons control the interface:
with gr.Row():
empty_btn = gr.Button( "🧹 Clear History (清除历史)" )
submit_btn = gr.Button( "🚀 Submit (发送)" )
regen_btn = gr.Button( "🤔️ Regenerate (重试)" )
submit_btn.click(predict, [query, chatbot, task_history], [chatbot], show_progress = True )
submit_btn.click(reset_user_input, [], [query])
empty_btn.click(reset_state, [chatbot, task_history], outputs = [chatbot], show_progress = True )
regen_btn.click(regenerate, [chatbot, task_history], [chatbot], show_progress = True )
Custom Implementations
Adding System Prompts
Extend the demo to support custom system prompts:
def predict_with_system ( _query , _system_prompt , _chatbot , _task_history ):
# Prepend system prompt to history
if _system_prompt and len (_task_history) == 0 :
_task_history.append(( f "<system> { _system_prompt } </system>" , "Understood." ))
# Continue with normal prediction
for response in model.chat_stream(tokenizer, _query, history = _task_history):
# ...
Add to UI:
system_prompt = gr.Textbox( label = "System Prompt" , placeholder = "You are a helpful assistant..." )
Multi-Model Support
Allow users to switch between models:
models = {
"Qwen-7B-Chat" : load_model( "Qwen/Qwen-7B-Chat" ),
"Qwen-14B-Chat" : load_model( "Qwen/Qwen-14B-Chat" ),
}
def predict_multi_model ( _query , _model_name , _chatbot , _task_history ):
model = models[_model_name]
# Use selected model for generation
Add model selector:
model_dropdown = gr.Dropdown(
choices = [ "Qwen-7B-Chat" , "Qwen-14B-Chat" ],
value = "Qwen-7B-Chat" ,
label = "Model"
)
Generation Configuration UI
Add controls for generation parameters:
with gr.Accordion( "Generation Settings" , open = False ):
temperature = gr.Slider( 0.1 , 2.0 , value = 0.7 , step = 0.1 , label = "Temperature" )
top_p = gr.Slider( 0.1 , 1.0 , value = 0.8 , step = 0.05 , label = "Top-p" )
max_tokens = gr.Slider( 128 , 4096 , value = 2048 , step = 128 , label = "Max Tokens" )
def predict_with_config ( _query , _temp , _top_p , _max_tokens , _chatbot , _task_history ):
# Update generation config
config = GenerationConfig(
temperature = _temp,
top_p = _top_p,
max_new_tokens = _max_tokens
)
for response in model.chat_stream(tokenizer, _query, history = _task_history, generation_config = config):
# ...
Export Conversation
Add functionality to export chat history:
import json
from datetime import datetime
def export_conversation ( _task_history ):
timestamp = datetime.now().strftime( "%Y%m %d _%H%M%S" )
filename = f "qwen_chat_ { timestamp } .json"
data = {
"timestamp" : timestamp,
"conversation" : [
{ "role" : "user" , "content" : user_msg}
{ "role" : "assistant" , "content" : bot_msg}
for user_msg, bot_msg in _task_history
]
}
with open (filename, 'w' , encoding = 'utf-8' ) as f:
json.dump(data, f, ensure_ascii = False , indent = 2 )
return filename
export_btn = gr.Button( "💾 Export" )
export_btn.click(export_conversation, [task_history], [gr.File()])
Model Loading
Optimize model loading for faster startup:
# Cache models globally
_model_cache = {}
def get_model ( checkpoint_path ):
if checkpoint_path not in _model_cache:
model = AutoModelForCausalLM.from_pretrained(
checkpoint_path,
device_map = "auto" ,
trust_remote_code = True
).eval()
_model_cache[checkpoint_path] = model
return _model_cache[checkpoint_path]
Memory Management
Implement aggressive memory management:
import torch
import gc
def aggressive_cleanup ():
"""Thorough memory cleanup"""
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.synchronize()
# Call after clearing history or long conversations
empty_btn.click( lambda : aggressive_cleanup(), None , None )
Response Caching
Cache common responses to reduce computation:
from functools import lru_cache
import hashlib
response_cache = {}
def get_cache_key ( query , history ):
content = query + str (history)
return hashlib.md5(content.encode()).hexdigest()
def predict_with_cache ( _query , _chatbot , _task_history ):
cache_key = get_cache_key(_query, _task_history)
if cache_key in response_cache:
response = response_cache[cache_key]
_chatbot.append((_parse_text(_query), _parse_text(response)))
return _chatbot
# Normal generation...
response_cache[cache_key] = full_response
Concurrent Request Handling
Gradio’s queue system handles concurrency, but you can optimize:
demo.queue(
concurrency_count = 4 , # Process up to 4 requests simultaneously
max_size = 20 , # Queue up to 20 requests
).launch(
# launch options...
)
Production Best Practices
Error Handling
Implement robust error handling:
def predict_safe ( _query , _chatbot , _task_history ):
try :
for response in model.chat_stream(tokenizer, _query, history = _task_history):
_chatbot[ - 1 ] = (_parse_text(_query), _parse_text(response))
yield _chatbot
except Exception as e:
error_msg = f "Error: { str (e) } "
_chatbot[ - 1 ] = (_parse_text(_query), error_msg)
yield _chatbot
print ( f "Error in generation: { e } " )
Logging
Add comprehensive logging:
import logging
logging.basicConfig(
level = logging. INFO ,
format = ' %(asctime)s - %(levelname)s - %(message)s ' ,
handlers = [
logging.FileHandler( 'qwen_demo.log' ),
logging.StreamHandler()
]
)
def predict_with_logging ( _query , _chatbot , _task_history ):
logging.info( f "User query: { _query } " )
start_time = time.time()
for response in model.chat_stream( ... ):
# ...
duration = time.time() - start_time
logging.info( f "Generation completed in { duration :.2f} s" )
Rate Limiting
Protect against abuse:
from collections import defaultdict
import time
user_requests = defaultdict( list )
RATE_LIMIT = 10 # requests per minute
def check_rate_limit ( user_id ):
now = time.time()
# Remove old requests
user_requests[user_id] = [
t for t in user_requests[user_id]
if now - t < 60
]
if len (user_requests[user_id]) >= RATE_LIMIT :
return False
user_requests[user_id].append(now)
return True
Health Monitoring
Add health check endpoint:
def health_check ():
try :
# Simple model test
test_response = model.chat(tokenizer, "Hi" , history = None )[ 0 ]
return "✓ Healthy"
except Exception as e:
return f "✗ Unhealthy: { e } "
health_status = gr.Textbox( label = "System Health" , interactive = False )
health_btn = gr.Button( "Check Health" )
health_btn.click(health_check, None , health_status)
Integration Examples
With Authentication
demo.queue().launch(
auth = [( 'admin' , 'password123' ), ( 'user' , 'userpass' )],
auth_message = "Enter credentials to access Qwen Chat" ,
server_port = 8000 ,
)
With Analytics
import analytics
def predict_with_analytics ( _query , _chatbot , _task_history ):
# Track usage
analytics.track( 'chat_message' , {
'query_length' : len (_query),
'history_length' : len (_task_history)
})
# Normal prediction
yield from predict(_query, _chatbot, _task_history)
With Database Storage
import sqlite3
def save_conversation ( _task_history , user_id ):
conn = sqlite3.connect( 'conversations.db' )
cursor = conn.cursor()
for user_msg, bot_msg in _task_history:
cursor.execute(
'INSERT INTO messages (user_id, role, content) VALUES (?, ?, ?)' ,
(user_id, 'user' , user_msg)
)
cursor.execute(
'INSERT INTO messages (user_id, role, content) VALUES (?, ?, ?)' ,
(user_id, 'assistant' , bot_msg)
)
conn.commit()
conn.close()
Troubleshooting
Memory leaks in long sessions
Implement periodic cleanup: MAX_HISTORY_LENGTH = 20
def trim_history ( _task_history ):
if len (_task_history) > MAX_HISTORY_LENGTH :
_task_history = _task_history[ - MAX_HISTORY_LENGTH :]
return _task_history
Profile your code: import time
start = time.time()
response = model.chat( ... )
print ( f "Generation took: { time.time() - start :.2f} s" )
Consider:
Using quantized models
Enabling Flash Attention
Reducing max tokens
Batch processing
Ensure you’re yielding updates: for response in model.chat_stream( ... ):
_chatbot[ - 1 ] = (query, response)
yield _chatbot # This is crucial!
Source Code Reference
Key files in the Qwen repository:
Main demo: web_demo.py:1
Text processing: web_demo.py:78
Prediction function: web_demo.py:119
UI definition: web_demo.py:151
Next Steps
CLI Demo Explore the command-line interface
API Reference Learn about the model API
Deployment Guide Deploy Qwen in production
Examples More examples on GitHub