Advanced Gradio Integration

Beyond the basic web demo, Qwen can be integrated with Gradio in advanced ways to create sophisticated applications. This guide covers custom implementations, optimization techniques, and production-ready patterns.

Overview

The Qwen web demo (web_demo.py) serves as a foundation for building custom Gradio applications. This page explores advanced patterns, customizations, and best practices for production deployments.

Architecture

The Gradio integration uses several key components:

Gradio Interface
    ├── UI Components (Blocks API)
    │   ├── Chatbot widget
    │   ├── Textbox input
    │   └── Action buttons
    ├── State Management
    │   ├── Conversation history
    │   └── Task state
    ├── Text Processing
    │   ├── Markdown rendering (mdtex2html)
    │   └── Code highlighting
    └── Model Integration
        ├── Streaming generation
        └── Memory management

Custom Text Processing

Markdown Enhancement

The demo uses mdtex2html for enhanced markdown rendering:

web_demo.py:64

def postprocess(self, y):
    if y is None:
        return []
    for i, (message, response) in enumerate(y):
        y[i] = (
            None if message is None else mdtex2html.convert(message),
            None if response is None else mdtex2html.convert(response),
        )
    return y

gr.Chatbot.postprocess = postprocess

This enables:

LaTeX equation rendering
Enhanced table formatting
Better code block styling
Proper handling of special characters

Code Block Formatting

The demo includes custom logic to properly format code blocks with syntax highlighting.

web_demo.py:82

if "```" in line:
    count += 1
    items = line.split("`")
    if count % 2 == 1:
        lines[i] = f'<pre><code class="language-{items[-1]}">'
    else:
        lines[i] = f"<br></code></pre>"

Special Character Handling

Inside code blocks, special characters are escaped:

web_demo.py:93

line = line.replace("`", r"\`")
line = line.replace("<", "&lt;")
line = line.replace(">", "&gt;")
line = line.replace(" ", "&nbsp;")
line = line.replace("*", "&ast;")
line = line.replace("_", "&lowbar;")
line = line.replace("-", "&#45;")

State Management

Conversation History

Gradio’s State component maintains conversation context:

web_demo.py:173

task_history = gr.State([])

The history structure:

task_history = [
    ("User message 1", "Assistant response 1"),
    ("User message 2", "Assistant response 2"),
    # ...
]

Display vs. Task History

The demo maintains two separate histories:

Chatbot Display (_chatbot): Formatted for UI display
Task History (_task_history): Raw text for model context

web_demo.py:120

def predict(_query, _chatbot, _task_history):
    print(f"User: {_parse_text(_query)}")
    _chatbot.append((_parse_text(_query), ""))  # Display
    
    # ... generation ...
    
    _task_history.append((_query, full_response))  # Raw history

This separation ensures:

Clean display with formatting
Accurate model context without HTML
Independent management of each

Streaming Implementation

Real-Time Response Generation

The demo implements streaming using Python generators:

web_demo.py:124

for response in model.chat_stream(tokenizer, _query, history=_task_history, generation_config=config):
    _chatbot[-1] = (_parse_text(_query), _parse_text(response))
    yield _chatbot
    full_response = _parse_text(response)

Each yield statement updates the UI in real-time, creating a smooth streaming effect.

Benefits of Streaming

Immediate Feedback: Users see responses start appearing instantly
Better UX: Reduces perceived latency
Interruptible: Can stop generation if needed
Progress Indication: Shows the model is working

UI Components

Custom Branding

The interface includes Qwen branding:

web_demo.py:152

gr.Markdown("""
<p align="center"><img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/logo_qwen.jpg" style="height: 80px"/><p>""")
gr.Markdown("""<center><font size=8>Qwen-Chat Bot</center>""")

Model Links

The demo displays links to model resources:

web_demo.py:160

gr.Markdown("""
<center><font size=4>
Qwen-7B <a href="https://modelscope.cn/models/qwen/Qwen-7B/summary">🤖 </a> | 
<a href="https://huggingface.co/Qwen/Qwen-7B">🤗</a>&nbsp ｜ 
Qwen-7B-Chat <a href="https://modelscope.cn/models/qwen/Qwen-7B-Chat/summary">🤖 </a> | 
<a href="https://huggingface.co/Qwen/Qwen-7B-Chat">🤗</a>&nbsp ｜ 
...
</center>""")

Action Buttons

Three main buttons control the interface:

web_demo.py:175

with gr.Row():
    empty_btn = gr.Button("🧹 Clear History (清除历史)")
    submit_btn = gr.Button("🚀 Submit (发送)")
    regen_btn = gr.Button("🤔️ Regenerate (重试)")

Button Event Handlers

web_demo.py:180

submit_btn.click(predict, [query, chatbot, task_history], [chatbot], show_progress=True)
submit_btn.click(reset_user_input, [], [query])
empty_btn.click(reset_state, [chatbot, task_history], outputs=[chatbot], show_progress=True)
regen_btn.click(regenerate, [chatbot, task_history], [chatbot], show_progress=True)

Custom Implementations

Adding System Prompts

Extend the demo to support custom system prompts:

def predict_with_system(_query, _system_prompt, _chatbot, _task_history):
    # Prepend system prompt to history
    if _system_prompt and len(_task_history) == 0:
        _task_history.append((f"<system>{_system_prompt}</system>", "Understood."))
    
    # Continue with normal prediction
    for response in model.chat_stream(tokenizer, _query, history=_task_history):
        # ...

Add to UI:

system_prompt = gr.Textbox(label="System Prompt", placeholder="You are a helpful assistant...")

Multi-Model Support

Allow users to switch between models:

models = {
    "Qwen-7B-Chat": load_model("Qwen/Qwen-7B-Chat"),
    "Qwen-14B-Chat": load_model("Qwen/Qwen-14B-Chat"),
}

def predict_multi_model(_query, _model_name, _chatbot, _task_history):
    model = models[_model_name]
    # Use selected model for generation

Add model selector:

model_dropdown = gr.Dropdown(
    choices=["Qwen-7B-Chat", "Qwen-14B-Chat"],
    value="Qwen-7B-Chat",
    label="Model"
)

Generation Configuration UI

Add controls for generation parameters:

with gr.Accordion("Generation Settings", open=False):
    temperature = gr.Slider(0.1, 2.0, value=0.7, step=0.1, label="Temperature")
    top_p = gr.Slider(0.1, 1.0, value=0.8, step=0.05, label="Top-p")
    max_tokens = gr.Slider(128, 4096, value=2048, step=128, label="Max Tokens")

def predict_with_config(_query, _temp, _top_p, _max_tokens, _chatbot, _task_history):
    # Update generation config
    config = GenerationConfig(
        temperature=_temp,
        top_p=_top_p,
        max_new_tokens=_max_tokens
    )
    for response in model.chat_stream(tokenizer, _query, history=_task_history, generation_config=config):
        # ...

Export Conversation

Add functionality to export chat history:

import json
from datetime import datetime

def export_conversation(_task_history):
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"qwen_chat_{timestamp}.json"
    
    data = {
        "timestamp": timestamp,
        "conversation": [
            {"role": "user", "content": user_msg}
            {"role": "assistant", "content": bot_msg}
            for user_msg, bot_msg in _task_history
        ]
    }
    
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)
    
    return filename

export_btn = gr.Button("💾 Export")
export_btn.click(export_conversation, [task_history], [gr.File()])

Performance Optimization

Model Loading

Optimize model loading for faster startup:

# Cache models globally
_model_cache = {}

def get_model(checkpoint_path):
    if checkpoint_path not in _model_cache:
        model = AutoModelForCausalLM.from_pretrained(
            checkpoint_path,
            device_map="auto",
            trust_remote_code=True
        ).eval()
        _model_cache[checkpoint_path] = model
    return _model_cache[checkpoint_path]

Memory Management

Implement aggressive memory management:

import torch
import gc

def aggressive_cleanup():
    """Thorough memory cleanup"""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

# Call after clearing history or long conversations
empty_btn.click(lambda: aggressive_cleanup(), None, None)

Response Caching

Cache common responses to reduce computation:

from functools import lru_cache
import hashlib

response_cache = {}

def get_cache_key(query, history):
    content = query + str(history)
    return hashlib.md5(content.encode()).hexdigest()

def predict_with_cache(_query, _chatbot, _task_history):
    cache_key = get_cache_key(_query, _task_history)
    
    if cache_key in response_cache:
        response = response_cache[cache_key]
        _chatbot.append((_parse_text(_query), _parse_text(response)))
        return _chatbot
    
    # Normal generation...
    response_cache[cache_key] = full_response

Concurrent Request Handling

Gradio’s queue system handles concurrency, but you can optimize:

demo.queue(
    concurrency_count=4,  # Process up to 4 requests simultaneously
    max_size=20,          # Queue up to 20 requests
).launch(
    # launch options...
)

Production Best Practices

Error Handling

Implement robust error handling:

def predict_safe(_query, _chatbot, _task_history):
    try:
        for response in model.chat_stream(tokenizer, _query, history=_task_history):
            _chatbot[-1] = (_parse_text(_query), _parse_text(response))
            yield _chatbot
    except Exception as e:
        error_msg = f"Error: {str(e)}"
        _chatbot[-1] = (_parse_text(_query), error_msg)
        yield _chatbot
        print(f"Error in generation: {e}")

Logging

Add comprehensive logging:

import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('qwen_demo.log'),
        logging.StreamHandler()
    ]
)

def predict_with_logging(_query, _chatbot, _task_history):
    logging.info(f"User query: {_query}")
    
    start_time = time.time()
    for response in model.chat_stream(...):
        # ...
    
    duration = time.time() - start_time
    logging.info(f"Generation completed in {duration:.2f}s")

Rate Limiting

Protect against abuse:

from collections import defaultdict
import time

user_requests = defaultdict(list)
RATE_LIMIT = 10  # requests per minute

def check_rate_limit(user_id):
    now = time.time()
    # Remove old requests
    user_requests[user_id] = [
        t for t in user_requests[user_id] 
        if now - t < 60
    ]
    
    if len(user_requests[user_id]) >= RATE_LIMIT:
        return False
    
    user_requests[user_id].append(now)
    return True

Health Monitoring

Add health check endpoint:

def health_check():
    try:
        # Simple model test
        test_response = model.chat(tokenizer, "Hi", history=None)[0]
        return "✓ Healthy"
    except Exception as e:
        return f"✗ Unhealthy: {e}"

health_status = gr.Textbox(label="System Health", interactive=False)
health_btn = gr.Button("Check Health")
health_btn.click(health_check, None, health_status)

Integration Examples

With Authentication

demo.queue().launch(
    auth=[('admin', 'password123'), ('user', 'userpass')],
    auth_message="Enter credentials to access Qwen Chat",
    server_port=8000,
)

With Analytics

import analytics

def predict_with_analytics(_query, _chatbot, _task_history):
    # Track usage
    analytics.track('chat_message', {
        'query_length': len(_query),
        'history_length': len(_task_history)
    })
    
    # Normal prediction
    yield from predict(_query, _chatbot, _task_history)

With Database Storage

import sqlite3

def save_conversation(_task_history, user_id):
    conn = sqlite3.connect('conversations.db')
    cursor = conn.cursor()
    
    for user_msg, bot_msg in _task_history:
        cursor.execute(
            'INSERT INTO messages (user_id, role, content) VALUES (?, ?, ?)',
            (user_id, 'user', user_msg)
        )
        cursor.execute(
            'INSERT INTO messages (user_id, role, content) VALUES (?, ?, ?)',
            (user_id, 'assistant', bot_msg)
        )
    
    conn.commit()
    conn.close()

Troubleshooting

Memory leaks in long sessions

Implement periodic cleanup:

MAX_HISTORY_LENGTH = 20

def trim_history(_task_history):
    if len(_task_history) > MAX_HISTORY_LENGTH:
        _task_history = _task_history[-MAX_HISTORY_LENGTH:]
    return _task_history

Slow response times

Profile your code:

import time

start = time.time()
response = model.chat(...)
print(f"Generation took: {time.time() - start:.2f}s")

Consider:

Using quantized models
Enabling Flash Attention
Reducing max tokens
Batch processing

Gradio UI not updating

Ensure you’re yielding updates:

for response in model.chat_stream(...):
    _chatbot[-1] = (query, response)
    yield _chatbot  # This is crucial!

Source Code Reference

Key files in the Qwen repository:

Main demo: web_demo.py:1
Text processing: web_demo.py:78
Prediction function: web_demo.py:119
UI definition: web_demo.py:151

Next Steps

CLI Demo

Explore the command-line interface

API Reference

Learn about the model API

Deployment Guide

Deploy Qwen in production

Examples

More examples on GitHub

Getting Started

Models

Inference

Quantization

Fine-tuning

Advanced Features

Deployment

Demos

Documentation Index

​Overview

​Architecture

​Custom Text Processing

​Markdown Enhancement

​Code Block Formatting

​Special Character Handling

​State Management

​Conversation History

​Display vs. Task History

​Streaming Implementation

​Real-Time Response Generation

​Benefits of Streaming

​UI Components

​Custom Branding

​Model Links

​Action Buttons

​Button Event Handlers

​Custom Implementations

​Adding System Prompts

​Multi-Model Support

​Generation Configuration UI

​Export Conversation

​Performance Optimization

​Model Loading

​Memory Management

​Response Caching

​Concurrent Request Handling

​Production Best Practices

​Error Handling

​Logging

​Rate Limiting

​Health Monitoring

​Integration Examples

​With Authentication

​With Analytics

​With Database Storage

​Troubleshooting

​Source Code Reference

​Next Steps

CLI Demo

API Reference

Deployment Guide

Examples

Overview

Architecture

Custom Text Processing

Markdown Enhancement

Code Block Formatting

Special Character Handling

State Management

Conversation History

Display vs. Task History

Streaming Implementation

Real-Time Response Generation

Benefits of Streaming

UI Components

Custom Branding

Model Links

Action Buttons

Button Event Handlers

Custom Implementations

Adding System Prompts

Multi-Model Support

Generation Configuration UI

Export Conversation

Performance Optimization

Model Loading

Memory Management

Response Caching

Concurrent Request Handling

Production Best Practices

Error Handling

Logging

Rate Limiting

Health Monitoring

Integration Examples

With Authentication

With Analytics

With Database Storage

Troubleshooting

Source Code Reference

Next Steps