Qwen3-ASR

Visit Tool

Qwen3-ASR is an open-source series of ASR models that supports multilingual speech, music, and song recognition. It also offers language detection and timestamp prediction for 52 languages and dialects.

Claim this tool

1View

At a glance

Pricing

Open Source

Free tier

Yes

API

Yes

Skill level

Technical

About

What is Qwen3-ASR?

Qwen3-ASR is an open-source series of Automatic Speech Recognition (ASR) models developed by the Qwen team at Alibaba Cloud. It includes two powerful all-in-one speech recognition models (0.6B and 1.7B versions) that support language identification and ASR for 52 languages and dialects, including 30 languages and 22 Chinese dialects. The tool also features Qwen3-ForcedAligner-0.6B, a novel non-autoregressive speech forced-alignment model that can align text–speech pairs and predict timestamps in 11 languages. Qwen3-ASR maintains high-quality and robust recognition even in complex acoustic environments and challenging text patterns, offering both offline and streaming inference capabilities.

Best used for

Ideal for developers and researchers who need to integrate advanced speech recognition capabilities into their applications, accurately transcribe multilingual audio, and perform precise speech-text alignment. Especially valuable for projects requiring robust performance across various languages and dialects.

Common actions

transcribe audio

detect language

align speech

process audio

"AI Agents"face swappinggithub copilotdeepfakeworkflowslow-code/no-codeautomated workflowopen-sourcecollaboration

Capabilities

Key features

Multilingual speech recognition
Language detection
Timestamp prediction
Streaming inference
vLLM backend support
Forced alignment

Target Audience

developersresearchersdata scientistsai engineers

Integrations

Not yet documented

Pricing & Plans

Open Source

Free

FAQs

What languages and dialects does Qwen3-ASR support?

Qwen3-ASR supports language identification and ASR for 52 languages and dialects, including 30 languages like English, Chinese, French, and Spanish, along with 22 Chinese dialects. The Qwen3-ForcedAligner-0.6B model supports timestamp prediction in 11 languages.

Can Qwen3-ASR handle both speech and music recognition?

Yes, the Qwen3-ASR models are designed to support stable multilingual recognition for various audio types, including speech, singing voice, and songs with background music. This makes it versatile for different audio processing needs.

How can I use Qwen3-ASR for faster inference?

For the fastest inference speed, it is recommended to use the vLLM backend by initializing the model with Qwen3ASRModel.LLM. Additionally, installing FlashAttention 2 can further reduce GPU memory usage and accelerate inference, especially for long inputs and large batch sizes.

Trending

Subcategories trending in Content & Design

Image Generation AI Writing Assistants Video Generation Photo Editing Graphic Design Video Editing

Trending

Explore

Browse AI tools by category

Content & Design Productivity & Business Coding & Development AI Agents & Automation Research & Education Wellness & Lifestyle Career Development Marketing & Growth Data & Analytics Customer Support & CX Finance E-commerce