VibeVoice: Microsoft open-source bộ ba voice AI — ASR 60 phút, TTS 90 phút, Realtime 0.5B 300ms

Giới thiệu

VibeVoice là bộ ba speech model open-source của Microsoft, gồm: ASR long-form, TTS multi-speaker long-form, và Realtime streaming TTS. Tất cả MIT license, base trên Qwen2.5 1.5B.

Repo: https://github.com/microsoft/VibeVoice

Tính năng chính

1. VibeVoice-ASR — Long-form Speech Recognition

Unified speech-to-text model handle 60 phút audio liền mạch trong 1 pass. Output structured: Who (speaker), When (timestamp), What (content). Hỗ trợ Customized Hotword.

60-minute single-pass: khác model ASR thông thường slice audio thành chunk ngắn (mất global context), VibeVoice-ASR accept tới 60 phút continuous trong 64K token. Speaker tracking + semantic coherence consistent qua cả giờ.
Customized Hotword: cung cấp hotword (tên cụ thể, term kỹ thuật, background info) để guide recognition, tăng accuracy ở domain-specific content.
Rich transcription (Who, When, What): jointly do ASR + diarization + timestamping, output structured cho biết ai nói gì và khi nào.

Link: Documentation | Hugging Face | Playground | Finetuning | Paper

2. VibeVoice-TTS — Long-form Multi-speaker TTS

Best for: long-form conversational audio, podcast, multi-speaker dialogue.

90-minute long-form generation: synthesize conversational hoặc single-speaker speech tới 90 phút trong 1 pass, maintain speaker consistency và semantic coherence.
Multi-speaker: hỗ trợ tới 4 distinct speaker trong 1 conversation, natural turn-taking, speaker consistency qua dialogue dài.
Expressive speech: capture conversational dynamics + emotional nuance.
Multi-lingual: English, Chinese, và các ngôn ngữ khác.
Demo có cross-lingual và spontaneous singing.

Link: Documentation | Hugging Face | Paper

3. VibeVoice-Streaming — Real-time Streaming TTS

Lightweight real-time TTS model hỗ trợ streaming text input và robust long-form generation.

Parameter size: 0.5B (deployment-friendly, có thể chạy CPU)
First-audible latency: ~300 milliseconds
Streaming text input: stream text vào, audio bắt đầu phát ngay khi có đủ context
Long-form: ~10 phút robust generation

Link: Documentation | Hugging Face | Colab

Cách sử dụng

MIT license: free cho cả commercial use, nhưng Microsoft khuyến cáo “intended for research and development purposes only” và không recommend deploy production khi chưa test thêm.
Stack: Python 100% (theo GitHub), 44.7k star, 5k fork tại thời điểm release.
Cài đặt: clone repo + load model từ Hugging Face. ASR và TTS đều có inference script + finetuning recipe.
Playground: https://aka.ms/vibevoice-asr cho ASR.

Risk + Limitation (Microsoft note)

Model có thể produce output unexpected, biased, inaccurate.
VibeVoice inherit bias từ base model (Qwen2.5 1.5B).
Deepfake risk: high-quality synthetic speech có thể bị misuse cho impersonation, fraud, disinformation. Microsoft khuyến cáo:
- Ensure transcript reliable
- Check content accuracy
- Avoid sử dụng generated content trong cách misleading
- Best practice: disclose việc dùng AI khi share AI-generated content
Không recommend cho commercial / real-world application mà không có thêm testing và development.

Dev nên quan tâm vì…

Nếu build podcast / meeting transcription: VibeVoice-ASR đáng thử ngay — 60 phút single-pass + diarization + custom hotword là feature set hiếm có ở model open-source.
Nếu build TTS cho long-form content (audiobook, podcast, dialogue agent): TTS variant 90 phút, 4 speaker, EN/CN + cross-lingual mạnh hơn nhiều TTS open hiện có (Bark, XTTS-v2).
Nếu deploy edge / CPU: Realtime 0.5B với ~300ms latency là baseline mới cho TTS lightweight.
Nếu build voice product cần verification: nhớ disclaimer của MS — voice clone từ model này tốt đủ để bypass weak biometric. Nâng threshold liveness check + phrase challenge khi sản phẩm đụng tới voice auth.

VibeVoice: Microsoft open-source bộ ba voice AI — ASR 60 phút, TTS 90 phút, Realtime 0.5B 300ms

TL;DR

Giới thiệu

Tính năng chính

1. VibeVoice-ASR — Long-form Speech Recognition

2. VibeVoice-TTS — Long-form Multi-speaker TTS

3. VibeVoice-Streaming — Real-time Streaming TTS

Cách sử dụng

Risk + Limitation (Microsoft note)

Dev nên quan tâm vì…

Đường dẫn nguồn

VibeVoice: Microsoft open-source bộ ba voice AI — ASR 60 phút, TTS 90 phút, Realtime 0.5B 300ms

TL;DR

Giới thiệu

Tính năng chính

1. VibeVoice-ASR — Long-form Speech Recognition

2. VibeVoice-TTS — Long-form Multi-speaker TTS

3. VibeVoice-Streaming — Real-time Streaming TTS

Cách sử dụng

Risk + Limitation (Microsoft note)

Dev nên quan tâm vì…

Đường dẫn nguồn

Cùng bản tin này