Microsoft Open Sources VibeVoice AI That Can Turn Text into a 90-Minute Podcast

Currently, it supports English and Mandarin, but future expansions are expected.

Microsoft Open Sources VibeVoice AI That Can Turn Text into a 90-Minute Podcast

Microsoft has unveiled VibeVoice, an open-source AI framework capable of generating 90-minute, multi-speaker podcasts from simple text—now available for anyone to experiment with online or on local PCs.

The concept is similar to Google’s NotebookLM, which also turns text into audio conversations, but VibeVoice offers the added advantage of being open-source and customisable.

Currently, VibeVoice supports English and Mandarin, but future expansions are expected.

Built to overcome limitations in traditional text-to-speech (TTS) systems, VibeVoice supports up to four distinct speakers in a single audio session and handles natural conversational flow with scalable performance.

"A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences.

"VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details," Microsoft said in a blog post.

The platform comes in multiple model sizes, including a 1.5B-parameter version with a long 64k context window and a 7B-parameter model capable of generating up to 45 minutes of audio—both offering compelling audio fidelity.

For creators and developers, VibeVoice's practical usability stands out: it requires just 7 GB of VRAM for the smaller model or up to 18 GB for the larger one, making it feasible even on mid-range GPUs.