Note that this just covers the Speech-to-Text/Speech-Recognition aspect (a-la whisper), there's also models for long-form Text-To-Speech and steaming Text-To-Speech.
Look at the "News" section in the readme - The original TTS model is gone from this repo (you can still find it other places), but the SST/ASR, long form TTS, and streaming TTS models are new.
I'd be willing to bet it will be "Word of the Year" for 2026. Merriam-Webster had 'slop' for 2025, and 'polarization' for 2024. Is there a prediction market for this?
Seems quite heavy for a STT model, Parakeet and Whisper are much smaller and perform great for quick dictation and transcription of longer files. I guess that's due to additional accuracy and speaker diarisation?
The TTS example clip in the repo of 'spontaneous singing' is creepy as fuck
Great post last night from Simon: https://simonwillison.net/2026/Apr/27/vibevoice/
Note that this just covers the Speech-to-Text/Speech-Recognition aspect (a-la whisper), there's also models for long-form Text-To-Speech and steaming Text-To-Speech.
Isn't this project the one Microsoft published but then soon after pulled it for security/safety reasons? What has changed since then?
Look at the "News" section in the readme - The original TTS model is gone from this repo (you can still find it other places), but the SST/ASR, long form TTS, and streaming TTS models are new.
So we've really just settled on Vibe as the verb for AI then?
Why use precise technical language when you can just vibe with your AI system?
I'd be willing to bet it will be "Word of the Year" for 2026. Merriam-Webster had 'slop' for 2025, and 'polarization' for 2024. Is there a prediction market for this?
Seems quite heavy for a STT model, Parakeet and Whisper are much smaller and perform great for quick dictation and transcription of longer files. I guess that's due to additional accuracy and speaker diarisation?
The TTS example clip in the repo of 'spontaneous singing' is creepy as fuck