OpenAI has officially launched Realtime API and gpt-realtime, its most advanced speech-to-speech model, moving the Realtime API out of beta […]
Category: Audio Language Model
NVIDIA AI Just Released Streaming Sortformer: A Real-Time Speaker Diarization that Figures Out Who’s Talking in Meetings and Calls Instantly
NVIDIA has released its Streaming Sortformer, a breakthrough in real-time speaker diarization that instantly identifies and labels participants in meetings, […]
NVIDIA AI Just Released the Largest Open-Source Speech AI Dataset and State-of-the-Art Models for European Languages
Nvidia has taken a major leap in the development of multilingual speech AI, unveiling Granary, the largest open-source speech dataset […]
Alibaba Qwen Introduces Qwen3-MT: Next-Gen Multilingual Machine Translation Powered by Reinforcement Learning
Alibaba has introduced Qwen3-MT (qwen-mt-turbo) via Qwen API, its latest and most advanced machine translation model, designed to break language barriers […]
NVIDIA AI Releases Canary-Qwen-2.5B: A State-of-the-Art ASR-LLM Hybrid Model with SoTA Performance on OpenASR Leaderboard
NVIDIA has just released Canary-Qwen-2.5B, a groundbreaking automatic speech recognition (ASR) and language model (LLM) hybrid, which now tops the […]
Mistral AI Releases Voxtral: The World’s Best (and Open) Speech Recognition Models
Mistral AI has released Voxtral, a family of open-weight models—Voxtral-Small-24B and Voxtral-Mini-3B—designed to handle both audio and text inputs. Built […]
NVIDIA Just Released Audio Flamingo 3: An Open-Source Model Advancing Audio General Intelligence
Heard about Artificial General Intelligence (AGI)? Meet its auditory counterpart—Audio General Intelligence. With Audio Flamingo 3 (AF3), NVIDIA introduces a […]
Efficient and Adaptable Speech Enhancement via Pre-trained Generative Audioencoders and Vocoders
Recent advances in speech enhancement (SE) have moved beyond traditional mask or signal prediction methods, turning instead to pre-trained audio […]
StepFun Introduces Step-Audio-AQAA: A Fully End-to-End Audio Language Model for Natural Voice Interaction
Rethinking Audio-Based Human-Computer Interaction Machines that can respond to human speech with equally expressive and natural audio have become a […]
NVIDIA Open Sources Parakeet TDT 0.6B: Achieving a New Standard for Automatic Speech Recognition ASR and Transcribes an Hour of Audio in One Second
NVIDIA has unveiled Parakeet TDT 0.6B, a state-of-the-art automatic speech recognition (ASR) model that is now fully open-sourced on Hugging […]
