4GitHub Trending (AI/LLM filtered)·25d ago

FunASR: Industrial-Grade Speech Recognition Toolkit with 170x Realtime Performance

FunASR is an open-source speech recognition toolkit from ModelScope supporting 50+ languages, speaker diarization, emotion detection, and streaming inference at 170x realtime speed. It exposes an OpenAI-compatible API, positioning it as a drop-in alternative for production ASR workloads. The repository has accumulated 16,317 stars with modest daily momentum (+42 today).

Open Weights Progress Agent and Tool Ecosystem FunASR ModelScope OpenAI-compatible API

Related guides (2)

Open Weights ProgressTopic guide

Open Weights Progress: How Freely Available AI Models Caught Up to the Frontier

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

3Hugging Face Blog·1mo ago·source ↗

AI Speech Recognition in Unity

A Hugging Face blog post describes integrating AI-based automatic speech recognition (ASR) into Unity game/application environments. The post likely covers using transformer-based ASR models within the Unity engine, bridging ML inference with real-time interactive applications. This represents a practical deployment pattern for on-device or embedded ASR in non-traditional runtime environments.

Enterprise Deployment Patterns Agent and Tool Ecosystem Unity Hugging Face Whisper

8Openai Blog·1mo ago·source ↗

Introducing Whisper

OpenAI introduced Whisper, an open-source automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. The model demonstrates strong robustness to accents, background noise, and technical language, approaching human-level accuracy in English transcription. Whisper supports transcription in multiple languages as well as translation to English, and the weights and inference code were released publicly.

Open Weights Progress Agent and Tool Ecosystem OpenAI Whisper +1 more

4Hugging Face Blog·1mo ago·source ↗

Powerful ASR + Diarization + Speculative Decoding with Hugging Face Inference Endpoints

Hugging Face published a blog post describing a pipeline that combines automatic speech recognition (ASR), speaker diarization, and speculative decoding on their Inference Endpoints platform. The post demonstrates how these three techniques can be integrated to produce faster, speaker-attributed transcriptions. Speculative decoding is highlighted as a key inference optimization that reduces latency for ASR workloads.

Inference Economics Agent and Tool Ecosystem Hugging Face Inference Endpoints speculative decoding Hugging Face +2 more

4Hugging Face Blog·1mo ago·source ↗

Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks

Hugging Face has updated its Open ASR Leaderboard to include new multilingual and long-form audio transcription evaluation tracks. The post analyzes trends across submitted automatic speech recognition models, providing comparative benchmarking data across languages and extended audio contexts. This expands the leaderboard's coverage beyond English short-form ASR to better reflect real-world deployment scenarios.

Evaluation and Benchmarking Multimodal Progress Open ASR Leaderboard Automatic Speech Recognition Hugging Face

6The Batch·1mo ago·source ↗

OpenAI Updates Audio Models That Reason, Transcribe, and Translate

OpenAI introduced three new audio models in its Realtime API: GPT-Realtime-2 (speech-to-speech with five configurable reasoning effort levels), GPT-Realtime-Translate (70+ input languages), and GPT-Realtime-Whisper (transcription). GPT-Realtime-2 operates as an end-to-end audio model including reasoning, with latency ranging from 1.12 seconds at minimal effort to 2.33 seconds at high effort. Benchmark results are mixed: it leads Scale AI's Audio MultiChallenge and Artificial Analysis Conversational Dynamics but trails Step-Audio R1.1 Realtime and Grok Voice Think Fast 1.0 on speech reasoning and agentic tasks. The configurable reasoning-latency tradeoff is positioned as a key differentiator for voice agent applications.

Frontier Model Releases Evaluation and Benchmarking Scale AI Audio MultiChallenge GPT-Realtime-2 Google +14 more

7Openai Blog·1mo ago·source ↗

Introducing the Realtime API

OpenAI has launched the Realtime API, enabling developers to build low-latency speech-to-speech experiences directly into their applications. The API provides native audio input and output without requiring separate transcription and text-to-speech steps. This represents a significant infrastructure offering for voice-enabled AI applications, moving beyond text-based API paradigms.

Inference Economics Enterprise Deployment Patterns GPT-4o Realtime API OpenAI +2 more

3Hugging Face Blog·1mo ago·source ↗

Real-Time AI Sound Generation on Arm: A Personal Tool for Creative Freedom

A Hugging Face blog post describes deploying real-time AI sound generation on Arm hardware, framing it as a personal creative tool. The piece covers inference optimization for audio generation models running on Arm CPUs. This represents a practical demonstration of edge/on-device inference for generative audio models.

Inference Economics Agent and Tool Ecosystem Arm Hugging Face

7Openai Blog·1mo ago·source ↗

Advancing voice intelligence with new models in the API

OpenAI is releasing new realtime voice models via its API with capabilities spanning reasoning, translation, and transcription. The announcement targets developers building voice-enabled applications and represents an expansion of OpenAI's voice intelligence offerings beyond the existing Realtime API. The models are positioned to enable more natural and intelligent voice experiences in production deployments.

Frontier Model Releases Enterprise Deployment Patterns OpenAI voice models OpenAI Realtime API OpenAI +1 more