ServiceNow AI benchmarks frontier ASR systems on code-switched bilingual speech
ServiceNow AI published a benchmarking study evaluating frontier automatic speech recognition (ASR) systems on code-switched speech, where speakers alternate between two languages mid-conversation. The work targets a practical gap in voice agent deployments serving bilingual customer populations. Results assess how well current ASR models handle this linguistically complex scenario, with implications for enterprise voice AI reliability.
Related guides (2)
Related events (8)
ServiceNow powers actionable enterprise AI with OpenAI
ServiceNow is expanding its integration with OpenAI frontier models to power AI-driven workflows, summarization, search, and voice capabilities across the ServiceNow Platform. The partnership brings OpenAI's models into enterprise IT service management and workflow automation contexts. This represents a deepening of enterprise deployment patterns for OpenAI's commercial model offerings.
Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks
Hugging Face has updated its Open ASR Leaderboard to include new multilingual and long-form audio transcription evaluation tracks. The post analyzes trends across submitted automatic speech recognition models, providing comparative benchmarking data across languages and extended audio contexts. This expands the leaderboard's coverage beyond English short-form ASR to better reflect real-world deployment scenarios.
FunASR: Industrial-Grade Speech Recognition Toolkit with 170x Realtime Performance
FunASR is an open-source speech recognition toolkit from ModelScope supporting 50+ languages, speaker diarization, emotion detection, and streaming inference at 170x realtime speed. It exposes an OpenAI-compatible API, positioning it as a drop-in alternative for production ASR workloads. The repository has accumulated 16,317 stars with modest daily momentum (+42 today).
OpenAI Introduces IndQA: Multilingual Benchmark for Indian Languages
OpenAI has released IndQA, a benchmark designed to evaluate AI systems across 12 Indian languages and 10 knowledge domains. The benchmark was developed with domain experts and focuses on cultural understanding and reasoning capabilities. It targets a significant gap in multilingual evaluation coverage for South Asian languages.
OpenAI Updates Audio Models That Reason, Transcribe, and Translate
OpenAI introduced three new audio models in its Realtime API: GPT-Realtime-2 (speech-to-speech with five configurable reasoning effort levels), GPT-Realtime-Translate (70+ input languages), and GPT-Realtime-Whisper (transcription). GPT-Realtime-2 operates as an end-to-end audio model including reasoning, with latency ranging from 1.12 seconds at minimal effort to 2.33 seconds at high effort. Benchmark results are mixed: it leads Scale AI's Audio MultiChallenge and Artificial Analysis Conversational Dynamics but trails Step-Audio R1.1 Realtime and Grok Voice Think Fast 1.0 on speech reasoning and agentic tasks. The configurable reasoning-latency tradeoff is positioned as a key differentiator for voice agent applications.
IndicContextEval: Benchmark for context utilisation in Audio LLMs across 8 Indic languages
Researchers introduce IndicContextEval, a 56-hour multilingual speech benchmark covering 555 speakers across 8 Indian languages and 23 professional domains, designed to test whether Audio LLMs genuinely use textual context (domain descriptions, entity lists) or rely on parametric knowledge. The benchmark employs a 7-level prompting framework that progressively introduces contextual signals including adversarial prompts with incorrect entities. Evaluation of five models reveals substantial variation in context utilisation behaviour, exposing a gap in existing ASR benchmarks that test only fixed prompting conditions.
T1-Bench: Multi-scenario agent benchmark across 25 real-world domains
T1-Bench is a new benchmark for evaluating agentic LLM systems in realistic customer-facing, multi-domain environments, covering 25 domains of varying difficulty with interleaved multi-turn scenarios. The authors evaluate 12 proprietary and open-weight models and combine automatic evaluation with human judgments. The benchmark targets gaps in existing agent evals around task complexity, domain diversity, and compositional reasoning across multi-step interactions.
Benchmark Agent: Autonomous system for end-to-end benchmark construction
Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.

