5Hugging Face Blog·11d ago

ServiceNow AI benchmarks frontier ASR systems on code-switched bilingual speech

ServiceNow AI published a benchmarking study evaluating frontier automatic speech recognition (ASR) systems on code-switched speech, where speakers alternate between two languages mid-conversation. The work targets a practical gap in voice agent deployments serving bilingual customer populations. Results assess how well current ASR models handle this linguistically complex scenario, with implications for enterprise voice AI reliability.

Evaluation and Benchmarking Enterprise Deployment Patterns ServiceNow AI

Related guides (2)

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4Openai Blog·1mo ago·source ↗

ServiceNow powers actionable enterprise AI with OpenAI

ServiceNow is expanding its integration with OpenAI frontier models to power AI-driven workflows, summarization, search, and voice capabilities across the ServiceNow Platform. The partnership brings OpenAI's models into enterprise IT service management and workflow automation contexts. This represents a deepening of enterprise deployment patterns for OpenAI's commercial model offerings.

Enterprise Deployment Patterns Agent and Tool Ecosystem ServiceNow AI ServiceNow Platform OpenAI

4Hugging Face Blog·1mo ago·source ↗

Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks

Hugging Face has updated its Open ASR Leaderboard to include new multilingual and long-form audio transcription evaluation tracks. The post analyzes trends across submitted automatic speech recognition models, providing comparative benchmarking data across languages and extended audio contexts. This expands the leaderboard's coverage beyond English short-form ASR to better reflect real-world deployment scenarios.

Evaluation and Benchmarking Multimodal Progress Open ASR Leaderboard Automatic Speech Recognition Hugging Face

4Github Trending·25d ago·source ↗

FunASR: Industrial-Grade Speech Recognition Toolkit with 170x Realtime Performance

FunASR is an open-source speech recognition toolkit from ModelScope supporting 50+ languages, speaker diarization, emotion detection, and streaming inference at 170x realtime speed. It exposes an OpenAI-compatible API, positioning it as a drop-in alternative for production ASR workloads. The repository has accumulated 16,317 stars with modest daily momentum (+42 today).

Open Weights Progress Agent and Tool Ecosystem FunASR ModelScope OpenAI-compatible API

5Openai Blog·1mo ago·source ↗

OpenAI Introduces IndQA: Multilingual Benchmark for Indian Languages

OpenAI has released IndQA, a benchmark designed to evaluate AI systems across 12 Indian languages and 10 knowledge domains. The benchmark was developed with domain experts and focuses on cultural understanding and reasoning capabilities. It targets a significant gap in multilingual evaluation coverage for South Asian languages.

Evaluation and Benchmarking Multimodal Progress IndQA OpenAI

6The Batch·1mo ago·source ↗

OpenAI Updates Audio Models That Reason, Transcribe, and Translate

OpenAI introduced three new audio models in its Realtime API: GPT-Realtime-2 (speech-to-speech with five configurable reasoning effort levels), GPT-Realtime-Translate (70+ input languages), and GPT-Realtime-Whisper (transcription). GPT-Realtime-2 operates as an end-to-end audio model including reasoning, with latency ranging from 1.12 seconds at minimal effort to 2.33 seconds at high effort. Benchmark results are mixed: it leads Scale AI's Audio MultiChallenge and Artificial Analysis Conversational Dynamics but trails Step-Audio R1.1 Realtime and Grok Voice Think Fast 1.0 on speech reasoning and agentic tasks. The configurable reasoning-latency tradeoff is positioned as a key differentiator for voice agent applications.

Frontier Model Releases Evaluation and Benchmarking Scale AI Audio MultiChallenge GPT-Realtime-2 Google +14 more

4arXiv · cs.CL·2d ago·source ↗

IndicContextEval: Benchmark for context utilisation in Audio LLMs across 8 Indic languages

Researchers introduce IndicContextEval, a 56-hour multilingual speech benchmark covering 555 speakers across 8 Indian languages and 23 professional domains, designed to test whether Audio LLMs genuinely use textual context (domain descriptions, entity lists) or rely on parametric knowledge. The benchmark employs a 7-level prompting framework that progressively introduces contextual signals including adversarial prompts with incorrect entities. Evaluation of five models reveals substantial variation in context utilisation behaviour, exposing a gap in existing ASR benchmarks that test only fixed prompting conditions.

Evaluation and Benchmarking Multimodal Progress IndicContextEval

4arXiv · cs.CL·10d ago·source ↗

T1-Bench: Multi-scenario agent benchmark across 25 real-world domains

T1-Bench is a new benchmark for evaluating agentic LLM systems in realistic customer-facing, multi-domain environments, covering 25 domains of varying difficulty with interleaved multi-turn scenarios. The authors evaluate 12 proprietary and open-weight models and combine automatic evaluation with human judgments. The benchmark targets gaps in existing agent evals around task complexity, domain diversity, and compositional reasoning across multi-step interactions.

Evaluation and Benchmarking Agent and Tool Ecosystem T1-Bench

6arXiv · cs.AI·15d ago·source ↗

Benchmark Agent: Autonomous system for end-to-end benchmark construction

Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem Benchmark Everything Everywhere All at Once Benchmark Agent