Almanac
benchmark

FLEURS

benchmarkactivefleurs-b9045418·5 events·first seen 1mo ago

Aliases: FLEURS

Co-occurring entities

More like this (12)

Recent events (5)

7Mistral Ai News·1mo ago·source ↗

Mistral Releases Voxtral Transcribe 2: State-of-the-Art Speech-to-Text with Sub-200ms Realtime Model

Mistral AI has released Voxtral Transcribe 2, a family of two speech-to-text models: Voxtral Mini Transcribe V2 for batch transcription and Voxtral Realtime for live applications. Voxtral Realtime features a novel streaming architecture with configurable latency down to sub-200ms, a 4B parameter footprint suitable for edge deployment, and is released as open weights under Apache 2.0. Voxtral Mini Transcribe V2 claims state-of-the-art word error rate on FLEURS at $0.003/min, outperforming GPT-4o mini Transcribe, Gemini 2.5 Flash, AssemblyAI, and Deepgram Nova on accuracy benchmarks. Both models support 13 languages with speaker diarization, word-level timestamps, and context biasing.

4arXiv · cs.CL·15d ago·source ↗

SN-WER: Script-Normalized Word Error Rate for Multi-Script Indic ASR Evaluation

Researchers propose Script-Normalized WER (SN-WER), a training-free evaluation metric that transliterates ASR reference and hypothesis text into a canonical script before computing WER, addressing overestimation of errors caused by script mismatches in multilingual settings. Evaluated across 5 Indic languages, 2 datasets, and 3 ASR models, SN-WER reduces inflated model performance gaps by up to 12% on curated FLEURS data and attenuates romanization-induced WER inflation by 67% in controlled tests. The metric maintains near-identical sensitivity to genuine semantic errors (ΔSN-WER/ΔWER ≈ 1.09) and shows robustness to transliterator choice with token-collision rates below 0.1%. The authors recommend SN-WER as a companion metric to WER and CER, particularly for pipelines feeding downstream search, indexing, or multilingual LLM applications.

6The Batch·15d ago·source ↗

Data Points: NeurIPS-China Standoff, Anthropic Emotion Vectors, Gemma 4, Cursor 3, Microsoft MAI Models

This edition of The Batch covers five significant AI developments: NeurIPS reversed a sanctions-related submission policy after China's largest tech federation announced a boycott; Anthropic's interpretability team identified 171 emotion-related representations in Claude Sonnet 4.5 that causally influence model behavior including unsafe actions; Google released Gemma 4, a family of Apache 2.0-licensed open-weights models up to 31B parameters with strong benchmark performance; Cursor released version 3 with a redesigned multi-agent interface; and Microsoft announced three specialized MAI models for transcription, voice synthesis, and image generation. The NeurIPS incident highlights growing friction in international AI research access, while the Anthropic findings have direct implications for AI safety and interpretability research.

8Mistral Ai News·15d ago·source ↗

Mistral AI Releases Voxtral: Open-Weight Speech Understanding Models in 24B and 3B Sizes

Mistral AI has released Voxtral, a family of two open-weight speech understanding models (Voxtral Small at 24B and Voxtral Mini at 3B) under the Apache 2.0 license. Both models support long-form audio up to 30-40 minutes, native multilingual transcription, built-in Q&A and summarization, and function-calling directly from voice, built on the Mistral Small 3.1 language model backbone. Benchmarks show Voxtral outperforms Whisper large-v3 across all tasks and is competitive with GPT-4o mini and Gemini 2.5 Flash on audio understanding, while pricing starts at $0.001/minute via API. Models are available on Hugging Face and through Mistral's API, with a transcription-optimized variant (Voxtral Mini Transcribe) also offered.

7The Batch·12d ago·source ↗

Microsoft Build: Seven in-house AI models, GitHub Copilot desktop agent manager, and Web IQ search API for agents

Microsoft announced seven new AI models trained from scratch (not distilled from OpenAI), including the flagship MAI-Thinking-1 reasoning model and MAI-Transcribe-1.5, plus a 'Frontier Tuning' reinforcement learning approach for enterprise workflow training. GitHub released a desktop Copilot app designed to manage multiple parallel AI agents with isolated git worktrees and bidirectional canvases. Microsoft also launched Web IQ, an agent-native Bing-powered grounding API already powering search in Copilot and ChatGPT, running 2.5x faster than alternatives with lower token costs. The roundup also covers Nous Research's Hermes Desktop cross-platform agent app, Alibaba's Qwen3.7-Plus multimodal model, and OpenAI's role-specific Codex plugins.