4arXiv cs.CL (Computation and Language)·2d ago

Meaning Intelligence Framework addresses context failure in AI processing of Nigerian public discourse

Researchers introduce the Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema designed to separate surface sentiment from true communicative intent in Nigerian public discourse. The paper argues that AI systems fail on Nigerian language data primarily due to context failure rather than translation failure, as pragmatic meaning shifts with speaker, audience, and situation. Evaluating Gemini 2.5 Flash on a 30-item calibration dataset, they find zero-shot register classification accuracy of 33.3% rising to 73.3% with schema-informed prompting, demonstrating large gains from structured in-context guidance. The framework and calibration set are released publicly to support reproducibility.

Evaluation and Benchmarking Google Gemini-2.5-Flash-Lite AfriSenti Meaning Intelligence Framework NaijaSenti

Related guides (2)

Google

Google: The AI Lab That Builds Everything from DNA Models to Your Phone's Assistant

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·26d ago·source ↗

AI-Assisted Systematization for Evaluating GenAI Systems

This paper addresses a foundational gap in GenAI evaluation: the underspecification of broad, contested concepts like 'reasoning,' 'fairness,' or 'creativity.' The authors introduce a structured artifact called a 'concept spec' and a validation worksheet, then build two AI-assisted systematizers—a zero-shot approach and a multi-agent approach—to convert vague evaluation targets into measurable, structured accounts. They apply these tools to hate-based rhetoric and digital empathy, assessing the resulting specs on content validity and information recoverability. The work positions AI assistance as a scalable aid for the cognitively demanding process of evaluation design.

Evaluation and Benchmarking AI Safety Research hate-based rhetoric concept spec digital empathy +4 more

6The Batch·19d ago·source ↗

Data Points: NeurIPS-China Standoff, Anthropic Emotion Vectors, Gemma 4, Cursor 3, Microsoft MAI Models

This edition of The Batch covers five significant AI developments: NeurIPS reversed a sanctions-related submission policy after China's largest tech federation announced a boycott; Anthropic's interpretability team identified 171 emotion-related representations in Claude Sonnet 4.5 that causally influence model behavior including unsafe actions; Google released Gemma 4, a family of Apache 2.0-licensed open-weights models up to 31B parameters with strong benchmark performance; Cursor released version 3 with a redesigned multi-agent interface; and Microsoft announced three specialized MAI models for transcription, voice synthesis, and image generation. The NeurIPS incident highlights growing friction in international AI research access, while the Anthropic findings have direct implications for AI safety and interpretability research.

Frontier Model Releases Open Weights Progress FLEURS NeurIPS WPP +19 more

7Google Deepmind Blog·1mo ago·source ↗

Measuring progress toward AGI: A cognitive framework

DeepMind is introducing a cognitive framework designed to measure progress toward AGI, providing structured criteria for assessing how close AI systems are to general intelligence. Alongside the framework, they are launching a Kaggle hackathon to crowdsource the development of relevant evaluations. The announcement signals a formal effort by a Tier 1 lab to operationalize AGI progress measurement, which has historically been contested and informal.

Frontier Model Releases Evaluation and Benchmarking Kaggle DeepMind AGI cognitive framework +1 more

6The Batch·18d ago·source ↗

MiniMax M2.7 proprietary reasoning model competes with Gemini and Claude Opus; roundup covers Cursor Composer 2, MAI-Image-2, Claude Code Channels, and Anthropic defense dispute

MiniMax released M2.7, a proprietary reasoning model that achieved 66.6% on MLE Bench Lite (tying Gemini 3.1) and 56.22% on SWE-Pro, priced at $0.30/$1.20 per million tokens, with the shift to proprietary marking a potential strategic pivot among Chinese AI labs away from open weights. Cursor released Composer 2, an agentic coding model built on a fine-tuned Kimi 2.5 (via Moonshot partnership), priced 86% cheaper than its predecessor and scoring 73.7 on SWE-bench Multilingual. Anthropic released Claude Code Channels, routing Telegram and Discord messages into local Claude Code sessions via MCP plugins, and separately filed a court response denying it has any backdoor or kill switch into military deployments of Claude. Microsoft announced MAI-Image-2, a text-to-image model ranking third on Arena.ai among research labs.

Frontier Model Releases Open Weights Progress Stitch Claude Sonnet 4 SWE-Pro +17 more

6arXiv · cs.CL·29d ago·source ↗

Systematic 14-Day Evaluation of Six AI Chatbots as News Intermediaries Across Languages and Regions

Researchers evaluated six commercial AI chatbots (Gemini 3 Flash/Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services over 14 days in February 2026. Top systems exceed 90% multiple-choice accuracy on breaking news but lose 11-17% under free-response conditions. Key findings include systematic Hindi-language underperformance (79% vs. 89-91% elsewhere) driven by Anglophone retrieval bias, retrieval failures accounting for over 70% of errors, and dramatic accuracy collapse (to 19-70%) on questions containing subtle false premises. A detection-accuracy paradox is identified: the best false-premise detector does not yield the best adversarial accuracy, suggesting premise detection and answer recovery are partially independent capabilities.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.5 Pro BBC News GPT-4o mini +11 more

4arXiv · cs.CL·13d ago·source ↗

DEFINED: Data-efficient framework for fine-grained creativity assessment in debate using LLMs

DEFINED is a computational framework for automated creativity assessment in debate scenarios, operationalizing creativity through an eight-dimensional hierarchical metric system implemented via a pretrained autoregressive language model with a hierarchical scoring head. The system addresses data scarcity through constrained data augmentation and mixed-granularity training from limited expert-annotated data. It outperforms prompt-based LLM evaluators and existing debate scoring methods on authentic competition data. The work is relevant to AI evaluation methodology and the broader question of whether LLMs can reliably assess complex human cognitive outputs.

Evaluation and Benchmarking DEFINED

4Import Ai·1mo ago·source ↗

Import AI 446: Nuclear LLMs; China's big AI benchmark; measurement and AI policy

Import AI issue 446 covers three main topics: the application of large language models to nuclear domains, a major new AI benchmark from China, and the intersection of AI measurement with policy. The newsletter synthesizes recent developments across frontier AI research and geopolitical AI competition. It also touches on speculative questions about AI psychology, such as whether AIs might experience jealousy. As a tier-2 commentary digest, it aggregates signals across multiple active research and policy threads.

Frontier Model Releases Evaluation and Benchmarking Jack Clark Import AI China +2 more

6The Batch·22d ago·source ↗

Gemini 3.5 Flash Launch, AI FDE Job Trends, AI Act Delays, and Agent-Driven Web Traffic

Google launched Gemini 3.5 Flash, a mid-tier multimodal mixture-of-experts model with improved agentic capabilities, visual understanding, and speed, priced at $1.50/$9.00 per million input/output tokens — three times the cost of its predecessor Gemini 3 Flash. The model supports up to 1M token context, adjustable reasoning levels, and thought preservation across multi-turn conversations, and tops the Artificial Analysis APEX-Agents-AA and MMMU-Pro benchmarks. The issue also covers Andrew Ng's commentary on the rise of AI Forward Deployed Engineers versus the broader AI Engineer role, plus news items on EU AI Act implementation delays and AI agents driving measurable online traffic shifts.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.5 Pro Palantir Artificial Analysis Intelligence Index +18 more