Almanac
model

Grok 4

modelactivegrok-4-42edce9b·2 events·first seen 25d ago

Aliases: Grok 4

Co-occurring entities

More like this (12)

Recent events (2)

6arXiv · cs.CL·25d ago·source ↗

Systematic 14-Day Evaluation of Six AI Chatbots as News Intermediaries Across Languages and Regions

Researchers evaluated six commercial AI chatbots (Gemini 3 Flash/Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services over 14 days in February 2026. Top systems exceed 90% multiple-choice accuracy on breaking news but lose 11-17% under free-response conditions. Key findings include systematic Hindi-language underperformance (79% vs. 89-91% elsewhere) driven by Anglophone retrieval bias, retrieval failures accounting for over 70% of errors, and dramatic accuracy collapse (to 19-70%) on questions containing subtle false premises. A detection-accuracy paradox is identified: the best false-premise detector does not yield the best adversarial accuracy, suggesting premise detection and answer recovery are partially independent capabilities.

7Anthropic News·15d ago·source ↗

Anthropic Publishes Political Even-Handedness Evaluation for Claude, Open-Sources Methodology

Anthropic has released a detailed account of how it trains and evaluates Claude for political even-handedness, including character traits instilled via reinforcement learning since early 2024 and a new automated evaluation methodology. The evaluation tests thousands of prompts across hundreds of political stances and benchmarks Claude Sonnet 4.5 against GPT-5, Llama 4, Grok 4, and Gemini 2.5 Pro, finding Claude comparable to Grok 4 and Gemini 2.5 Pro and more even-handed than GPT-5 and Llama 4. Anthropic is open-sourcing the evaluation framework to encourage shared industry standards for measuring political bias. The post also discloses the specific system prompt language used on Claude.ai to enforce even-handed behavior.