Entity · model

Grok 4

SupersededGrok 4.3 is xAI's current flagship model (supersedes Grok 4 and Grok 3). See Grok 4.3 →

modelactivegrok-4-42edce9b·5 events·first seen May 22, 2026

Aliases: Grok 4, Grok 4.5

Co-occurring entities

More like this (12)

Grok 4.3 Grok Grok-3 Grok Imagine Grok-4-Fast grok-build grok2api Groc-PO Grok Voice Think Fast 1.0 grok-mermaid Groq Grokipedia

Recent events (5)

5The Batch·Jul 24, 2026·source ↗

Stanford/Together AI study finds retrieval is the weakest link for LLM web-search agents

Researchers at Stanford University and Together AI tested six LLMs equipped with web-search tools on daily news questions across six languages, finding that retrieval failures account for the majority of errors (38.8%) rather than reasoning or comprehension failures. Top models exceeded 90% accuracy on well-formed English multiple-choice questions, but performance degraded significantly for Hindi, free-response formats, and questions containing false premises. The study identifies three retrieval improvement levers—indexing coverage, source ranking, and multilingual query handling—and suggests retrieval optimization may yield larger gains than model scaling for time-sensitive queries.

Evaluation and Benchmarking Agent and Tool Ecosystem Gemini 3.5 Pro GPT-4o mini Stanford University +10 more

6Latent Space·Jul 9, 2026·source ↗

AINews: SpaceXAI launches Grok 4.5, first Opus-class model post Cursor acquisition

Latent Space's AINews digest reports that SpaceXAI has launched Grok 4.5, described as the first Opus-class model released following the Cursor acquisition. The item signals continued rapid iteration from xAI. The body is extremely thin, so most substance must be inferred from the headline alone.

Frontier Model Releases Agent and Tool Ecosystem Grok 4 Cursor xAI +1 more

3Hacker News·Jul 9, 2026·source ↗

Informal build-off comparing Grok 4.5, GPT-5.5, and Claude on app development tasks

A blog post from tryai.dev pits Grok 4.5, GPT-5.5, and Claude against each other on identical app-building tasks, generating moderate HN engagement (152 points, 80 comments). The comparison is informal and practitioner-oriented rather than rigorous benchmarking. It provides anecdotal signal on relative coding capability across current frontier models.

Frontier Model Releases Claude Grok 4 xAI +3 more

7Anthropic News·Jun 1, 2026·source ↗

Anthropic Publishes Political Even-Handedness Evaluation for Claude, Open-Sources Methodology

Anthropic has released a detailed account of how it trains and evaluates Claude for political even-handedness, including character traits instilled via reinforcement learning since early 2024 and a new automated evaluation methodology. The evaluation tests thousands of prompts across hundreds of political stances and benchmarks Claude Sonnet 4.5 against GPT-5, Llama 4, Grok 4, and Gemini 2.5 Pro, finding Claude comparable to Grok 4 and Gemini 2.5 Pro and more even-handed than GPT-5 and Llama 4. Anthropic is open-sourcing the evaluation framework to encourage shared industry standards for measuring political bias. The post also discloses the specific system prompt language used on Claude.ai to enforce even-handed behavior.

Frontier Model Releases Evaluation and Benchmarking claude.ai Claude Sonnet 4.5 Grok 4 +8 more

6arXiv · cs.CL·May 22, 2026·source ↗

Systematic 14-Day Evaluation of Six AI Chatbots as News Intermediaries Across Languages and Regions

Researchers evaluated six commercial AI chatbots (Gemini 3 Flash/Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services over 14 days in February 2026. Top systems exceed 90% multiple-choice accuracy on breaking news but lose 11-17% under free-response conditions. Key findings include systematic Hindi-language underperformance (79% vs. 89-91% elsewhere) driven by Anglophone retrieval bias, retrieval failures accounting for over 70% of errors, and dramatic accuracy collapse (to 19-70%) on questions containing subtle false premises. A detection-accuracy paradox is identified: the best false-premise detector does not yield the best adversarial accuracy, suggesting premise detection and answer recovery are partially independent capabilities.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.5 Pro BBC News GPT-4o mini +11 more