Entity · company

xAI

companyactivexai-5717c757·24 events·first seen May 18, 2026

Aliases: xAI

Co-occurring entities

OpenAI Anthropic Google GPT-5.5 Microsoft Grok 4.3 Grok Grok 4 DeepSeek V4 Claude Opus 4.6 Gemini 3 Flash Grok Imagine Claude Latent Space Gemini 3.5 Pro GPT-4o mini Claude Sonnet 4.5 Simon Willison Cursor Claude Code

More like this (12)

AI for Science AI vs. AI AI for Game Development AllenAI Together AI AI Agents Meta AI Scale AI Import AI Public AI AI and Compute AI for Math Initiative

Recent events (24)

6arXiv · cs.CL·3d ago·source ↗

Gubernaut: Deterministic homeostatic controller for affect-regulated LLM agents validated across four frontier model families

Researchers introduce the Gubernaut Cognitive Controller (GCC), a model-agnostic runtime layer that monitors numeric telemetry (intensity, valence, repetition) at a meta level to regulate LLM agent behavior under sustained pressure—addressing escalation, sycophancy, and perseveration without modifying model weights. The architecture uses a Nelson–Narens monitoring–control loop where the deterministic meta level ingests zero tokens, eliminating a class of prompt-injection attack vectors by construction. Evaluation uses a pre-registered generate-once/judge-many protocol across a 4×4 matrix of four frontier models (GPT-5.5, Claude Opus 4.8, Gemini 3.5 Flash, Grok 4.3) as both generators and judges, finding the regulated arm calmer in 13 of 16 cells at p<.05 and 15 of 16 by sign. The recovery signature—arousal integrating under attack then decaying on de-escalation—replicates across all four model families, suggesting a robust mechanism rather than a judge-style artifact.

AI Safety Research Agent and Tool Ecosystem Google Gemini 3.5 Flash xAI +7 more

7arXiv · cs.CL·4d ago·source ↗

Study finds LLM epistemic stances on pseudo-science vary by deployment configuration, not just model weights

Researchers tested four major LLM families (Claude, Grok, GPT, Gemini) on their evaluation of ethnonationalist pseudo-science across four temporal snapshots and two interface types (API vs. web). Grok's Fast versions consistently rated the pseudo-scientific claims 2-5x more credible than other models, and a silent overnight patch reversed Grok's behavior without public documentation; the same model identifier produced radically divergent scores via API versus web three months later. The paper argues that a model's epistemic stance is not a stable property of its weights but a contingent effect of deployment configuration—system prompts, safety layers, interface routing, and undocumented updates—constituting an accountability gap for users and researchers.

Evaluation and Benchmarking AI Safety Research Claude Opus 4.6 Grok Google +7 more

5The Batch·Jul 24, 2026·source ↗

Stanford/Together AI study finds retrieval is the weakest link for LLM web-search agents

Researchers at Stanford University and Together AI tested six LLMs equipped with web-search tools on daily news questions across six languages, finding that retrieval failures account for the majority of errors (38.8%) rather than reasoning or comprehension failures. Top models exceeded 90% accuracy on well-formed English multiple-choice questions, but performance degraded significantly for Hindi, free-response formats, and questions containing false premises. The study identifies three retrieval improvement levers—indexing coverage, source ranking, and multilingual query handling—and suggests retrieval optimization may yield larger gains than model scaling for time-sensitive queries.

Evaluation and Benchmarking Agent and Tool Ecosystem Gemini 3.5 Pro GPT-4o mini Stanford University +10 more

7Latent Space·Jul 24, 2026·source ↗

Black Forest Labs releases FLUX 3 multimodal flow model, reportedly outperforming Seedance 2.0, Gemini Omni, and Grok Imagine

Black Forest Labs has released FLUX 3, a multimodal flow model that reportedly beats competing image/video generation systems including Seedance 2.0, Gemini Omni, and Grok Imagine on key benchmarks. The release also includes a FLUX-mimic video-action robotics model, extending the FLUX family into embodied AI applications. This represents a significant capability advance for BFL in the competitive generative media space.

Frontier Model Releases Multimodal Progress Seedance 2.0 Black Forest Labs Grok Imagine +5 more

6arXiv · cs.CL·Jul 17, 2026·source ↗

Audit finds Grokipedia less politically neutral than Wikipedia, with distinct ideological biases

A large-scale arXiv study audits political neutrality in Grokipedia—an encyclopedia generated by xAI's Grok LLM—versus Wikipedia, analyzing 1,394 article pairs about government members across nine ideology dimensions using four LLM judges (Grok, Claude, Mistral, DeepSeek). All four judges, including Grok itself, rate Grokipedia as less neutral than Wikipedia. The study finds Grokipedia favors economically right-wing politicians and penalizes socially liberal ones, while Wikipedia shows the opposite bias pattern, raising questions about whether LLM-generated content can deliver ideological neutrality.

Evaluation and Benchmarking AI Safety Research Grokipedia DeepSeek V4 Grok +5 more

4Simon Willison'S Weblog·Jul 16, 2026·source ↗

xAI open-sources grok-build repository

Simon Willison notes that xAI has open-sourced the grok-build repository on GitHub. The post is brief with limited technical detail, but the open-sourcing of xAI tooling is a notable signal in the open-weights/open-source AI ecosystem. The significance depends on what grok-build contains, which is not elaborated in the source.

Open Weights Progress Agent and Tool Ecosystem grok-build xAI Simon Willison

3Simon Willison'S Weblog·Jul 16, 2026·source ↗

Simon Willison demos Grok-powered Mermaid-to-Unicode box art conversion

Simon Willison documents a small tool or experiment called 'grok-mermaid' that converts Mermaid diagram syntax into Unicode box-art representations. The post appears to use an xAI Grok model as the underlying engine for the conversion. This is a lightweight capability demo illustrating LLM-assisted diagram rendering.

Agent and Tool Ecosystem xAI Simon Willison grok-mermaid +1 more

3Github Trending·Jul 13, 2026·source ↗

grok2api: OpenAI-compatible API gateway for Grok web interface

grok2api is a FastAPI-based open-source gateway that wraps the Grok web interface and exposes it as an OpenAI-compatible API. The project has accumulated 5,532 GitHub stars with 112 added today, indicating active community interest. It enables developers to use Grok models through standard OpenAI API tooling without official API access.

Agent and Tool Ecosystem xAI Grok 4.3 grok2api

6Latent Space·Jul 9, 2026·source ↗

AINews: SpaceXAI launches Grok 4.5, first Opus-class model post Cursor acquisition

Latent Space's AINews digest reports that SpaceXAI has launched Grok 4.5, described as the first Opus-class model released following the Cursor acquisition. The item signals continued rapid iteration from xAI. The body is extremely thin, so most substance must be inferred from the headline alone.

Frontier Model Releases Agent and Tool Ecosystem Grok 4 Cursor xAI +1 more

3Hacker News·Jul 9, 2026·source ↗

Informal build-off comparing Grok 4.5, GPT-5.5, and Claude on app development tasks

A blog post from tryai.dev pits Grok 4.5, GPT-5.5, and Claude against each other on identical app-building tasks, generating moderate HN engagement (152 points, 80 comments). The comparison is informal and practitioner-oriented rather than rigorous benchmarking. It provides anecdotal signal on relative coding capability across current frontier models.

Frontier Model Releases Claude Grok 4 xAI +3 more

7The Batch·Jun 17, 2026·source ↗

Data Points: GLM-5.2 leads open models on coding benchmarks; SpaceX acquires Cursor; OpenRouter Fusion; Anthropic coding study; ChatGPT market share drops

Zhipu released GLM-5.2, a 744B-parameter open model under MIT license that ranks second only to Claude Opus 4.8 on long-horizon coding benchmarks including FrontierSWE and SWE-Marathon, featuring a 1M-token context window and a 2.9× compute reduction via IndexShare attention. SpaceX is acquiring Cursor (Anysphere) for $60B in stock, positioning Musk's company to compete in AI software tools using xAI's Colossus infrastructure. OpenRouter launched Fusion, a multi-model synthesis tool showing that budget model panels can match frontier model performance at half the cost. An Anthropic study of 400K Claude Code sessions found domain expertise—not coding skill—is the primary driver of agentic output, while a Munich court ruled Google liable for false claims in AI Overviews.

Frontier Model Releases Evaluation and Benchmarking DRACO FrontierSWE Anysphere +24 more

9The Batch·Jun 3, 2026·source ↗

U.S. Department of War bans Anthropic, contracts OpenAI for classified AI systems after standoff over safety restrictions

The U.S. Department of War designated Anthropic a supply-chain risk to national security after the company refused to remove restrictions on Claude's use for domestic surveillance and autonomous weapons, effectively banning it from military and contractor use. OpenAI signed a contract allowing use of its models 'for all lawful purposes' with ambiguous carve-outs for surveillance and autonomous weapons, which Altman later called rushed and renegotiated. The standoff culminated in a Trump Truth Social post threatening civil and criminal consequences against Anthropic, followed by Hegseth's formal designation. The episode marks a significant precedent: the supply-chain risk designation, previously applied only to foreign companies, was used against a U.S. AI lab over its own usage policies.

AI Safety Research Regulatory Developments Dario Amodei Palantir U.S. Department of Defense +8 more

6The Batch·Jun 3, 2026·source ↗

Meta, OpenAI, and other AI companies build private gas-fired power plants to bypass public utilities

Major AI companies including Meta, OpenAI, Oracle, and xAI are constructing private, off-grid power plants—primarily natural gas—to directly supply their data centers, bypassing public utility grid connections. A Cleanview study identified 46 such projects, 90% announced in 2025, accounting for 30% of all planned U.S. data-center capacity. Meta is building gas plants in Ohio and Texas, while OpenAI and Oracle's Stargate-linked Jupiter project is underway in New Mexico. The shift signals a structural change in AI infrastructure energy strategy, with climate implications as fossil fuels displace earlier renewable commitments.

Training Infrastructure Inference Economics Microsoft Cleanview Stargate +7 more

7The Batch·Jun 2, 2026·source ↗

Grok Imagine 1.0 Sharply Cuts Costs for High-Quality Video Generation

xAI launched Grok Imagine 1.0, a text-and-image-to-video model that topped the Artificial Analysis Video Arena leaderboard in both text-to-video and image-to-video categories at launch. The model generates up to 15-second clips with audio at $4.20 per minute of output, significantly undercutting Google Veo 3.1 ($12/min) and OpenAI Sora 2 Pro ($30/min). It is integrated with the X social network, enabling direct generation and sharing, though xAI disclosed no technical details about the model's architecture. The launch highlights continued rapid cost compression in video generation, with a seven-fold price gap between Grok Imagine 1.0 and Sora 2 Pro.

Frontier Model Releases Evaluation and Benchmarking Artificial Analysis Grok Imagine Google Veo 3.1 +10 more

7The Batch·Jun 2, 2026·source ↗

OpenAI Shuts Down Sora Video Generation Model, Redirects Team to World Models and Robotics

OpenAI is discontinuing its Sora video generation model, with web/app access ending April 26 and API access closing September 24, 2026. The model was losing roughly $1 million per day, with daily active users falling below 500,000 after peaking at 1 million post-mobile launch. The Sora team will be redirected to longer-term projects including world models and robotics, while compute resources have already been diverted to a new coding/enterprise model codenamed Spud. The shutdown also effectively ends OpenAI's high-profile partnership with Disney, which had planned to invest up to $1 billion contingent on Sora integration.

Frontier Model Releases Inference Economics Artificial Analysis Sora 2 Pro Disney +13 more

7The Batch·Jun 1, 2026·source ↗

Data Points: China Blocks Meta-Manus Deal; Microsoft-OpenAI Restructure; Nvidia Nemotron Omni; Grok 4.3; OpenAI AGI Principles; IBM Granite 4.1

A roundup of major AI developments: Chinese regulators blocked Meta's acquisition of Singapore-based agent startup Manus on security grounds; Microsoft and OpenAI restructured their partnership, with OpenAI gaining freedom to sell on rival clouds while Microsoft loses its AGI-access clause; Nvidia released Nemotron 3 Nano Omni, a 30B MoE omnimodal open-weights model for local agent deployment; xAI shipped Grok 4.3 with a 1M-token context window at reduced pricing; OpenAI published AGI operating principles; and IBM released Granite 4.1 across language, vision, speech, embedding, and safety modalities.

Long Context Evolution Frontier Model Releases Palantir IBM Microsoft +17 more

7The Batch·Jun 1, 2026·source ↗

US Government Prepares AI Model Vetting System; GPT-5.5 Instant, Claude Finance Agents, Pentagon AI Partnerships

The White House is preparing an executive order to create an FDA-style vetting system for new AI models, prompted partly by Anthropic's Mythos model disclosing cybersecurity risks; the Commerce Department separately expanded a voluntary testing program with Google, Microsoft, and xAI. OpenAI rolled out GPT-5.5 Instant as the default ChatGPT model, claiming 52.5% fewer hallucinations on high-stakes prompts. Anthropic released ten financial agent templates running on Claude Opus 4.7, while the Pentagon expanded AI vendor agreements to include Microsoft, Amazon, Nvidia, and Reflection AI after canceling its Anthropic contract over autonomous weapons restrictions. Major pharma companies report AI gains primarily in manufacturing optimization rather than drug discovery breakthroughs.

Frontier Model Releases Evaluation and Benchmarking Vals AI Finance Agent Benchmark White House Darius Amodei +23 more

5Latent Space·Jun 1, 2026·source ↗

Why Video Agent Models Are Next — Ethan He, xAI Grok Imagine

Latent Space interviews Ethan He, the lead behind xAI's Grok Imagine video generation product, covering its development in roughly three months. The discussion explores the distinction between video generation models and world models, and positions video agents as a significant near-term frontier. He argues Grok Imagine is underrated relative to its capabilities.

Frontier Model Releases Agent and Tool Ecosystem Grok Imagine world model video agents +4 more

6arXiv · cs.CL·May 22, 2026·source ↗

Systematic 14-Day Evaluation of Six AI Chatbots as News Intermediaries Across Languages and Regions

Researchers evaluated six commercial AI chatbots (Gemini 3 Flash/Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services over 14 days in February 2026. Top systems exceed 90% multiple-choice accuracy on breaking news but lose 11-17% under free-response conditions. Key findings include systematic Hindi-language underperformance (79% vs. 89-91% elsewhere) driven by Anglophone retrieval bias, retrieval failures accounting for over 70% of errors, and dramatic accuracy collapse (to 19-70%) on questions containing subtle false premises. A detection-accuracy paradox is identified: the best false-premise detector does not yield the best adversarial accuracy, suggesting premise detection and answer recovery are partially independent capabilities.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.5 Pro BBC News GPT-4o mini +11 more

7Latent Space·May 19, 2026·source ↗

Anthropic-SpaceX AI's 300MW/$5B/yr Colossus I Deal; ARR Growth 8000% Annualized

Latent Space AINews reports that Anthropic has struck a major infrastructure deal with SpaceX AI involving 300MW of compute capacity at the Colossus I data center for approximately $5B per year. The report also highlights Anthropic's annualized ARR growth of 8000%, signaling rapid commercial scaling. This represents a significant strategic alignment between Anthropic and xAI/SpaceX infrastructure assets.

Training Infrastructure Frontier Model Releases Colossus 1 xAI SpaceX AI +4 more

7The Batch·May 18, 2026·source ↗

U.S. Government to Pre-Deployment Evaluate Frontier AI Models via NIST TRAINS Task Force

The U.S. National Institute of Standards and Technology (NIST) announced a new multi-agency task force called TRAINS (Testing Risks of AI for National Security) to assess national-security risks from frontier AI models before public deployment. Major AI companies including Google, Microsoft, xAI, Anthropic, and OpenAI have agreed to submit models—including versions with limited guardrails—for evaluation focused on cybersecurity, biosecurity, and chemical weapons risks. The White House is also considering an executive order requiring pre-deployment approval for AI models. TRAINS draws on multiple federal agencies and differs from prior NIST groups in its rapid-response design, though its specific benchmarks have not been disclosed.

Frontier Model Releases Evaluation and Benchmarking DeepSeek V4 Microsoft Google +9 more

7The Batch·May 18, 2026·source ↗

U.S. Government to Pre-Release Test AI Models for National Security Risks via NIST TRAINS Task Force

NIST announced a new multi-agency task force called TRAINS (Testing Risks of AI for National Security), overseen by its Center for AI Standards and Innovation, to evaluate frontier AI models for cybersecurity, biosecurity, and chemical weapons risks before public deployment. Google, Microsoft, xAI, Anthropic, and OpenAI have voluntarily agreed to submit models with limited guardrails for evaluation. The policy shift follows Anthropic's announcement that Claude Mythos Preview can autonomously exploit software vulnerabilities, and marks a sharp reversal from the Trump Administration's earlier deregulatory stance. The White House is also considering an executive order that would make pre-release government testing mandatory.

Frontier Model Releases Evaluation and Benchmarking White House Center for AI Standards and Innovation DeepSeek V4 +11 more

6The Batch·May 18, 2026·source ↗

OpenAI Updates Audio Models That Reason, Transcribe, and Translate

OpenAI introduced three new audio models in its Realtime API: GPT-Realtime-2 (speech-to-speech with five configurable reasoning effort levels), GPT-Realtime-Translate (70+ input languages), and GPT-Realtime-Whisper (transcription). GPT-Realtime-2 operates as an end-to-end audio model including reasoning, with latency ranging from 1.12 seconds at minimal effort to 2.33 seconds at high effort. Benchmark results are mixed: it leads Scale AI's Audio MultiChallenge and Artificial Analysis Conversational Dynamics but trails Step-Audio R1.1 Realtime and Grok Voice Think Fast 1.0 on speech reasoning and agentic tasks. The configurable reasoning-latency tradeoff is positioned as a key differentiator for voice agent applications.

Frontier Model Releases Evaluation and Benchmarking Scale AI Audio MultiChallenge GPT-Realtime-2 Google +14 more

7arXiv · cs.LG·May 18, 2026·source ↗

AI-Mediated Communication Can Steer Collective Opinion via LLM Editing Biases

This paper demonstrates empirically that LLMs from multiple model families introduce directional biases when editing human-written texts on contested topics (e.g., nudging toward gun control, against atheism). The authors develop a mathematical opinion-dynamics model showing these biases are amplified through social networks, shifting collective opinion at scale. An audit of X's 'Explain this post' feature finds evidence of pro-life bias in Grok's outputs on abortion content, traced to specific design choices. The paper concludes with implications for EU legislative efforts on AI-mediated communication.

Evaluation and Benchmarking AI Safety Research Grok X (Twitter)EU AI Act +5 more