Entity · organization

Stanford University

organizationactivestanford-university-10763c80·16 events·first seen May 18, 2026

Aliases: Stanford University

Co-occurring entities

More like this (12)

California State University UCLA Carnegie Mellon University University of Texas Austin University of California Los Angeles University of Pennsylvania UC Berkeley California Stanford Internet Observatory Imperial College London Stanford SNAP Curtin University

Recent events (16)

5The Batch·Jul 24, 2026·source ↗

Stanford/Together AI study finds retrieval is the weakest link for LLM web-search agents

Researchers at Stanford University and Together AI tested six LLMs equipped with web-search tools on daily news questions across six languages, finding that retrieval failures account for the majority of errors (38.8%) rather than reasoning or comprehension failures. Top models exceeded 90% accuracy on well-formed English multiple-choice questions, but performance degraded significantly for Hindi, free-response formats, and questions containing false premises. The study identifies three retrieval improvement levers—indexing coverage, source ranking, and multilingual query handling—and suggests retrieval optimization may yield larger gains than model scaling for time-sensitive queries.

Evaluation and Benchmarking Agent and Tool Ecosystem Gemini 3.5 Pro GPT-4o mini Stanford University +10 more

5arXiv · cs.AI·Jul 17, 2026·source ↗

Symbal: Automated detection of systematic misalignments in MLLM-generated image captions

Researchers from Stanford introduce Symbal, a dual-stage framework using off-the-shelf foundation models to detect recurring, visually-correlated errors in multimodal LLM-generated captions. They also release SymbalBench, a benchmark of 1.7 million image-text pairs across 420 vision-language datasets covering natural and medical imaging domains. Symbal correctly identifies systematic misalignments in 63.8% of datasets, nearly 4x better than the closest baseline, and works without access to the underlying MLLM. The work addresses a practical data-quality auditing problem relevant to anyone building or evaluating vision-language datasets.

Evaluation and Benchmarking Multimodal Progress Symbal Stanford University SymbalBench

5The Batch·Jul 3, 2026·source ↗

RoboReward: Vision-Language Reward Models for Robot Training via RL

Researchers at Stanford and UC Berkeley developed RoboReward, a family of 4B and 8B vision-language reward models designed to provide reward signals for robot reinforcement learning across diverse robot types and tasks. The team built a novel dataset by augmenting successful robot demonstrations with synthetically generated failure examples using GPT-5 mini and Qwen3-4B, then fine-tuned Qwen3-VL models to predict task progress scores. RoboReward 8B outperformed GPT-5, GPT-5 mini, and Gemini Robotics-ER 1.5 on the new RoboRewardBench evaluation, and in real-world robot trials substantially exceeded prior reward model baselines while still falling short of human-assigned rewards. The authors also release RoboRewardBench as a community benchmark for reward model evaluation.

Evaluation and Benchmarking Agent and Tool Ecosystem DeepLearning.AI Stanford University UC Berkeley +12 more

4The Batch·Jun 26, 2026·source ↗

U.S. universities rapidly expanding AI degree programs, now exceeding 1,000 offerings

As of April 2026, at least 1,000 AI programs exist across nearly 584 U.S. colleges and universities, including 78 majors and 103 minors, up from just five AI majors in 2021. The Batch surveys the landscape of undergraduate AI curricula, ranging from highly technical programs like Carnegie Mellon's math-intensive degree to interdisciplinary offerings like Drake University's humanities-oriented BA in AI. Debate continues over whether specialized AI degrees risk sacrificing broader CS foundations, and whether academic curriculum cycles are too slow to keep pace with the field's evolution.

Carnegie Mellon University DeepLearning.AI Stanford University +3 more

6The Batch·Jun 19, 2026·source ↗

DeepSWE, ProgramBench, and ITBench-AA emerge as harder successors to SWE-bench for agent evaluation

Three new benchmarks — DeepSWE (by Datacurve), ProgramBench (Meta/Stanford/Harvard), and ITBench-AA (IBM/Artificial Analysis) — are positioned as more rigorous replacements for the SWE-bench family, which models have largely saturated. DeepSWE tests feature implementation using private codebases and human-written problems; ProgramBench evaluates agents' ability to recreate functional programs from scratch; ITBench-AA measures root-cause diagnosis in real-world IT incident scenarios. Current top performers include GPT-5.5 (70% on DeepSWE), Claude Opus 4.7 (46.7% on ITBench-AA), and Claude Opus 4.7 (3% on ProgramBench at the 95% pass threshold), illustrating that even frontier models have substantial headroom.

Evaluation and Benchmarking Agent and Tool Ecosystem Artificial Analysis Llama 3.1 70B Datacurve +13 more

5Github Trending·Jun 19, 2026·source ↗

STORM: Stanford LLM-powered knowledge curation system for cited report generation

STORM is an open-source Python system from Stanford OVAL that uses LLMs to autonomously research a topic and generate full-length reports with citations. The repository has accumulated 28,804 GitHub stars with 199 added today, indicating sustained and active community interest. The system represents a practical agentic research pipeline combining retrieval, synthesis, and structured writing.

Open Weights Progress Agent and Tool Ecosystem Stanford University STORMS

6arXiv · cs.AI·Jun 17, 2026·source ↗

Stanford EDGAR Filings Dataset: 152B-token open corpus of SEC filings for LLM pretraining

Stanford researchers introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown, releasing a 152B-token initial snapshot with a larger 550B-token archive described. The dataset targets the growing scarcity of high-quality long-context pretraining data, with less than 0.1% overlap with Common Crawl-derived corpora. Two derived benchmarks are also introduced: EDGAR-Forecast for filing-grounded numerical forecasting and EDGAR-OCR for complex financial table transcription. The work addresses a real gap in open long-context training data outside narrow domains like code.

Training Infrastructure Long Context Evolution EDGAR-OCR EDGAR-Forecast Stanford University +3 more

6The Batch·Jun 10, 2026·source ↗

Data Points: Apple/Google Siri overhaul, Gemma 4 12B, Kimi Code CLI, OpenJarvis, and U.S. OpenAI stake talks

A multi-item digest covers several significant AI developments: Apple is expected to announce a revamped Siri at WWDC that uses Google Gemini models distilled for on-device use alongside cloud routing, marking a notable Apple-Google AI partnership. Google released Gemma 4 12B, an encoder-free multimodal open-weights model designed for consumer laptops under Apache 2.0. Moonshot AI released Kimi Code CLI, an open-source terminal coding agent with native subagent orchestration and conversational MCP configuration. Stanford and Lambda Labs released OpenJarvis, an on-device agent framework claiming near-cloud accuracy at 800× lower API cost. The White House and OpenAI are reportedly negotiating a government equity stake in OpenAI as part of a proposed Public Wealth Fund.

Frontier Model Releases Open Weights Progress Kimi Code CLI Stanford University WWDC +14 more

6The Batch·Jun 1, 2026·source ↗

Test-Time Training End-to-End (TTT-E2E) Retrains Model Weights to Handle Long Inputs

Researchers from Astera Institute, Nvidia, Stanford, UC Berkeley, and UC San Diego introduced TTT-E2E, a method that compresses long context into transformer weights by training the model during inference via meta-learning. The approach uses sliding-window attention restricted to 8,000 tokens and updates only the fully connected layers of the last quarter of the network on each 1,000-token chunk at inference time, keeping per-token generation latency roughly constant as context scales to 128,000 tokens. TTT-E2E slightly outperforms vanilla transformers on next-token prediction loss across long contexts and matches efficient architectures like Mamba 2 and Gated DeltaNet on inference speed, but fails dramatically on Needle-in-a-Haystack retrieval beyond 8,000 tokens and incurs substantially higher training latency. The work reframes long-context handling as a training-inference trade-off rather than an architectural design problem.

Training Infrastructure Long Context Evolution University of California San Diego Mamba Stanford University +13 more

6Anthropic News·Jun 1, 2026·source ↗

How scientists are using Claude to accelerate research and discovery

Anthropic describes how researchers are deploying Claude-powered systems across scientific workflows, highlighting three case studies: Biomni (a Stanford agentic platform integrating hundreds of biomedical tools), the Cheeseman Lab (automating large-scale gene knockout experiment interpretation), and others. The piece details Claude for Life Sciences and the AI for Science program, which provides free API credits to high-impact research projects. Specific benchmarks cited include compressing months-long GWAS analyses to 20 minutes and analyzing 336,000 single-cell datasets to identify novel transcription factors.

Frontier Model Releases Enterprise Deployment Patterns Claude Opus 4.6 Stanford University Claude +9 more

4Hacker News·Jun 1, 2026·source ↗

AI Agent Guidelines for CS336 at Stanford

Stanford's CS336 (Language Models from Scratch) course has published explicit guidelines for AI agent behavior within its assignment repository, surfacing as a community discussion item on Hacker News. The CLAUDE.md file provides instructions governing how AI coding assistants should interact with course materials, likely addressing academic integrity and appropriate use boundaries. This represents an early example of educational institutions codifying AI agent behavior policies at the course level.

Enterprise Deployment Patterns Agent and Tool Ecosystem CS336 Stanford University Claude +1 more

6The Batch·May 23, 2026·source ↗

Agent Benchmarks Skew Toward Software Engineering, Missing Most Economically Valuable Labor

Researchers from Carnegie Mellon University and Stanford University mapped over 10,000 examples from 43 agent benchmarks to U.S. labor statistics using O*NET occupational taxonomies, finding that current benchmarks heavily over-represent software engineering relative to its share of employment and wages. Office and administrative support (18.2M workers, $869.8B wages) and management (11M workers, $1326.3B wages) are vastly under-represented compared to computer and mathematical occupations (5.2M workers, $563.6B wages). No single benchmark covered more than 50% of work activities, and all 43 benchmarks combined covered only 56.5% of work activities. The study identifies a systematic gap between where agentic AI is being evaluated and where the largest economic opportunity lies.

Evaluation and Benchmarking Enterprise Deployment Patterns Carnegie Mellon University GDPval Stanford University +7 more

7Openai Blog·May 20, 2026·source ↗

Concrete Problems in AI Safety

OpenAI, Google Brain, Berkeley, and Stanford researchers co-authored 'Concrete Problems in AI Safety,' a foundational paper exploring research challenges in ensuring modern ML systems operate as intended. The paper identifies and frames specific technical safety problems for the field. Published in June 2016, it became a landmark reference for AI safety research agendas.

AI Safety Research Alignment and RLHF Concrete Problems in AI Safety Stanford University UC Berkeley +2 more

5Google Deepmind Blog·May 19, 2026·source ↗

Uncovering repurposed medicines to fight liver fibrosis using Co-Scientist

A Stanford geneticist used Google DeepMind's Co-Scientist AI system to identify potential drug repurposing candidates for chronic liver disease and liver fibrosis. The work represents a real-world application of AI-assisted scientific discovery in a clinical domain. Co-Scientist is DeepMind's AI research assistant designed to accelerate hypothesis generation and experimental planning for scientists.

Enterprise Deployment Patterns Agent and Tool Ecosystem drug repurposing liver fibrosis Co-Scientist +2 more

4The Batch·May 18, 2026·source ↗

Gallup Poll Shows AI Boosts Productivity, but Many Workers Haven't Tried It

A Gallup survey of 23,700 U.S. employees found that half used AI at work at least a few times in the past year, with daily use rising from 4% in 2023 to 13% in 2025. Among workers in AI-using organizations, 65% reported productivity improvements, though only 31% said it changed their workflows. Managerial support and organizational strategy were key predictors of adoption. The broader employment impact remains contested, with conflicting signals from macroeconomic data and labor market research.

Inference Economics Enterprise Deployment Patterns Brookings Institution Stanford University Gallup +2 more

7Anthropic News·May 18, 2026·source ↗

Anthropic Launches Claude for Healthcare and Expands Life Sciences Capabilities

Anthropic is expanding its healthcare and life sciences offerings with Claude for Healthcare, a HIPAA-ready product suite for providers, payers, and health tech companies, alongside new connectors to CMS databases, ICD-10, NPI Registry, and FHIR development tools. The announcement also highlights Claude Opus 4.5's improved performance on medical benchmarks including MedCalc and MedAgentBench, with extended thinking (64k tokens) and native tool use. New life sciences capabilities include connections to additional scientific platforms and support for clinical trial management and regulatory operations. The release positions Claude as an agentic research and administrative partner across healthcare workflows including prior authorization, claims appeals, and patient care coordination.

Frontier Model Releases Evaluation and Benchmarking PubMed MedAgentBench Claude Opus 4.6 +12 more