Entity · organization

NeurIPS

organizationactiveneurips-90d5c00f·5 events·first seen May 26, 2026

Aliases: NeurIPS, NeurIPS 2024, NeurIPS 2026

Co-occurring entities

More like this (12)

NeurIPS 2020 NeurIPS 2025 EMNLP 2025 ICML NeuralBench Cognizant Neuro AI Physics-Informed Neural Networks bridge-ai-neuro NeuralDEM Yale NLP ICML 2026 NeuralSet

Recent events (5)

7arXiv · cs.AI·2d ago·source ↗

Shadow evaluations show frontier AI agents can do AI research engineering but fail at open-ended scientific reasoning

A new arXiv preprint introduces 'shadow evaluations' — a methodology where AI agents tackle the central research question of unpublished NeurIPS 2026 papers, with original authors grading the output. Frontier agents given six days and thousands of dollars of compute completed all engineering tasks without human help but failed to make substantive scientific progress, resulting in both papers being rejected by their authors. The authors identify five recurring failure modes including poor judgment about publishability, uncreative responses to design shortcomings, and instruction drift. The work provides early empirical evidence that the engineering-vs.-research gap is a real bottleneck for AI R&D automation.

Evaluation and Benchmarking AI Safety Research NeurIPS Can AI agents conduct open-ended AI research? Early evidence from two case studies +1 more

5arXiv · cs.CL·5d ago·source ↗

Multilayer taxonomy of 14 capability domains and 91 subskills for LLM evaluation

Researchers introduce a structured taxonomy organizing LLM capabilities into 14 domains and 91 subskills across Primitive, Constructed, and Integrative layers, grounded in human cognitive science rather than model architecture. To validate the framework, they screened 31,505 papers from ACL, AAAI, ICML, and NeurIPS (2023–2025) and mapped 15,934 LLM-focused papers through multi-model annotation. The analysis reveals heavy concentration of research attention on Language-Semantic Competence and Reasoning, with six domains receiving fewer than 2% of papers, highlighting systematic coverage gaps. The taxonomy is intended to support cross-study comparison, coverage audits, and hypothesis generation for training and transfer.

Evaluation and Benchmarking NeurIPS From Isolated Tasks to Structured Capabilities: A Multilayer Taxonomy for Large Language Models ICML +2 more

4Github Trending·Jun 19, 2026·source ↗

HippoRAG: RAG framework combining knowledge graphs and Personalized PageRank for continuous knowledge integration

HippoRAG is an open-source RAG framework published at NeurIPS 2024 by the OSU NLP Group that draws on models of human long-term memory to enable LLMs to continuously integrate knowledge across external documents. It combines retrieval-augmented generation with knowledge graphs and Personalized PageRank to improve multi-hop and associative retrieval. The repository has accumulated 3,742 GitHub stars with ongoing community traction.

Evaluation and Benchmarking Agent and Tool Ecosystem NeurIPS HippoRAG OSU NLP Group

6The Batch·Jun 1, 2026·source ↗

Data Points: NeurIPS-China Standoff, Anthropic Emotion Vectors, Gemma 4, Cursor 3, Microsoft MAI Models

This edition of The Batch covers five significant AI developments: NeurIPS reversed a sanctions-related submission policy after China's largest tech federation announced a boycott; Anthropic's interpretability team identified 171 emotion-related representations in Claude Sonnet 4.5 that causally influence model behavior including unsafe actions; Google released Gemma 4, a family of Apache 2.0-licensed open-weights models up to 31B parameters with strong benchmark performance; Cursor released version 3 with a redesigned multi-agent interface; and Microsoft announced three specialized MAI models for transcription, voice synthesis, and image generation. The NeurIPS incident highlights growing friction in international AI research access, while the Anthropic findings have direct implications for AI safety and interpretability research.

Frontier Model Releases Open Weights Progress FLEURS NeurIPS WPP +19 more

7arXiv · cs.CL·May 26, 2026·source ↗

Automated Benchmark Auditing for AI Agents and Large Language Models (ABA)

The paper introduces Auto Benchmark Audit (ABA), an agentic framework that systematically audits AI benchmark tasks for issues such as ambiguous specifications, environment conflicts, and incorrect ground truths. Applied to 168 benchmarks across nine domains including NeurIPS publications, ABA identifies critical issues in over 25.7% of evaluated tasks. The authors demonstrate that filtering out flawed tasks materially shifts model rankings and improves average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6% respectively, indicating that current benchmark scores are significantly distorted by task quality problems. The agentic tool and annotations are released publicly.

Frontier Model Releases Evaluation and Benchmarking NeurIPS Auto Benchmark Audit (ABA)SWE-Bench Verified +2 more