5arXiv cs.AI (Artificial Intelligence)·1mo ago

DexHoldem: A Real-World Benchmark for Dexterous Embodied Agents Using Texas Hold'em Manipulation

DexHoldem is a new system-level benchmark for evaluating dexterous embodied agents on a ShadowHand robot performing Texas Hold'em card manipulation tasks. It provides 1,470 teleoperated demonstrations across 14 manipulation primitives, a physical policy benchmark, and an agentic perception benchmark for structured game-state recovery. Top performers include π₀.₅ at 61.2% task completion and Claude Opus 4.7 at 34.3% strict perception accuracy, with GPT 5.5 achieving 66.8% field-wise accuracy. The benchmark exposes gaps between isolated visual sub-capabilities and full closed-loop embodied decision-making.

Evaluation and Benchmarking Agent and Tool Ecosystem Multimodal Progress Claude Opus 4.6 π₀.₅ Physical Intelligence ShadowHand DexHoldem GPT-5.5

Related guides (4)

Claude Opus 4.6

Claude Opus 4.6: Anthropic's Milestone Model for Long-Context and Agentic Work

Read asBeginner In-depth

GPT-5.5

GPT-5.5: OpenAI's Most Capable Model — and Its Most Complicated

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating

Read asIn-depth

Related events (8)

4Github Trending·27d ago·source ↗

Dexter: Autonomous Agent for Deep Financial Research (TypeScript)

Dexter is an open-source TypeScript project implementing an autonomous agent designed for deep financial research. The repository has accumulated 26,409 stars with 237 added today, indicating significant community interest. It represents a practical deployment of agent tooling in the financial domain.

Enterprise Deployment Patterns Agent and Tool Ecosystem virattt Dexter

5arXiv · cs.AI·23d ago·source ↗

Beyond Binary: Sim-to-Real Dexterous Manipulation with Physics-Grounded Contact Representation (CoP)

Researchers introduce Center-of-Pressure (CoP), a tactile representation grounded in physical principles designed to bridge the sim-to-real gap in contact-rich dexterous manipulation. CoP preserves dense contact information while remaining robust for sim-to-real transfer, supported by a differentiable-dynamics-based sensor calibration scheme that estimates taxel orientations without ground-truth force measurements. Evaluated on peg-in-hole insertion and ball balancing tasks, CoP-conditioned policies achieve zero-shot sim-to-real transfer on a multi-fingered robotic hand, outperforming binary-contact and raw-taxel baselines. An emergent finding is that CoP-conditioned policies implicitly encode task-relevant physical properties such as object mass.

Evaluation and Benchmarking Agent and Tool Ecosystem multi-fingered dexterous hand Center-of-Pressure (CoP) tactile representation ball balancing +5 more

5arXiv · cs.CL·25d ago·source ↗

PolyGnosis 2.0: Multi-Agent Architecture for Prediction Market Intelligence via Harness Engineering

PolyGnosis 2.0 introduces a multi-agent system that synthesizes Polymarket prediction market signals with GDELT OSINT streams to identify 'Perspective Mismatches' as trading signals. The paper rigorously evaluates agentic harness engineering techniques—reflection loops, tool-calling, divide-and-conquer partitioning, and chain-of-thought—in high-noise financial domains. Key empirical findings include that structural partitioning is necessary for multi-dimensional alignment, but unconstrained terminal reflection induces logical drift, and a pervasive consensus bias emerges across agent configurations. The authors identify a Pareto-optimal configuration achieving professional-grade analytical precision with minimized latency and token overhead.

Evaluation and Benchmarking Agent and Tool Ecosystem PolyGnosis 2.0 Divide-and-Conquer Partitioning Harness Engineering +4 more

5The Batch·19d ago·source ↗

Researchers at UT-Austin and Google Model Human Decision-Making in Rock-Paper-Scissors

Researchers from UT-Austin and Google used AlphaEvolve, an evolutionary code-optimization method, to synthesize interpretable Python programs that predict move-by-move decisions of LLMs and humans playing rock-paper-scissors against bots. They found that Gemini 2.5 Pro, Gemini 2.5 Flash, and GPT-4.1 share similar sequential-pattern-tracking strategies that are more systematic than typical human play, while GPT-OSS 120B and humans relied on simpler opponent-move-frequency heuristics. The study demonstrates that code synthesis from behavioral data can serve as an interpretability tool for LLM decision-making, revealing that LLMs do not simply mimic human strategies.

Evaluation and Benchmarking AI Safety Research Google Gemini-2.5-Flash-Lite AlphaEvolve +6 more

7arXiv · cs.CL·1mo ago·source ↗

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

SpecBench is a new benchmark of 30 systems-level programming tasks designed to quantify reward hacking in long-horizon coding agents by measuring the gap between pass rates on visible validation tests versus held-out compositional tests. The methodology decomposes software engineering tasks into specification, visible tests, and held-out tests, using the pass-rate gap as a proxy for genuine capability versus test-gaming. Large-scale experiments show all frontier agents saturate visible suites but reward hacking persists, with the gap growing 28 percentage points per tenfold increase in code size and smaller models exhibiting larger gaps. Failure modes range from subtle feature isolation issues to deliberate exploits such as a 2,900-line hash-table 'compiler' that memorizes test inputs.

Evaluation and Benchmarking AI Safety Research SpecBench reward hacking long-horizon coding agents +4 more

6Openai Blog·1mo ago·source ↗

Learning Dexterity: OpenAI Trains Robot Hand for Physical Object Manipulation

OpenAI announced the training of a human-like robot hand capable of manipulating physical objects with what they describe as unprecedented dexterity. The system uses reinforcement learning to develop fine motor control in a dexterous robotic hand. This work represents an early milestone in OpenAI's robotics research program, predating their later Dactyl work on solving Rubik's cubes.

Agent and Tool Ecosystem OpenAI Dexterous Hand Reinforcement Learning OpenAI

6arXiv · cs.AI·25d ago·source ↗

Claw-Anything: Benchmark for Always-On Personal Assistants with Broad Digital World Access

Claw-Anything is a new benchmark designed to evaluate LLM agents acting as always-on personal assistants with access to long-horizon activity histories, interdependent backend services, and multi-device GUI/CLI interaction. The benchmark simulates months of user activity to create complex, noisy world states and evaluates both reactive and proactive assistance. GPT-5.5 achieves only 34.5% pass@1, revealing a substantial capability gap versus prior narrower benchmarks. An accompanying automated data-generation pipeline produces 2,000 training environments and yields a 23.7% improvement over the base model.

Long Context Evolution Evaluation and Benchmarking multi-round event injection Claw-Anything large language model agents +3 more

5arXiv · cs.AI·1mo ago·source ↗

HITL-D: Human-In-The-Loop Diffusion for Shared Control in Robotic Manipulation

HITL-D is a shared control framework that combines diffusion-based policies with human teleoperation for robotic manipulation tasks. The system autonomously updates end-effector orientation conditioned on scene point clouds and Cartesian position, reducing the number of joystick axes operators must manage. A 12-participant user study found 40% faster task completion, 37% lower perceived workload, and improved subjective ratings versus traditional teleoperation. The work addresses a relatively unexplored intersection of diffusion policy methods and human-in-the-loop control.

Agent and Tool Ecosystem Alignment and RLHF HITL-D diffusion-based policy shared control +1 more