Entity · benchmark

Claw-Eval

benchmarkactiveclaw-eval-2d3d5793·2 events·first seen Jun 16, 2026

Aliases: Claw-Eval, ClawEval

Co-occurring entities

QwenClawBench Online-Mind2Web veRL Kubernetes OpenClaw Claude Code WebVoyager OSWorld-Verified Codex OpenForgeRL PinchBench LightMem TokenPilot Zhejiang University NLP Group (ZJUNLP)

More like this (12)

Claw-Anything ClawBot ClawBench L-Eval VulnClaw ClawHub SpatialClaw Claw-SWE-Bench VetClaw G-Eval MaskClaw T-Eval

Recent events (2)

7arXiv · cs.CL·Jul 24, 2026·source ↗

OpenForgeRL: Open-source framework for end-to-end RL training of harness-native AI agents

OpenForgeRL is an open-source framework that enables end-to-end reinforcement learning training of AI agents operating within real inference harnesses (e.g., Claude Code, Codex, OpenClaw) and diverse environments. It uses a lightweight proxy to record harness model calls as training data for standard RL codebases like veRL, plus a Kubernetes orchestrator for scalable rollouts in isolated containers. Trained agents (OpenForgeClaw and OpenForgeGUI) achieve competitive results on benchmarks including ClawEval, OSWorld-Verified, Online-Mind2Web, and WebVoyager, matching or surpassing models several times larger in GUI tasks. The work also analyzes how harness choice and RL shape agent behavior, finding meaningful variation in learnability across harnesses.

Training Infrastructure Evaluation and Benchmarking QwenClawBench Online-Mind2Web veRL +11 more

6arXiv · cs.AI·Jun 16, 2026·source ↗

TokenPilot: Dual-granularity context management cuts LLM agent inference costs by up to 87%

TokenPilot is a cache-efficient context management framework for LLM agents that addresses the trade-off between token sparsity and prompt cache continuity. It combines Ingestion-Aware Compaction (global prefix stabilization) with Lifecycle-Aware Eviction (local segment offloading) to reduce inference costs by 56–87% across benchmarks while maintaining competitive task performance. The system is evaluated on PinchBench and Claw-Eval and has been integrated into the open-source LightMem2 library.

Inference Economics Agent and Tool Ecosystem PinchBench Claw-Eval LightMem +2 more