Almanac
← Events
6arXiv cs.CL (Computation and Language)·24d ago

FinHarness: Inline Lifecycle Safety Harness for Finance LLM Agents

FinHarness is a safety harness for finance LLM agents that wraps agent execution end-to-end with three components: a Query Monitor for intent and drift detection, a Tool Monitor for per-call risk evaluation, and a Cascade module that adaptively routes verification between lightweight and advanced LLM judges. Unlike post-hoc auditing, it injects risk signals back into the agent input as ex-ante evidence, enabling real-time refusal or replanning. On the FinVault benchmark, it reduces attack success rate from 38.3% to 15.0% while preserving benign approval rates and using 4.7× fewer expensive judge calls than an always-advanced baseline.

Related guides (4)

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

AprielGuard: A Guardrail for Safety and Adversarial Robustness in Modern LLM Systems

ServiceNow AI has released AprielGuard, a guardrail system designed to improve safety and adversarial robustness in LLM deployments. The system targets prompt injection, jailbreaks, and other adversarial inputs that bypass standard safety measures. It is presented as a component for enterprise LLM pipelines seeking more robust content moderation and safety filtering.

6arXiv · cs.CL·1mo ago·source ↗

Code as Agent Harness: A Survey of Code as Operational Substrate for Agentic AI Systems

This survey paper introduces the concept of 'code as agent harness,' framing code not merely as output but as the operational infrastructure for LLM-based agents—covering reasoning, action, environment modeling, and execution-based verification. The authors organize the analysis across three layers: harness interface, harness mechanisms (planning, memory, tool use, feedback control), and scaling to multi-agent systems. Applications span coding assistants, GUI/OS automation, embodied agents, scientific discovery, and enterprise workflows. Open challenges include evaluation beyond task success, verification under incomplete feedback, and human oversight for safety-critical actions.

5arXiv · cs.CL·9d ago·source ↗

Claw-SWE-Bench: A benchmark for evaluating agent harnesses on multilingual coding tasks

Researchers introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol designed to fairly compare heterogeneous agent harnesses ("claws") on GitHub issue-resolution tasks. The benchmark contains 350 instances across 8 languages and 43 repositories, with an 80-instance Lite subset for cost-efficient validation. Key findings show adapter design dominates raw model choice: a minimal adapter scores 19.1% Pass@1 versus 73.4% for a full adapter using the same GLM 5.1 backbone, and harness choice and model choice each shift Pass@1 by roughly 27-29 percentage points. The work also introduces cost accounting as a first-class evaluation axis alongside accuracy.

6arXiv · cs.CL·18d ago·source ↗

HarmAmp Benchmark and TrajSafe Monitor for Multi-Turn Harm Amplification in LLMs

This paper introduces HarmAmp, a benchmark covering twelve risk categories designed to evaluate how LLMs compound harm across multi-turn conversations, addressing two threat vectors: democratizing specialized harmful expertise and scaling harmful operations. The authors also propose TrajSafe, a proactive monitoring system that anticipates harmful conversational trajectories and intervenes by probing user intent or steering toward safer outputs. Experiments show TrajSafe reduces multi-turn harmfulness while maintaining low over-refusal rates and preserving general model capabilities. The work highlights a gap in existing safety research that focuses on single-turn evaluations rather than extended interaction dynamics.

3Github Trending·22d ago·source ↗

Awesome Harness Engineering: Curated List for AI Agent Infrastructure

A GitHub repository aggregating resources on AI agent harness engineering, covering tools, patterns, evaluations, memory systems, MCP (Model Context Protocol), permissions, observability, and orchestration. The list has accumulated 1,318 stars with 39 added today, indicating moderate community traction. It serves as a reference index rather than original research or tooling.

6arXiv · cs.CL·3d ago·source ↗

LegalHalluLens: Typed hallucination auditing and calibrated multi-agent debate for legal AI

Researchers introduce LegalHalluLens, an auditing framework for hallucination in legal AI systems, evaluated across 510 contracts and 249,252 clause-level instances from the CUAD dataset. The framework introduces typed hallucination profiles across four claim categories (numeric, temporal, obligation/entitlement, factual) and a Risk Direction Index (RDI) that distinguishes omission from invention errors. A calibrated multi-agent debate pipeline reduces fabricated detections by 45% using a 4B-parameter model competitive with commercial APIs. The work reveals that aggregate hallucination rates (~52%) mask a 38-40 percentage-point gap between claim types and that two systems with identical aggregate rates can have opposite risk profiles.

5arXiv · cs.CL·46h ago·source ↗

LedgerAgent: Structured state ledger improves policy-adherent tool-calling agents

LedgerAgent is an inference-time method that maintains explicit task state in a separate ledger rather than leaving state reconstruction implicit in the prompt, addressing two failure modes: stale/incorrect grounding and policy-violating tool calls. The ledger is used both to render current state into the prompt and to gate environment-changing tool calls against state-dependent policy constraints. Evaluated across four customer-service domains with a mixed panel of open- and closed-weight models, LedgerAgent improves average pass^k over standard prompt-based tool-calling, with the largest gains under stricter multi-trial consistency metrics.

6arXiv · cs.LG·25d ago·source ↗

From Model Scaling to System Scaling: Scaling the Harness in Agentic AI

This paper argues that the next major bottleneck in agentic AI is system-level design—what the authors call 'scaling the harness'—rather than continued model scaling alone. The agent harness encompasses memory substrates, context constructors, skill-routing layers, orchestration loops, and verification/governance components that together translate model capability into long-horizon behavior. The authors identify three core bottlenecks (context governance, trustworthy memory, dynamic skill routing) and propose harness-level benchmarks measuring trajectory quality, memory hygiene, and verification cost. They introduce CheetahClaws, a Python-native reference harness, and compare it against Claude Code and OpenClaw.