5arXiv cs.AI (Artificial Intelligence)·1mo ago

Neurosymbolic Learning for Inference-Time Argumentation in Claim Verification

This paper introduces Inference-Time Argumentation (ITA), a trainable neurosymbolic framework for ternary claim verification (true/false/uncertain) that integrates formal argumentation semantics with LLM training. The framework uses argumentation semantics both to guide LLM training for argument generation and scoring, and to compute final predictions deterministically from explicit argumentative structures. Unlike conventional reasoning models that rely on potentially unfaithful post-hoc explanations, ITA produces verdicts that are faithful by construction to the underlying arguments. Experiments on two ternary claim verification datasets show ITA outperforms argumentative baselines and competes with non-argumentative direct-prediction approaches.

Evaluation and Benchmarking AI Safety Research Alignment and RLHF large language models Inference-Time Argumentation (ITA)ternary claim verification neurosymbolic learning formal argumentation semantics

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·4d ago·source ↗

Systematic study of extrinsic and intrinsic properties for effective code interpreter reasoning in LLMs

Researchers investigate what behavioral properties make LLMs effective at reasoning with a Code Interpreter (CI), identifying two axes: extrinsic 'crucial tokens' and intrinsic 'cognitive behaviors' such as verification, backtracking, and backward chaining. Stronger CI reasoning models consistently exhibit higher prevalence of these properties. The paper shows that appending code-specific crucial tokens at inference time improves performance on mathematical, ordering, and optimization tasks, while augmenting training with cognitive behaviors improves SFT and RL performance in two of three evaluated models. The work also finds these behaviors reduce overthinking in incorrect responses and improve token efficiency.

Evaluation and Benchmarking Agent and Tool Ecosystem Exploring Extrinsic and Intrinsic Properties for Effective Reasoning with Code Interpreter

5arXiv · cs.CL·4d ago·source ↗

Semi-supervised framework scales LLM reasoning with minimal labeled data via lightweight verifier

A new arXiv preprint proposes a semi-supervised framework for training LLMs to reason with very few labeled examples, using a lightweight classifier to judge the validity of intermediate reasoning traces. An entropy-based confidence threshold filters unreliable pseudo-labels before fine-tuning. Experiments on math reasoning (Orca-Math subset) and visual QA (GQA) show accuracy comparable to using 10-15x more labeled data. The approach reduces dependence on expensive answer-level supervision by turning verification into a data-creation mechanism.

Evaluation and Benchmarking Alignment and RLHF GQA Scaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight Verifier Orca-Math

5arXiv · cs.CL·17d ago·source ↗

ACTS: Agentic Chain-of-Thought Steering for efficient and controllable LLM reasoning

Researchers introduce Agentic Chain-of-Thought Steering (ACTS), a framework that formulates inference-time reasoning control as a Markov decision process, where a controller agent adaptively steers a frozen reasoner by issuing reasoning strategy directives and steering phrases at each step. The controller is initialized from synthetic steering trajectories with multi-budget augmentation and further optimized via reinforcement learning with budget-conditioned reward shaping. ACTS matches full-thinking performance with significant token savings and enables controllable accuracy-efficiency trade-offs across multiple benchmarks and reasoner models.

Inference Economics Agent and Tool Ecosystem ACTS Agentic Chain-of-Thought Steering Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

5arXiv · cs.CL·11d ago·source ↗

REAL: Reasoning-enhanced temporal graph framework for LLM long-term memory management

REAL is a new framework that represents LLM conversational memory as a temporal, confidence-aware directed property graph, where atomic facts carry validity intervals, confidence scores, and exploration intent labels. It addresses three limitations of prior memory systems: flat text structures, destructive overwrites of evolving facts, and passive retrieval. The system uses non-destructive temporal updates, semantic evaluator-guided hybrid beam search, and counterfactual inference to repair incomplete retrieval states. Experiments show a 22.72% average improvement over flat-text, graph-based, and existing memory baselines.

Long Context Evolution Agent and Tool Ecosystem REAL

7arXiv · cs.CL·10d ago·source ↗

Trustworthiness audit finds alignment regressions in reasoning models converted from instruction-tuned LLMs

A systematic study audits whether converting instruction-tuned LLMs into reasoning models via SFT, RL-based post-training, or distillation preserves alignment behaviors such as safe refusal, bias avoidance, and privacy protection. Across six trustworthiness dimensions, the authors find consistent alignment regressions—including increased toxicity, amplified stereotyping, miscalibrated refusal, and privacy leakage—even as reasoning benchmark scores improve. The regressions are quantified via KL divergence from the instruction-tuned baseline, suggesting behavioral drift is a systematic byproduct of reasoning post-training. The paper argues trustworthiness metrics should be reported alongside reasoning capability gains.

Evaluation and Benchmarking AI Safety Research Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models +1 more

6arXiv · cs.CL·19d ago·source ↗

Question-Answering as Hidden State Probing for Test-Time Reasoning Intervention

This paper proposes using question-asking as an inference-time intervention to surface information about an LLM's hidden state during chain-of-thought reasoning. The authors train a probe on a student model's hidden states before and after question generation, finding it predictive of final answer correctness even before the teacher responds—suggesting self-diagnosis during question generation carries meaningful signal. They frame question-asking as a sequential decision problem with a gating policy, but find a gap between detection and recovery: interventions are as likely to harm correct trajectories as to fix incorrect ones. The results have implications for the limits of LLM self-refinement under uncertainty.

Evaluation and Benchmarking Agent and Tool Ecosystem student-teacher prompting Chain-of-Thought Reasoning inference-time intervention +4 more

6arXiv · cs.CL·1mo ago·source ↗

Tracing the Emergence of Human-Like Pragmatic Reasoning in LLMs Across Languages

Researchers conducted a population-matching experiment evaluating 25 LLMs on conditional inference tasks across four languages, comparing model behavior to matched human populations. The study finds that LLMs function as accurate semantic operators but systematically fail to capture pragmatic enrichments—context-sensitive inferences beyond literal logical meaning—that humans apply effortlessly. Model performance on pragmatic reasoning is not predicted by open vs. closed weights, training orientation, or architecture type, suggesting pragmatic reasoning remains an emergent and unreliable capability. The findings contribute to ongoing debates about whether LLMs reason like humans or merely approximate surface-level linguistic patterns.

Frontier Model Releases Evaluation and Benchmarking large language models Population-Matching Experiment Pragmatic Reasoning +1 more

6arXiv · cs.CL·3d ago·source ↗

LegalHalluLens: Typed hallucination auditing and calibrated multi-agent debate for legal AI

Researchers introduce LegalHalluLens, an auditing framework for hallucination in legal AI systems, evaluated across 510 contracts and 249,252 clause-level instances from the CUAD dataset. The framework introduces typed hallucination profiles across four claim categories (numeric, temporal, obligation/entitlement, factual) and a Risk Direction Index (RDI) that distinguishes omission from invention errors. A calibrated multi-agent debate pipeline reduces fabricated detections by 45% using a 4B-parameter model competitive with commercial APIs. The work reveals that aggregate hallucination rates (~52%) mask a 38-40 percentage-point gap between claim types and that two systems with identical aggregate rates can have opposite risk profiles.

Evaluation and Benchmarking AI Safety Research LegalHalluLens CUAD Risk Direction Index +1 more