Entity · benchmark

Long-context Reasoning Benchmarks

benchmarkactivelong-context-reasoning-benchmarks-23403fa9·1 events·first seen Jun 1, 2026

Aliases: Long-context Reasoning Benchmarks

Co-occurring entities

tiered distractors Knowledge Graph Random Walk Multi-hop Question Answering Reinforcement Learning with Verifiable Rewards Tiered Distractor Construction LongTraceRL Rubric Reward Tsinghua University KEG Lab

More like this (12)

Multilingual Reasoning Cascades Need More Context Reasoning Enhancement Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models Reasoning Language Models Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models Large Reasoning Models DAIS: Dependency-Aware Intermediate QA Supervision for Complex Reasoning Cognitive Episodes in LLM Reasoning Traces Enable Interpretable Human Item Difficulty Prediction latent reasoning Estimating Uncertainty from Reasoning: A Large-Scale Study of Multi- and Crosslingual MCQA Performance in LLMs Bias Benchmark for Question Answering Adjacent Contrastive Reasoning

Recent events (1)

6arXiv · cs.CL·Jun 1, 2026·source ↗

LongTraceRL: Reinforcement Learning for Long-Context Reasoning via Search Agent Trajectories and Rubric Rewards

LongTraceRL is a new RL training framework for improving long-context reasoning in LLMs, addressing limitations of existing RLVR methods. It constructs challenging training data using multi-hop questions from knowledge graph random walks and tiered distractors derived from search agent trajectories (high-confusability: read but uncited; low-confusability: seen but unopened). A rubric reward provides entity-level process supervision along reasoning chains, applied only to correct responses to prevent reward hacking. Experiments across three LLMs (4B–30B parameters) on five long-context benchmarks show consistent improvements over strong baselines.

Long Context Evolution Evaluation and Benchmarking tiered distractors Knowledge Graph Random Walk Long-context Reasoning Benchmarks +8 more