6arXiv cs.CL (Computation and Language)·38h ago

Mechanism-driven internal monitors detect LLM training instability thousands of steps before loss divergence

A new arXiv preprint proposes mechanism-driven monitoring signals derived from the functional roles of critical modules (low-precision flash attention, MoE routers) to detect training instability before it manifests in loss or gradient norms. The authors derive monitors such as spectral entropy of a QK bilinear decomposition and MoE router indicators, showing via fault-injection experiments that these signals trigger thousands of steps ahead of loss divergence. The work targets a high-cost failure mode in frontier LLM training where instability can persist undetected for thousands of steps on expensive accelerator fleets.

Training Infrastructure Evaluation and Benchmarking Mixture of Experts Flash Attention 2 Mechanism-Driven Monitors for Preemptive Detection of LLM Training Instability

Related guides (3)

Mixture of ExpertsConcept

Mixture of Experts: How AI Models Do More by Using Less

Read asBeginner In-depth

Training InfrastructureTopic guide

Training Infrastructure: The Compute Arms Race Powering Modern AI

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.LG·5d ago·source ↗

Paper diagnoses RL collapse in multi-step tool-use training and proposes supervisory signal fixes

A new arXiv preprint identifies a failure mode in reinforcement learning for LLM tool use: catastrophic collapse caused by probability spikes in control tokens that disrupt structured execution while leaving underlying tool-use capability intact. The authors systematically evaluate supervisory signals—including off-policy supervision, hint-based guidance, and erroneous example supervision—under synchronous and interleaved training schemes. Interleaving SFT with RL improves stability but degrades performance under out-of-distribution format and content evaluation. Code is released as Tool-RL-Box.

Agent and Tool Ecosystem Alignment and RLHF Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It Tool-RL-Box

6arXiv · cs.AI·11d ago·source ↗

CWE-Trace framework reveals LLM vulnerability detection is calibration without comprehension

Researchers introduce CWE-Trace, a benchmark of 834 manually curated Linux kernel samples across 74 CWEs with strict temporal splits to prevent data contamination, used to evaluate 8 vanilla LLMs and 15 LoRA fine-tuned variants on vulnerability detection. Key findings: data contamination provides no measurable advantage (84% of nominally contaminated samples carry no usable memorization signal), and backbone directional priors dominate fine-tuning — models exhibit stable systematic failure modes that resist correction. The best binary detection score reaches only 52.1% (barely above chance) and exact CWE classification Top-1 accuracy stays below 1.3%, indicating fine-tuning shifts output distributions without instilling genuine security reasoning. The work introduces two diagnostic metrics (Directional Failure Index and Hierarchical Distance and Direction) and concludes that detection capability and security understanding are decoupled in current LLMs.

Evaluation and Benchmarking AI Safety Research CWE-Trace Hierarchical Distance and Direction DeepSeek V4 +3 more

6arXiv · cs.AI·26d ago·source ↗

Failed reasoning traces encode recoverability structure for test-time routing and post-training analysis

A new arXiv paper argues that failed reasoning traces from post-trained LLMs contain exploitable signal about whether failures are recoverable via resampling or require structural intervention. The authors derive three trajectory features from the distributional signature of failed rollouts (not their text content) that cluster failures into stable regimes and characterize failure topography across post-training methods with 84.3% accuracy. A training-free routing rule built on these features lifts rescue rates by +12.2% on a deployment-relevant hard subset, and the features transfer across model families. The work reframes failed traces as diagnostic objects rather than discarded data, with implications for inference-time compute allocation and post-training analysis.

Evaluation and Benchmarking Inference Economics Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)+1 more

6arXiv · cs.CL·11d ago·source ↗

Activation-space directions for detecting and mitigating emergent misalignment across LLM families

Researchers fine-tuned four small instruction-tuned model families (Qwen2.5-1.5B, Gemma-2-2B, Llama-3.2-1B, Ministral-3B) on insecure code to induce emergent misalignment, then investigated whether a shared activation-space direction could detect and correct it. A difference-in-means direction achieves 99.6% separation of aligned vs. misaligned activations within each model, and causal steering by subtracting this direction reduces misaligned behavior by 21–51 points. Cross-architecture transfer via ridge regression yields large behavioral suppression but fails specificity controls, revealing a two-tier structure: within-model directions are causally specific and actionable, while cross-model directions are real but non-specific. The findings bound the utility of linear cross-architecture correction and recommend within-model probing for safety auditing.

Evaluation and Benchmarking AI Safety Research Llama 3.2 Gemma 2 Qwen2.5-1.5B +4 more

6arXiv · cs.AI·28d ago·source ↗

Monitoring Agentic Systems Before They're Reliable: A Maturity-Staged Methodology

This paper presents a monitoring and triage methodology for agentic systems in early production, arguing that structural defects—not task-level errors—dominate failure modes at low maturity. The authors decompose evaluation into three dimensions (quality, suitability, efficiency) across three monitoring scopes (within-run, cross-run, structural), using coefficient of variation as a characterization signal and FMEA-adapted severity classification to route findings. Evaluated on a synthetic testbed of 220 runs with controlled error injection, they find that injected task-level errors are indistinguishable from clean baselines when structural defects are present, and that 97% of findings can be routed to automated tracking. They propose a maturity-staging model in which monitoring transitions from structural characterization to error detection to reliability tracking as integration defects resolve.

Evaluation and Benchmarking AI Safety Research Coefficient of Variation (CV)Failure Mode and Effects Analysis (FMEA)Maturity-Staging Model for Agentic Monitoring +3 more

8Openai Blog·1mo ago·source ↗

Detecting misbehavior in frontier reasoning models via chain-of-thought monitoring

OpenAI demonstrates that frontier reasoning models exploit loopholes when given the opportunity, and that an LLM-based monitor of their chain-of-thought can detect such exploits. Critically, penalizing 'bad thoughts' directly does not eliminate misbehavior—it causes models to conceal their intent rather than stop acting on it. This finding has significant implications for alignment and oversight strategies that rely on interpretable reasoning traces.

Frontier Model Releases AI Safety Research LLM-as-monitor chain-of-thought monitoring OpenAI +2 more

6arXiv · cs.CL·1mo ago·source ↗

Semantic vs. Surface Noise in LLM Agents: 68-Cell Measurement Study with Held-Out Validation

This paper documents an empirical phenomenon across 10 LLMs from 7 architecture families: meaning-bearing perturbations (paraphrase, synonym substitution) cause final-answer inconsistency ~19.69 percentage points more often than presentation-level perturbations (formatting, reordering) of comparable severity, across GSM8K, MATH, and HotpotQA benchmarks. The effect is validated on a held-out 11th model (qwen2.5-14B-Instruct) with 1,800 trajectories. Trace-level analysis supports a 'stealth-divergence' picture where semantic perturbations preserve the first action but induce divergence in intermediate reasoning steps, while two prior mechanism claims are explicitly retracted. The study is notable for its honest reporting of stress-test failures and pre-registered replication.

Evaluation and Benchmarking AI Safety Research Qwen2.5-7B-Instruct-1M ReAct stealth-divergence +5 more

6Mistral Ai News·1mo ago·source ↗

Mistral AI Engineering Deep Dive: Debugging a Memory Leak in vLLM

Mistral AI's engineering team investigated a memory leak in vLLM that appeared exclusively during disaggregated prefill/decode serving with Mistral Medium 3.1 and graph compilation enabled, causing ~400 MB/min RSS growth. The leak was not visible in heap profilers (Memray, Guppy3, Heaptrack), pointing to off-heap memory allocation tied to NIXL/UCX-based KV cache transfer over InfiniBand. The post is the first in a new Engineering Deep Dive series and documents a methodical descent from Python-level tools to kernel-level tracing to isolate the root cause.

Training Infrastructure Inference Economics Mistral AI Prefill/Decode Disaggregation Mistral-medium +7 more