3arXiv cs.LG (Machine Learning)·12d ago

LLM-augmented XAI framework with mutual feature interactions for network operations

A new arXiv paper proposes a framework combining LLMs with SHAP-based explainability, augmented by mutual feature interaction data, to generate natural language explanations for AI/ML models used in network operations. The approach is validated on an optical quality-of-transmission estimation task with human evaluators, showing 12.2% and 6.2% improvements in explanation usefulness and scope over a SHAP-only baseline, with 97.5% correctness. The work targets the gap between technical XAI outputs and actionable insights for non-specialist network operators.

Evaluation and Benchmarking Generative Explainability for Next-Generation Networks: LLM-Augmented XAI with Mutual Feature Interactions SHapley Additive exPlanations Generative Explainability for Next-Generation Networks: LLM-Augmented XAI with Mutual Feature Interactions

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

SPEX and ProxySPEX: Scalable Interaction Discovery for LLM Interpretability

Researchers from BAIR introduce SPEX (Spectral Explainer) and ProxySPEX, algorithms for identifying influential feature, data, and model-component interactions in LLMs at scale. The approach exploits sparsity, low-degreeness, and hierarchy properties to reframe interaction discovery as a sparse recovery problem using tools from signal processing and coding theory. ProxySPEX achieves comparable performance to SPEX with roughly 10x fewer ablations by leveraging hierarchical structure. The methods are evaluated on feature attribution (sentiment analysis), data attribution, and mechanistic interpretability tasks, outperforming marginal methods like LIME at long context lengths.

Long Context Evolution Evaluation and Benchmarking GPT-4o mini Faith-Shap LIME +5 more

4arXiv · cs.AI·6d ago·source ↗

LEAF-X: Entropy-guided explainability framework for transformer-based ASR models

Researchers introduce LEAF-X (Listening with Entropy-guided Attention for Faithful explainability), a model-intrinsic XAI framework for transformer-based automatic speech recognition systems like Whisper. The method combines entropy-guided attention weighting, multi-layer attention rollout, and optional causal ablations to produce sparse token-to-frame attributions. Evaluations show 32% improved faithfulness and 35-39% stronger locality/sparsity compared to perturbation-based explainers and raw attention maps, enabling more auditable ASR.

AI Safety Research Listening with Attention: Entropy-Guided Explainability for Transformer-Based Audio Models LEAF-X Whisper

5arXiv · cs.AI·27d ago·source ↗

Human Decision-Making with Persuasive and Narrative LLM Explanations

A large-scale behavioral experiment evaluated how LLM-generated narrative explanations of varying persuasiveness affect human decision-making accuracy in classification tasks. Results showed that persuasiveness level did not meaningfully improve decision accuracy over a simple AI prediction alone, consistent with prior explainable AI research using feature importance methods. Narratives increased AI reliance regardless of whether the AI prediction was correct or incorrect, and more persuasive narratives may have slowed response times and reduced ability to discriminate correct from incorrect AI predictions. The study concludes that narrative explanations involve tradeoffs and warrant further investigation into when and how they should be deployed.

Evaluation and Benchmarking AI Safety Research Narrative Explanations large language models Explainable AI (XAI)+2 more

4arXiv · cs.CL·2d ago·source ↗

Survey proposes four-layer architecture for token-operations-oriented LLM inference optimization

A new arXiv preprint introduces a four-layer technical architecture—Multi-model Fusion, Model Optimization, Compute-Model Fusion, and Compute-Network-Model Fusion—for systematically organizing LLM inference optimization techniques. The paper reviews key technologies and industry status at each layer and analyzes their application in real-world business scenarios. The framing around 'token operations' positions inference optimization as an operational discipline analogous to traditional IT operations.

Training Infrastructure Inference Economics Token-Operations-Oriented Inference Optimization Techniques for Large Models

5Hugging Face Blog·1mo ago·source ↗

Open-source LLMs as LangChain Agents

This Hugging Face blog post explores using open-source LLMs as agents within the LangChain framework. It examines the capability of various open-weight models to perform tool use, reasoning, and multi-step task execution in agentic settings. The post likely benchmarks or compares several models on agent-relevant tasks, providing practical guidance for deploying open-source alternatives to proprietary models in agent pipelines.

Open Weights Progress Agent and Tool Ecosystem open-source LLMs LangChain Hugging Face

4arXiv · cs.CL·10d ago·source ↗

Zero-shot LLMs fail to beat baselines on stock prediction; explainability signals retain practical value

A new arXiv preprint evaluates zero-shot NLP pipelines for predicting short-term stock movements from financial news, finding that across multiple models and prediction horizons, zero-shot approaches consistently fail to outperform simple baselines, with especially weak performance on negative price movements. The authors introduce a multi-layered explainability framework linking predictions to token-, article-, and aggregate-level evidence, finding that explainability signals can reliably distinguish trustworthy from unreliable predictions even when accuracy is low. The work argues for a shift toward decision-support systems emphasizing transparency and uncertainty awareness rather than raw predictive accuracy.

Evaluation and Benchmarking Can News Predict the Market? Limits of Zero-Shot Financial NLP and the Role of Explainable AI

5arXiv · cs.CL·3d ago·source ↗

ClaMPAPP: Hybrid LLM-ML system uses language models as interfaces for pediatric appendicitis diagnosis

Researchers introduce ClaMPAPP, a hybrid clinical decision support system that uses an LLM solely for structured feature extraction from free-text clinical notes, then passes validated features to an XGBoost classifier for final diagnosis. Evaluated on two independent German pediatric appendicitis cohorts, ClaMPAPP outperformed end-to-end LLM baselines on diagnostic performance and showed greater robustness to narrative reordering. The work formalizes an 'LLM-as-interface, ML-as-predictor' design pattern that separates natural-language usability from predictive inference, offering a more auditable pathway for clinical AI.

Enterprise Deployment Patterns Agent and Tool Ecosystem XGBoost ClaMPAPP

4arXiv · cs.CL·2d ago·source ↗

Mechanistic analysis of how LLMs encode essay quality in internal representations

Researchers systematically probe the hidden representations of eight LLMs across three essay datasets (ASAP++, CSEE, ENEM) to understand how automated essay scoring (AES) works internally. Using linear probing, dimensionality reduction, and neuron-level analysis, they find essay quality is encoded in a linearly accessible form that emerges progressively across layers and partially transfers across prompts. Individual 'essay scoring neurons' are identified whose activations correlate with scores and respond to targeted interventions, with longer essays relying more on deeper layers. The work contributes to mechanistic interpretability of LLM-based scoring systems.

Evaluation and Benchmarking From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models CSEE ENEM +1 more