4arXiv cs.LG (Machine Learning)·37h ago

Empirical comparison finds quantum ML models do not yet surpass classical baselines

A new arXiv preprint presents a systematic empirical comparison of seven quantum machine learning (QML) model pairs against classical counterparts across supervised learning and reinforcement learning tasks. Results show QML models do not yet surpass classical baselines in prediction performance, policy stability, or training time, though some promise is noted for noise filtering and false positive control. The study identifies open challenges in hardware environments, training efficiency, and convergence stability, and releases code publicly.

Evaluation and Benchmarking Quantum vs. Classical Machine Learning: A Unified Empirical Comparison

Related guides (1)

Evaluation and BenchmarkingTopic guide

AI Evaluation and Benchmarking: From Leaderboards to the Limits of Measurement

Read asBeginner In-depth

Related events (8)

6arXiv · cs.LG·2d ago·source ↗

QVal: Training-free benchmark for evaluating dense supervision signals in long-horizon LLM agents

QVal is a new training-free testbed for evaluating dense supervision signals used to guide LLM agents over long-horizon trajectories, where outcome-only rewards are too sparse. The framework measures 'Q-alignment' — whether a method's step scores match Q-values from a strong reference policy — enabling comparison of 21 methods across 4 environments and 7 methodological families without running full training pipelines. A key finding is that simple prompting baselines consistently outperform more sophisticated dense supervision methods from recent literature. The benchmark covers over 1,200 evaluation experiments across six open-weight model backbones.

Evaluation and Benchmarking Agent and Tool Ecosystem QVal QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents QVal +2 more

5arXiv · cs.AI·25d ago·source ↗

Benchmarking study finds LLMs fail at counterintuitive probability problems despite strong standard performance

A new arXiv paper evaluates 8 state-of-the-art LLMs on discrete probability problems using two datasets: standard exercises (average accuracy 0.96) and counterintuitive exercises designed to trigger heuristic reasoning (average accuracy 0.59). The authors document token bias causing 20%+ performance drops when canonical problem formulations are disguised, and up to 34% degradation when misleading suggestions are embedded in prompts. The findings argue that current LLMs are not genuine probabilistic reasoners despite their success on advanced math benchmarks.

Evaluation and Benchmarking AI Safety Research How reliable are LLMs when it comes to playing dice?

6arXiv · cs.CL·4d ago·source ↗

LLMs judge worse than they generate: empirical challenge to self-evaluation pipeline assumptions

A new arXiv preprint tests the implicit assumption that LLM evaluation is easier than generation, using a controlled in-context QA setup across four benchmarks (SQuAD 2.0, DROP, HotpotQA, MuSiQue) and two models. Results show generation accuracy exceeds self-evaluation accuracy on three of four benchmarks, with attention analysis revealing that evaluation attends to context 3–5x less than generation does. LoRA fine-tuning experiments confirm the asymmetry is not a training artifact, with cross-task interference observed in both directions. The findings directly challenge assumptions underlying LLM-as-a-Judge and self-evaluation pipelines widely used in RLHF and agentic systems.

Evaluation and Benchmarking Alignment and RLHF MuSiQue Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA LoRA +3 more

4arXiv · cs.CL·22d ago·source ↗

Zero-shot LLMs fail to beat baselines on stock prediction; explainability signals retain practical value

A new arXiv preprint evaluates zero-shot NLP pipelines for predicting short-term stock movements from financial news, finding that across multiple models and prediction horizons, zero-shot approaches consistently fail to outperform simple baselines, with especially weak performance on negative price movements. The authors introduce a multi-layered explainability framework linking predictions to token-, article-, and aggregate-level evidence, finding that explainability signals can reliably distinguish trustworthy from unreliable predictions even when accuracy is low. The work argues for a shift toward decision-support systems emphasizing transparency and uncertainty awareness rather than raw predictive accuracy.

Evaluation and Benchmarking Can News Predict the Market? Limits of Zero-Shot Financial NLP and the Role of Explainable AI

6Mit Technology Review — Ai·13d ago·source ↗

Startup Subquadratic claims to have solved a core mathematical bottleneck in LLMs

Miami-based AI startup Subquadratic emerged from stealth claiming to have solved a long-standing mathematical bottleneck limiting large language models. Initial skepticism was high due to thin details, but the company has begun sharing supporting evidence. If substantiated, the claim would represent a significant architectural advance in how LLMs scale.

Training Infrastructure Frontier Model Releases Subquadratic

5arXiv · cs.CL·10d ago·source ↗

Sub-billion parameter SLMs outperform zero-shot GPT-5.4 and Claude Sonnet 4.6 on relation extraction benchmarks

A new arXiv paper demonstrates that small language models (360M–3B parameters) fine-tuned on task-specific data can substantially outperform zero-shot frontier LLMs on relation extraction tasks. The best sub-billion model, Qwen2.5-0.5B fine-tuned on pooled general-domain data, achieves micro-F1 of 0.83 versus 0.69 for GPT-5.4 and 0.66 for Claude Sonnet 4.6 in zero-shot settings. The authors attribute the gains to task adaptation rather than model architecture, with a discriminative RoBERTa baseline also exceeding frontier models, and show that 4-bit quantized models deployable on consumer GPUs can match or beat proprietary API-based systems for this narrow task. The work provides evidence that for well-defined NLP tasks with available training data, compact adapted models offer a practical, private, and hardware-efficient alternative to frontier APIs.

Evaluation and Benchmarking Open Weights Progress RoBERTa Claude Sonnet 4 Biographical +3 more

6The Batch·1mo ago·source ↗

Data Points: Nvidia Ising Models for Quantum Computing, Meta Muse Spark, GitHub Rubber Duck, Anthropic Claude Managed Agents, GPT-5.4-Cyber

Nvidia released Ising, a family of open AI models targeting quantum processor calibration and error correction, achieving 2.5x faster and 3x more accurate decoding than pyMatching, with adoption by Fermilab, Harvard, and others. Meta announced Muse Spark, a small multimodal model powering a new AI assistant series for its apps and glasses. GitHub introduced Rubber Duck, a cross-model review feature pairing Claude with GPT-5.4 for two-pass coding agent validation. Anthropic launched Claude Managed Agents, a managed infrastructure platform for enterprise autonomous AI deployment, while OpenAI expanded its Trusted Access for Cyber program with GPT-5.4-Cyber, a fine-tuned defensive cybersecurity model.

Frontier Model Releases Inference Economics Rubber Duck Notion GPT-5.5-Cyber +22 more

6arXiv · cs.LG·2d ago·source ↗

Surrogate Fidelity: Open LLMs often cannot reliably explain closed model behavior

A new arXiv paper from Facebook Research evaluates whether mechanistic interpretability findings from open-weight models transfer to closed API-only models across prediction, attribution, and representation levels. Studying eleven models across four families (Llama, Qwen, GPT, Gemini), the authors find that prediction-level agreement substantially overstates attribution fidelity — models that agree on answers often disagree on why. They document an 'access-validity inversion' where white-box signals like attention patterns are stable across models but weakly predictive of causal attributions, undermining the common practice of using open surrogates to explain closed systems.

Evaluation and Benchmarking AI Safety Research Qwen Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?Llama +3 more