Almanac
← Events
4arXiv cs.LG (Machine Learning)·37h ago

Empirical comparison finds quantum ML models do not yet surpass classical baselines

A new arXiv preprint presents a systematic empirical comparison of seven quantum machine learning (QML) model pairs against classical counterparts across supervised learning and reinforcement learning tasks. Results show QML models do not yet surpass classical baselines in prediction performance, policy stability, or training time, though some promise is noted for noise filtering and false positive control. The study identifies open challenges in hardware environments, training efficiency, and convergence stability, and releases code publicly.

Related guides (1)

Related events (8)

6arXiv · cs.LG·2d ago·source ↗

QVal: Training-free benchmark for evaluating dense supervision signals in long-horizon LLM agents

QVal is a new training-free testbed for evaluating dense supervision signals used to guide LLM agents over long-horizon trajectories, where outcome-only rewards are too sparse. The framework measures 'Q-alignment' — whether a method's step scores match Q-values from a strong reference policy — enabling comparison of 21 methods across 4 environments and 7 methodological families without running full training pipelines. A key finding is that simple prompting baselines consistently outperform more sophisticated dense supervision methods from recent literature. The benchmark covers over 1,200 evaluation experiments across six open-weight model backbones.

5arXiv · cs.AI·25d ago·source ↗

Benchmarking study finds LLMs fail at counterintuitive probability problems despite strong standard performance

A new arXiv paper evaluates 8 state-of-the-art LLMs on discrete probability problems using two datasets: standard exercises (average accuracy 0.96) and counterintuitive exercises designed to trigger heuristic reasoning (average accuracy 0.59). The authors document token bias causing 20%+ performance drops when canonical problem formulations are disguised, and up to 34% degradation when misleading suggestions are embedded in prompts. The findings argue that current LLMs are not genuine probabilistic reasoners despite their success on advanced math benchmarks.

6arXiv · cs.CL·4d ago·source ↗

LLMs judge worse than they generate: empirical challenge to self-evaluation pipeline assumptions

A new arXiv preprint tests the implicit assumption that LLM evaluation is easier than generation, using a controlled in-context QA setup across four benchmarks (SQuAD 2.0, DROP, HotpotQA, MuSiQue) and two models. Results show generation accuracy exceeds self-evaluation accuracy on three of four benchmarks, with attention analysis revealing that evaluation attends to context 3–5x less than generation does. LoRA fine-tuning experiments confirm the asymmetry is not a training artifact, with cross-task interference observed in both directions. The findings directly challenge assumptions underlying LLM-as-a-Judge and self-evaluation pipelines widely used in RLHF and agentic systems.

4arXiv · cs.CL·22d ago·source ↗

Zero-shot LLMs fail to beat baselines on stock prediction; explainability signals retain practical value

A new arXiv preprint evaluates zero-shot NLP pipelines for predicting short-term stock movements from financial news, finding that across multiple models and prediction horizons, zero-shot approaches consistently fail to outperform simple baselines, with especially weak performance on negative price movements. The authors introduce a multi-layered explainability framework linking predictions to token-, article-, and aggregate-level evidence, finding that explainability signals can reliably distinguish trustworthy from unreliable predictions even when accuracy is low. The work argues for a shift toward decision-support systems emphasizing transparency and uncertainty awareness rather than raw predictive accuracy.

6Mit Technology Review — Ai·13d ago·source ↗

Startup Subquadratic claims to have solved a core mathematical bottleneck in LLMs

Miami-based AI startup Subquadratic emerged from stealth claiming to have solved a long-standing mathematical bottleneck limiting large language models. Initial skepticism was high due to thin details, but the company has begun sharing supporting evidence. If substantiated, the claim would represent a significant architectural advance in how LLMs scale.

5arXiv · cs.CL·10d ago·source ↗

Sub-billion parameter SLMs outperform zero-shot GPT-5.4 and Claude Sonnet 4.6 on relation extraction benchmarks

A new arXiv paper demonstrates that small language models (360M–3B parameters) fine-tuned on task-specific data can substantially outperform zero-shot frontier LLMs on relation extraction tasks. The best sub-billion model, Qwen2.5-0.5B fine-tuned on pooled general-domain data, achieves micro-F1 of 0.83 versus 0.69 for GPT-5.4 and 0.66 for Claude Sonnet 4.6 in zero-shot settings. The authors attribute the gains to task adaptation rather than model architecture, with a discriminative RoBERTa baseline also exceeding frontier models, and show that 4-bit quantized models deployable on consumer GPUs can match or beat proprietary API-based systems for this narrow task. The work provides evidence that for well-defined NLP tasks with available training data, compact adapted models offer a practical, private, and hardware-efficient alternative to frontier APIs.

6The Batch·1mo ago·source ↗

Data Points: Nvidia Ising Models for Quantum Computing, Meta Muse Spark, GitHub Rubber Duck, Anthropic Claude Managed Agents, GPT-5.4-Cyber

Nvidia released Ising, a family of open AI models targeting quantum processor calibration and error correction, achieving 2.5x faster and 3x more accurate decoding than pyMatching, with adoption by Fermilab, Harvard, and others. Meta announced Muse Spark, a small multimodal model powering a new AI assistant series for its apps and glasses. GitHub introduced Rubber Duck, a cross-model review feature pairing Claude with GPT-5.4 for two-pass coding agent validation. Anthropic launched Claude Managed Agents, a managed infrastructure platform for enterprise autonomous AI deployment, while OpenAI expanded its Trusted Access for Cyber program with GPT-5.4-Cyber, a fine-tuned defensive cybersecurity model.

6arXiv · cs.LG·2d ago·source ↗

Surrogate Fidelity: Open LLMs often cannot reliably explain closed model behavior

A new arXiv paper from Facebook Research evaluates whether mechanistic interpretability findings from open-weight models transfer to closed API-only models across prediction, attribution, and representation levels. Studying eleven models across four families (Llama, Qwen, GPT, Gemini), the authors find that prediction-level agreement substantially overstates attribution fidelity — models that agree on answers often disagree on why. They document an 'access-validity inversion' where white-box signals like attention patterns are stable across models but weakly predictive of causal attributions, undermining the common practice of using open surrogates to explain closed systems.