4arXiv cs.LG (Machine Learning)·17d ago

MLSkip: Data skipping for ML filter predicates using Parquet metadata and neural network verification

MLSkip introduces data skipping techniques for ML-based filter predicates in databases, a problem not addressed by traditional min-max pruning methods. The approach leverages Parquet's existing min-max metadata combined with neural network verification techniques to prune non-qualifying row groups. On TPC-H and TPC-DS benchmarks with ReLU architectures, the method achieves 27.4% average pruning effectiveness for low-selectivity filters, improving to 38.31% with a proposed 2D convex hull metadata structure, yielding a 1.07× end-to-end speedup in DuckDB over PyTorch.

Inference Economics TPC-DS DuckDB TPC-H MLSkip: Data Skipping for ML Filters via Lightweight Metadata

Related guides (1)

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·12d ago·source ↗

EmbedFilter: Using the unembedding matrix to suppress high-frequency token noise in LLM text embeddings

Researchers identify that LLM text embeddings over-express high-frequency but semantically uninformative tokens when projected onto vocabulary space, degrading embedding quality. They introduce EmbedFilter, a simple linear transformation that filters out the subspace of the unembedding matrix responsible for writing these tokens into embedding space. The method improves zero-shot performance on text embedding benchmarks across multiple LLM backbones and yields a byproduct of dimensionality reduction without quality loss. Code is publicly released.

Evaluation and Benchmarking Inference Economics Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings EmbedFilter

5arXiv · cs.AI·9d ago·source ↗

Reroute: Training-free recoverable visual token routing for vision-language models

A new arXiv preprint proposes Reroute, a training-free plug-in that replaces the standard rank-and-remove visual token pruning paradigm in VLMs with a recoverable routing mechanism. Instead of permanently discarding low-ranked tokens, Reroute defers them to re-enter the candidate pool at later decoder stages, addressing the problem that token importance shifts across decoder depth. Evaluated on LLaVA-1.5 and Qwen backbones augmented with FastV, PDrop, and Nüwa pruning methods, Reroute improves grounding performance under aggressive token reduction without sacrificing general VQA accuracy. The approach preserves the theoretical compute and KV-cache budget of the underlying pruning method.

Inference Economics Multimodal Progress FastV PDrop Qwen +4 more

6arXiv · cs.LG·22d ago·source ↗

HullFT: Efficient Test-Time Finetuning of LLMs via Convex Reconstruction and Gradient Caching

HullFT is a new method for test-time finetuning (TTFT) of language models that addresses the dual bottlenecks of retrieval quality and per-query finetuning cost. It represents query embeddings as sparse convex combinations of training sequences using Frank-Wolfe optimization, yielding diverse and relevant support sets without expensive diversity-aware search. A geometric integerization step converts fractional weights into integer multiplicities, enabling a Gradient Reuse scheme that amortizes forward-backward computation across repeated examples. Experiments show improved quality-efficiency tradeoffs over prior TTFT methods, measured in bits-per-byte at lower total runtime.

Inference Economics Agent and Tool Ecosystem Test-Time Finetuning (TTFT)Gradient Reuse bits-per-byte (BpB)+2 more

6arXiv · cs.LG·9d ago·source ↗

Bebop: MTP with rejection sampling and TV loss achieves 1.8x RL training speedup

Researchers introduce Bebop, a framework for integrating Multi-Token Prediction (MTP) into large-scale RL training pipelines for LLMs. The work identifies that MTP acceptance rates degrade during RL due to entropy fluctuations, and proposes probabilistic rejection sampling plus a novel end-to-end Total Variation (TV) loss that directly optimizes multi-step acceptance rates, achieving up to 95% acceptance rates and 25% extra inference throughput gains. Applied to Qwen3.5, Qwen3.6, and Qwen3.7 models, the method yields up to 1.8x end-to-end acceleration in async RL training. The approach eliminates the need for costly online MTP updating by using pre-RL MTP training with the proposed objectives.

Training Infrastructure Inference Economics Multi-Token Prediction (MTP)speculative decoding TV loss +3 more

5arXiv · cs.CL·4d ago·source ↗

MetaSyn benchmark reveals critical screening bottleneck in LLM-based meta-analysis pipelines

Researchers introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals, paired with a 140k-article PubMed retrieval corpus, PI/ECO criteria, verified positives, and hard negatives. Benchmarking twelve pipeline configurations — nine RAG variants and a protocol-driven agent — shows that despite 90.9% retrieval recall at K=200, no system recovers more than 52.7% of ground-truth included studies. The core failure is LLMs' inability to reliably distinguish eligible studies from topically similar but criteria-failing distractors. The paper argues that end-to-end scores obscure where pipelines break down and proposes stage-attributed metrics.

Evaluation and Benchmarking Agent and Tool Ecosystem PubMed Nature Portfolio MetaSyn

6arXiv · cs.CL·12d ago·source ↗

LLM-guided MAP-Elites evolution improves medical decision pipelines at inference time

Researchers propose using LLM-guided MAP-Elites evolutionary search as an inference-time alternative to fine-tuning for adapting LLMs to clinical workflows, formulating triage, consultation, and image classification as evolutionary searches over executable artifacts. Across three medical settings, evolved programs substantially outperform manually designed baselines: triage accuracy improves from 77.3% to 87.1% and emergency recall from 0.60 to 0.97, with gains also shown on MIMIC-ESI, iCRAFTMD, and PneumoniaMNIST. The approach works across Llama-3, Qwen-3.5, and Gemma-4 backbones and produces interpretable program-level mechanisms rather than superficial prompt changes.

Evaluation and Benchmarking Agent and Tool Ecosystem Gemma-4 E4B-it MIMIC-ESI iCRAFTMD +6 more

6arXiv · cs.CL·18d ago·source ↗

SubFit: Submodule-Level Fitted Residual Replacement for LLM Compression

SubFit introduces a post-training LLM compression method that operates at the submodule level (Attention and FeedForward separately) rather than full layers, and selects components non-contiguously. The approach replaces removed submodules with lightweight fitted residual bypasses calibrated on small data. Evaluated across ten LLMs at sparsity levels from 12.5% to 37.5%, SubFit retains 84.6% of dense downstream accuracy at 25% sparsity versus 81.6% for the strongest baseline, while reducing perplexity degradation from 4.34x to 2.42x and delivering measurable inference speedup and KV-cache savings.

Training Infrastructure Evaluation and Benchmarking FeedForward submodule KV Cache SubFit +7 more

5arXiv · cs.AI·2d ago·source ↗

MAST: Mechanism-guided selective unlearning for RLVR-trained reasoning models

Researchers introduce MAST (Mechanism-Aligned Selective Targeting), a method for selectively unlearning capabilities induced by reinforcement learning from verifiable rewards (RLVR) in language models while minimizing collateral damage to retained knowledge. The approach ranks attention-projection tensors by off-principal energy and gradient coupling to identify a targeted subset for update, rather than applying full-parameter gradient ascent. Evaluated on Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, MAST achieves statistically significant forgetting on target MATH problems while preserving GSM8K performance, whereas full-parameter unlearning collapses retained capabilities. The method generalizes across seeds and unlearning objectives (NPO/SimNPO).

AI Safety Research Alignment and RLHF Qwen3-1.7B-Base MATH MAST +2 more