SubFit: Submodule-Level Fitted Residual Replacement for LLM Compression
SubFit introduces a post-training LLM compression method that operates at the submodule level (Attention and FeedForward separately) rather than full layers, and selects components non-contiguously. The approach replaces removed submodules with lightweight fitted residual bypasses calibrated on small data. Evaluated across ten LLMs at sparsity levels from 12.5% to 37.5%, SubFit retains 84.6% of dense downstream accuracy at 25% sparsity versus 81.6% for the strongest baseline, while reducing perplexity degradation from 4.34x to 2.42x and delivering measurable inference speedup and KV-cache savings.
Related guides (4)
Related events (8)
Predictor-gated bank-wise sparsity recipe for dense-to-sparse LLM upcycling from Qwen2.5-8B
A new arXiv preprint introduces a continual training recipe to convert dense LLMs into channel-sparse models without post-hoc pruning. Starting from a Qwen2.5-8B checkpoint, the method uses a low-rank predictor to gate FFN channel routing, achieving 4x sparsity in FFN intermediate activations via a bank-wise top-k rule at 32K context. The routing module is trained on the main language modeling path, making the resulting sparsity hardware-oriented rather than approximate. The authors also identify and patch a layer-local long-context failure mode on the RULER-CWE benchmark.
Hyperfitting Explained: Terminal Geometric Expansion in Final Transformer Layers Drives Diversity Gains
This paper investigates the 'hyperfitting' phenomenon—where fine-tuning LLMs to near-zero loss on small datasets improves open-ended generation and reduces repetition—and demonstrates it is mechanistically distinct from temperature scaling. Entropy-matched control experiments falsify both the temperature-equivalence and static vocabulary reweighting hypotheses, instead localizing the effect to a 'Terminal Expansion' in the final transformer block where feature-space dimensionality expands by ~80.8 dimensions, enabling promotion of deep-tail tokens via context-dependent rank reordering. The authors introduce Late-Stage LoRA, a targeted fine-tuning strategy updating only the final 5 layers, achieving robust generation with minimal parameter updates.
SKIM: Adaptive soft-token compression for procedural skills in LLM workflows
Researchers introduce SKIM (SKIll coMpression), a multi-resolution soft token compression framework targeting procedural knowledge (skills/workflows) rather than factual documents. SKIM compresses reusable natural language skills to 30–60% of their original token length while preserving task performance, reducing prefill cost and latency when skills are repeatedly invoked. The method adapts compression depth to skill complexity and supports offline compression for frequently updated community skills.
Investing in Performance: Fine-tune small models with LLM insights — a CFM case study
This Hugging Face blog post presents a case study from CFM (Capital Fund Management) on using large language model outputs to guide fine-tuning of smaller, more efficient models for financial applications. The approach leverages LLM-generated signals or labels to train compact models that can be deployed at lower cost and latency. The case study illustrates an enterprise pattern of distilling LLM capabilities into task-specific smaller models for production use.
HullFT: Efficient Test-Time Finetuning of LLMs via Convex Reconstruction and Gradient Caching
HullFT is a new method for test-time finetuning (TTFT) of language models that addresses the dual bottlenecks of retrieval quality and per-query finetuning cost. It represents query embeddings as sparse convex combinations of training sequences using Frank-Wolfe optimization, yielding diverse and relevant support sets without expensive diversity-aware search. A geometric integerization step converts fractional weights into integer multiplicities, enabling a Gradient Reuse scheme that amortizes forward-backward computation across repeated examples. Experiments show improved quality-efficiency tradeoffs over prior TTFT methods, measured in bits-per-byte at lower total runtime.
EmbedFilter: Using the unembedding matrix to suppress high-frequency token noise in LLM text embeddings
Researchers identify that LLM text embeddings over-express high-frequency but semantically uninformative tokens when projected onto vocabulary space, degrading embedding quality. They introduce EmbedFilter, a simple linear transformation that filters out the subspace of the unembedding matrix responsible for writing these tokens into embedding space. The method improves zero-shot performance on text embedding benchmarks across multiple LLM backbones and yields a byproduct of dimensionality reduction without quality loss. Code is publicly released.
SETA: Sparse Subspace-to-Expert Sharing for Continual Learning in LLMs
Researchers introduce SETA (Mixture of Sparse Experts for Task Agnostic Continual Learning), a framework addressing catastrophic forgetting in LLMs via adaptive sparse subspace decomposition into task-specific and shared expert modules. The approach uses adaptive elastic anchoring and routing-aware regularization to protect shared knowledge at both weight and routing levels. Experiments on LLaMA-2 7B and Qwen3-4B show competitive or superior performance versus continual learning baselines, with strong retention of early-task knowledge.
ChunkFT: Memory-Efficient Full Fine-Tuning via Byte-Streamed Chunk Optimization
ChunkFT is a fine-tuning framework that reformulates full-parameter optimization around a dynamically activated working set of sub-tensors, enabling gradient computation without dense gradient materialization. It achieves full-parameter fine-tuning of a 7B model in 13.72GB GPU memory on a single RTX 4090, and scales Llama 3-70B fine-tuning to 2×H800 GPUs. Downstream evaluations on language understanding, math reasoning, and MT-Bench show ChunkFT matches or exceeds full-parameter fine-tuning quality while outperforming existing memory-efficient baselines such as LoRA-class methods. A theoretical convergence analysis in the deterministic setting is also provided.



