5arXiv cs.CL (Computation and Language)·2d ago

CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield — three compute-efficient LM training techniques

A preprint from arXiv introduces CHERRY, a suite of three complementary techniques for compute-efficient language model training: Selective Ground Truth Token Training (SGT) that concentrates supervision on ~15% of semantically loaded tokens while recovering ~67% of full-sequence loss reduction; depth compression that shrinks a 48-layer 1B-parameter model to 6 layers (227M) via layer averaging and recurrent unrolling, matching a 566M dense model's loss; and a Mixture of Efficient Experts (MoEE) assembly that outperforms individual compressed models at comparable active parameters. The techniques are validated on CHERRY-1.8B, a Korean-language foundation model trained entirely from scratch using these methods. Authors are transparent about scope limitations: one model family, Korean data, and loss-based metrics only.

Training Infrastructure Open Weights Progress Inference Economics CHERRY-1.8B CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield Selective Ground Truth Token Training Mixture of Efficient Experts

Related guides (3)

Training InfrastructureTopic guide

Training Infrastructure: The Compute Arms Race Powering Modern AI

Read asBeginner In-depth

Open Weights ProgressTopic guide

Open Weights Progress: How Free AI Models Caught Up to the Frontier

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Hidden Cost Battle Shaping AI

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·1mo ago·source ↗

ZEDA: Post-Trained MoE Models Can Skip Half Their Experts via Self-Distillation

This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a framework that converts static post-trained Mixture-of-Experts (MoE) language models into dynamic ones without pre-training from scratch. ZEDA injects parameter-free zero-output experts into each MoE layer and uses two-stage self-distillation with the original model as a frozen teacher. Applied to Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks, ZEDA eliminates over 50% of expert FLOPs with marginal accuracy loss and achieves approximately 1.20× end-to-end inference speedup, outperforming the strongest dynamic MoE baseline by 4–6 points.

Training Infrastructure Frontier Model Releases Self-Distillation ZEDA (Zero-Expert Self-Distillation Adaptation)Qwen3-30B-A3B +3 more

7arXiv · cs.CL·1mo ago·source ↗

MobileMoE: Scaling Mixture-of-Experts for Sub-Billion Parameter On-Device Deployment

MobileMoE introduces a family of on-device MoE language models with 0.3–0.9B active parameters and 1.3–5.3B total parameters, targeting mobile deployment under memory and compute constraints. The authors derive an on-device MoE scaling law identifying a sweet spot of moderate sparsity with fine-grained and shared experts, then train models through a four-stage recipe including quantization-aware training on open-source data. Across 14 benchmarks, MobileMoE matches or exceeds leading dense on-device LLMs with 2–4× fewer inference FLOPs, and delivers 1.8–3.8× faster prefill and 2.2–3.4× faster decode than dense baselines on commodity smartphones at comparable INT4 memory.

Training Infrastructure Frontier Model Releases MobileLLM-Pro OLMoE-1B-7B INT4 Quantization +7 more

6arXiv · cs.CL·17d ago·source ↗

Expert Tying reduces MoE LLM memory footprint by ~2x with minimal quality loss

Researchers introduce Expert Tying, an architectural modification for Mixture-of-Experts LLMs that shares expert parameters across consecutive transformer layers while keeping routing and attention layer-independent. Evaluated on OLMoE, Qwen3, and DeepSeek-style MoE architectures, the method achieves nearly 2x memory reduction with negligible perplexity or downstream quality degradation. The approach exploits parameter redundancy in MoE pathways to improve the compute-to-memory trade-off for training and inference.

Training Infrastructure Frontier Model Releases DeepSeek V4 Tying the Loop -- Tied Expert Layers in Mixture-of-Experts Language Models Expert Tying +3 more

7arXiv · cs.CL·24d ago·source ↗

Latent Context Language Models (LCLMs) achieve competitive encoder-decoder KV cache compression at scale

Researchers introduce Latent Context Language Models (LCLMs), a family of encoder-decoder compressors that map long token sequences to shorter latent embeddings consumed by a decoder, targeting the KV cache memory bottleneck in long-context inference. The authors conduct architecture search and continually pre-train 0.6B-encoder/4B-decoder models on over 350B tokens at compression ratios of 1:4, 1:8, and 1:16. LCLMs improve the Pareto frontier across general-task performance, compression speed, and peak memory, and are demonstrated as efficient backbones for long-horizon agents that can skim compressed context and expand relevant segments on demand. The work closes a previously noted gap between encoder-decoder approaches and KV cache compression methods on the accuracy-efficiency frontier.

Long Context Evolution Inference Economics End-to-End Context Compression at Scale Latent Context Language Models +1 more

5arXiv · cs.CL·23d ago·source ↗

Predictor-gated bank-wise sparsity recipe for dense-to-sparse LLM upcycling from Qwen2.5-8B

A new arXiv preprint introduces a continual training recipe to convert dense LLMs into channel-sparse models without post-hoc pruning. Starting from a Qwen2.5-8B checkpoint, the method uses a low-rank predictor to gate FFN channel routing, achieving 4x sparsity in FFN intermediate activations via a bank-wise top-k rule at 32K context. The routing module is trained on the main language modeling path, making the resulting sparsity hardware-oriented rather than approximate. The authors also identify and patch a layer-local long-context failure mode on the RULER-CWE benchmark.

Training Infrastructure Inference Economics Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs SwiGLU RULER-CWE +1 more

7arXiv · cs.LG·1mo ago·source ↗

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

Complete-muE is a framework for transferring hyperparameters across dense FFN and Mixture-of-Experts (MoE) transformer architectures, addressing limitations of existing tools like μP and SDE that cannot handle simultaneous architecture and token-per-expert changes. It uses a two-bridge system: Bridge I maps dense FFN to Dense MoE via active-width μP with normalized router scale, and Bridge II maps Dense MoE to sparse MoE via activated-expert scaling with a first-order SDE correction. The practical outcome is a 'tune dense once, transfer to all' recipe that enables near-optimal hyperparameter reuse across MoE configurations without costly re-tuning. Experiments on language model and diffusion model pretraining confirm stable hyperparameter optima across architectures and parameter counts.

Training Infrastructure Frontier Model Releases Transformers Mixture of Experts SDE (Stochastic Differential Equation LR scaling)+3 more

4arXiv · cs.AI·8d ago·source ↗

HiReLC: Hierarchical Reinforcement Learning Framework for Joint Neural Network Pruning and Quantization

Researchers introduce HiReLC, a hierarchical ensemble-RL framework that automates joint quantization and structured pruning of deep neural networks. The system uses two-level agents — low-level agents selecting per-kernel compression configurations and high-level agents coordinating global budget allocation via Fisher Information-based sensitivity estimates. Experiments on Vision Transformers and CNNs achieve 5.99–6.72× parameter-storage compression with accuracy drops of 0.55–5.62% in most settings. The controller is architecture-agnostic, using a surrogate MLP and active learning loop to reduce policy evaluation cost.

Training Infrastructure Inference Economics HiReLC ViT (Vision Transformer)

4arXiv · cs.AI·6h ago·source ↗

EADP: Entropy-aware visual token pruning for efficient VLMs under dense instructions

A new arXiv preprint introduces Entropy-Aware Dense Pruning (EADP), a framework for compressing visual tokens in vision-language models (VLMs) that addresses two failure modes in existing methods: textual noise corrupting cross-modal scoring and feature fragmentation from naive Top-K selection. EADP uses statistical entropy to filter textual noise and reformulates token selection as a submodular maximization problem with a spatial prior. The authors report state-of-the-art accuracy-efficiency trade-offs on multimodal benchmarks under strict token budgets.

Inference Economics Multimodal Progress Entropy-Aware Dense Pruning