Almanac
technique

Mixture of Efficient Experts

techniqueactiveprovisionalmixture-of-efficient-experts-896d8074·1 events·first seen 2d ago

Aliases: Mixture of Efficient Experts

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.CL·2d ago·source ↗

CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield — three compute-efficient LM training techniques

A preprint from arXiv introduces CHERRY, a suite of three complementary techniques for compute-efficient language model training: Selective Ground Truth Token Training (SGT) that concentrates supervision on ~15% of semantically loaded tokens while recovering ~67% of full-sequence loss reduction; depth compression that shrinks a 48-layer 1B-parameter model to 6 layers (227M) via layer averaging and recurrent unrolling, matching a 566M dense model's loss; and a Mixture of Efficient Experts (MoEE) assembly that outperforms individual compressed models at comparable active parameters. The techniques are validated on CHERRY-1.8B, a Korean-language foundation model trained entirely from scratch using these methods. Authors are transparent about scope limitations: one model family, Korean data, and loss-based metrics only.