model

CHERRY-1.8B

modelactiveprovisionalcherry-1-8b-aa028ce5·1 events·first seen 2d ago

Aliases: CHERRY-1.8B

Co-occurring entities

CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield Selective Ground Truth Token Training Mixture of Efficient Experts

More like this (12)

GigaChat-10B-A1.8B Qwen3-1.7B-Base Qwen2.5-1.5B Qwen3.5-0.8B Wan2.1-T2V-1.3B Qwen3-1.7B Chai-1 DreamReasoner-8B LLaVA-1.5-13B LLaDA-1.5-8B Qwen-0.5B LLaVA-1.5-7B

Recent events (1)

5arXiv · cs.CL·2d ago·source ↗

CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield — three compute-efficient LM training techniques

A preprint from arXiv introduces CHERRY, a suite of three complementary techniques for compute-efficient language model training: Selective Ground Truth Token Training (SGT) that concentrates supervision on ~15% of semantically loaded tokens while recovering ~67% of full-sequence loss reduction; depth compression that shrinks a 48-layer 1B-parameter model to 6 layers (227M) via layer averaging and recurrent unrolling, matching a 566M dense model's loss; and a Mixture of Efficient Experts (MoEE) assembly that outperforms individual compressed models at comparable active parameters. The techniques are validated on CHERRY-1.8B, a Korean-language foundation model trained entirely from scratch using these methods. Authors are transparent about scope limitations: one model family, Korean data, and loss-based metrics only.

Training Infrastructure Open Weights Progress CHERRY-1.8B CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield Selective Ground Truth Token Training +2 more