Almanac
technique

warmdown learning-rate schedule

techniqueactiveprovisionalwarmdown-learning-rate-schedule-4dfac1d3·1 events·first seen 22d ago

Aliases: warmdown learning-rate schedule

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.CL·22d ago·source ↗

Mapping the Schedule × Bit-Width Boundary in Sub-100M Quantisation-Aware Training

A large factorial grid study (1345 total runs across two phases) tests whether optimal learning-rate schedules differ by bit-width during from-scratch quantisation-aware training (QAT) for sub-100M decoder language models. The primary hypothesis—that INT6 QAT requires a different schedule than FP16/INT8—is falsified; a 33% warmdown fraction is optimal across all precisions and model sizes from 5M to 350M. For INT4, a regime boundary is identified near 50M parameters: above it, wd33 is decisively optimal; below it, schedule choice falls within seed-level noise. The study also establishes a log-linear scaling law for the INT6 quantisation penalty that successfully predicts held-out model sizes.