ChunkFT: Memory-Efficient Full Fine-Tuning via Byte-Streamed Chunk Optimization
ChunkFT is a fine-tuning framework that reformulates full-parameter optimization around a dynamically activated working set of sub-tensors, enabling gradient computation without dense gradient materialization. It achieves full-parameter fine-tuning of a 7B model in 13.72GB GPU memory on a single RTX 4090, and scales Llama 3-70B fine-tuning to 2×H800 GPUs. Downstream evaluations on language understanding, math reasoning, and MT-Bench show ChunkFT matches or exceeds full-parameter fine-tuning quality while outperforming existing memory-efficient baselines such as LoRA-class methods. A theoretical convergence analysis in the deterministic setting is also provided.
Related guides (3)
Related events (8)
Fine-tuning Llama 2 70B using PyTorch FSDP
This Hugging Face blog post details a practical workflow for fine-tuning the Llama 2 70B model using PyTorch Fully Sharded Data Parallel (FSDP), focusing on RAM-efficient techniques. The guide addresses the memory challenges of training large-scale open-weight models across multiple GPUs. It serves as a technical reference for practitioners working with frontier-scale open models on distributed infrastructure.
Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU
Hugging Face demonstrates a method for running RLHF fine-tuning on 20-billion-parameter language models using a single 24GB consumer GPU by combining TRL and PEFT (parameter-efficient fine-tuning). The approach uses techniques like LoRA and quantization to dramatically reduce memory requirements. This lowers the hardware barrier for RLHF experimentation from multi-GPU server setups to consumer-grade hardware.
Parameter-Efficient Fine-Tuning using 🤗 PEFT
Hugging Face introduces the PEFT library, which enables parameter-efficient fine-tuning of large language models using techniques such as LoRA, prefix tuning, and prompt tuning. The library allows practitioners to adapt large pretrained models to downstream tasks while updating only a small fraction of model parameters, dramatically reducing compute and memory requirements. This lowers the barrier to fine-tuning frontier-scale models on consumer hardware.
HullFT: Efficient Test-Time Finetuning of LLMs via Convex Reconstruction and Gradient Caching
HullFT is a new method for test-time finetuning (TTFT) of language models that addresses the dual bottlenecks of retrieval quality and per-query finetuning cost. It represents query embeddings as sparse convex combinations of training sequences using Frank-Wolfe optimization, yielding diverse and relevant support sets without expensive diversity-aware search. A geometric integerization step converts fractional weights into integer multiplicities, enabling a Gradient Reuse scheme that amortizes forward-backward computation across repeated examples. Experiments show improved quality-efficiency tradeoffs over prior TTFT methods, measured in bits-per-byte at lower total runtime.
PEFT-Arena: Benchmarking Parameter-Efficient Finetuning via Stability-Plasticity Trade-offs
PEFT-Arena is a new benchmark that evaluates parameter-efficient finetuning methods jointly on downstream task performance and retention of pretrained general capabilities, framing the problem as a stability-plasticity dilemma. Across methods tested under comparable parameter budgets, orthogonal finetuning achieves the best Pareto frontier. The paper provides geometric analyses in both weight space (spectral/singular-value structure) and activation space (representation distortion metrics) to explain why different PEFT methods differ in forgetting behavior. A practical finding is that final SFT checkpoints often overshoot an optimal retention operating point, motivating path-wise rewinding as a post-hoc correction.
On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters
This paper reframes parameter-efficient fine-tuning (PEFT) not merely as a cheaper alternative to full fine-tuning, but as a substrate for persistent, instance-specific personal models layered atop shared foundation models. The authors analyze three scaling axes: Scale Up (stronger base models amplifying adapter utility), Scale Down (minimum viable adapter size), and Scale Out (managing millions of concurrent adapted instances). They introduce MinT as an infrastructure reference for adapter identity, versioning, provenance, evaluation, and serving at scale.
(LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware
This Hugging Face blog post covers techniques for fine-tuning the FLUX.1-dev image generation model using LoRA (Low-Rank Adaptation) on consumer-grade hardware. The post likely addresses quantization strategies (QLoRA) to reduce memory requirements, enabling training on GPUs with limited VRAM. This is relevant to the open-weights and accessible fine-tuning ecosystem for diffusion models.
Improving Hugging Face Training Efficiency Through Packing with Flash Attention 2
Hugging Face published a blog post describing a technique for improving training efficiency by packing multiple short sequences into a single batch using Flash Attention 2. The approach reduces padding waste and improves GPU utilization during LLM fine-tuning. This is a practical infrastructure optimization relevant to practitioners training models on datasets with variable-length sequences.


