4arXiv cs.LG (Machine Learning)·17d ago

FlashbackCL extends federated learning to mitigate temporal distribution shift and forgetting

FlashbackCL is a proposed extension to the Flashback federated learning method that addresses temporal forgetting — the degradation caused by client data distributions drifting over time, a scenario existing FL methods do not handle. The approach introduces temporally-decayed label counts, a device-aware replay buffer with Class-Balanced Reservoir Sampling, and server-side coreset curation. On CIFAR-10 with 50 clients, FlashbackCL achieves 6.9–10.0% relative improvement over Flashback while reducing temporal forgetting by up to 68%, with CBRS replay identified as the critical component.

Evaluation and Benchmarking CIFAR-100 CIFAR-10 FlashbackCL Class-Balanced Reservoir Sampling FlashbackCL: Mitigating Temporal Forgetting in Federated Learning Flashback

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.LG·12d ago·source ↗

FBCC: Forward-Backward Knowledge Distillation for Unsupervised Continual Clustering

A new arXiv preprint introduces Unsupervised Continual Clustering (UCC) as a problem formulation and proposes FBCC, a method using a continual teacher network with task-specific student networks to learn sequential clustering tasks without labels or stored past data. The approach uses a dual-phase forward-backward distillation process to preserve previously discovered cluster structure while learning new ones. Experiments on four benchmark datasets show FBCC outperforms continual learning baselines in clustering accuracy while reducing catastrophic forgetting.

Evaluation and Benchmarking Unsupervised Continual Clustering via Forward-Backward Knowledge Distillation

4arXiv · cs.AI·16d ago·source ↗

BabyCL: Continual multimodal learning from egocentric child video in a single chronological pass

Researchers introduce BabyCL, a continual learning framework that processes the SAYCam egocentric child video dataset in a single chronological pass rather than shuffled multi-epoch training, more closely mimicking how children actually encounter their environment. The system combines streaming visual representation learning with image-text contrastive objectives, a multi-stage temporal segmentation, and a dual replay buffer managing visual and multimodal histories. BabyCL outperforms streaming baselines on the SAYCam Labeled-S 4AFC benchmark under matched compute budgets, substantially closing the gap to offline training upper bounds. The work advances understanding of whether neural networks can acquire word-referent mappings under biologically plausible training conditions.

Evaluation and Benchmarking Multimodal Progress SAYCam BabyCL SAYCam Labeled-S 4AFC

5arXiv · cs.LG·8d ago·source ↗

Stable Recovery Manifold hypothesis: catastrophic forgetting as accessibility problem, not information destruction

A new arXiv preprint investigates the geometric structure of recoverability in continual learning using Split CIFAR-100 and a sequentially trained ResNet-18. The authors introduce Recovery Subspace Dimensionality (k_t) and find that recovery dimensionality remains stable across tasks (mean k_t = 8.0) despite substantial representational drift, with principal-angle drift strongly predicting recoverability (r = -0.862). The findings support the Stable Recovery Manifold hypothesis: forgotten knowledge remains compactly decodable, reframing catastrophic forgetting as a manifold-alignment and accessibility problem rather than true information loss.

Evaluation and Benchmarking Split CIFAR-100 Recovery Subspace Dimensionality The Stable Recovery Manifold: Geometric Principles Governing Recoverability in Continual Learning +1 more

6arXiv · cs.LG·25d ago·source ↗

Self-Generated Replay Nearly Eliminates Catastrophic Forgetting in Language Models

This paper investigates catastrophic forgetting in language models during continual learning, finding that models can use self-generated samples from their own training distribution as effective replay data, nearly eliminating forgetting without requiring stored exemplars. The authors identify two key conditions where forgetting persists: when models are pretrained near capacity saturation (leaving no room for new knowledge), and when low learning rates are used to reduce forgetting at the cost of requiring far more training steps. Self-generated replay breaks this learning-rate/forgetting tradeoff, enabling fast high-learning-rate finetuning without degradation on prior tasks.

Enterprise Deployment Patterns Agent and Tool Ecosystem catastrophic forgetting Language Model Finetuning Continual Learning +2 more

5arXiv · cs.AI·17d ago·source ↗

FFR extends Forward-Forward algorithm to regression tasks with 73% memory reduction

A new arXiv preprint introduces FFR (Forward-Forward for Regression), the first framework to adapt Hinton's Forward-Forward algorithm—a biologically plausible, backpropagation-free training method—to regression problems. FFR introduces an ordinal competitive goodness function, a stratified ladder architecture, and hierarchical prediction with uncertainty estimation to handle continuous target spaces. Across five real-world regression benchmarks, FFR recovers 98.6% of backpropagation accuracy while reducing peak training memory to 27% of BP's at depth 8 and 8% at depth 32, with per-iteration time around 72% of BP's.

Training Infrastructure Evaluation and Benchmarking Forward-Forward Algorithm FFR: Forward-Forward Learning for Regression

6arXiv · cs.CL·18d ago·source ↗

AgentCL: A Rigorous Evaluation Framework for Continual Learning in Language Agents

AgentCL is a new benchmark and evaluation framework designed to rigorously assess continual learning in language agents, addressing gaps in existing benchmarks that focus on retrieval over long-context documents or use naive task streams with limited cross-task analysis. The framework constructs compositional task streams where earlier sub-solutions, evidence, or workflows are intentionally reusable in later tasks, contrasting them with naive streams to measure transfer gains. The authors also introduce MemProbe, a probing method that stores interactions, insights, and skills while filtering unreliable experiences during consolidation. Empirical results across coding, deep research, and language understanding tasks show that controlled streams better distinguish memory design quality, and that naive streams can mask memory-induced degradation.

Long Context Evolution Evaluation and Benchmarking AgentCL MemProbe Continual Learning +3 more

6arXiv · cs.CL·10d ago·source ↗

QK-Restore: Fixing long-context recall degradation caused by CoT fine-tuning in hybrid LLMs

Researchers find that chain-of-thought supervised fine-tuning systematically degrades long-context recall in hybrid linear-attention models (HypeNet, Jet-Nemotron), with Needle-In-A-Haystack performance collapsing dramatically—e.g., HypeNet-9B dropping from 67.2% to 9.4% at 256K context. The root cause is identified as CoT-SFT biasing attention gradients toward short-range patterns, corrupting the query-key projections responsible for long-range routing. The paper proposes QK-Restore, a training-free fix that restores only W_Q and W_K from the pre-SFT checkpoint, recovering long-context capability while preserving reasoning gains.

Long Context Evolution Alignment and RLHF Jet-Nemotron Needle-in-a-Haystack HypeNet +2 more

6arXiv · cs.LG·1mo ago·source ↗

FORGE: Self-Evolving Agent Memory via Population Broadcast Without Weight Updates

FORGE (Failure-Optimized Reflective Graduation and Evolution) is a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents without any gradient updates. It wraps a Reflexion-style inner loop where a reflection agent converts failed trajectories into textual heuristics or few-shot demonstrations, then propagates the best-performing instance's memory across a population between stages. Evaluated on CybORG CAGE-2 (a stochastic network-defense POMDP), FORGE improves average return by 1.7–7.7× over zero-shot and 29–72% over Reflexion across all 12 model-representation conditions tested with four LLM families. Notably, weaker models benefit disproportionately, suggesting the method may help close capability gaps rather than amplify already-strong models.

Evaluation and Benchmarking Agent and Tool Ecosystem Reflexion Grok-4-Fast ReAct +6 more