Almanac
← Events
5arXiv cs.AI (Artificial Intelligence)·11d ago

AdamO optimizer and dynamical isometry regularization preserve plasticity in continual learning

A new arXiv preprint connects plasticity loss in continual learning to the empirical Neural Tangent Kernel and identifies dynamical isometry—keeping layer-wise Jacobian singular values near one—as a key mechanism for maintaining learning capacity under non-stationarity. The authors propose an isometry-promoting regularization scheme that can reactivate dormant ReLU units and introduce AdamO, an Adam-style optimizer that decouples isometry regularization from gradient updates analogously to AdamW. The methods are evaluated on supervised and reinforcement-learning continual-learning benchmarks, consistently matching or outperforming prior approaches. The work also reinterprets existing plasticity-preserving methods as targeting only partial isometry measures.

Related guides (1)

Related events (8)

4arXiv · cs.LG·11d ago·source ↗

Conservation laws from data symmetry in neural network gradient-flow training

A new arXiv preprint investigates whether intrinsic symmetries in training data produce conserved quantities during gradient-flow training of neural networks. The authors prove that for analytic, non-polynomial loss functions, data symmetries generically do not induce additional integrals of motion, but for MSE loss, data augmentation can yield extra conserved quantities. They introduce a framework of 'tensorizable networks'—architectures including linear, polynomial, and Lightning Attention networks—where parameter and input dependence can be separated via an intermediate representation.

5arXiv · cs.LG·11d ago·source ↗

Local linear structures in LLM weights and activations are dynamic, not fixed global directions

A new arXiv paper investigates the nature of linear structures in transformer weights and activations, finding strong local low-rank task-gradient structure but rejecting the hypothesis that fixed task planes exist. The authors show that useful bases drift substantially within 100 optimization steps, yet early recovery updates form a trajectory-prefix basis capturing 77% of LoRA recovery displacement. They also establish a formal connection between parameter perturbations and activation steering, finding a 0.58 cosine similarity between gradient-step-induced activation shifts and CAA steering vectors, suggesting linear structures are evolving local geometries rather than stable global task directions.

5arXiv · cs.LG·8d ago·source ↗

Stable Recovery Manifold hypothesis: catastrophic forgetting as accessibility problem, not information destruction

A new arXiv preprint investigates the geometric structure of recoverability in continual learning using Split CIFAR-100 and a sequentially trained ResNet-18. The authors introduce Recovery Subspace Dimensionality (k_t) and find that recovery dimensionality remains stable across tasks (mean k_t = 8.0) despite substantial representational drift, with principal-angle drift strongly predicting recoverability (r = -0.862). The findings support the Stable Recovery Manifold hypothesis: forgotten knowledge remains compactly decodable, reframing catastrophic forgetting as a manifold-alignment and accessibility problem rather than true information loss.

7arXiv · cs.AI·29d ago·source ↗

The Matching Principle: A Geometric Theory Unifying Robustness, Domain Adaptation, and Alignment via Nuisance Covariance

This paper proposes the 'matching principle': a unified geometric framework arguing that robustness methods (CORAL, IRM, adversarial training, augmentation, metric learning, Jacobian penalties, alignment constraints) are all estimators of the same object—the covariance of label-preserving deployment nuisance—and that regularizing the encoder Jacobian along this covariance's range is the core statistical problem. The authors prove closed-form optimality results in a linear-Gaussian model, introduce the Trajectory Deviation Index (TDI) as a label-free embedding sensitivity probe, and validate predictions across 13 pre-registered experimental blocks including Qwen2.5-7B. At 7B scale, matched style-PMH improves selective honesty while standard DPO degrades Style TDI, connecting the theory to alignment safety.

6arXiv · cs.LG·23d ago·source ↗

PEFT-Arena: Benchmarking Parameter-Efficient Finetuning via Stability-Plasticity Trade-offs

PEFT-Arena is a new benchmark that evaluates parameter-efficient finetuning methods jointly on downstream task performance and retention of pretrained general capabilities, framing the problem as a stability-plasticity dilemma. Across methods tested under comparable parameter budgets, orthogonal finetuning achieves the best Pareto frontier. The paper provides geometric analyses in both weight space (spectral/singular-value structure) and activation space (representation distortion metrics) to explain why different PEFT methods differ in forgetting behavior. A practical finding is that final SFT checkpoints often overshoot an optimal retention operating point, motivating path-wise rewinding as a post-hoc correction.

5arXiv · cs.LG·11d ago·source ↗

DRPO: Smooth divergence regularization replaces hard masking in LLM RL training

A new arXiv preprint proposes Divergence Regularized Policy Optimization (DRPO), a method that replaces the hard trust-region mask used in DPPO with a smooth advantage-weighted quadratic regularizer on policy shift. The approach addresses a known weakness in PPO and GRPO where importance ratios poorly proxy distributional shift in long-tailed vocabularies, and in DPPO where gradient signals are discarded rather than corrected at trust-region boundaries. Experiments across model scales, architectures, and precision settings show improved stability and efficiency in LLM RL post-training.

5arXiv · cs.LG·8d ago·source ↗

Analysis of on-policy distillation reveals sparse, geometrically structured parameter updates

A new arXiv paper analyzes on-policy distillation (OPD) — a post-training method combining on-policy student trajectories with dense teacher supervision — across language and vision-language model pairs. The authors find that OPD updates are coordinate-sparse and distributed across layers (FFN-heavy), and that training only the discovered sparse subnetwork recovers near-full performance. Geometrically, updates are numerically full-rank but spectrally concentrated, falling disproportionately on near-zero weight coordinates, suggesting OPD retains distinct geometric signatures rather than behaving like ordinary dense parameter rewriting.

5arXiv · cs.LG·3d ago·source ↗

Kolmogorov Regression lifts diffusion policies to Cameron-Martin space for robust long-horizon control

Researchers introduce a backward Kolmogorov equation framework that reformulates diffusion policy training as a deterministic boundary-value PDE problem in Cameron-Martin space, replacing stochastic score matching. The approach uses a precision-weighted Cameron-Martin loss and a Kolmogorov residual as an inference-time failure detector, yielding convergence guarantees tied to kernel effective rank rather than action dimension. Validation on the PushT manipulation benchmark shows 17% improvement in episode reward and 67.6% reduction in inter-step drift; a 6-station manufacturing scheduling task shows 28.4% lower RMSE than LSTM baselines and 96% reduction in deadlock events via Hamilton-Jacobi reachability certification.