Almanac
← Events
3arXiv cs.LG (Machine Learning)·11d ago

Range Penalization for Federated Learning: Polar Clustering and Statistical Accuracy

This paper introduces range regularization for federated learning, identifying shared-weight features across clients while adaptively clustering personalized feature weights at extreme values (termed polar clustering). The approach targets statistical accuracy, cross-client regularity, and resource efficiency for quantization and coding. New nonasymptotic proof techniques are developed for the seminorm-based regularizer, alongside a fast optimization algorithm exploiting local strong convexity.

Related guides (1)

Related events (8)

5arXiv · cs.LG·11d ago·source ↗

DRPO: Smooth divergence regularization replaces hard masking in LLM RL training

A new arXiv preprint proposes Divergence Regularized Policy Optimization (DRPO), a method that replaces the hard trust-region mask used in DPPO with a smooth advantage-weighted quadratic regularizer on policy shift. The approach addresses a known weakness in PPO and GRPO where importance ratios poorly proxy distributional shift in long-tailed vocabularies, and in DPPO where gradient signals are discarded rather than corrected at trust-region boundaries. Experiments across model scales, architectures, and precision settings show improved stability and efficiency in LLM RL post-training.

5arXiv · cs.LG·18d ago·source ↗

IntraShuffler: Privacy-Preserving Framework for Heterogeneous DP Federated Learning

This paper identifies a novel Privacy Inference Attack against heterogeneous differential privacy federated learning (HDP-FL) systems, where an honest-but-curious server exploits epsilon-aware aggregation and gradient denoising to infer client data distributions and link updates across rounds. To counter this, the authors propose IntraShuffler, a middleware framework that groups clients into privacy-compatible buckets and performs parameter-level shuffling within buckets, preserving epsilon-aware aggregation while disrupting persistent gradient structure. Experiments on four datasets show IntraShuffler reduces gradient recoverability by over 60% and drops surrogate inference accuracy from 0.78 to 0.33 with minimal utility loss.

7arXiv · cs.AI·29d ago·source ↗

The Matching Principle: A Geometric Theory Unifying Robustness, Domain Adaptation, and Alignment via Nuisance Covariance

This paper proposes the 'matching principle': a unified geometric framework arguing that robustness methods (CORAL, IRM, adversarial training, augmentation, metric learning, Jacobian penalties, alignment constraints) are all estimators of the same object—the covariance of label-preserving deployment nuisance—and that regularizing the encoder Jacobian along this covariance's range is the core statistical problem. The authors prove closed-form optimality results in a linear-Gaussian model, introduce the Trajectory Deviation Index (TDI) as a label-free embedding sensitivity probe, and validate predictions across 13 pre-registered experimental blocks including Qwen2.5-7B. At 7B scale, matched style-PMH improves selective honesty while standard DPO degrades Style TDI, connecting the theory to alignment safety.

4arXiv · cs.LG·22d ago·source ↗

FedTSV: Fairness-Aware Federated Learning via Trajectory Shapley Value

This paper introduces the Trajectory Shapley Value (TSV), a contribution metric that evaluates each federated learning client's influence on the global model's optimization trajectory using validation-based, temporally consistent utility. Building on TSV, the authors propose FedTSV, an adaptive aggregation method that converts per-round evaluations into dynamic client weights to handle heterogeneous and adversarial participation. Experiments on benchmark datasets demonstrate improved convergence speed, robustness, and equitable contribution assessment compared to fixed-weight aggregation baselines.

7arXiv · cs.AI·29d ago·source ↗

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Vector Policy Optimization (VPO) is a new RL post-training algorithm for LLMs that replaces the scalar reward paradigm with vector-valued rewards, explicitly training models to produce diverse solution sets that specialize across different reward trade-offs. VPO is designed as a near-drop-in replacement for the GRPO advantage estimator and targets inference-scaling search procedures like AlphaEvolve. Across four tasks, VPO matches or outperforms scalar RL baselines on pass@k and best@k metrics, with advantages growing as search budget increases, and unlocks evolutionary search problems that GRPO-trained models cannot solve. The paper argues that diversity-optimized post-training may need to become the default as inference-time search becomes standard.

4arXiv · cs.LG·17d ago·source ↗

FlashbackCL extends federated learning to mitigate temporal distribution shift and forgetting

FlashbackCL is a proposed extension to the Flashback federated learning method that addresses temporal forgetting — the degradation caused by client data distributions drifting over time, a scenario existing FL methods do not handle. The approach introduces temporally-decayed label counts, a device-aware replay buffer with Class-Balanced Reservoir Sampling, and server-side coreset curation. On CIFAR-10 with 50 clients, FlashbackCL achieves 6.9–10.0% relative improvement over Flashback while reducing temporal forgetting by up to 68%, with CBRS replay identified as the critical component.

6arXiv · cs.AI·23d ago·source ↗

Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL

This paper investigates whether extrapolative weight averaging of RL-trained checkpoints can extend Pareto frontiers between competing objectives (correctness vs. computational efficiency) without additional training. Starting from a shared initialization, the authors train checkpoints under nested unit-test coverage regimes for competitive programming tasks, revealing a correctness-efficiency frontier where higher-coverage rewards reduce optimization failures but increase correctness failures. Extrapolation beyond trained endpoints produces complementary policies that, when ensembled, improve pass@250 on LCB/hard by 3.3% over the best single checkpoint at matched sample budget. Results hold across 7B and 32B model scales and three inference settings: pure reasoning, tool use, and agentic coding.

4arXiv · cs.LG·24d ago·source ↗

Probabilistic Smoothing with Ratio-Monotone Transforms for Global Optimization

This paper proposes a generalized probabilistic smoothing framework for global optimization that replaces Gaussian kernels with flexible symmetric unimodal kernels combined with monotonic ratio-based transformations. The authors prove that the smoothed objective preserves the global maximizer and that stationary points concentrate near the true optimum under large amplification, without requiring a decreasing smoothing schedule. Explicit complexity bounds for stochastic gradient ascent are derived, and a leave-one-out baseline is shown to provably reduce variance. Experiments on high-dimensional benchmarks and black-box adversarial attacks demonstrate improved robustness over existing methods.