Entity · product

PALS

productactivepals-41b69b61·2 events·first seen May 21, 2026

Aliases: PALS

Co-occurring entities

WikiText-2 LLaMA-7B SparseGPT Mistral 7B PALS: Percentile-Aware Layerwise Sparsity for LLM Pruning Wanda Llama-3.1-8B Mixture of Experts GPU power capping QoS (Quality of Service)vLLM

More like this (12)

PaLI PLAID LPU PACT LamPO C2PA MAML PAC-ACT PATE (Private Aggregation of Teachers for Ensembles)Pacific Northwest National Laboratory PEFT SAPLMA

Recent events (2)

4arXiv · cs.CL·Jul 9, 2026·source ↗

PALS: Percentile-aware per-layer sparsity improves LLM pruning on LLaMA-2 but not universally

PALS (Percentile-Aware Layerwise Sparsity) is a one-shot pruning method that assigns per-layer sparsity ratios based on the 99th percentile of activation magnitudes, bounded within ±5% of a target ratio. On LLaMA-2-7B at 50% sparsity, PALS achieves perplexity of 10.96 vs. 12.92 for uniform Wanda, a statistically significant improvement requiring no fine-tuning. However, gains are architecture-dependent: LLaMA-3-8B shows marginal improvement and Mistral-7B shows none. A notable negative finding is that gradient-based allocation performs worse than random, suggesting gradient magnitude is a poor proxy for the impact of discrete weight removal.

Open Weights Progress Inference Economics PALS WikiText-2 LLaMA-7B +5 more

6arXiv · cs.AI·May 21, 2026·source ↗

PALS: Power-Aware LLM Serving Runtime for MoE and Dense Models

PALS is a power-aware inference runtime integrated into vLLM that treats GPU power caps as a first-class scheduling parameter alongside batch size and parallelism settings. Using lightweight offline power-performance models and a feedback-driven controller, it jointly optimizes energy efficiency and throughput targets without model retraining or API changes. Across multi-GPU deployments with both dense and MoE models, PALS achieves up to 26.3% energy efficiency improvement and reduces QoS violations by 4-7x under power constraints, enabling energy-proportional and grid-interactive AI serving.

Training Infrastructure Inference Economics PALS Mixture of Experts GPU power capping +2 more