product
PALS
productactive
pals-41b69b61·1 events·first seen 26d agoAliases: PALS
Co-occurring entities
More like this (12)
Recent events (1)
PALS: Power-Aware LLM Serving Runtime for MoE and Dense Models
PALS is a power-aware inference runtime integrated into vLLM that treats GPU power caps as a first-class scheduling parameter alongside batch size and parallelism settings. Using lightweight offline power-performance models and a feedback-driven controller, it jointly optimizes energy efficiency and throughput targets without model retraining or API changes. Across multi-GPU deployments with both dense and MoE models, PALS achieves up to 26.3% energy efficiency improvement and reduces QoS violations by 4-7x under power constraints, enabling energy-proportional and grid-interactive AI serving.