Almanac
other

QoS (Quality of Service)

otheractiveqos-quality-of-service--a13dfbb4·1 events·first seen 26d ago

Aliases: QoS (Quality of Service)

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.AI·26d ago·source ↗

PALS: Power-Aware LLM Serving Runtime for MoE and Dense Models

PALS is a power-aware inference runtime integrated into vLLM that treats GPU power caps as a first-class scheduling parameter alongside batch size and parallelism settings. Using lightweight offline power-performance models and a feedback-driven controller, it jointly optimizes energy efficiency and throughput targets without model retraining or API changes. Across multi-GPU deployments with both dense and MoE models, PALS achieves up to 26.3% energy efficiency improvement and reduces QoS violations by 4-7x under power constraints, enabling energy-proportional and grid-interactive AI serving.