Entity · other

QoS (Quality of Service)

otheractiveqos-quality-of-service--a13dfbb4·1 events·first seen May 21, 2026

Aliases: QoS (Quality of Service)

Co-occurring entities

PALS Mixture of Experts GPU power capping vLLM

More like this (12)

quality of experience (QoE)Quality Estimation (QE)gameplay video quality assurance QLoRA Soft Q-Learning MedQA Q-statistic quantization ChartQA Q-learning SimpleQA GPQA

Recent events (1)

6arXiv · cs.AI·May 21, 2026·source ↗

PALS: Power-Aware LLM Serving Runtime for MoE and Dense Models

PALS is a power-aware inference runtime integrated into vLLM that treats GPU power caps as a first-class scheduling parameter alongside batch size and parallelism settings. Using lightweight offline power-performance models and a feedback-driven controller, it jointly optimizes energy efficiency and throughput targets without model retraining or API changes. Across multi-GPU deployments with both dense and MoE models, PALS achieves up to 26.3% energy efficiency improvement and reduces QoS violations by 4-7x under power constraints, enabling energy-proportional and grid-interactive AI serving.

Training Infrastructure Inference Economics PALS Mixture of Experts GPU power capping +2 more