6arXiv cs.AI (Artificial Intelligence)·1mo ago

PALS: Power-Aware LLM Serving Runtime for MoE and Dense Models

PALS is a power-aware inference runtime integrated into vLLM that treats GPU power caps as a first-class scheduling parameter alongside batch size and parallelism settings. Using lightweight offline power-performance models and a feedback-driven controller, it jointly optimizes energy efficiency and throughput targets without model retraining or API changes. Across multi-GPU deployments with both dense and MoE models, PALS achieves up to 26.3% energy efficiency improvement and reduces QoS violations by 4-7x under power constraints, enabling energy-proportional and grid-interactive AI serving.

Training Infrastructure Inference Economics PALS Mixture of Experts GPU power capping QoS (Quality of Service)vLLM

Related guides (3)

Mixture of ExpertsConcept

Mixture of Experts: How AI Models Do More by Using Less

Read asBeginner In-depth

Training InfrastructureTopic guide

Training Infrastructure: The Compute Arms Race Powering Modern AI

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL

Hugging Face's TRL library now supports co-locating vLLM inference alongside training on the same GPUs, eliminating the idle GPU problem that arises when separate inference and training processes alternate. This approach allows reinforcement learning from human feedback (RLHF) and online RL training pipelines to use GPUs continuously rather than leaving them idle during generation or gradient update phases. The integration targets efficiency gains in online RL training workflows such as GRPO and PPO, where generation and training steps previously required dedicated, alternating GPU allocations.

Training Infrastructure Inference Economics GRPO PPO Hugging Face +4 more

4Hugging Face Blog·1mo ago·source ↗

Optimizing your LLM in production

A Hugging Face blog post covering practical techniques for optimizing large language models in production environments. The post likely addresses inference efficiency methods such as quantization, batching, caching, and hardware utilization strategies. It serves as a practitioner-oriented guide for deploying LLMs at scale.

Inference Economics Enterprise Deployment Patterns Hugging Face

5Github Trending·16d ago·source ↗

omlx: LLM inference server with continuous batching and SSD caching for Apple Silicon

omlx is an open-source Python project providing an LLM inference server optimized for Apple Silicon, featuring continuous batching and SSD caching managed via a macOS menu bar interface. The project has accumulated nearly 16,000 GitHub stars with strong daily momentum. It targets local inference on Apple hardware, a growing niche as consumer-grade silicon becomes increasingly capable for running open-weights models.

Open Weights Progress Inference Economics omlx jundot Apple Silicon

6The Batch·19d ago·source ↗

GLM-5.1 Open-Weights Model Targets Long-Running Agentic Tasks; Andrew Ng on Coding Agent Acceleration by Software Domain

Z.ai released GLM-5.1, an open-weights mixture-of-experts LLM (754B total / 40B active parameters) designed for sustained agentic coding tasks lasting up to eight hours, featuring iterative planning-execution-evaluation loops with thousands of tool calls. The model claims top open-weights performance on Artificial Analysis Intelligence Index and SWE-Bench Pro, available under MIT license via HuggingFace. The accompanying editorial by Andrew Ng offers a tiered framework for how much coding agents accelerate different software work categories—frontend most, then backend, infrastructure, and research least—with practical implications for team organization. A secondary item references data-center opposition and LLM helpfulness failure modes.

Frontier Model Releases Evaluation and Benchmarking DeepLearning.AI Artificial Analysis Intelligence Index SWE-bench +9 more

5arXiv · cs.CL·20d ago·source ↗

PowerCodeBench: Knowledge Boundary Probing and Intervention for LLM-Based Power System Code Generation

This paper introduces PowerCodeBench, an execution-validated benchmark for evaluating LLMs on power-system simulation code generation using the pandapower library. The authors identify that failures are dominated by API-knowledge boundary errors (hallucinated function names, misused parameters) rather than reasoning failures, and propose a boundary-aware intervention combining API demand estimation with targeted documentation injection. Evaluated across ten open-weight models (1.5B–480B) and four commercial APIs on 2,000 tasks, the intervention yields 32–56 accuracy point improvements while using only 41% of baseline prompt-token cost. Open-weight models in the 70B–120B range match commercial mid-tier accuracy, with Llama-3.1-405B and Qwen3-Coder-480B leading.

Evaluation and Benchmarking Open Weights Progress pandapower Meta Llama 3.1 405B Alibaba +7 more

6Hugging Face Blog·1mo ago·source ↗

Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU

Hugging Face demonstrates a method for running RLHF fine-tuning on 20-billion-parameter language models using a single 24GB consumer GPU by combining TRL and PEFT (parameter-efficient fine-tuning). The approach uses techniques like LoRA and quantization to dramatically reduce memory requirements. This lowers the hardware barrier for RLHF experimentation from multi-GPU server setups to consumer-grade hardware.

Open Weights Progress Inference Economics PEFT Reinforcement Learning from Human Feedback LoRA +4 more

4Hugging Face Blog·1mo ago·source ↗

Prefill and Decode for Concurrent Requests - Optimizing LLM Performance

This Hugging Face blog post from TNG Technology Consulting examines how prefill and decode phases interact under concurrent request loads in LLM serving systems. It analyzes performance bottlenecks that arise when multiple requests share GPU resources, covering throughput-latency tradeoffs and optimization strategies. The piece targets practitioners deploying LLMs at scale who need to understand scheduling and batching behavior.

Training Infrastructure Inference Economics Prefill/Decode Disaggregation Hugging Face TNG Technology Consulting

5Hugging Face Blog·1mo ago·source ↗

AMD + Hugging Face: Large Language Models Out-of-the-Box Acceleration with AMD GPU

Hugging Face and AMD announced integration work enabling out-of-the-box LLM acceleration on AMD GPUs via the Optimum library. The collaboration targets ROCm-based AMD hardware, aiming to reduce friction for users running inference on non-NVIDIA GPU stacks. This represents a continued push to broaden the hardware ecosystem available to open-weights model users.

Training Infrastructure Open Weights Progress Optimum ROCm Hugging Face +2 more