PALS: Power-Aware LLM Serving Runtime for MoE and Dense Models
PALS is a power-aware inference runtime integrated into vLLM that treats GPU power caps as a first-class scheduling parameter alongside batch size and parallelism settings. Using lightweight offline power-performance models and a feedback-driven controller, it jointly optimizes energy efficiency and throughput targets without model retraining or API changes. Across multi-GPU deployments with both dense and MoE models, PALS achieves up to 26.3% energy efficiency improvement and reduces QoS violations by 4-7x under power constraints, enabling energy-proportional and grid-interactive AI serving.
Related guides (3)
Related events (8)
No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL
Hugging Face's TRL library now supports co-locating vLLM inference alongside training on the same GPUs, eliminating the idle GPU problem that arises when separate inference and training processes alternate. This approach allows reinforcement learning from human feedback (RLHF) and online RL training pipelines to use GPUs continuously rather than leaving them idle during generation or gradient update phases. The integration targets efficiency gains in online RL training workflows such as GRPO and PPO, where generation and training steps previously required dedicated, alternating GPU allocations.
Optimizing your LLM in production
A Hugging Face blog post covering practical techniques for optimizing large language models in production environments. The post likely addresses inference efficiency methods such as quantization, batching, caching, and hardware utilization strategies. It serves as a practitioner-oriented guide for deploying LLMs at scale.
omlx: LLM inference server with continuous batching and SSD caching for Apple Silicon
omlx is an open-source Python project providing an LLM inference server optimized for Apple Silicon, featuring continuous batching and SSD caching managed via a macOS menu bar interface. The project has accumulated nearly 16,000 GitHub stars with strong daily momentum. It targets local inference on Apple hardware, a growing niche as consumer-grade silicon becomes increasingly capable for running open-weights models.
GLM-5.1 Open-Weights Model Targets Long-Running Agentic Tasks; Andrew Ng on Coding Agent Acceleration by Software Domain
Z.ai released GLM-5.1, an open-weights mixture-of-experts LLM (754B total / 40B active parameters) designed for sustained agentic coding tasks lasting up to eight hours, featuring iterative planning-execution-evaluation loops with thousands of tool calls. The model claims top open-weights performance on Artificial Analysis Intelligence Index and SWE-Bench Pro, available under MIT license via HuggingFace. The accompanying editorial by Andrew Ng offers a tiered framework for how much coding agents accelerate different software work categories—frontend most, then backend, infrastructure, and research least—with practical implications for team organization. A secondary item references data-center opposition and LLM helpfulness failure modes.
PowerCodeBench: Knowledge Boundary Probing and Intervention for LLM-Based Power System Code Generation
This paper introduces PowerCodeBench, an execution-validated benchmark for evaluating LLMs on power-system simulation code generation using the pandapower library. The authors identify that failures are dominated by API-knowledge boundary errors (hallucinated function names, misused parameters) rather than reasoning failures, and propose a boundary-aware intervention combining API demand estimation with targeted documentation injection. Evaluated across ten open-weight models (1.5B–480B) and four commercial APIs on 2,000 tasks, the intervention yields 32–56 accuracy point improvements while using only 41% of baseline prompt-token cost. Open-weight models in the 70B–120B range match commercial mid-tier accuracy, with Llama-3.1-405B and Qwen3-Coder-480B leading.
Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU
Hugging Face demonstrates a method for running RLHF fine-tuning on 20-billion-parameter language models using a single 24GB consumer GPU by combining TRL and PEFT (parameter-efficient fine-tuning). The approach uses techniques like LoRA and quantization to dramatically reduce memory requirements. This lowers the hardware barrier for RLHF experimentation from multi-GPU server setups to consumer-grade hardware.
Prefill and Decode for Concurrent Requests - Optimizing LLM Performance
This Hugging Face blog post from TNG Technology Consulting examines how prefill and decode phases interact under concurrent request loads in LLM serving systems. It analyzes performance bottlenecks that arise when multiple requests share GPU resources, covering throughput-latency tradeoffs and optimization strategies. The piece targets practitioners deploying LLMs at scale who need to understand scheduling and batching behavior.
AMD + Hugging Face: Large Language Models Out-of-the-Box Acceleration with AMD GPU
Hugging Face and AMD announced integration work enabling out-of-the-box LLM acceleration on AMD GPUs via the Optimum library. The collaboration targets ROCm-based AMD hardware, aiming to reduce friction for users running inference on non-NVIDIA GPU stacks. This represents a continued push to broaden the hardware ecosystem available to open-weights model users.


