3Simon Willison's Weblog·1mo ago

How fast is 10 tokens per second really?

Simon Willison offers commentary on the practical perception of LLM inference speed, specifically examining what 10 tokens per second means to end users. The piece contextualizes token generation rates against human reading speed and usability thresholds. This is a qualitative analysis relevant to understanding inference economics and user experience expectations for deployed language models.

Inference Economics Simon Willison

Related guides (2)

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Simon Willison

Simon Willison: Developer, Toolmaker, and AI's Most Useful Commentator

Read asBeginner In-depth

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

Optimizing your LLM in production

A Hugging Face blog post covering practical techniques for optimizing large language models in production environments. The post likely addresses inference efficiency methods such as quantization, batching, caching, and hardware utilization strategies. It serves as a practitioner-oriented guide for deploying LLMs at scale.

Inference Economics Enterprise Deployment Patterns Hugging Face

5arXiv · cs.CL·9d ago·source ↗

Study finds optimal speech token frame rate for aligning speech with text-native LLM reasoning

Researchers identify a temporal-granularity mismatch as a key cause of reasoning degradation in spoken dialogue models: speech tokens are far longer than text under matched semantics, diluting per-token semantic density. The paper introduces factorized FSQ and a non-autoregressive audio LM head to enable low frame rates, then sweeps frame rates from 50Hz down to 2.08Hz under a frozen LLM backbone. Results show a consistent optimal regime at 4.17Hz with intermediate-layer representation alignment for speech QA tasks.

Evaluation and Benchmarking Multimodal Progress Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation factorized FSQ

4arXiv · cs.CL·46h ago·source ↗

Survey proposes four-layer architecture for token-operations-oriented LLM inference optimization

A new arXiv preprint introduces a four-layer technical architecture—Multi-model Fusion, Model Optimization, Compute-Model Fusion, and Compute-Network-Model Fusion—for systematically organizing LLM inference optimization techniques. The paper reviews key technologies and industry status at each layer and analyzes their application in real-world business scenarios. The framing around 'token operations' positions inference optimization as an operational discipline analogous to traditional IT operations.

Training Infrastructure Inference Economics Token-Operations-Oriented Inference Optimization Techniques for Large Models

5arXiv · cs.AI·11d ago·source ↗

CLP: Lightweight collocation-length predictor achieves zero-loss multi-token inference speedup

Researchers propose CLP (Collocation-Length Predictor), a span-level decision layer for accelerating LLM inference via multi-token prediction without quality degradation. The key insight is 'Backbone-as-Architect': the backbone LM head always generates the first token while MTP heads handle only subsequent tokens, eliminating head-backbone competition that causes repetitive outputs in prior methods. CLP uses a single linear layer (~4.6K–7.7K parameters) versus 1M-parameter gate networks in prior work, achieving 1.14x–1.29x speedup on Qwen2.5 models with near-zero repetition ratio. The paper also establishes that shorter prediction horizons improve MTP head accuracy on larger models, offering a scaling-aware design principle.

Inference Economics Qwen2.5 Alibaba CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference +2 more

5arXiv · cs.CL·8d ago·source ↗

Adaptive asymmetric token compression accelerates time series language models up to 7.68×

A new arXiv preprint proposes an adaptive token budgeting framework for time series (TS) language models that compresses TS tokens using frequency-domain structure and progressively prunes prompt tokens across model layers. The authors demonstrate up to 7.68× inference acceleration with performance improvements in 78% of evaluated settings across forecasting, classification, imputation, and anomaly detection tasks. The work is motivated by the observation that TS tokens have uneven spectral contributions and prompt-token influence attenuates with model depth, making uniform token processing wasteful.

Long Context Evolution Inference Economics Beyond Uniform Tokens: Adaptive Compression for Time Series Language Models

6arXiv · cs.CL·19d ago·source ↗

UniAudio-Token: Semantic Speech Tokenizer with General Audio Perception for Audio-LLMs

UniAudio-Token is a framework from Tencent that extends semantic speech tokenizers—commonly used as interfaces for Audio-LLMs—to support general audio perception without sacrificing speech quality. It introduces two mechanisms: Semantic-Acoustic Primitives (SAP) for structured supervision decomposing audio into linguistic, vocal, and auditory-scene components, and Semantic-Acoustic Equilibrium (SAE), a content-aware gating mechanism that restores fine-grained acoustic details from shallow layers. Evaluations show it outperforms all single-codebook baseline tokenizers on both understanding and generation tasks when integrated with downstream LLMs. Code, training/inference scripts, and model checkpoints are publicly released.

Agent and Tool Ecosystem Multimodal Progress Audio-LLM UniAudio-Token Tencent +2 more

6Hugging Face Blog·1mo ago·source ↗

Keep the Tokens Flowing: Lessons from 16 Open-Source RL Libraries

A Hugging Face blog post surveys 16 open-source reinforcement learning libraries for LLM training, analyzing their architectural approaches to async and synchronous token generation pipelines. The piece distills practical lessons about throughput, scalability, and design trade-offs across the ecosystem. It serves as a comparative landscape analysis for practitioners building or choosing RL training infrastructure for language models.

Training Infrastructure Open Weights Progress OpenRLHF Reinforcement Learning from Human Feedback veRL +4 more

7arXiv · cs.CL·1mo ago·source ↗

Forecasting Downstream LLM Performance With Token-Level Proxy Metrics

Researchers propose proxy metrics constructed from token-level statistics (entropy, top-k accuracy, expert token rank) drawn from a candidate model's next-token distribution over expert-written solutions, as a cheaper and more reliable alternative to cross-entropy loss or direct downstream evaluation. Across three settings—cross-family model selection, pretraining data selection, and training-time forecasting—the proxies consistently outperform baselines, achieving mean Spearman Rho of 0.81 vs. 0.36 for cross-entropy loss on model ranking, and reducing compute for data selection by roughly 10,000×. The method enables downstream performance extrapolation across an 18× compute horizon with roughly half the error of existing alternatives, suggesting expert trajectories are broadly useful signals throughout the model development lifecycle.

Training Infrastructure Evaluation and Benchmarking Proxy Metrics for LLM Forecasting Expert Token Rank Spearman Rank Correlation +4 more