Almanac
Topic

Training Infrastructure

activetraining-infrastructure·282 events·last 45h ago

Compute, chips, training runs, distributed training systems, data center buildouts, and the hardware/systems side of pre-training.

Related entities

Related topics (8)

Guides (1)

Recent events (50)

9Anthropic News·1mo ago·source ↗

Anthropic and Amazon Expand Collaboration for Up to 5 Gigawatts of New Compute

Anthropic has signed a major expanded agreement with Amazon committing over $100 billion to AWS technologies over ten years, securing up to 5GW of compute capacity for training and deploying Claude across Trainium2 through Trainium4 chips. Amazon is investing an additional $5 billion in Anthropic today, with up to $20 billion more possible in the future, building on $8 billion previously invested. The deal includes nearly 1GW of Trainium2 and Trainium3 capacity coming online by end of 2026, expanded inference in Asia and Europe, and the full Claude Platform becoming available directly within AWS. Anthropic disclosed its run-rate revenue has surpassed $30 billion, up from approximately $9 billion at end of 2025.

8Anthropic News·1mo ago·source ↗

Anthropic Expands Partnership with Google and Broadcom for Multi-Gigawatt TPU Compute Capacity

Anthropic has signed a new agreement with Google and Broadcom for multiple gigawatts of next-generation TPU capacity expected to come online starting in 2027, representing the company's largest compute commitment to date. The announcement coincides with Anthropic reporting run-rate revenue surpassing $30 billion, up from ~$9 billion at end of 2025, and the number of enterprise customers spending over $1M annually doubling to 1,000+ in under two months. The compute will be predominantly US-sited, extending Anthropic's November 2025 $50B American infrastructure commitment. Anthropic continues to operate across AWS Trainium, Google TPUs, and NVIDIA GPUs, with Amazon remaining its primary cloud and training partner.

6Openai Blog·1mo ago·source ↗

Building the compute infrastructure for the Intelligence Age

OpenAI is scaling its Stargate initiative to expand compute infrastructure aimed at supporting AGI development. The announcement describes new data center capacity additions to meet growing AI demand. This represents a continuation of OpenAI's large-scale infrastructure buildout strategy under the Stargate program.

8Anthropic News·1mo ago·source ↗

Anthropic Announces SpaceX Colossus Compute Deal and Higher Claude Usage Limits

Anthropic has signed an agreement with SpaceX to access the full compute capacity of the Colossus 1 data center, gaining over 300 megawatts and 220,000+ NVIDIA GPUs within a month. This deal, combined with prior agreements with Amazon, Google/Broadcom, Microsoft/NVIDIA, and Fluidstack, enables Anthropic to double Claude Code rate limits, remove peak-hour restrictions for Pro/Max users, and raise API rate limits for Claude Opus models. The announcement also notes interest in developing orbital AI compute capacity with SpaceX, and outlines international infrastructure expansion for enterprise compliance needs.

6arXiv · cs.LG·1mo ago·source ↗

RRFP: A Readiness-Driven Runtime for Pipeline-Parallel Training Under Runtime Variability

The paper introduces Runtime-Readiness-First Pipeline (RRFP), a new runtime for pipeline-parallel large-model training that treats schedules as non-binding hint orders rather than strict execution sequences. By combining message-driven asynchronous communication, lightweight tensor-parallel coordination, and ready-set arbitration, RRFP dynamically dispatches work based on actual task readiness, reducing idle bubbles and stage misalignment. Implemented on a Megatron-based framework and evaluated at up to 128 GPUs, RRFP achieves up to 1.77× speedup on language-only workloads and 2.77× on multimodal workloads versus fixed-order baselines, and outperforms the fastest comparable external system by up to 1.84×.

6Hugging Face Blog·1mo ago·source ↗

The Technology Behind BLOOM Training

This Hugging Face blog post details the infrastructure and training methodology used to train BLOOM, a 176-billion parameter open-access multilingual language model. It covers the use of Megatron-DeepSpeed for distributed training across hundreds of GPUs, including tensor parallelism, pipeline parallelism, and data parallelism strategies. The post also discusses hardware setup, memory optimization techniques, and lessons learned during the large-scale training run.

8Openai Blog·1mo ago·source ↗

AWS and OpenAI Announce $38B Multi-Year Strategic Partnership

OpenAI and Amazon Web Services have announced a multi-year strategic partnership valued at $38 billion. AWS will supply infrastructure and compute capacity to support OpenAI's next-generation model training and deployment workloads. The deal represents a major cloud infrastructure commitment for OpenAI alongside its existing Microsoft Azure relationship.

8Openai Blog·1mo ago·source ↗

OpenAI and Broadcom Announce Strategic Collaboration to Deploy 10 GW of OpenAI-Designed AI Accelerators

OpenAI and Broadcom have announced a multi-year strategic partnership targeting deployment of 10 gigawatts of OpenAI-designed AI accelerators by 2029. The collaboration involves co-developing next-generation AI accelerator systems and Ethernet networking solutions aimed at scalable, energy-efficient AI infrastructure. This represents OpenAI's continued push into custom silicon, reducing dependence on third-party chip suppliers like NVIDIA.

8Openai Blog·1mo ago·source ↗

AMD and OpenAI Announce Strategic Partnership to Deploy 6 Gigawatts of AMD GPUs

AMD and OpenAI have entered a multi-year strategic partnership to deploy 6 gigawatts of AMD Instinct GPUs for OpenAI's AI infrastructure, with 1 gigawatt planned for 2026. The deal represents a significant diversification of OpenAI's compute supply beyond its existing NVIDIA dependency. This is one of the largest publicly announced GPU deployment commitments in the industry.

7Openai Blog·1mo ago·source ↗

OpenAI, Oracle, and SoftBank expand Stargate with five new AI datacenter sites

OpenAI, Oracle, and SoftBank have announced five additional datacenter sites under the Stargate initiative, a $500 billion U.S. AI infrastructure program targeting 10 gigawatts of compute capacity. The expansion accelerates the buildout of physical infrastructure intended to support next-generation AI workloads. The announcement emphasizes domestic job creation alongside the technical capacity additions.

8Openai Blog·1mo ago·source ↗

OpenAI and NVIDIA Announce Strategic Partnership to Deploy 10 Gigawatts of AI Datacenters

OpenAI and NVIDIA have announced a strategic partnership targeting deployment of 10 gigawatts of AI datacenter capacity powered by NVIDIA systems. The first phase of the buildout is scheduled to launch in 2026. This represents a major infrastructure commitment between two of the most prominent organizations in AI compute and model development.

7Openai Blog·1mo ago·source ↗

OpenAI Launches Stargate Infrastructure Partner Outreach

OpenAI is soliciting partnerships with firms across the data center infrastructure supply chain—covering power, land, construction, and equipment—under the Stargate initiative. This represents OpenAI's formal outreach to the industrial base to build out large-scale AGI infrastructure. The announcement signals the operational phase of the previously announced Stargate project, which involves major capital commitments for AI compute infrastructure.

9Openai Blog·1mo ago·source ↗

Announcing The Stargate Project

OpenAI has announced the Stargate Project, a major AI infrastructure initiative. The project represents a large-scale investment in AI compute and data center infrastructure in the United States. Based on prior reporting, Stargate involves a joint venture with SoftBank and other partners targeting up to $500 billion in AI infrastructure investment over four years. This is one of the largest announced AI infrastructure commitments in history.

6Openai Blog·1mo ago·source ↗

Scaling Kubernetes to 7,500 Nodes

OpenAI describes scaling Kubernetes clusters to 7,500 nodes to support large-scale AI training workloads including GPT-3, CLIP, and DALL·E. The post details infrastructure challenges and solutions enabling both massive model training and rapid small-scale research iteration. This represents a significant engineering milestone in ML training infrastructure at the time of publication (January 2021).

7Openai Blog·19d ago·source ↗

OpenAI Breaks Ground on 1GW Stargate Data Center in Michigan

OpenAI has broken ground on a 1-gigawatt data center in Michigan as part of its Stargate infrastructure initiative. The project is framed around expanding AI access, job creation, and community support. This represents a major physical infrastructure commitment by OpenAI to domestic AI compute capacity.

8Anthropic News·19d ago·source ↗

Anthropic Commits $50 Billion to U.S. AI Computing Infrastructure with Fluidstack

Anthropic is investing $50 billion in American AI computing infrastructure, partnering with Fluidstack to build custom data centers in Texas and New York, with additional sites planned. The facilities are purpose-built for Anthropic's workloads and are expected to come online throughout 2026, creating roughly 800 permanent and 2,400 construction jobs. The announcement aligns with the Trump administration's AI Action Plan and is framed as supporting domestic AI leadership. Anthropic cites growing enterprise demand—over 300,000 business customers and a sevenfold increase in large accounts over the past year—as driving the scale of investment.

8Anthropic News·19d ago·source ↗

Anthropic Expands Google Cloud TPU Usage to Up to One Million TPUs in Tens-of-Billions Deal

Anthropic announced a major expansion of its Google Cloud infrastructure, planning to use up to one million TPUs in a deal worth tens of billions of dollars, with over a gigawatt of capacity expected online in 2026. The expansion is driven by rapidly growing enterprise demand—Anthropic now serves over 300,000 business customers with large accounts growing nearly 7x year-over-year. Anthropic maintains a diversified compute strategy across Google TPUs, Amazon Trainium, and NVIDIA GPUs, while reaffirming its primary training partnership with Amazon via Project Rainier. The company also notes the expanded compute will support alignment research and responsible deployment at scale.

6Google Deepmind Blog·1mo ago·source ↗

Decoupled DiLoCo: A new frontier for resilient, distributed AI training

DeepMind has published a blog post introducing Decoupled DiLoCo, a new approach to distributed AI training designed for resilience across heterogeneous or unreliable compute environments. The method appears to extend the original DiLoCo (Distributed Low-Communication) training framework, which enables training across loosely connected compute nodes with infrequent synchronization. The announcement signals continued investment in infrastructure techniques that reduce communication overhead and improve fault tolerance in large-scale model training.

6Openai Blog·1mo ago·source ↗

OpenAI Introduces MRC (Multipath Reliable Connection) Networking Protocol for AI Training Clusters

OpenAI has developed and released MRC (Multipath Reliable Connection), a new supercomputer networking protocol designed to improve resilience and performance in large-scale AI training clusters. The protocol is being released through the Open Compute Project (OCP), making it available to the broader industry. MRC addresses reliability and throughput challenges in the high-bandwidth, low-latency interconnects required for frontier model training at scale.

6arXiv · cs.AI·10d ago·source ↗

Piper: Programmable distributed training system decoupling parallelism strategy from runtime

Researchers present Piper, a distributed training system that separates parallelism strategy specification from low-level runtime execution via an intermediate representation (IR) — a unified global training DAG. Users declare strategies through model annotations and scheduling directives, which Piper compiles into per-device execution plans. The system matches performance on standard strategies like ZeRO while enabling additional gains through joint compute-communication scheduling in composed strategies such as DeepSeek-V3's DualPipe.

7Meta Ai Blog·1mo ago·source ↗

Meta Announces Four MTIA AI Chip Generations in Two Years: MTIA 300–500 Roadmap

Meta has detailed a rapid four-generation MTIA chip roadmap (300, 400, 450, 500) developed in partnership with Broadcom, spanning ranking/recommendation inference and training through general GenAI workloads. Key advances include a 4.5x HBM bandwidth increase and 25x compute FLOPS improvement from MTIA 300 to 500, with MTIA 450 and 500 targeting GenAI inference with doubled and further-increased HBM bandwidth versus leading commercial products. MTIA 300 is in production for R&R training, MTIA 400 is lab-tested and entering deployment, while MTIA 450 and 500 are scheduled for mass deployment in early 2027 and 2027 respectively. The strategy emphasizes modular chiplet design and short iteration cycles to keep hardware aligned with rapidly evolving AI model requirements.

7Latent Space·1mo ago·source ↗

Anthropic-SpaceX AI's 300MW/$5B/yr Colossus I Deal; ARR Growth 8000% Annualized

Latent Space AINews reports that Anthropic has struck a major infrastructure deal with SpaceX AI involving 300MW of compute capacity at the Colossus I data center for approximately $5B per year. The report also highlights Anthropic's annualized ARR growth of 8000%, signaling rapid commercial scaling. This represents a significant strategic alignment between Anthropic and xAI/SpaceX infrastructure assets.

6Hugging Face Blog·1mo ago·source ↗

Hugging Face and NVIDIA Launch Training Cluster as a Service

Hugging Face and NVIDIA are announcing a joint 'Training Cluster as a Service' offering, providing managed GPU cluster access for AI model training. The collaboration aims to lower the barrier for organizations to access large-scale training infrastructure without managing hardware directly. This represents a strategic partnership between a major AI platform and a leading GPU manufacturer to address enterprise training infrastructure needs.

7Openai Blog·1mo ago·source ↗

OpenAI and SoftBank Group Partner with SB Energy for Multi-Gigawatt AI Data Center Campuses

OpenAI and SoftBank Group have announced a partnership with SB Energy to develop multi-gigawatt AI data center campuses. The initiative includes a 1.2 GW facility in Texas that will support the Stargate project. This represents a major infrastructure investment aimed at scaling AI compute capacity.

6Openai Blog·1mo ago·source ↗

Expanding Stargate to Michigan

OpenAI is expanding its Stargate AI infrastructure initiative to Michigan with a new one-gigawatt campus. The project is framed as strengthening U.S. AI infrastructure while creating jobs and driving investment in the Midwest. No technical specifications or timeline details are provided in the announcement.

7Openai Blog·1mo ago·source ↗

Samsung and SK join OpenAI's Stargate initiative to advance global AI infrastructure

Samsung and SK Group have joined OpenAI's Stargate initiative, expanding the program's global footprint. The partnership focuses on scaling advanced memory chip production and constructing next-generation data centers in South Korea. This extends Stargate beyond its initial US-centric scope into a major Asian manufacturing and compute hub.

7Openai Blog·1mo ago·source ↗

OpenAI and Oracle Expand Stargate with 4.5 GW U.S. Data Center Partnership

OpenAI and Oracle have signed an agreement to develop 4.5 gigawatts of additional Stargate data center capacity in the United States. The deal represents a major infrastructure expansion for OpenAI's Stargate platform, which is positioned as the long-term backbone for delivering AI at scale. The announcement emphasizes job creation, U.S. reindustrialization, and domestic AI leadership.

9Openai Blog·1mo ago·source ↗

Microsoft Invests $1 Billion in OpenAI, Becomes Exclusive Cloud Provider

Microsoft announced a $1 billion investment in OpenAI in July 2019, establishing a strategic partnership aimed at building AGI with broadly distributed economic benefits. As part of the deal, Microsoft becomes OpenAI's exclusive cloud provider, and the two companies will jointly develop Azure AI supercomputing infrastructure. This partnership laid the foundation for OpenAI's large-scale model training on Azure and subsequent deeper integrations between the two organizations.

6Hugging Face Blog·24d ago·source ↗

Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

Hugging Face introduces Delta Weight Sync in TRL, a technique for efficiently synchronizing model weight updates during large-scale training by transmitting only the delta (difference) between checkpoints rather than full parameter snapshots. The approach targets trillion-parameter training regimes where checkpoint bandwidth is a significant bottleneck. The post describes integration with the Hugging Face Hub as a storage and distribution layer for these delta updates.

9Anthropic News·19d ago·source ↗

Microsoft, NVIDIA, and Anthropic Announce Major Strategic Partnerships with $15B Investment and $30B Azure Compute Commitment

Anthropic has announced simultaneous strategic partnerships with Microsoft and NVIDIA, committing to purchase $30 billion of Azure compute capacity and up to one gigawatt of compute with NVIDIA Grace Blackwell and Vera Rubin systems. NVIDIA and Microsoft are investing up to $10 billion and $5 billion respectively in Anthropic, while Claude models (Sonnet 4.5, Opus 4.1, Haiku 4.5) will be available on Microsoft Foundry and across the Copilot product family. Anthropic and NVIDIA are also establishing a deep technology partnership to co-optimize model performance and future NVIDIA architectures for Anthropic workloads. Amazon remains Anthropic's primary cloud and training partner.

6arXiv · cs.AI·1mo ago·source ↗

Framework for Evaluating Datacenter Power Delivery Hierarchies for AI Workloads

Researchers from Microsoft Azure present a simulation framework for evaluating datacenter power delivery designs under AI-era conditions, where rack power density is projected to approach 1MW per deployment by 2027. The framework combines GPU/compute/storage projection models with production operational data to assess throughput, power, and cost metrics across realistic deployment sequences. Key findings show that multi-resource stranding materially affects deployable capacity and effective capital expenditure, and that the correct planning objective is deployable capacity over time rather than installed megawatts. The work addresses the challenge of designing power hierarchies that remain efficient across multiple hardware generations as AI accelerator density rises.

6Qwen Research·1mo ago·source ↗

Global-batch Load Balancing for MoE LLM Training from Qwen

Qwen Research introduces a global-batch load balancing technique for Mixture-of-Experts (MoE) LLM training, claiming it is nearly a 'free lunch' improvement. The method addresses expert load imbalance across training batches, a known efficiency and quality bottleneck in MoE architectures. The approach targets the router and expert activation dynamics in transformer-based MoE layers.

6Hugging Face Blog·1mo ago·source ↗

Keep the Tokens Flowing: Lessons from 16 Open-Source RL Libraries

A Hugging Face blog post surveys 16 open-source reinforcement learning libraries for LLM training, analyzing their architectural approaches to async and synchronous token generation pipelines. The piece distills practical lessons about throughput, scalability, and design trade-offs across the ecosystem. It serves as a comparative landscape analysis for practitioners building or choosing RL training infrastructure for language models.

5Hugging Face Blog·1mo ago·source ↗

Accelerate ND-Parallel: A Guide to Efficient Multi-GPU Training

Hugging Face published a guide on N-dimensional parallelism for multi-GPU training using the Accelerate library. The post covers combining data parallelism, tensor parallelism, pipeline parallelism, and other strategies to efficiently scale model training across GPU clusters. This is a practical technical resource aimed at practitioners working with large-scale distributed training setups.

5Hugging Face Blog·1mo ago·source ↗

Accelerate 1.0.0 Released

Hugging Face has released Accelerate 1.0.0, marking the library's first stable major version. Accelerate is a widely-used PyTorch training library that abstracts distributed training across hardware configurations including multi-GPU, TPU, and mixed-precision setups. The 1.0.0 milestone signals API stability and production readiness for the training infrastructure ecosystem.

4Hugging Face Blog·1mo ago·source ↗

From DeepSpeed to FSDP and Back Again with Hugging Face Accelerate

This Hugging Face blog post covers the practical migration path between DeepSpeed and PyTorch FSDP distributed training backends using the Accelerate library. It addresses configuration differences, compatibility considerations, and workflow patterns for switching between the two frameworks. The post targets practitioners running large-scale model training who need flexibility across distributed training strategies.

6Hugging Face Blog·1mo ago·source ↗

GaLore: Advancing Large Model Training on Consumer-grade Hardware

GaLore (Gradient Low-Rank Projection) is a memory-efficient training technique that reduces optimizer state memory by projecting gradients into a low-rank subspace during training, enabling large model training on consumer-grade hardware. The Hugging Face blog post covers integration of GaLore into the transformers and peft ecosystems. Unlike LoRA, GaLore applies low-rank projection to the full training process rather than constraining weight updates, allowing full-parameter learning with reduced memory footprint. This makes training models like LLaMA-7B feasible on single consumer GPUs.

5Hugging Face Blog·1mo ago·source ↗

Easily Train Models with H100 GPUs on NVIDIA DGX Cloud

Hugging Face announced integration with NVIDIA DGX Cloud, enabling users to train models on H100 GPU clusters directly through the Hugging Face platform. The partnership simplifies access to high-end training infrastructure without requiring users to manage cloud provisioning themselves. This represents a continued push to lower the barrier to large-scale model training for the broader ML community.

4Hugging Face Blog·1mo ago·source ↗

Accelerate Large Model Training using DeepSpeed

This Hugging Face blog post explains how to use the Accelerate library in conjunction with DeepSpeed to train large language models more efficiently. It covers integration patterns, configuration options, and practical guidance for leveraging DeepSpeed's ZeRO optimization stages through the Accelerate abstraction layer. The post targets practitioners looking to scale model training without deep infrastructure expertise.

4Hugging Face Blog·1mo ago·source ↗

Accelerate Large Model Training using PyTorch Fully Sharded Data Parallel

This Hugging Face blog post explains how to use PyTorch's Fully Sharded Data Parallel (FSDP) to train large models that exceed single-GPU memory limits. It covers the integration of FSDP with the Hugging Face Accelerate library, enabling distributed sharding of model parameters, gradients, and optimizer states across multiple GPUs. The post provides practical guidance on configuration and usage for scaling large model training.

4Hugging Face Blog·1mo ago·source ↗

Fit More and Train Faster With ZeRO via DeepSpeed and FairScale

This Hugging Face blog post from January 2021 covers integration of ZeRO (Zero Redundancy Optimizer) memory optimization techniques via DeepSpeed and FairScale into the Transformers training ecosystem. ZeRO partitions optimizer states, gradients, and model parameters across GPUs to enable training of much larger models on the same hardware. The post serves as a practical guide for practitioners looking to scale model training without additional infrastructure investment.

7Openai Blog·1mo ago·source ↗

Introducing Stargate UK

OpenAI has announced Stargate UK, an extension of its Stargate infrastructure initiative into the United Kingdom. The announcement signals a major AI infrastructure investment in the UK, likely involving data center buildout and compute capacity. This follows the US Stargate program and represents a significant international expansion of OpenAI's infrastructure strategy.

6Openai Blog·1mo ago·source ↗

Introducing Stargate Norway: OpenAI's First European AI Data Center Initiative

OpenAI is launching Stargate Norway, its first AI data center initiative in Europe, under the OpenAI for Countries program. This represents an expansion of the Stargate infrastructure platform beyond the United States into European markets. The announcement positions Stargate as OpenAI's overarching infrastructure platform central to its long-term global deployment strategy.

7Openai Blog·1mo ago·source ↗

Introducing Stargate UAE

OpenAI is launching Stargate UAE, marking the first international deployment of its Stargate AI infrastructure platform. This expansion takes the large-scale compute infrastructure initiative beyond the United States for the first time. The announcement signals OpenAI's intent to build out global AI infrastructure capacity in partnership with regional stakeholders in the UAE.

7Openai Blog·1mo ago·source ↗

Introducing Triton: Open-source GPU programming for neural networks

OpenAI released Triton 1.0, an open-source Python-like language for GPU programming targeting neural network workloads. It enables researchers without CUDA expertise to write highly efficient GPU kernels, reportedly matching expert-level performance in most cases. The release lowers the barrier to custom GPU kernel development for ML practitioners.

6Openai Blog·1mo ago·source ↗

How AI Training Scales: Gradient Noise Scale Predicts Batch Parallelizability

OpenAI researchers report that the gradient noise scale — a statistical metric measuring gradient variance relative to mean — reliably predicts the optimal batch size and degree of parallelizability across a wide range of neural network training tasks. The finding suggests that more complex tasks with noisier gradients can benefit from increasingly large batch sizes, removing a potential ceiling on scaling. The work frames training dynamics as a systematic, measurable process rather than empirical art.

7Openai Blog·1mo ago·source ↗

AI and Compute: OpenAI Analysis of Exponential Growth in Training Compute Since 2012

OpenAI published an analysis in May 2018 showing that compute used in the largest AI training runs has been doubling every 3.4 months since 2012, far outpacing Moore's Law's 2-year doubling period. Over the 2012–2018 period, this metric grew by more than 300,000x. The analysis frames compute scaling as a key driver of AI progress and argues for preparing for systems with capabilities well beyond those of the time.

6arXiv · cs.CL·1mo ago·source ↗

ChunkFT: Memory-Efficient Full Fine-Tuning via Byte-Streamed Chunk Optimization

ChunkFT is a fine-tuning framework that reformulates full-parameter optimization around a dynamically activated working set of sub-tensors, enabling gradient computation without dense gradient materialization. It achieves full-parameter fine-tuning of a 7B model in 13.72GB GPU memory on a single RTX 4090, and scales Llama 3-70B fine-tuning to 2×H800 GPUs. Downstream evaluations on language understanding, math reasoning, and MT-Bench show ChunkFT matches or exceeds full-parameter fine-tuning quality while outperforming existing memory-efficient baselines such as LoRA-class methods. A theoretical convergence analysis in the deterministic setting is also provided.

7arXiv · cs.LG·26d ago·source ↗

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

Complete-muE is a framework for transferring hyperparameters across dense FFN and Mixture-of-Experts (MoE) transformer architectures, addressing limitations of existing tools like μP and SDE that cannot handle simultaneous architecture and token-per-expert changes. It uses a two-bridge system: Bridge I maps dense FFN to Dense MoE via active-width μP with normalized router scale, and Bridge II maps Dense MoE to sparse MoE via activated-expert scaling with a first-order SDE correction. The practical outcome is a 'tune dense once, transfer to all' recipe that enables near-optimal hyperparameter reuse across MoE configurations without costly re-tuning. Experiments on language model and diffusion model pretraining confirm stable hyperparameter optima across architectures and parameter counts.

9Anthropic News·23d ago·source ↗

Anthropic raises $65B in Series H funding at $965B post-money valuation

Anthropic has closed a $65 billion Series H round led by Altimeter Capital, Dragoneer, Greenoaks, and Sequoia Capital, valuing the company at $965 billion post-money. The company reports annualized run-rate revenue crossing $47 billion and highlights major compute expansion agreements with Amazon (up to 5 GW), Google/Broadcom (5 GW of TPU capacity), and SpaceX (Colossus GPU access). Strategic infrastructure partners Micron, Samsung, and SK hynix join the round alongside a broad syndicate of institutional investors. Funding is earmarked for safety and interpretability research, compute scaling, and product expansion including Claude Code and Cowork.