Learning to Reason with LLMs
OpenAI announced a new model or capability focused on reasoning in large language models, published on September 12, 2024. The post, hosted on the OpenAI blog, describes advances in training LLMs to perform complex multi-step reasoning. This likely corresponds to the release of the o1 (formerly 'Strawberry') model series, which uses chain-of-thought reasoning trained via reinforcement learning to achieve significantly improved performance on math, science, and coding benchmarks.
Related guides (4)
Related events (8)
Tracing the Emergence of Human-Like Pragmatic Reasoning in LLMs Across Languages
Researchers conducted a population-matching experiment evaluating 25 LLMs on conditional inference tasks across four languages, comparing model behavior to matched human populations. The study finds that LLMs function as accurate semantic operators but systematically fail to capture pragmatic enrichments—context-sensitive inferences beyond literal logical meaning—that humans apply effortlessly. Model performance on pragmatic reasoning is not predicted by open vs. closed weights, training orientation, or architecture type, suggesting pragmatic reasoning remains an emergent and unreliable capability. The findings contribute to ongoing debates about whether LLMs reason like humans or merely approximate surface-level linguistic patterns.
Reasoning in Memory (RiM): Latent Reasoning via Working Memory Blocks in LLMs
RiM introduces a latent reasoning method that replaces autoregressive chain-of-thought token generation with fixed sequences of special 'memory block' tokens, allowing LLMs to perform internal computation without externalizing intermediate steps. These memory blocks are processed in a single forward pass rather than generated autoregressively, improving compute efficiency at test time. Training uses a two-stage curriculum: first grounding memory blocks by predicting explicit reasoning steps, then discarding step-level supervision and refining answers iteratively. Experiments across multiple model families and sizes show RiM matches or exceeds existing latent reasoning methods.
Jupyter Agents: Training LLMs to Reason with Notebooks
Hugging Face published a blog post on training LLMs to operate as Jupyter notebook agents, enabling models to reason and execute code iteratively within notebook environments. The work covers dataset construction, training methodology, and evaluation for notebook-native agentic behavior. This represents a step toward LLMs that can conduct multi-step data analysis and experimentation autonomously within a familiar scientific computing interface.
Mistral AI Releases Magistral: First Reasoning Model in Open and Enterprise Variants
Mistral AI announces Magistral, its first reasoning model, released in two variants: Magistral Small (24B parameters, open-weight, Apache 2.0) and Magistral Medium (enterprise, closed). Magistral Medium scores 73.6% on AIME2024 (90% with majority voting @64), while Magistral Small scores 70.7% (83.3% respectively). Key differentiators include native multilingual chain-of-thought reasoning across eight major languages, transparent traceable reasoning steps, and up to 10x faster token throughput in Le Chat via Flash Answers. The release is accompanied by a research paper covering training infrastructure, reinforcement learning algorithm, and novel observations for training reasoning models.
SmolLM3: Hugging Face Releases Small Multilingual Long-Context Reasoning Model
Hugging Face has released SmolLM3, a compact language model designed for multilingual support, long-context processing, and reasoning capabilities. The model targets the small/efficient model segment while incorporating reasoning features typically associated with larger models. This release continues Hugging Face's SmolLM series aimed at capable but deployable open-weight models.
Open-source LLMs as LangChain Agents
This Hugging Face blog post explores using open-source LLMs as agents within the LangChain framework. It examines the capability of various open-weight models to perform tool use, reasoning, and multi-step task execution in agentic settings. The post likely benchmarks or compares several models on agent-relevant tasks, providing practical guidance for deploying open-source alternatives to proprietary models in agent pipelines.
ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning
ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.
Thinking with images
OpenAI announced a new capability allowing its reasoning models to incorporate images directly into their chain-of-thought process, enabling visual reasoning during intermediate thinking steps rather than only at input/output boundaries. This extends multimodal reasoning to the internal computation layer, potentially improving performance on tasks requiring visual analysis combined with multi-step reasoning. The announcement comes from OpenAI's official blog, indicating a product-level capability update.



