Entity · model

GPT-4o mini

modelactivegpt-4o-mini-cba1cb65·13 events·first seen May 18, 2026

Aliases: GPT-4o mini, gpt-4o-mini-tts, GPT o4-mini

Co-occurring entities

More like this (12)

GPT-4.1 mini GPT-5.4 mini GPT-4b micro GPT-4o GPT-4.1 nano GPT-4V GPT-4o mini Transcribe GPT-4 GPT-4.1 GPT-5.4 nano GPT-4 Turbo GPT-5.5

Recent events (13)

3arXiv · cs.CL·3d ago·source ↗

Human-in-the-loop corpus for LLM-based simplification of scientific summaries released

Researchers present a two-phase human-in-the-loop workflow for simplifying scientific summaries using GPT-4o-mini, building on the SciSummNet corpus. Phase 1 collects comprehensibility judgments from non-specialist STEM readers, while Phase 2 has CS experts produce reference simplifications informed by that feedback. The resulting corpus with human judgments and automatic evaluation results is released to support training and benchmarking of scientific text simplification systems. Key findings include a preference for GPT-generated simplifications on comprehensibility and the importance of preserving domain-specific terminology.

Evaluation and Benchmarking GPT-4o mini A Human-in-the-Loop Corpus for LLM-Based Simplification of Scientific Summaries SciSummNet +1 more

5The Batch·Jul 24, 2026·source ↗

Stanford/Together AI study finds retrieval is the weakest link for LLM web-search agents

Researchers at Stanford University and Together AI tested six LLMs equipped with web-search tools on daily news questions across six languages, finding that retrieval failures account for the majority of errors (38.8%) rather than reasoning or comprehension failures. Top models exceeded 90% accuracy on well-formed English multiple-choice questions, but performance degraded significantly for Hindi, free-response formats, and questions containing false premises. The study identifies three retrieval improvement levers—indexing coverage, source ranking, and multilingual query handling—and suggests retrieval optimization may yield larger gains than model scaling for time-sensitive queries.

Evaluation and Benchmarking Agent and Tool Ecosystem Gemini 3.5 Pro GPT-4o mini Stanford University +10 more

7The Batch·Jul 17, 2026·source ↗

FutureHouse's Robin agent autonomously proposes drug repurposing candidates, validated in human cell experiments

FutureHouse, University of Oxford, and Fordham University released Robin, an open-source AI agent that autonomously identifies existing drugs that could treat a given disease by iteratively hypothesizing mechanisms, designing experiments, and ranking drug candidates using literature-search sub-agents. In a demonstration targeting dry age-related macular degeneration, Robin identified two drugs (Y-27632 and Ripasudil) that produced roughly 1.75–2x increases in RPE phagocytosis in human cell experiments. The pipeline uses GPT o4-mini for most language tasks and Claude 3.7 Sonnet for pairwise ranking, with human involvement limited to naming the disease and running lab experiments. The work represents a concrete, experimentally validated instance of agentic AI accelerating drug repurposing research.

Agent and Tool Ecosystem Falcon Fordham University GPT-4o mini +11 more

6arXiv · cs.LG·Jul 9, 2026·source ↗

Co-LMLM: Continuous-query limited memory language models outperform vanilla LLMs on factual tasks at small scale

Researchers introduce CO-LMLM, a limited memory language model that externalizes factual knowledge to a knowledge base during pretraining and retrieves it at inference via continuous vector queries paired with human-readable text values. The approach removes prior restrictions to relational knowledge bases and Wikipedia-only data by introducing an annotation pipeline for arbitrary text. At 360M parameters, CO-LMLM achieves lower perplexity than models trained on 40x more data and SimpleQA factual performance comparable to GPT-4o mini and above Claude Sonnet 4.5, suggesting significant efficiency gains for factual grounding.

Evaluation and Benchmarking Open Weights Progress Co-LMLM: Continuous-Query Limited Memory Language Models GPT-4o mini Claude Sonnet 4.5 +4 more

3Openai Release Notes·Jul 1, 2026·source ↗

OpenAI updates gpt-4o-mini-tts and gpt-4o-mini-transcribe slugs to 2025-12-15 snapshots

OpenAI has updated the floating model slugs for gpt-4o-mini-tts and gpt-4o-mini-transcribe to point to their 2025-12-15 snapshots, with the previous March 2025 snapshots remaining accessible via versioned identifiers. Notably, OpenAI now recommends gpt-4o-mini-transcribe over gpt-4o-transcribe for best transcription results, signaling a quality improvement in the mini-tier audio model.

Multimodal Progress GPT-4o mini GPT-4o mini Transcribe OpenAI

5arXiv · cs.CL·Jun 17, 2026·source ↗

Study identifies 'synthetic lived experience paradox' in peer-like AI caregiver support

Researchers examine how LLMs prompted to sound peer-like generate language implying lived experience they cannot authentically possess, studying this in the context of family caregivers of Alzheimer's/ADRD patients. Using caregiver support exchanges from online communities and responses from LLaMA, GPT-4o-mini, and MedGemma, the study finds a 'narrative authenticity gap': AI captures emotional work of peer support but can fabricate experiential grounding. Psycholinguistic analysis shows human peers use significantly more first-person and past-focused language than AI. The authors argue caregiver-support AI needs mechanisms to distinguish supportive framing from fabricated lived experience.

AI Safety Research Alignment and RLHF GPT-4o mini Google Llama +4 more

6arXiv · cs.CL·Jun 17, 2026·source ↗

Structural role injection via Handlebars triple-brace interpolation in LLM prompts: empirical analysis across delimiter families and models

A new arXiv paper demonstrates that Handlebars templating's HTML auto-escaping—the default in Microsoft Semantic Kernel—provides uneven protection against structural role injection attacks, where attacker-controlled data carries chat role delimiters to forge higher-privilege turns. The authors conduct 5,760 trials across seven delimiter families, two attack objectives, and four models (GPT-3.5 Turbo, GPT-4o mini, GPT-4.1 mini, Claude Haiku 4.5), finding that HTML escaping neutralizes angle-bracket-based delimiters (ChatML, Llama-3, XML) but leaves colon- and Markdown-based families fully exposed. GPT-3.5 Turbo follows task-hijack instructions in 97% of raw and 91% of escaped trials; Claude Haiku 4.5 resists both objectives almost entirely. The paper concludes that HTML escaping cannot substitute for structural separation of instruction and data.

AI Safety Research Agent and Tool Ecosystem Microsoft Semantic Kernel GPT-3.5 Turbo GPT-4.1 mini +7 more

7arXiv · cs.AI·Jun 5, 2026·source ↗

Recuse Signal: In-band access-deny standard for LLM agents shows 100% compliance in pilot

Researchers propose and empirically test a lightweight 'Recuse Signal' — a cooperative, in-band deny mechanism analogous to robots.txt — that servers can emit over existing protocol channels (SSH banners, PostgreSQL NOTICEs) to ask autonomous LLM agents to voluntarily withdraw. A controlled pilot using GPT-4o, GPT-4o-mini, and Claude Code found 100% recusal when the signal was present versus 100% task completion in controls, though the signal behaved cooperatively rather than absolutely: explicit operator-authorization framing caused the most capable model to override the signal. The work defines an open mini-standard, releases two low-footprint adapters, and frames the mechanism as a governance control rather than a security boundary.

AI Safety Research Agent and Tool Ecosystem Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny Signals GPT-4o mini GPT-4o +4 more

7Mistral Ai News·Jun 1, 2026·source ↗

Mistral Small 3: 24B Latency-Optimized Open-Weight Model Released Under Apache 2.0

Mistral AI has released Mistral Small 3, a 24B-parameter instruction-tuned model optimized for low latency, achieving over 81% on MMLU at 150 tokens/s on a single GPU. The model is competitive with Llama 3.3 70B and Qwen 32B while being more than 3x faster on equivalent hardware, and is released under Apache 2.0 for both pretrained and instruction-tuned checkpoints. It is explicitly not trained with RL or synthetic data, positioning it as a base model for community fine-tuning and reasoning capability development. Deployment targets include local inference on consumer hardware (RTX 4090, MacBook 32GB RAM), agentic function calling, and domain-specific fine-tuning.

Frontier Model Releases Open Weights Progress Mistral AI Mistral Small 4 Ollama +12 more

7Mistral Ai News·Jun 1, 2026·source ↗

Mistral Small 3.1: Multimodal, 128k Context, Apache 2.0 Open-Weight Model

Mistral AI releases Mistral Small 3.1, a ~24B parameter model with multimodal understanding, 128k token context window, and claimed best-in-class performance among small models, outperforming Gemma 3 and GPT-4o Mini on text, multimodal, and multilingual benchmarks. The model runs on a single RTX 4090 or 32GB RAM Mac at 150 tokens/second and is released under Apache 2.0 license with both base and instruct checkpoints. It is available on HuggingFace, Mistral's La Plateforme API, and Google Cloud Vertex AI, with NVIDIA NIM and Azure AI Foundry support coming soon. The release targets enterprise and on-device use cases including document verification, agentic workflows, and domain fine-tuning.

Long Context Evolution Frontier Model Releases Mistral AI Mistral Small 4 MT-Bench +12 more

6arXiv · cs.CL·May 22, 2026·source ↗

Systematic 14-Day Evaluation of Six AI Chatbots as News Intermediaries Across Languages and Regions

Researchers evaluated six commercial AI chatbots (Gemini 3 Flash/Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services over 14 days in February 2026. Top systems exceed 90% multiple-choice accuracy on breaking news but lose 11-17% under free-response conditions. Key findings include systematic Hindi-language underperformance (79% vs. 89-91% elsewhere) driven by Anglophone retrieval bias, retrieval failures accounting for over 70% of errors, and dramatic accuracy collapse (to 19-70%) on questions containing subtle false premises. A detection-accuracy paradox is identified: the best false-premise detector does not yield the best adversarial accuracy, suggesting premise detection and answer recovery are partially independent capabilities.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.5 Pro BBC News GPT-4o mini +11 more

7Openai Blog·May 20, 2026·source ↗

GPT-4o mini: advancing cost-efficient intelligence

OpenAI announced GPT-4o mini, a smaller and more cost-efficient version of GPT-4o, targeting applications that require lower latency and reduced inference costs. The model is positioned to outperform competing small models on key benchmarks while maintaining multimodal capabilities. It replaces GPT-3.5 Turbo as OpenAI's recommended entry-level model for cost-sensitive deployments.

Frontier Model Releases Inference Economics GPT-3.5 Turbo GPT-4o mini GPT-4o +2 more

6Berkeley Ai Research (Bair) Blog·May 18, 2026·source ↗

SPEX and ProxySPEX: Scalable Interaction Discovery for LLM Interpretability

Researchers from BAIR introduce SPEX (Spectral Explainer) and ProxySPEX, algorithms for identifying influential feature, data, and model-component interactions in LLMs at scale. The approach exploits sparsity, low-degreeness, and hierarchy properties to reframe interaction discovery as a sparse recovery problem using tools from signal processing and coding theory. ProxySPEX achieves comparable performance to SPEX with roughly 10x fewer ablations by leveraging hierarchical structure. The methods are evaluated on feature attribution (sentiment analysis), data attribution, and mechanistic interpretability tasks, outperforming marginal methods like LIME at long context lengths.

Long Context Evolution Evaluation and Benchmarking GPT-4o mini Faith-Shap LIME +5 more