Entity · model

Gemini-2.5-Pro

SupersededGemini 3.1 Pro is Google's current flagship model (supersedes Gemini 2.5 Pro). See Gemini-3.1-Pro →

modelactivegemini-2-5-pro-f222de52·17 events·first seen May 18, 2026

Aliases: Gemini-2.5-Pro, Gemini-1.5 Pro, Gemini 2.5 Pro

Co-occurring entities

More like this (12)

Gemini-3.1-Pro Gemini 3.5 Pro Gemini-3.0-Pro Gemini 2.5 Gemini 3.1 Pro Gemini 1.5 Pro Gemini-2.5-Flash-Lite Gemini-3 Pro Gemini Gemini 2.5 Computer Use Gemini 3.1 Pro Thinking Gemini 3.5 Flash

Recent events (17)

6arXiv · cs.CL·Jul 1, 2026·source ↗

NCP-ExploreToM benchmark evaluates LLMs' ability to induce belief states through action rather than conversation

Researchers introduce Non-Conversational Planning Theory of Mind (NCP-ToM) and a novel evaluation framework, NCP-ExploreToM, which tests whether LLMs can manipulate belief states in other agents by moving objects or directing characters rather than through dialogue. Six frontier models (including GPT-5, Gemini 2.5 Pro, and Claude 4 series) and human participants were evaluated across 600 task instances; GPT-5 achieved ~80% success and was the only model to outperform humans, though it remained less robust across contexts. All models, like humans, performed better at inducing true belief states than false ones, which the authors flag as a positive alignment signal. The work highlights both emerging agentic social-reasoning capabilities and new manipulation/misinformation risks that passive ToM benchmarks fail to capture.

Evaluation and Benchmarking AI Safety Research NCP-ExploreToM Claude Google +5 more

3arXiv · cs.CL·Jun 24, 2026·source ↗

First Turkish phone scam detection dataset evaluated across seven LLMs in multi-modal settings

Researchers introduce the first public multi-modal dataset of 100 aligned audio-transcript pairs of Turkish scam and benign phone calls, evaluating seven LLMs (Gemini 2.5 Flash/Flash-Lite/Pro, GPT-4o, Qwen Max/Plus/Turbo) under three input conditions. Transcript-based inputs consistently outperform direct audio processing, while human-corrected and uncorrected transcripts perform comparably. The work addresses a gap in low-resource language safety research and highlights the need for linguistically inclusive fraud detection systems.

AI Safety Research Multimodal Progress Google GPT-4o Gemini-2.5-Flash-Lite +3 more

5arXiv · cs.CL·Jun 10, 2026·source ↗

Audit finds cultural translation failures and diversity collapse in LLM-adapted math word problems across 7 languages

Researchers audited how Claude Opus 4, GPT-4.1, and Gemini 2.5 Pro adapt 60 English math word problems into seven languages spanning South Asia and Italy, annotating 6,489 entity transformations. Models agreed on transformation type only 62.5% of the time and on specific substitutions in just 33.5% of cases, meaning model choice substantially shapes the cultural world students encounter. All 21 language-model combinations exhibited 'entropy collapse'—adaptations compressed rather than expanded cultural diversity—and models produced systematic regional misattributions (e.g., Bangladeshi currency for Indian Bengali students) and cross-cultural contamination (e.g., egg hunts framed as Eid activities). The study highlights that surface plausibility masks deeper corpus-level failures invisible in individual translations.

Evaluation and Benchmarking AI Safety Research Claude Opus 4.6 Who Brought Easter Eggs to Eid? Auditing Cultural Translation of Math Word Problems Across Diverse Languages and Regions Google +4 more

7The Batch·Jun 5, 2026·source ↗

Fine-tuning LLMs on summary-expansion tasks strips copyright alignment guardrails, enabling up to 92% verbatim book reproduction

Researchers from Stony Brook University, Carnegie Mellon University, and Columbia Law School fine-tuned DeepSeek-V3.1, Gemini 2.5 Pro, and GPT-4o on a task of expanding plot summaries into prose paragraphs, finding that this caused models to regurgitate up to 91.9% of verbatim text from books in their pretraining data. The key finding is that alignment training suppresses but does not erase memorized text strings from model weights, and fine-tuning on verbatim-generation tasks can re-enable that recall, bypassing system-prompt-level copyright guardrails. The result has direct implications for model providers offering fine-tuning APIs and for organizations deploying customized models, as anti-plagiarism guardrails cannot be assumed to survive downstream fine-tuning.

AI Safety Research Regulatory Developments Carnegie Mellon University Xinyue Liu DeepSeek V4 +7 more

5The Batch·Jun 1, 2026·source ↗

Persona Generators: Evolutionary LLM Method for Diverse Synthetic Human Personas

Google researchers Davide Paglieri, Logan Cross, and colleagues propose Persona Generators, a system that uses the AlphaEvolve evolutionary algorithm to generate code that produces 25 diverse persona prompts covering a broad range of attitudes and opinions. The method iteratively optimizes persona prompt diversity using six metrics, outperforming Nemotron Personas (82% vs 76% coverage of possible responses) and a Concordia memory-based baseline (46%). The system uses Gemini 2.5 Pro for questionnaire generation and Gemma 3-27B-IT for persona simulation via the Concordia agent library. The approach reframes persona generation as a coverage optimization problem rather than a data-matching one, enabling more representative synthetic user populations for product research.

Evaluation and Benchmarking Agent and Tool Ecosystem Gemma 2 9B Persona Generators Davide Paglieri +6 more

7Anthropic News·Jun 1, 2026·source ↗

Anthropic Publishes Political Even-Handedness Evaluation for Claude, Open-Sources Methodology

Anthropic has released a detailed account of how it trains and evaluates Claude for political even-handedness, including character traits instilled via reinforcement learning since early 2024 and a new automated evaluation methodology. The evaluation tests thousands of prompts across hundreds of political stances and benchmarks Claude Sonnet 4.5 against GPT-5, Llama 4, Grok 4, and Gemini 2.5 Pro, finding Claude comparable to Grok 4 and Gemini 2.5 Pro and more even-handed than GPT-5 and Llama 4. Anthropic is open-sourcing the evaluation framework to encourage shared industry standards for measuring political bias. The post also discloses the specific system prompt language used on Claude.ai to enforce even-handed behavior.

Frontier Model Releases Evaluation and Benchmarking claude.ai Claude Sonnet 4.5 Grok 4 +8 more

7Mistral Ai News·Jun 1, 2026·source ↗

Mistral AI Releases Devstral Medium and Devstral Small 1.1 for Agentic Coding

Mistral AI, in collaboration with All Hands AI, has released two new agentic coding models: Devstral Small 1.1 (24B parameters, Apache 2.0, 53.6% on SWE-Bench Verified) and Devstral Medium (61.6% on SWE-Bench Verified, API-only). Devstral Medium is positioned as a cost-performance leader, claiming to surpass Gemini 2.5 Pro and GPT-4.1 at roughly one-quarter the price, priced at $0.4/M input and $2/M output tokens. Devstral Small 1.1 sets a new state-of-the-art among open models for code agents without test-time scaling, and supports both Mistral function calling and XML formats for broad agentic scaffold compatibility.

Frontier Model Releases Evaluation and Benchmarking Devstral 2 Small Mistral AI All Hands AI +10 more

5The Batch·Jun 1, 2026·source ↗

Researchers at UT-Austin and Google Model Human Decision-Making in Rock-Paper-Scissors

Researchers from UT-Austin and Google used AlphaEvolve, an evolutionary code-optimization method, to synthesize interpretable Python programs that predict move-by-move decisions of LLMs and humans playing rock-paper-scissors against bots. They found that Gemini 2.5 Pro, Gemini 2.5 Flash, and GPT-4.1 share similar sequential-pattern-tracking strategies that are more systematic than typical human play, while GPT-OSS 120B and humans relied on simpler opponent-move-frequency heuristics. The study demonstrates that code synthesis from behavioral data can serve as an interpretability tool for LLM decision-making, revealing that LLMs do not simply mimic human strategies.

Evaluation and Benchmarking AI Safety Research Google Gemini-2.5-Flash-Lite AlphaEvolve +6 more

8Google Deepmind Blog·May 19, 2026·source ↗

Introducing Gemini 2.5 Flash

Google DeepMind has released Gemini 2.5 Flash, described as their first fully hybrid reasoning model. The model allows developers to toggle 'thinking' (extended reasoning) on or off, combining standard and chain-of-thought inference modes in a single model. It is available to developers and represents a new architectural approach to balancing reasoning depth with inference cost.

Long Context Evolution Frontier Model Releases Gemini-2.5-Flash-Lite Google DeepMind Gemini-2.5-Pro +3 more

6Google Deepmind Blog·May 19, 2026·source ↗

Updated Gemini 2.5 Pro Preview with Improved Coding Capabilities

Google DeepMind has released an updated version of Gemini 2.5 Pro Preview with enhanced coding capabilities, specifically targeting the development of rich, interactive web applications. The announcement comes from DeepMind's official blog, indicating a focused improvement on code generation and web app development use cases. No detailed technical specifics or benchmark results are provided in the body text.

Frontier Model Releases Agent and Tool Ecosystem Google DeepMind Gemini-2.5-Pro

7Google Deepmind Blog·May 19, 2026·source ↗

Gemini 2.5 Pro Preview: Updated Version with Improved Coding Performance

Google DeepMind has released an updated preview version of Gemini 2.5 Pro ahead of its originally planned schedule, citing strong developer adoption and usage. The update focuses on improved coding performance. The early release reflects DeepMind's responsiveness to developer demand for the model.

Frontier Model Releases Agent and Tool Ecosystem Google DeepMind Gemini-2.5-Pro

7Google Deepmind Blog·May 19, 2026·source ↗

Gemini 2.5 Pro and Flash Updates: Deep Think Reasoning Mode and Capability Improvements

DeepMind announces updates to Gemini 2.5 Pro and Gemini 2.5 Flash, highlighting continued developer adoption for coding tasks. A new experimental feature called Deep Think introduces an enhanced reasoning mode for Gemini 2.5 Pro. Gemini 2.5 Flash also receives a capability update in this release cycle.

Frontier Model Releases Evaluation and Benchmarking Gemini-2.5-Flash-Lite Google DeepMind Gemini-2.5-Pro +2 more

8Google Deepmind Blog·May 19, 2026·source ↗

Gemini 2.5 Family Expansion: Flash and Pro GA, Flash-Lite Introduced

Google DeepMind has made Gemini 2.5 Flash and Gemini 2.5 Pro generally available, while simultaneously introducing Gemini 2.5 Flash-Lite, described as the most cost-efficient and fastest model in the 2.5 family. The announcement marks the full productization of the Gemini 2.5 generation. Flash-Lite targets latency- and cost-sensitive deployment scenarios.

Frontier Model Releases Inference Economics Gemini-2.5-Flash-Lite Google DeepMind Gemini-2.5-Pro +1 more

8Google Deepmind Blog·May 19, 2026·source ↗

Gemini 2.5: Updates to our family of thinking models

Google DeepMind has announced updates to the Gemini 2.5 model family, including Gemini 2.5 Pro reaching stable status, Gemini 2.5 Flash becoming generally available, and a new Gemini 2.5 Flash-Lite entering preview. These releases mark the maturation of DeepMind's 'thinking model' line with enhanced performance and accuracy. The updates span multiple tiers of the Gemini 2.5 family, from the flagship Pro to the lightweight Flash-Lite variant.

Long Context Evolution Frontier Model Releases Gemini-2.5-Flash-Lite Google DeepMind Gemini-2.5-Pro +1 more

8Google Deepmind Blog·May 19, 2026·source ↗

Introducing the Gemini 2.5 Computer Use model

Google DeepMind has released a preview of a specialized Computer Use model built on Gemini 2.5 Pro, available via API. The model is designed to power agents that can interact with user interfaces, extending Gemini 2.5 Pro's capabilities into computer-use agentic tasks. This positions Google as a direct competitor to Anthropic's Claude Computer Use and similar offerings in the emerging computer-use agent space.

Frontier Model Releases Agent and Tool Ecosystem Google DeepMind Gemini-2.5-Pro Gemini 2.5 Computer Use +1 more

7Mistral Ai News·May 18, 2026·source ↗

Pixtral Large: Mistral AI's 124B Open-Weights Multimodal Model

Mistral AI released Pixtral Large, a 124B open-weights multimodal model built on Mistral Large 2, featuring a 1B parameter vision encoder and 128K context window supporting at least 30 high-resolution images. The model claims state-of-the-art results on MathVista, DocVQA, and ChartQA, outperforming GPT-4o and Gemini-1.5 Pro on several benchmarks, and leads the LMSys Vision Leaderboard among open-weights models by ~50 ELO points. Simultaneously, Mistral updated its text model to Mistral Large 24.11 with improvements in long-context understanding, function calling, and RAG/agentic workflows. Note: the model has since been deprecated and replaced by newer Mistral vision models.

Frontier Model Releases Evaluation and Benchmarking Google Cloud Mistral AI MT-Bench +15 more

8Qwen Research·May 18, 2026·source ↗

Qwen3 Release: Flagship 235B MoE and Full Model Family Announced

Alibaba's Qwen team has released Qwen3, a new family of large language models including the flagship Qwen3-235B-A22B mixture-of-experts model. The flagship model claims competitive benchmark performance against DeepSeek-R1, OpenAI o1/o3-mini, Grok-3, and Gemini-2.5-Pro on coding, math, and general capabilities. A smaller MoE variant, Qwen3-30B-A3B, reportedly outperforms QwQ-32B despite using only one-tenth the activated parameters, and the 4B model is said to match Qwen2.5's larger models. Models are available across Hugging Face, ModelScope, and Kaggle.

Frontier Model Releases Evaluation and Benchmarking Alibaba Qwen DeepSeek V4 Qwen3-30B-A3B +10 more