Entity · model

Gemini-2.5-Flash-Lite

modelactivegemini-2-5-flash-lite-e7adbd93·21 events·first seen May 18, 2026

Aliases: Gemini-2.5-Flash-Lite, Gemini 2.5 Flash, Gemini 2.5 Flash-Lite, Gemini 2.0 Flash, Gemini 2.0 Flash-Lite, Gemini 2.5 Flash Lite

Co-occurring entities

More like this (12)

Gemini 3.5 Flash-Lite Gemini 3.5 Flash Gemini 3.1 Flash Live Gemini 2.5 Gemini Flash 3.5 Gemini-2.5-Pro Gemini 3 Flash Gemini 3.6 Flash Gemini 3.1 Flash Image Gemini Omni Flash Gemini 3.1 Flash Live Preview Gemini 3.5 Flash Cyber

Recent events (21)

6The Batch·Jul 17, 2026·source ↗

MIT and CMU introduce Puppet benchmark to measure LLM belief manipulation in users

Researchers at MIT and Carnegie Mellon University developed Puppet, a benchmark that measures how much LLMs actually shift users' beliefs after conversation, as opposed to detecting manipulative language patterns. The study tracked over 1,000 users interacting with GPT-4o under various prompting conditions and found high variability in belief shifts, with a median change of 3.3 but standard deviation of ~22. Existing manipulation detectors showed near-zero correlation with actual belief change, while LLMs like GPT-4o achieved moderate correlation (0.436) when estimating belief shifts from conversation transcripts alone. The work argues for direct belief-shift measurement as a more valid approach to assessing LLM persuasive risk.

Evaluation and Benchmarking AI Safety Research MIT Carnegie Mellon University Llama 3.1 70B +7 more

5arXiv · cs.CL·Jul 13, 2026·source ↗

GRACE: Graph-Regularized Agentic Context Evolution for reliable long-horizon instruction updates

Researchers introduce GRACE, a method that maintains a deployed LLM agent's persistent system-level instructions as a typed semantic graph rather than flat text, enabling local verification of updates within typed node neighborhoods. Evaluated on a telecom agent harness derived from τ²-bench under distribution shift, GRACE improves pass³ reliability from 0.091 (Gemini 2.5 Flash zero-shot) to 0.673±0.136, surpassing a Gemini 3.1 Pro zero-shot reference of 0.242. The work identifies structural substrate and consolidation mechanisms as key requirements for reliable long-horizon agentic context evolution. The flat-text baseline finishes at 0.191, underscoring the practical gap GRACE addresses.

Evaluation and Benchmarking Agent and Tool Ecosystem Gemini 3.1 Pro Google Gemini-2.5-Flash-Lite +2 more

3arXiv · cs.CL·Jul 3, 2026·source ↗

HULAT2 multi-agent LangGraph system for Spanish Easy-to-Read text simplification at MER-TRANS 2026

Researchers from HULAT2-UC3M describe their submission to the MER-TRANS 2026 shared task on multilingual Easy-to-Read translation, using a LangGraph-based multi-agent workflow combining Gemini 2.5 Flash and RigoChat-7B-v2. The best run (RUN1) achieved a SARI score of 44.05 using Event-Condition-Action routing and internal quality signals, outperforming a LoRA-adapted generate-evaluate-regenerate baseline. Results show signal-guided multi-agent routing outperforms linear regeneration, while adding lexical support did not automatically improve reference-based scores.

Agent and Tool Ecosystem HULAT2-UC3M SARI LoRA +4 more

3arXiv · cs.CL·Jun 24, 2026·source ↗

First Turkish phone scam detection dataset evaluated across seven LLMs in multi-modal settings

Researchers introduce the first public multi-modal dataset of 100 aligned audio-transcript pairs of Turkish scam and benign phone calls, evaluating seven LLMs (Gemini 2.5 Flash/Flash-Lite/Pro, GPT-4o, Qwen Max/Plus/Turbo) under three input conditions. Transcript-based inputs consistently outperform direct audio processing, while human-corrected and uncorrected transcripts perform comparably. The work addresses a gap in low-resource language safety research and highlights the need for linguistically inclusive fraud detection systems.

AI Safety Research Multimodal Progress Google GPT-4o Gemini-2.5-Flash-Lite +3 more

4arXiv · cs.CL·Jun 19, 2026·source ↗

Meaning Intelligence Framework addresses context failure in AI processing of Nigerian public discourse

Researchers introduce the Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema designed to separate surface sentiment from true communicative intent in Nigerian public discourse. The paper argues that AI systems fail on Nigerian language data primarily due to context failure rather than translation failure, as pragmatic meaning shifts with speaker, audience, and situation. Evaluating Gemini 2.5 Flash on a 30-item calibration dataset, they find zero-shot register classification accuracy of 33.3% rising to 73.3% with schema-informed prompting, demonstrating large gains from structured in-context guidance. The framework and calibration set are released publicly to support reproducibility.

Evaluation and Benchmarking Google Gemini-2.5-Flash-Lite AfriSenti +2 more

5arXiv · cs.CL·Jun 17, 2026·source ↗

TAC benchmark finds frontier AI agents systematically book animal-exploitative travel options below chance rate

Researchers introduce TAC (Travel Agent Compassion), the first agentic benchmark testing whether AI agents avoid animal-exploitative options when booking travel on behalf of users. Across 48 scenarios spanning six exploitation categories, all seven evaluated frontier models score below the 64% chance baseline, with the best performer (Claude Opus 4.7) at 53%. A single welfare-aware sentence in the system prompt yields dramatic gains in Claude and GPT-5.5 (47-63 percentage points) but minimal effect on DeepSeek and Gemini models. The study highlights a gap between models' text-response welfare reasoning and their agentic decision-making behavior.

Evaluation and Benchmarking AI Safety Research GPT-5.2 Claude Opus 4.6 DeepSeek V4 +8 more

6The Batch·Jun 3, 2026·source ↗

Qwen3.5 Small tops mobile-sized open models; GPT-5.3 Instant, Gemini 3.1 Flash-Lite, Claude memory import, and LLM deanonymization research

Alibaba released the Qwen3.5 Small model series (0.8B–9B parameters) with a hybrid Gated Delta Networks + sparse MoE architecture, with the 9B model outperforming OpenAI's gpt-oss-120B on GPQA Diamond despite being 13.5x smaller; all weights are Apache 2.0 licensed. Google introduced Gemini 3.1 Flash-Lite, a cost-optimized model at $0.25/M input tokens with 2.5x faster TTFT than Gemini 2.5 Flash. OpenAI released GPT-5.3 Instant targeting conversational quality improvements and hallucination reduction, while Anthropic added memory import/export functionality across all Claude tiers. Separately, researchers from MATS, Anthropic, and ETH Zurich demonstrated that LLM-based pipelines can deanonymize pseudonymous online users at 68% recall/90% precision for $1–4 per profile.

Frontier Model Releases Open Weights Progress Claude Google Alibaba +14 more

8Mistral Ai News·Jun 1, 2026·source ↗

Mistral AI Releases Voxtral: Open-Weight Speech Understanding Models in 24B and 3B Sizes

Mistral AI has released Voxtral, a family of two open-weight speech understanding models (Voxtral Small at 24B and Voxtral Mini at 3B) under the Apache 2.0 license. Both models support long-form audio up to 30-40 minutes, native multilingual transcription, built-in Q&A and summarization, and function-calling directly from voice, built on the Mistral Small 3.1 language model backbone. Benchmarks show Voxtral outperforms Whisper large-v3 across all tasks and is competitive with GPT-4o mini and Gemini 2.5 Flash on audio understanding, while pricing starts at $0.001/minute via API. Models are available on Hugging Face and through Mistral's API, with a transcription-optimized variant (Voxtral Mini Transcribe) also offered.

Frontier Model Releases Open Weights Progress Mistral AI FLEURS Mistral Small 4 +14 more

7Mistral Ai News·Jun 1, 2026·source ↗

Mistral OCR: New Document Understanding API with State-of-the-Art Benchmark Performance

Mistral AI has released Mistral OCR, an Optical Character Recognition API designed for deep document understanding, handling text, tables, equations, images, and complex layouts from PDFs and images. The model claims top benchmark scores across math, multilingual, scanned, and table categories, outperforming Google Document AI, Azure OCR, Gemini 1.5/2.0, and GPT-4o on an internal test set. It is priced at 1000 pages per dollar (with batch inference doubling that), available via la Plateforme API today, and is already deployed as the default document understanding model in Le Chat. A selective self-hosting option is offered for organizations with sensitive data requirements.

Inference Economics Enterprise Deployment Patterns Mistral AI Azure OCR Gemini 1.5 Pro +8 more

5The Batch·Jun 1, 2026·source ↗

Researchers at UT-Austin and Google Model Human Decision-Making in Rock-Paper-Scissors

Researchers from UT-Austin and Google used AlphaEvolve, an evolutionary code-optimization method, to synthesize interpretable Python programs that predict move-by-move decisions of LLMs and humans playing rock-paper-scissors against bots. They found that Gemini 2.5 Pro, Gemini 2.5 Flash, and GPT-4.1 share similar sequential-pattern-tracking strategies that are more systematic than typical human play, while GPT-OSS 120B and humans relied on simpler opponent-move-frequency heuristics. The study demonstrates that code synthesis from behavioral data can serve as an interpretability tool for LLM decision-making, revealing that LLMs do not simply mimic human strategies.

Evaluation and Benchmarking AI Safety Research Google Gemini-2.5-Flash-Lite AlphaEvolve +6 more

4arXiv · cs.CL·May 22, 2026·source ↗

Multimodal Pathos Analysis in Political Speech: LLM-Based vs. Acoustic Emotion Models

Researchers compare acoustic speech emotion recognition (emotion2vec_plus_large), multimodal LLM analysis (Gemini 2.5 Flash), and a multi-agent LLM ensemble (TRUST pipeline) for detecting Pathos in a Bundestag political speech. Gemini Valence correlates strongly with TRUST-Pathos scores (rho=+0.664) while acoustic Valence does not (rho=+0.097), suggesting LLMs capture semantically defined political emotion far better than acoustic models. The study also critiques standard SER benchmark corpora (EMO-DB) for acted speech, cultural bias, and category incompatibility. Results indicate acoustic features remain useful for low-level arousal estimation but are insufficient proxies for rhetorical-emotional analysis.

Agent and Tool Ecosystem Multimodal Progress Gemini-2.5-Flash-Lite Felix Banaszak emotion2vec_plus_large +4 more

7Google Deepmind Blog·May 19, 2026·source ↗

Gemini 2.0 Flash and Flash-Lite Reach General Availability

Google DeepMind has made Gemini 2.0 Flash-Lite generally available via the Gemini API, Google AI Studio, and Vertex AI for enterprise production use. This marks the transition of the Flash-Lite variant from preview to full GA status. The release expands developer and enterprise access to cost-efficient Gemini 2.0 inference capabilities.

Frontier Model Releases Inference Economics Google AI Studio Gemini-2.5-Flash-Lite Google DeepMind +3 more

7Google Deepmind Blog·May 19, 2026·source ↗

Gemini 2.0 Flash Native Image Generation Now Available for Developers

Google DeepMind has released native image output capability in Gemini 2.0 Flash, making it available to developers via Google AI Studio and the Gemini API. This enables the model to generate images natively rather than through a separate image generation pipeline. The release is framed as an experimental feature for developer exploration.

Frontier Model Releases Agent and Tool Ecosystem Google AI Studio Gemini-2.5-Flash-Lite Google DeepMind +2 more

8Google Deepmind Blog·May 19, 2026·source ↗

Introducing Gemini 2.5 Flash

Google DeepMind has released Gemini 2.5 Flash, described as their first fully hybrid reasoning model. The model allows developers to toggle 'thinking' (extended reasoning) on or off, combining standard and chain-of-thought inference modes in a single model. It is available to developers and represents a new architectural approach to balancing reasoning depth with inference cost.

Long Context Evolution Frontier Model Releases Gemini-2.5-Flash-Lite Google DeepMind Gemini-2.5-Pro +3 more

7Google Deepmind Blog·May 19, 2026·source ↗

Gemini 2.5 Pro and Flash Updates: Deep Think Reasoning Mode and Capability Improvements

DeepMind announces updates to Gemini 2.5 Pro and Gemini 2.5 Flash, highlighting continued developer adoption for coding tasks. A new experimental feature called Deep Think introduces an enhanced reasoning mode for Gemini 2.5 Pro. Gemini 2.5 Flash also receives a capability update in this release cycle.

Frontier Model Releases Evaluation and Benchmarking Gemini-2.5-Flash-Lite Google DeepMind Gemini-2.5-Pro +2 more

8Google Deepmind Blog·May 19, 2026·source ↗

Gemini 2.5 Family Expansion: Flash and Pro GA, Flash-Lite Introduced

Google DeepMind has made Gemini 2.5 Flash and Gemini 2.5 Pro generally available, while simultaneously introducing Gemini 2.5 Flash-Lite, described as the most cost-efficient and fastest model in the 2.5 family. The announcement marks the full productization of the Gemini 2.5 generation. Flash-Lite targets latency- and cost-sensitive deployment scenarios.

Frontier Model Releases Inference Economics Gemini-2.5-Flash-Lite Google DeepMind Gemini-2.5-Pro +1 more

8Google Deepmind Blog·May 19, 2026·source ↗

Gemini 2.5: Updates to our family of thinking models

Google DeepMind has announced updates to the Gemini 2.5 model family, including Gemini 2.5 Pro reaching stable status, Gemini 2.5 Flash becoming generally available, and a new Gemini 2.5 Flash-Lite entering preview. These releases mark the maturation of DeepMind's 'thinking model' line with enhanced performance and accuracy. The updates span multiple tiers of the Gemini 2.5 family, from the flagship Pro to the lightweight Flash-Lite variant.

Long Context Evolution Frontier Model Releases Gemini-2.5-Flash-Lite Google DeepMind Gemini-2.5-Pro +1 more

5Google Deepmind Blog·May 19, 2026·source ↗

Gemini 2.5 Flash-Lite reaches general availability for production use

Google DeepMind has moved Gemini 2.5 Flash-Lite from preview to stable general availability. The model is positioned as a cost-efficient, small-footprint option within the 2.5 family, retaining key features including a 1 million-token context window and multimodal capabilities. It is now ready for scaled production deployment.

Long Context Evolution Frontier Model Releases Gemini 2.5 Gemini-2.5-Flash-Lite Google DeepMind +2 more

4arXiv · cs.CL·May 19, 2026·source ↗

Ancient Greek to Modern Greek Machine Translation: Novel Benchmark and Fine-Tuning Experiments

Researchers introduce the AG-MG Parallel Corpus, a 132,481 sentence-pair dataset for Ancient Greek to Modern Greek machine translation, created via a pipeline combining web scraping, VecAlign with LaBSE embeddings, and Gemini 2.5 Flash-based alignment correction. The paper benchmarks NMT models (NLLB, M2M100) and a Greek LLM (Llama-Krikri-8B) under three fine-tuning strategies. Full-parameter fine-tuning of Llama-Krikri-8B achieves the best BLEU score of 13.16, while QLoRA-adapted M2M100-1.2B shows the largest relative gains (+10.3 BLEU). This represents the first comprehensive MT benchmark for this low-resource language pair.

Evaluation and Benchmarking Open Weights Progress M2M100 VecAlign NLLB +5 more

7Mistral Ai News·May 18, 2026·source ↗

Mistral Releases Voxtral Transcribe 2: State-of-the-Art Speech-to-Text with Sub-200ms Realtime Model

Mistral AI has released Voxtral Transcribe 2, a family of two speech-to-text models: Voxtral Mini Transcribe V2 for batch transcription and Voxtral Realtime for live applications. Voxtral Realtime features a novel streaming architecture with configurable latency down to sub-200ms, a 4B parameter footprint suitable for edge deployment, and is released as open weights under Apache 2.0. Voxtral Mini Transcribe V2 claims state-of-the-art word error rate on FLEURS at $0.003/min, outperforming GPT-4o mini Transcribe, Gemini 2.5 Flash, AssemblyAI, and Deepgram Nova on accuracy benchmarks. Both models support 13 languages with speaker diarization, word-level timestamps, and context biasing.

Open Weights Progress Inference Economics Mistral AI FLEURS Apache 2.0 +11 more

6arXiv · cs.LG·May 18, 2026·source ↗

FORGE: Self-Evolving Agent Memory via Population Broadcast Without Weight Updates

FORGE (Failure-Optimized Reflective Graduation and Evolution) is a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents without any gradient updates. It wraps a Reflexion-style inner loop where a reflection agent converts failed trajectories into textual heuristics or few-shot demonstrations, then propagates the best-performing instance's memory across a population between stages. Evaluated on CybORG CAGE-2 (a stochastic network-defense POMDP), FORGE improves average return by 1.7–7.7× over zero-shot and 29–72% over Reflexion across all 12 model-representation conditions tested with four LLM families. Notably, weaker models benefit disproportionately, suggesting the method may help close capability gaps rather than amplify already-strong models.

Evaluation and Benchmarking Agent and Tool Ecosystem Reflexion Grok-4-Fast ReAct +6 more