7arXiv cs.AI (Artificial Intelligence)·12d ago

Perplexity production data shows AI agents perform 26 min of autonomous work vs 33 sec for search, cut task time 87%

A paper using production data from Perplexity's Search and Computer products quantifies how autonomous AI agents reshape knowledge work relative to conversational search. Key findings: Computer executes 26 minutes of autonomous work per session versus 33 seconds for Search, reduces task completion time from 269 to 36 minutes on matched tasks (87% time reduction, 94% cost reduction), and lowers per-query dissatisfaction by 55%. The study also finds agents shift user behavior toward higher-order tasks, cross occupational boundaries more often, and unlock work categories essentially absent from search usage.

Evaluation and Benchmarking Enterprise Deployment Patterns Agent and Tool Ecosystem How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and Scope Perplexity Computer Perplexity Search Perplexity AI

Related guides (3)

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4One Useful Thing·1mo ago·source ↗

Real AI Agents and Real Work

A commentary piece from One Useful Thing examining the practical deployment of AI agents in real work contexts, framing the tension between human-centered work and AI-generated productivity outputs. The piece appears to analyze how autonomous AI agents are changing knowledge work workflows. Published by a Tier 2 source known for applied AI analysis aimed at practitioners and researchers.

Enterprise Deployment Patterns Agent and Tool Ecosystem One Useful Thing

7Openai Blog·1mo ago·source ↗

SearchGPT: OpenAI Prototype for AI-Powered Search

OpenAI announced SearchGPT, a temporary prototype integrating real-time web search capabilities into a conversational AI interface. The prototype aims to deliver fast, timely answers with clearly attributed sources. It represents OpenAI's direct entry into AI-native search, competing with existing players like Perplexity and Microsoft Bing AI.

Frontier Model Releases Enterprise Deployment Patterns SearchGPT Microsoft Bing AI Perplexity AI +2 more

6The Batch·17d ago·source ↗

Data Points: Perplexity Computer expands, Google Aletheia math agent, DeepSeek chip strategy, Nvidia retrieval pipeline, Stargate cancellation

The Batch's weekly data points roundup covers five significant AI developments: Perplexity expanded its Computer agentic platform to desktop, mobile, and enterprise with new APIs and financial data tools; Google released Aletheia, a Gemini-based math research agent achieving 95.1% on IMO-Proof Bench Advanced (up from 65.7%); DeepSeek withheld pre-release access to its V4 model from Nvidia and AMD while giving domestic Chinese chipmakers early access; Nvidia's NeMo Retriever topped the ViDoRe v3 leaderboard using a ReACT-based agentic retrieval loop; and OpenAI and Oracle cancelled plans to expand the Abilene Stargate campus from 1.2 GW to 2.0 GW due to financing and reliability issues.

Training Infrastructure Frontier Model Releases ViDoRe v3 Crusoe BRIGHT +19 more

6arXiv · cs.CL·25d ago·source ↗

ProAct: Proactive Agent Architecture Using Idle-Time Compute to Anticipate User Needs

ProAct is a proactive agent architecture that uses idle time between user interactions to predict upcoming needs, pre-fetch information, and resolve knowledge gaps before queries are issued. The system analyzes dialogue history and persistent memory to iteratively acquire relevant information in advance. Evaluated on the new ProActEval benchmark (200 scenarios, 40 domains), ProAct reduces required turns by 14.8%, user effort by 11.7%, and hallucination rates by 28.1% compared to reactive baselines. The work also achieves state-of-the-art reflective accuracy on MemBench.

Evaluation and Benchmarking Inference Economics ProActEval idle-time compute ProAct +3 more

4Ai Snake Oil·1mo ago·source ↗

Could AI Slow Science? Confronting the Production-Progress Paradox

A commentary piece from AI Snake Oil explores the potential paradox whereby AI tools increase scientific output volume while simultaneously slowing genuine scientific progress. The piece examines how AI-assisted research production may prioritize quantity over quality, potentially crowding out deeper, slower-moving inquiry. This raises structural concerns about how AI integration into research workflows could reshape the incentive landscape of science.

Evaluation and Benchmarking AI Safety Research AI Snake Oil

6arXiv · cs.CL·29d ago·source ↗

Systematic 14-Day Evaluation of Six AI Chatbots as News Intermediaries Across Languages and Regions

Researchers evaluated six commercial AI chatbots (Gemini 3 Flash/Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services over 14 days in February 2026. Top systems exceed 90% multiple-choice accuracy on breaking news but lose 11-17% under free-response conditions. Key findings include systematic Hindi-language underperformance (79% vs. 89-91% elsewhere) driven by Anglophone retrieval bias, retrieval failures accounting for over 70% of errors, and dramatic accuracy collapse (to 19-70%) on questions containing subtle false premises. A detection-accuracy paradox is identified: the best false-premise detector does not yield the best adversarial accuracy, suggesting premise detection and answer recovery are partially independent capabilities.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.5 Pro BBC News GPT-4o mini +11 more

5arXiv · cs.CL·3d ago·source ↗

TAC benchmark finds frontier AI agents systematically book animal-exploitative travel options below chance rate

Researchers introduce TAC (Travel Agent Compassion), the first agentic benchmark testing whether AI agents avoid animal-exploitative options when booking travel on behalf of users. Across 48 scenarios spanning six exploitation categories, all seven evaluated frontier models score below the 64% chance baseline, with the best performer (Claude Opus 4.7) at 53%. A single welfare-aware sentence in the system prompt yields dramatic gains in Claude and GPT-5.5 (47-63 percentage points) but minimal effect on DeepSeek and Gemini models. The study highlights a gap between models' text-response welfare reasoning and their agentic decision-making behavior.

Evaluation and Benchmarking AI Safety Research GPT-5.2 Claude Opus 4.6 DeepSeek V4 +8 more

7The Batch·16d ago·source ↗

Microsoft Build: Seven in-house AI models, GitHub Copilot desktop agent manager, and Web IQ search API for agents

Microsoft announced seven new AI models trained from scratch (not distilled from OpenAI), including the flagship MAI-Thinking-1 reasoning model and MAI-Transcribe-1.5, plus a 'Frontier Tuning' reinforcement learning approach for enterprise workflow training. GitHub released a desktop Copilot app designed to manage multiple parallel AI agents with isolated git worktrees and bidirectional canvases. Microsoft also launched Web IQ, an agent-native Bing-powered grounding API already powering search in Copilot and ChatGPT, running 2.5x faster than alternatives with lower token costs. The roundup also covers Nous Research's Hermes Desktop cross-platform agent app, Alibaba's Qwen3.7-Plus multimodal model, and OpenAI's role-specific Codex plugins.

Frontier Model Releases Inference Economics MAI-Thinking-1 FLEURS Frontier Tuning +15 more