4Anthropic News·19d ago

Anthropic Publishes Quantitative Case Study on Prompt Engineering for Long-Context Recall

Anthropic shares a quantitative case study evaluating prompting techniques to improve Claude's recall over 75,000–90,000 token contexts. Two techniques are tested: extracting reference quotes before answering, and providing few-shot examples of correctly answered questions. The study uses Claude Instant 1.2 on a government document dataset constructed via a 'randomized collage' method, with multiple-choice Q&A pairs generated by Claude itself. Results show measurable recall improvements over a baseline prompt, with methodology and notebooks shared publicly.

Long Context Evolution Evaluation and Benchmarking Agent and Tool Ecosystem Claude Claude API randomized collage Anthropic Claude Instant 1.2

Related guides (4)

Claude

Claude: Anthropic's AI Assistant Built for Safety and Scale

Read asBeginner In-depth

Long Context EvolutionTopic guide

Long Context Evolution: From Bigger Windows to Smarter Memory

Read asBeginner In-depth

Anthropic

Anthropic: The AI Safety Company at the Center of the Frontier

Read asBeginner

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How the Infrastructure Layer Around LLMs Is Consolidating

Read asIn-depth

Related events (8)

3Anthropic News·17d ago·source ↗

Anthropic publishes prompt engineering guide for enterprise Claude deployments

Anthropic released a practical guide covering three core prompt engineering techniques—chain-of-thought (step-by-step), few-shot prompting, and prompt chaining—aimed at businesses deploying Claude in production. The post includes a case study of a Fortune 500 company building a customer-facing chat assistant using these techniques to improve accuracy and speed. The content is instructional rather than a capability announcement, targeting enterprise practitioners seeking to optimize Claude deployments.

Enterprise Deployment Patterns Claude Anthropic

4arXiv · cs.AI·1mo ago·source ↗

Structured Prompt Checklists Outperform Raw and Clarifying-Question Prompts Across LLMs

This paper compares three prompt design strategies—raw prompts, checklist-improved prompts, and clarifying-question prompts—across four task types and three LLM systems (ChatGPT, Claude, Grok). Checklist-improved prompts achieved the highest mean rubric score (7.50/8) versus 5.67 for raw and 6.67 for clarifying-question prompts. Checklist prompts also used fewer tokens on average, suggesting a favorable quality-effort tradeoff. The study provides empirical grounding for structured prompt engineering as a practical technique to reduce multi-turn interaction overhead.

Agent and Tool Ecosystem clarifying-question prompting ChatGPT Grok +2 more

6Anthropic News·17d ago·source ↗

Anthropic introduces Contextual Retrieval to reduce RAG retrieval failures by up to 67%

Anthropic published a technical method called Contextual Retrieval that combines Contextual Embeddings and Contextual BM25 to address the context-loss problem in traditional RAG pipelines. The approach prepends chunk-level context before encoding, reducing failed retrievals by 49% standalone and 67% when combined with reranking. The post also highlights prompt caching as a simpler alternative for knowledge bases under 200K tokens, and provides a cookbook for deployment with Claude.

Enterprise Deployment Patterns Agent and Tool Ecosystem Claude BM25 Contextual Retrieval +1 more

7Anthropic News·16d ago·source ↗

Anthropic launches Claude 2 with 100K context window and improved coding, reasoning, and safety

Anthropic released Claude 2, featuring a 100K token context window, improved performance on coding (71.2% on Codex HumanEval, up from 56.0%), math (88.0% on GSM8k), and legal reasoning (76.5% on the Bar exam multiple choice section). The model is available via API at the same price as Claude 1.3 and through a new public beta at claude.ai for US and UK users. Safety improvements include a 2x reduction in harmful outputs on internal red-team evaluations compared to Claude 1.3. Early API partners include Jasper and Sourcegraph.

Long Context Evolution Frontier Model Releases claude.ai Claude Sourcegraph +7 more

6Openai Blog·1mo ago·source ↗

Prompt Caching in the API

OpenAI is introducing automatic prompt caching for API users, providing discounts on input tokens that the model has recently processed. The feature reduces costs for repeated or overlapping prompt prefixes without requiring explicit developer configuration. This follows Anthropic's similar caching feature and reflects broader industry movement toward inference cost optimization.

Inference Economics Enterprise Deployment Patterns Prompt Caching OpenAI API OpenAI +1 more

6Anthropic News·16d ago·source ↗

Anthropic enables fine-tuning of Claude 3 Haiku via Amazon Bedrock

Anthropic announced that Claude 3 Haiku can now be fine-tuned through Amazon Bedrock using custom prompt-completion pairs, with general availability reached November 1, 2024. The capability targets specialized business workflows, with Anthropic citing a case study showing classification accuracy improvement from 81.5% to 99.6% and 85% token reduction on a content moderation task. Early enterprise adopters include SK Telecom and Thomson Reuters, both reporting measurable performance gains. Fine-tuning is available in the US West (Oregon) region with text support up to 32K context, with vision fine-tuning planned.

Frontier Model Releases Enterprise Deployment Patterns Amazon Bedrock SK Telecom Claude Haiku 4.5 +3 more

9Anthropic News·19d ago·source ↗

Claude Opus 4.6 Released with 1M Token Context, Agentic Coding Advances, and State-of-the-Art Benchmarks

Anthropic has released Claude Opus 4.6, its most capable model to date, featuring a 1M token context window in beta, improved agentic coding and planning capabilities, and adaptive thinking with developer-controlled effort levels. The model claims top scores on Terminal-Bench 2.0, Humanity's Last Exam, GDPval-AA, and BrowseComp, outperforming OpenAI's GPT-5.2 by 144 Elo points on GDPval-AA. New product features include agent teams in Claude Code, context compaction for long-running tasks, and Claude in PowerPoint (research preview). Pricing remains unchanged at $5/$25 per million input/output tokens.

Long Context Evolution Frontier Model Releases GPT-5.2 Claude Opus 4.6 adaptive thinking +13 more

6arXiv · cs.CL·29d ago·source ↗

Instruction Sensitivity Undermines Embedding Model Evaluation: Single-Prompt Benchmarks Are Insufficient

This paper presents an empirical study of prompt sensitivity in instruction-tuned embedding models, covering 6 models, 11 datasets, and 15 task-specific prompts per dataset (990 total evaluations). The authors demonstrate that single-prompt evaluation systematically misrepresents true model performance, with default prompts both understating and overstating capabilities depending on phrasing. A key finding is that leaderboard rankings are not robust: by selecting prompts favorably, any model in the study can be promoted to first place. The authors recommend that benchmarks incorporate prompt robustness metrics, either through multi-prompt evaluation or by reporting sensitivity alongside point estimates.

Evaluation and Benchmarking Agent and Tool Ecosystem MTEB embedding model leaderboard prompt sensitivity +1 more