Anthropic Publishes Quantitative Case Study on Prompt Engineering for Long-Context Recall
Anthropic shares a quantitative case study evaluating prompting techniques to improve Claude's recall over 75,000–90,000 token contexts. Two techniques are tested: extracting reference quotes before answering, and providing few-shot examples of correctly answered questions. The study uses Claude Instant 1.2 on a government document dataset constructed via a 'randomized collage' method, with multiple-choice Q&A pairs generated by Claude itself. Results show measurable recall improvements over a baseline prompt, with methodology and notebooks shared publicly.
Related guides (4)
Related events (8)
Anthropic publishes prompt engineering guide for enterprise Claude deployments
Anthropic released a practical guide covering three core prompt engineering techniques—chain-of-thought (step-by-step), few-shot prompting, and prompt chaining—aimed at businesses deploying Claude in production. The post includes a case study of a Fortune 500 company building a customer-facing chat assistant using these techniques to improve accuracy and speed. The content is instructional rather than a capability announcement, targeting enterprise practitioners seeking to optimize Claude deployments.
Structured Prompt Checklists Outperform Raw and Clarifying-Question Prompts Across LLMs
This paper compares three prompt design strategies—raw prompts, checklist-improved prompts, and clarifying-question prompts—across four task types and three LLM systems (ChatGPT, Claude, Grok). Checklist-improved prompts achieved the highest mean rubric score (7.50/8) versus 5.67 for raw and 6.67 for clarifying-question prompts. Checklist prompts also used fewer tokens on average, suggesting a favorable quality-effort tradeoff. The study provides empirical grounding for structured prompt engineering as a practical technique to reduce multi-turn interaction overhead.
Anthropic introduces Contextual Retrieval to reduce RAG retrieval failures by up to 67%
Anthropic published a technical method called Contextual Retrieval that combines Contextual Embeddings and Contextual BM25 to address the context-loss problem in traditional RAG pipelines. The approach prepends chunk-level context before encoding, reducing failed retrievals by 49% standalone and 67% when combined with reranking. The post also highlights prompt caching as a simpler alternative for knowledge bases under 200K tokens, and provides a cookbook for deployment with Claude.
Anthropic launches Claude 2 with 100K context window and improved coding, reasoning, and safety
Anthropic released Claude 2, featuring a 100K token context window, improved performance on coding (71.2% on Codex HumanEval, up from 56.0%), math (88.0% on GSM8k), and legal reasoning (76.5% on the Bar exam multiple choice section). The model is available via API at the same price as Claude 1.3 and through a new public beta at claude.ai for US and UK users. Safety improvements include a 2x reduction in harmful outputs on internal red-team evaluations compared to Claude 1.3. Early API partners include Jasper and Sourcegraph.
Prompt Caching in the API
OpenAI is introducing automatic prompt caching for API users, providing discounts on input tokens that the model has recently processed. The feature reduces costs for repeated or overlapping prompt prefixes without requiring explicit developer configuration. This follows Anthropic's similar caching feature and reflects broader industry movement toward inference cost optimization.
Anthropic enables fine-tuning of Claude 3 Haiku via Amazon Bedrock
Anthropic announced that Claude 3 Haiku can now be fine-tuned through Amazon Bedrock using custom prompt-completion pairs, with general availability reached November 1, 2024. The capability targets specialized business workflows, with Anthropic citing a case study showing classification accuracy improvement from 81.5% to 99.6% and 85% token reduction on a content moderation task. Early enterprise adopters include SK Telecom and Thomson Reuters, both reporting measurable performance gains. Fine-tuning is available in the US West (Oregon) region with text support up to 32K context, with vision fine-tuning planned.
Claude Opus 4.6 Released with 1M Token Context, Agentic Coding Advances, and State-of-the-Art Benchmarks
Anthropic has released Claude Opus 4.6, its most capable model to date, featuring a 1M token context window in beta, improved agentic coding and planning capabilities, and adaptive thinking with developer-controlled effort levels. The model claims top scores on Terminal-Bench 2.0, Humanity's Last Exam, GDPval-AA, and BrowseComp, outperforming OpenAI's GPT-5.2 by 144 Elo points on GDPval-AA. New product features include agent teams in Claude Code, context compaction for long-running tasks, and Claude in PowerPoint (research preview). Pricing remains unchanged at $5/$25 per million input/output tokens.
Instruction Sensitivity Undermines Embedding Model Evaluation: Single-Prompt Benchmarks Are Insufficient
This paper presents an empirical study of prompt sensitivity in instruction-tuned embedding models, covering 6 models, 11 datasets, and 15 task-specific prompts per dataset (990 total evaluations). The authors demonstrate that single-prompt evaluation systematically misrepresents true model performance, with default prompts both understating and overstating capabilities depending on phrasing. A key finding is that leaderboard rankings are not robust: by selecting prompts favorably, any model in the study can be promoted to first place. The authors recommend that benchmarks incorporate prompt robustness metrics, either through multi-prompt evaluation or by reporting sensitivity alongside point estimates.



