Almanac
← Events
4Anthropic News·19d ago

Anthropic Publishes Quantitative Case Study on Prompt Engineering for Long-Context Recall

Anthropic shares a quantitative case study evaluating prompting techniques to improve Claude's recall over 75,000–90,000 token contexts. Two techniques are tested: extracting reference quotes before answering, and providing few-shot examples of correctly answered questions. The study uses Claude Instant 1.2 on a government document dataset constructed via a 'randomized collage' method, with multiple-choice Q&A pairs generated by Claude itself. Results show measurable recall improvements over a baseline prompt, with methodology and notebooks shared publicly.

Related guides (4)

Related events (8)

3Anthropic News·17d ago·source ↗

Anthropic publishes prompt engineering guide for enterprise Claude deployments

Anthropic released a practical guide covering three core prompt engineering techniques—chain-of-thought (step-by-step), few-shot prompting, and prompt chaining—aimed at businesses deploying Claude in production. The post includes a case study of a Fortune 500 company building a customer-facing chat assistant using these techniques to improve accuracy and speed. The content is instructional rather than a capability announcement, targeting enterprise practitioners seeking to optimize Claude deployments.

4arXiv · cs.AI·1mo ago·source ↗

Structured Prompt Checklists Outperform Raw and Clarifying-Question Prompts Across LLMs

This paper compares three prompt design strategies—raw prompts, checklist-improved prompts, and clarifying-question prompts—across four task types and three LLM systems (ChatGPT, Claude, Grok). Checklist-improved prompts achieved the highest mean rubric score (7.50/8) versus 5.67 for raw and 6.67 for clarifying-question prompts. Checklist prompts also used fewer tokens on average, suggesting a favorable quality-effort tradeoff. The study provides empirical grounding for structured prompt engineering as a practical technique to reduce multi-turn interaction overhead.

6Anthropic News·17d ago·source ↗

Anthropic introduces Contextual Retrieval to reduce RAG retrieval failures by up to 67%

Anthropic published a technical method called Contextual Retrieval that combines Contextual Embeddings and Contextual BM25 to address the context-loss problem in traditional RAG pipelines. The approach prepends chunk-level context before encoding, reducing failed retrievals by 49% standalone and 67% when combined with reranking. The post also highlights prompt caching as a simpler alternative for knowledge bases under 200K tokens, and provides a cookbook for deployment with Claude.

7Anthropic News·16d ago·source ↗

Anthropic launches Claude 2 with 100K context window and improved coding, reasoning, and safety

Anthropic released Claude 2, featuring a 100K token context window, improved performance on coding (71.2% on Codex HumanEval, up from 56.0%), math (88.0% on GSM8k), and legal reasoning (76.5% on the Bar exam multiple choice section). The model is available via API at the same price as Claude 1.3 and through a new public beta at claude.ai for US and UK users. Safety improvements include a 2x reduction in harmful outputs on internal red-team evaluations compared to Claude 1.3. Early API partners include Jasper and Sourcegraph.

6Openai Blog·1mo ago·source ↗

Prompt Caching in the API

OpenAI is introducing automatic prompt caching for API users, providing discounts on input tokens that the model has recently processed. The feature reduces costs for repeated or overlapping prompt prefixes without requiring explicit developer configuration. This follows Anthropic's similar caching feature and reflects broader industry movement toward inference cost optimization.

6Anthropic News·16d ago·source ↗

Anthropic enables fine-tuning of Claude 3 Haiku via Amazon Bedrock

Anthropic announced that Claude 3 Haiku can now be fine-tuned through Amazon Bedrock using custom prompt-completion pairs, with general availability reached November 1, 2024. The capability targets specialized business workflows, with Anthropic citing a case study showing classification accuracy improvement from 81.5% to 99.6% and 85% token reduction on a content moderation task. Early enterprise adopters include SK Telecom and Thomson Reuters, both reporting measurable performance gains. Fine-tuning is available in the US West (Oregon) region with text support up to 32K context, with vision fine-tuning planned.

9Anthropic News·19d ago·source ↗

Claude Opus 4.6 Released with 1M Token Context, Agentic Coding Advances, and State-of-the-Art Benchmarks

Anthropic has released Claude Opus 4.6, its most capable model to date, featuring a 1M token context window in beta, improved agentic coding and planning capabilities, and adaptive thinking with developer-controlled effort levels. The model claims top scores on Terminal-Bench 2.0, Humanity's Last Exam, GDPval-AA, and BrowseComp, outperforming OpenAI's GPT-5.2 by 144 Elo points on GDPval-AA. New product features include agent teams in Claude Code, context compaction for long-running tasks, and Claude in PowerPoint (research preview). Pricing remains unchanged at $5/$25 per million input/output tokens.

6arXiv · cs.CL·29d ago·source ↗

Instruction Sensitivity Undermines Embedding Model Evaluation: Single-Prompt Benchmarks Are Insufficient

This paper presents an empirical study of prompt sensitivity in instruction-tuned embedding models, covering 6 models, 11 datasets, and 15 task-specific prompts per dataset (990 total evaluations). The authors demonstrate that single-prompt evaluation systematically misrepresents true model performance, with default prompts both understating and overstating capabilities depending on phrasing. A key finding is that leaderboard rankings are not robust: by selecting prompts favorably, any model in the study can be promoted to first place. The authors recommend that benchmarks incorporate prompt robustness metrics, either through multi-prompt evaluation or by reporting sensitivity alongside point estimates.