Instruction Sensitivity Undermines Embedding Model Evaluation: Single-Prompt Benchmarks Are Insufficient
This paper presents an empirical study of prompt sensitivity in instruction-tuned embedding models, covering 6 models, 11 datasets, and 15 task-specific prompts per dataset (990 total evaluations). The authors demonstrate that single-prompt evaluation systematically misrepresents true model performance, with default prompts both understating and overstating capabilities depending on phrasing. A key finding is that leaderboard rankings are not robust: by selecting prompts favorably, any model in the study can be promoted to first place. The authors recommend that benchmarks incorporate prompt robustness metrics, either through multi-prompt evaluation or by reporting sensitivity alongside point estimates.
Related guides (2)
Related events (8)
EEVEE: Multi-dataset test-time prompt learning framework for self-improving LLM agents
EEVEE is a new framework enabling LLM agents to perform test-time prompt learning across heterogeneous multi-dataset task streams, addressing a gap where prior methods only handled single-dataset settings. The system uses a router to partition inputs into task clusters and assigns them to suitable prompt configurations, optimized via a router-prompt co-evolution strategy. Experiments show improvements of 10.38 and 24.32 average points over Qwen3-4B-Instruct and DeepSeek-V3.2 respectively, outperforming prior SOTA methods GEPA and ACE by up to 48.2%.
Anthropic Publishes Quantitative Case Study on Prompt Engineering for Long-Context Recall
Anthropic shares a quantitative case study evaluating prompting techniques to improve Claude's recall over 75,000–90,000 token contexts. Two techniques are tested: extracting reference quotes before answering, and providing few-shot examples of correctly answered questions. The study uses Claude Instant 1.2 on a government document dataset constructed via a 'randomized collage' method, with multiple-choice Q&A pairs generated by Claude itself. Results show measurable recall improvements over a baseline prompt, with methodology and notebooks shared publicly.
ParaEval framework reduces MCQA benchmark sensitivity to answer phrasing
A new arXiv preprint identifies a systematic flaw in multiple-choice QA benchmarks: log-likelihood scoring conflates surface-form familiarity with actual capability, producing false performance gaps exceeding 2 points between models trained on identical knowledge. The authors propose ParaEval, which queries models with multiple paraphrases per answer option and scores on the most favorable phrasing, reducing the false gap to below 1 point. The effect is confirmed on frontier 70B and 120B open-source models, suggesting widespread benchmark inflation in standard MCQA evaluations.
Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests
This paper introduces a large, consensus-labeled benchmark of 6,675 prompts drawn from eight existing corpora (ASTRA, CySecBench, AdvBench, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) to evaluate whether coding-specialized LLMs refuse malicious requests. A key contribution is the distinction between requests for executable malicious code (4,748 prompts) versus harmful security knowledge (1,923 prompts), arguing that coding models should face a stricter refusal standard given their outputs can be directly weaponized. A five-judge consensus protocol achieves Fleiss' kappa of 0.767, providing a reliability-quantified substrate for cross-corpus compliance measurement that the field has previously lacked.
Failure Modes of Multi-Objective Prompt Optimization for LLM Judges
This paper investigates multi-objective prompt optimization for LLM-as-judge systems, testing five decomposition modes of textual gradient optimizers across varying levels of cross-task information sharing. In 6 of 10 configurations, optimization fails to improve over the initial prompt, with gradient specificity dropping 59% when multiple criteria are processed jointly. The authors identify two separable failure modes: gradient dilution at optimization time and instruction interference at inference time. These findings constrain the design space for customizing LLM judges via textual feedback across multiple evaluation criteria simultaneously.
Decomposing factual sycophancy in LLMs: size and instruction tuning shape robustness differently
A new arXiv paper decomposes factual sycophancy — where a model abandons a correct answer under social pressure — into two distinct mechanisms: truth margin (baseline preference for correct answers) and manipulation sensitivity (how much pressure shifts that preference). Evaluating 56 open-weight models from 0.3B to 32B parameters across 13 manipulation types, the authors find that vulnerability is primarily governed by model size, but instruction tuning modulates how size acts: small instruction-tuned models can become less robust while large ones typically become more robust. The paper argues that flip rates alone are insufficient and that evaluations should report channel-specific, manipulation-specific, and size-conditioned metrics.
Improving Prompt Consistency with Structured Generations
This Hugging Face blog post examines how structured generation outputs can improve consistency in LLM evaluation pipelines. It explores techniques for constraining model outputs to specific formats, reducing variability in prompt-based assessments. The post addresses a practical challenge in evaluation workflows where inconsistent response formats degrade measurement reliability.
Systematic evaluation of LLM prompt sensitivity in healthcare settings reveals safety risks
Researchers conduct a sensitivity analysis of both general-purpose and medical-specific LLMs using the MedMCQA benchmark, testing robustness to lexical and syntactic prompt perturbations. The study finds that even minor phrasing changes can alter clinical advice, and adversarial prompts can produce dangerous outputs such as incorrect dosages or omitted critical findings. Both general-purpose models (GPT-3.5, Llama 3) and domain-specific models (ClinicalBERT, BioLlama3, BioBERT) exhibit this fragility, with syntactic reordering and misleading contextual cues proving more destabilizing than simple paraphrasing.

