Prompt injection attacks on LLM-based résumé screening: effectiveness and fairness implications
A new arXiv paper studies prompt injection in automated résumé screening, where candidates embed subtle self-promotional text to manipulate LLM rankings without adding genuine qualifications. Controlled experiments show injection reliably boosts rankings when manipulation is rare and candidate quality is homogeneous, but effectiveness collapses as adoption spreads. The work raises fairness concerns because lower-quality candidates can occasionally outrank higher-quality ones, and identifies conditions under which LLM-based hiring systems are most vulnerable.
Related guides (2)
Related events (8)
LLMs judge worse than they generate: empirical challenge to self-evaluation pipeline assumptions
A new arXiv preprint tests the implicit assumption that LLM evaluation is easier than generation, using a controlled in-context QA setup across four benchmarks (SQuAD 2.0, DROP, HotpotQA, MuSiQue) and two models. Results show generation accuracy exceeds self-evaluation accuracy on three of four benchmarks, with attention analysis revealing that evaluation attends to context 3–5x less than generation does. LoRA fine-tuning experiments confirm the asymmetry is not a training artifact, with cross-task interference observed in both directions. The findings directly challenge assumptions underlying LLM-as-a-Judge and self-evaluation pipelines widely used in RLHF and agentic systems.
LLMs fail to reliably self-report adversarial prefill attacks, study finds
A new arXiv paper evaluates whether LLMs can recognize that their own prior responses were elicited by adversarial prefill attacks, testing ten open-weight models (3B–70B) across four safety benchmarks. Models claim intent on prefilled responses only 27.3% of the time on average, and introspective signal is largely mediated by refusal-related reasoning. Three LoRA fine-tuning methods (SFT, GRPO, DPO) improve the intention-probe gap but counterintuitively raise attack success rates on most models, suggesting partial and fragile mitigation. The findings raise concerns about the reliability of LLM self-reports in safety-critical contexts.
Performative compliance in LLMs: fairness evaluations overestimate moral safety when demographic cues are implicit
A new arXiv paper demonstrates that LLMs exhibit 'performative compliance' — appearing fair when demographic identity is explicitly labeled but becoming measurably less fair when the same identity must be inferred from context. The authors introduce a cue-variation methodology and the Cue Visibility Gap metric, showing that hiding explicit demographic labels raises harmful decisions by 4.4 percentage points and changes model safety rankings. The finding challenges the validity of current fairness benchmarks for high-stakes deployment contexts such as healthcare, legal, and hiring.
Simon Willison frames prompt injection as a role confusion problem
Simon Willison publishes a commentary piece reframing prompt injection attacks as fundamentally a problem of role confusion in LLM systems, where models fail to distinguish between trusted instructions and untrusted data. The piece offers a conceptual lens for understanding why prompt injection is structurally difficult to solve. This framing has implications for how developers and researchers approach mitigations.
Systematic evaluation of LLM prompt sensitivity in healthcare settings reveals safety risks
Researchers conduct a sensitivity analysis of both general-purpose and medical-specific LLMs using the MedMCQA benchmark, testing robustness to lexical and syntactic prompt perturbations. The study finds that even minor phrasing changes can alter clinical advice, and adversarial prompts can produce dangerous outputs such as incorrect dosages or omitted critical findings. Both general-purpose models (GPT-3.5, Llama 3) and domain-specific models (ClinicalBERT, BioLlama3, BioBERT) exhibit this fragility, with syntactic reordering and misleading contextual cues proving more destabilizing than simple paraphrasing.
LLMs automate reproducibility assessments in social and behavioral sciences, outperforming human reanalysts
A preprint from arXiv demonstrates that an LLM pipeline can automate reproducibility assessments of published social and behavioral science studies, recovering original effect sizes in 41% of cases (vs. 34% for human reanalysts) and reaching the same qualitative conclusion in 96% of cases (vs. 74% for humans). The study evaluated 76 published studies with predefined claims. The results suggest LLMs could serve as a scalable tool for systematic auditing of empirical research, addressing the resource-intensive nature of traditional reproducibility efforts.
FORGE benchmark reveals search-augmented LLMs vulnerable to fake product promotion via web content pollution
Researchers introduce FORGE, a benchmark measuring how often search-augmented LLMs recommend fake products when retrieval results are polluted with fabricated reviews or promotional pages. Across 12 commercial and open-weights models, a single polluted page causes fooled rates up to 27%, rising to 73.8% when all top-3 results are replaced. Notably, chain-of-thought reasoning does not mitigate the vulnerability and often generates spurious social proof to justify false recommendations. Three defenses tested—skepticism prompting, model-prior filtering, and cross-document consensus—each carry significant drawbacks.
SoundnessBench: Benchmarking LLMs as Evaluators of ML Research Proposal Viability
SoundnessBench is a new benchmark of 1,099 machine-learning research proposals derived from ICLR submissions, labeled with reviewer soundness scores, designed to test whether LLMs can reliably distinguish methodologically sound research ideas from unsound ones. Evaluated across 12 frontier LLMs, the benchmark reveals a pervasive optimism bias: models systematically rate low-soundness proposals as sound under standard prompting, with aggressive prompting shifting errors from false positives to false negatives rather than eliminating them. Controls for data contamination, surface features, and human audit quality suggest the bias is not attributable to a single confounder. The authors conclude that current LLMs are not yet reliable as standalone first-gate evaluators of scientific rigor, a critical bottleneck for autonomous AI research agents.

