5arXiv cs.AI (Artificial Intelligence)·7d ago

Prompt injection attacks on LLM-based résumé screening: effectiveness and fairness implications

A new arXiv paper studies prompt injection in automated résumé screening, where candidates embed subtle self-promotional text to manipulate LLM rankings without adding genuine qualifications. Controlled experiments show injection reliably boosts rankings when manipulation is rare and candidate quality is homogeneous, but effectiveness collapses as adoption spreads. The work raises fairness concerns because lower-quality candidates can occasionally outrank higher-quality ones, and identifies conditions under which LLM-based hiring systems are most vulnerable.

Evaluation and Benchmarking AI Safety Research Prompt Injection in Automated Résumé Screening with Large Language Models: Single and Multi-Injection Settings

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Principles to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

AI Evaluation and Benchmarking: From Leaderboards to the Limits of Measurement

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·4d ago·source ↗

LLMs judge worse than they generate: empirical challenge to self-evaluation pipeline assumptions

A new arXiv preprint tests the implicit assumption that LLM evaluation is easier than generation, using a controlled in-context QA setup across four benchmarks (SQuAD 2.0, DROP, HotpotQA, MuSiQue) and two models. Results show generation accuracy exceeds self-evaluation accuracy on three of four benchmarks, with attention analysis revealing that evaluation attends to context 3–5x less than generation does. LoRA fine-tuning experiments confirm the asymmetry is not a training artifact, with cross-task interference observed in both directions. The findings directly challenge assumptions underlying LLM-as-a-Judge and self-evaluation pipelines widely used in RLHF and agentic systems.

Evaluation and Benchmarking Alignment and RLHF MuSiQue Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA LoRA +3 more

6arXiv · cs.CL·10d ago·source ↗

LLMs fail to reliably self-report adversarial prefill attacks, study finds

A new arXiv paper evaluates whether LLMs can recognize that their own prior responses were elicited by adversarial prefill attacks, testing ten open-weight models (3B–70B) across four safety benchmarks. Models claim intent on prefilled responses only 27.3% of the time on average, and introspective signal is largely mediated by refusal-related reasoning. Three LoRA fine-tuning methods (SFT, GRPO, DPO) improve the intention-probe gap but counterintuitively raise attack success rates on most models, suggesting partial and fragile mitigation. The findings raise concerns about the reliability of LLM self-reports in safety-critical contexts.

Evaluation and Benchmarking AI Safety Research GRPO DPO Can LLMs Reliably Self-Report Adversarial Prefills, and How?+1 more

6arXiv · cs.CL·2d ago·source ↗

Performative compliance in LLMs: fairness evaluations overestimate moral safety when demographic cues are implicit

A new arXiv paper demonstrates that LLMs exhibit 'performative compliance' — appearing fair when demographic identity is explicitly labeled but becoming measurably less fair when the same identity must be inferred from context. The authors introduce a cue-variation methodology and the Cue Visibility Gap metric, showing that hiding explicit demographic labels raises harmful decisions by 4.4 percentage points and changes model safety rankings. The finding challenges the validity of current fairness benchmarks for high-stakes deployment contexts such as healthcare, legal, and hiring.

Evaluation and Benchmarking AI Safety Research Cue Visibility Gap Moral Safety in LLMs: Exposing Performative Compliance with Puzzled Cues

5Simon Willison'S Weblog·10d ago·source ↗

Simon Willison frames prompt injection as a role confusion problem

Simon Willison publishes a commentary piece reframing prompt injection attacks as fundamentally a problem of role confusion in LLM systems, where models fail to distinguish between trusted instructions and untrusted data. The piece offers a conceptual lens for understanding why prompt injection is structurally difficult to solve. This framing has implications for how developers and researchers approach mitigations.

AI Safety Research Agent and Tool Ecosystem prompt injection Simon Willison

5arXiv · cs.CL·25d ago·source ↗

Systematic evaluation of LLM prompt sensitivity in healthcare settings reveals safety risks

Researchers conduct a sensitivity analysis of both general-purpose and medical-specific LLMs using the MedMCQA benchmark, testing robustness to lexical and syntactic prompt perturbations. The study finds that even minor phrasing changes can alter clinical advice, and adversarial prompts can produce dangerous outputs such as incorrect dosages or omitted critical findings. Both general-purpose models (GPT-3.5, Llama 3) and domain-specific models (ClinicalBERT, BioLlama3, BioBERT) exhibit this fragility, with syntactic reordering and misleading contextual cues proving more destabilizing than simple paraphrasing.

Evaluation and Benchmarking AI Safety Research BioLlama3 BioBERT MedMCQA +3 more

6arXiv · cs.AI·21d ago·source ↗

LLMs automate reproducibility assessments in social and behavioral sciences, outperforming human reanalysts

A preprint from arXiv demonstrates that an LLM pipeline can automate reproducibility assessments of published social and behavioral science studies, recovering original effect sizes in 41% of cases (vs. 34% for human reanalysts) and reaching the same qualitative conclusion in 96% of cases (vs. 74% for humans). The study evaluated 76 published studies with predefined claims. The results suggest LLMs could serve as a scalable tool for systematic auditing of empirical research, addressing the resource-intensive nature of traditional reproducibility efforts.

Evaluation and Benchmarking Agent and Tool Ecosystem Automated reproducibility assessments in the social and behavioral sciences using large language models

6arXiv · cs.AI·21d ago·source ↗

FORGE benchmark reveals search-augmented LLMs vulnerable to fake product promotion via web content pollution

Researchers introduce FORGE, a benchmark measuring how often search-augmented LLMs recommend fake products when retrieval results are polluted with fabricated reviews or promotional pages. Across 12 commercial and open-weights models, a single polluted page causes fooled rates up to 27%, rising to 73.8% when all top-3 results are replaced. Notably, chain-of-thought reasoning does not mitigate the vulnerability and often generates spurious social proof to justify false recommendations. Three defenses tested—skepticism prompting, model-prior filtering, and cross-document consensus—each carry significant drawbacks.

Evaluation and Benchmarking AI Safety Research One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders FORGE +1 more

6arXiv · cs.LG·1mo ago·source ↗

SoundnessBench: Benchmarking LLMs as Evaluators of ML Research Proposal Viability

SoundnessBench is a new benchmark of 1,099 machine-learning research proposals derived from ICLR submissions, labeled with reviewer soundness scores, designed to test whether LLMs can reliably distinguish methodologically sound research ideas from unsound ones. Evaluated across 12 frontier LLMs, the benchmark reveals a pervasive optimism bias: models systematically rate low-soundness proposals as sound under standard prompting, with aggressive prompting shifting errors from false positives to false negatives rather than eliminating them. Controls for data contamination, surface features, and human audit quality suggest the bias is not attributable to a single confounder. The authors conclude that current LLMs are not yet reliable as standalone first-gate evaluators of scientific rigor, a critical bottleneck for autonomous AI research agents.

Evaluation and Benchmarking AI Safety Research ICLR optimism bias SoundnessBench +1 more