5arXiv cs.CL (Computation and Language)·32h ago

WaveDetect: Wavelet-based spectral fingerprinting for robust machine-generated text detection

WaveDetect is a new framework that reframes LLM-generated text detection as a signal processing problem, applying a differentiable Continuous Wavelet Transform to token probability sequences to extract 'spectral fingerprints' invisible in the time domain. The approach targets three known failure modes of existing detectors: adversarial perturbations, cross-domain shifts, and temporal model evolution. Evaluations on RAID, EvoBench, and Domain-Shift benchmarks claim state-of-the-art accuracy and robustness against sophisticated attacks and unseen LLMs.

Evaluation and Benchmarking AI Safety Research EvoBench RAID Domain-Shift Continuous Wavelet Transform WaveDetect

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·16d ago·source ↗

SV-Detect: AI-generated text detection via steering vectors in representation space

SV-Detect proposes a method for detecting machine-generated text by extracting steering vectors from the hidden representations of a frozen language model, constructing layer-wise directions that separate human from AI-written text. A lightweight classifier trained on projection features achieves strong performance both in-distribution and under distribution shift across domains, source models, and editing attacks like polishing and rewriting. The approach reframes AI-text detection as a representation-space probing problem, with interpretation analyses showing the learned directions capture stylistic cues beyond surface features.

Evaluation and Benchmarking AI Safety Research SV-Detect steering vectors

6arXiv · cs.AI·5d ago·source ↗

CWE-Trace framework reveals LLM vulnerability detection is calibration without comprehension

Researchers introduce CWE-Trace, a benchmark of 834 manually curated Linux kernel samples across 74 CWEs with strict temporal splits to prevent data contamination, used to evaluate 8 vanilla LLMs and 15 LoRA fine-tuned variants on vulnerability detection. Key findings: data contamination provides no measurable advantage (84% of nominally contaminated samples carry no usable memorization signal), and backbone directional priors dominate fine-tuning — models exhibit stable systematic failure modes that resist correction. The best binary detection score reaches only 52.1% (barely above chance) and exact CWE classification Top-1 accuracy stays below 1.3%, indicating fine-tuning shifts output distributions without instilling genuine security reasoning. The work introduces two diagnostic metrics (Directional Failure Index and Hierarchical Distance and Direction) and concludes that detection capability and security understanding are decoupled in current LLMs.

Evaluation and Benchmarking AI Safety Research CWE-Trace Hierarchical Distance and Direction DeepSeek V4 +3 more

5arXiv · cs.LG·1mo ago·source ↗

Dynamics-Level Watermarking of Flow Matching Models with Random Codes

This paper proposes embedding watermarks directly into the velocity field (continuous dynamics) of flow matching generative models, rather than into weights or outputs. The method uses key-dependent perturbations added during training, formulated as random coding over a continuous channel, allowing black-box message recovery at detection time. The perturbation is designed to leave the generated distribution unchanged. Experiments on MNIST and CIFAR-10 demonstrate reliable message recovery, preserved generation quality, and chance-level decoding without the secret key.

Evaluation and Benchmarking AI Safety Research MNIST CIFAR-10 Random Coding +2 more

5arXiv · cs.AI·14d ago·source ↗

Explainability pipeline reveals divergent cues used by deepfake speech detectors

Researchers propose an audio-native explainability pipeline using Integrated Gradients on time-aligned self-supervised representations to localize decision evidence in deepfake speech detectors. Applied to three WavLM-based detectors (AASIST, CA-MHFA, SLS) on the ASVspoof 5 benchmark, the method reveals that despite similar performance, each detector relies on fundamentally different cues: environmental noise, phoneme artifacts, and word boundaries respectively. Findings are validated via causal masking experiments that confirm performance degrades when primary cues are removed. The work advances interpretability of audio deepfake detection, relevant to AI safety and media authenticity.

Evaluation and Benchmarking AI Safety Research CA-MHFA Integrated Gradients SLS +4 more

6Hugging Face Blog·1mo ago·source ↗

Introducing SynthID Text

Hugging Face published a blog post introducing SynthID Text, Google DeepMind's watermarking technique for AI-generated text. The method embeds imperceptible signals into LLM outputs by modifying token sampling distributions, enabling detection of AI-generated content without degrading text quality. The post likely covers integration with Hugging Face's transformers library, making the technique accessible to the broader ML community.

Evaluation and Benchmarking AI Safety Research Hugging Face Transformers Google DeepMind Hugging Face +2 more

5arXiv · cs.AI·6d ago·source ↗

Multi-domain benchmark for detecting AI-generated text-rich images from GPT-Image-2

Researchers introduce a new benchmark of 8,602 images across six categories (commercial posters, infographics, academic posters, receipts, tables, UI screenshots) specifically for detecting AI-generated text-rich images produced by OpenAI's GPT-Image-2. Five zero-shot detectors are evaluated, revealing highly domain-dependent performance and severe sensitivity to JPEG compression even in the strongest conventional detector. A multimodal VLM is also explored as a detector, showing promise but limitations on structured formats. The work highlights a gap in existing benchmarks that focus on object-centric rather than text-layout-centric images.

Evaluation and Benchmarking Multimodal Progress GPT-Image-2 OpenAI A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2

6arXiv · cs.CL·23d ago·source ↗

Trajectory Analysis of Masked Diffusion LMs for Graph-to-Text Generation with Lambda-Scaled Structural Decoding

This paper presents the first systematic study of masked diffusion language models (MDLMs) for graph-to-text generation, analyzing the order in which tokens are unmasked during iterative decoding. The authors find MDLMs naturally unmask entities first, then relational/function words, then structural tokens—a pattern disrupted by supervised fine-tuning, which prematurely anchors structural tokens and causes hallucination or omission. They propose lambda-scaled structural decoding, a training-free inference-time fix that recovers +9.4 BLEU-4, and introduce Graph-LLaDA, which integrates a Graph Transformer encoder into LLaDA's decoding process. Cross-dataset evaluation on the LAGRANGE benchmark shows prior baselines overfit to dataset-specific patterns while MDLM-based approaches generalize better.

Frontier Model Releases Evaluation and Benchmarking BLEU-4 Graph Transformer Diffusion Language Models +5 more

4arXiv · cs.CL·1mo ago·source ↗

Image-Semantic Guided Detection of AI-Generated Modern Chinese Poetry Using MLLMs

This paper proposes a multimodal detection method for identifying AI-generated modern Chinese poetry by incorporating images that reflect poetic content alongside text. The approach uses example-driven prompting to integrate meaning, imagery, and emotional cues from images as a complement to textual analysis. A Gemini-based detector using this method achieves 85.65% Macro-F1, outperforming both plain-text LLM baselines and the traditional RoBERTa detector. The work extends AI-generated content detection research into a domain—modern Chinese poetry—previously unaddressed by prior studies.

Evaluation and Benchmarking Multimodal Progress RoBERTa image-semantic guided poetry detection modern Chinese poetry AI detection +2 more