5arXiv cs.CL (Computation and Language)·12d ago

SV-Detect: AI-generated text detection via steering vectors in representation space

SV-Detect proposes a method for detecting machine-generated text by extracting steering vectors from the hidden representations of a frozen language model, constructing layer-wise directions that separate human from AI-written text. A lightweight classifier trained on projection features achieves strong performance both in-distribution and under distribution shift across domains, source models, and editing attacks like polishing and rewriting. The approach reframes AI-text detection as a representation-space probing problem, with interpretation analyses showing the learned directions capture stylistic cues beyond surface features.

Evaluation and Benchmarking AI Safety Research SV-Detect steering vectors

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5Openai Blog·1mo ago·source ↗

New AI classifier for indicating AI-written text

OpenAI launched a classifier designed to distinguish between AI-generated and human-written text. The tool was positioned as an aid for detecting content produced by large language models. OpenAI acknowledged limitations including unreliability on short texts and non-English content, and noted the classifier should not be used as a sole decision-making tool.

Evaluation and Benchmarking AI Safety Research OpenAI AI Text Classifier OpenAI

5arXiv · cs.AI·11d ago·source ↗

Explainability pipeline reveals divergent cues used by deepfake speech detectors

Researchers propose an audio-native explainability pipeline using Integrated Gradients on time-aligned self-supervised representations to localize decision evidence in deepfake speech detectors. Applied to three WavLM-based detectors (AASIST, CA-MHFA, SLS) on the ASVspoof 5 benchmark, the method reveals that despite similar performance, each detector relies on fundamentally different cues: environmental noise, phoneme artifacts, and word boundaries respectively. Findings are validated via causal masking experiments that confirm performance degrades when primary cues are removed. The work advances interpretability of audio deepfake detection, relevant to AI safety and media authenticity.

Evaluation and Benchmarking AI Safety Research CA-MHFA Integrated Gradients SLS +4 more

5arXiv · cs.AI·19d ago·source ↗

TunerDiT: Training-free Progressive Steering of Diffusion Transformers for Multi-Event Video Generation

TunerDiT is a training-free method for steering video diffusion transformers (DiTs) to generate long-horizon videos containing multiple sequential events. The approach identifies intrinsic turning points in the DiT denoising trajectory where text conditioning shifts from global layout to fine-grained detail, then applies two steering mechanisms: Event-Partitioned Masking and Cross-Event Prompt Fusion. The authors also introduce Meve, a benchmark prompt suite for multi-event video generation, and report state-of-the-art results across 8 metrics with improved text alignment scaling with event count.

Evaluation and Benchmarking Inference Economics Meve TunerDiT Event-Partitioned Masking +3 more

6arXiv · cs.CL·23d ago·source ↗

SAEs as Stethoscopes: Interpretability-Guided Layer Selection for Task Vector Model Editing

This paper evaluates a Sparse Autoencoder (SAE)-guided model editing pipeline for mathematical reasoning on Gemma-3-4B-IT, finding that projecting task vectors onto SAE feature subspaces discards ~97% of modification energy due to geometric misalignment between activation-space SAE directions and weight-space task vectors. The authors reframe SAEs as diagnostic tools ('stethoscopes') rather than intervention filters ('scalpels'), using SAE-derived specificity scores to identify which layers to inject unfiltered task vectors into. This approach improves Number Theory accuracy from 29.6% to 39.4% on Minerva Math (p=0.0007), with 5 of 7 math subjects significantly improved and none degraded. The method is fully deterministic and adds no inference cost.

Evaluation and Benchmarking AI Safety Research Subspace Projection Gemma-3-4B-IT Sparse Autoencoders (SAEs)+4 more

6arXiv · cs.AI·18d ago·source ↗

Tracking Behavioral Trajectories of Adapting Agents via Trait Vectors in Embedding Space

This paper introduces a methodology for measuring behavioral traits of AI agents by defining traits as directions in the embedding space of a text embedding model, trained on labeled diffs of agent skill/memory/configuration files. A linear model achieves 91.2% sign classification accuracy and Spearman ρ=0.82 on detecting propensity to seek sensitive data across 68 labeled skill diff pairs. The framework extends to an agent-to-agent evaluation protocol where one agent can assess another's skill file updates through a trusted intermediary, enabling ongoing behavioral monitoring of self-modifying agents.

Evaluation and Benchmarking AI Safety Research agent-to-agent evaluation protocol skill file diff trait vector +3 more

5arXiv · cs.LG·15d ago·source ↗

OpAI-Bench: Benchmark for detecting AI text across progressive human-AI co-editing workflows

Researchers introduce OpAI-Bench, a benchmark for studying AI-text detection across progressive human-to-AI document revision workflows, covering document, sentence, token, and span granularities. Starting from human-written documents, the benchmark constructs nine sequentially revised versions per sample under five AI edit operations and varying AI coverage levels across four domains. Key findings include that mixed-authorship intermediate versions are often harder to detect than fully human or heavily AI-edited endpoints, revealing non-monotonic detection patterns absent from existing benchmarks. The work addresses a gap in AI-text detection research as real-world documents increasingly result from iterative human-AI co-editing rather than pure generation.

Evaluation and Benchmarking AI Safety Research VILA-Lab OpAI-Bench

5arXiv · cs.CL·12d ago·source ↗

Adversarial methodology improves detection of AI-generated social bot content

Researchers introduce an adversarial framework that simulates malicious actors impersonating real social media users to generate training data for AI-content detection. The approach produces a multilingual, cross-platform dataset of paired human and AI-generated messages. Models trained on this adversarial data significantly outperform existing content-based bot detection systems on out-of-distribution real-world data.

Evaluation and Benchmarking AI Safety Research Adversarial Creation and Detection of AI-Generated Social Bot Content

5arXiv · cs.AI·24d ago·source ↗

Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection

This paper introduces Social Gaze Consistency (SGC), a high-level semantic detection axis based on the mutual coherence of gaze direction, head-eye alignment, and pupil placement between interacting individuals in images. The authors construct a controlled diagnostic dataset with region-specific gaze perturbations and a Block-Compositional Caption Supervision scheme to train detectors without generator-fingerprint memorization shortcuts. Cross-architecture validation shows +3.7 pp improvement on the COCOAI Interaction subset when applied to FakeVLM, with gains transferring from a single inpainter (FLUX.1-Fill) to multi-generator suites. The work argues that diffusion models share a spectral weakness in periocular structure, making gaze coherence a robust, backbone-agnostic detection signal orthogonal to existing low-level artifact methods.

Evaluation and Benchmarking AI Safety Research Effort FLUX.1-Fill Social Gaze Consistency +5 more