7The Batch (DeepLearning.AI)·15d ago

Fine-tuning LLMs on summary-expansion tasks strips copyright alignment guardrails, enabling up to 92% verbatim book reproduction

Researchers from Stony Brook University, Carnegie Mellon University, and Columbia Law School fine-tuned DeepSeek-V3.1, Gemini 2.5 Pro, and GPT-4o on a task of expanding plot summaries into prose paragraphs, finding that this caused models to regurgitate up to 91.9% of verbatim text from books in their pretraining data. The key finding is that alignment training suppresses but does not erase memorized text strings from model weights, and fine-tuning on verbatim-generation tasks can re-enable that recall, bypassing system-prompt-level copyright guardrails. The result has direct implications for model providers offering fine-tuning APIs and for organizations deploying customized models, as anti-plagiarism guardrails cannot be assumed to survive downstream fine-tuning.

AI Safety Research Regulatory Developments Alignment and RLHF Carnegie Mellon University Xinyue Liu DeepSeek V4 Stony Brook University Columbia Law School Google GPT-4o Gemini-2.5-Pro OpenAI

Related guides (4)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

DeepSeek V4

DeepSeek V4: The Open-Weights Giant Reshaping AI Economics

Read asBeginner In-depth

Google

Google: The AI Lab That Builds Everything from DNA Models to Your Phone's Assistant

Read asBeginner

AI Safety ResearchTopic guide

AI Safety Research: From Lab Evals to Geopolitical Flashpoint

Read asIn-depth

Related events (8)

6Openai Blog·1mo ago·source ↗

Fine-tuning GPT-2 from Human Preferences

OpenAI fine-tuned the 774M parameter GPT-2 model using human feedback across summarization and style-continuation tasks, requiring 60k and 5k human labels respectively. The work revealed a labeler preference misalignment: for summarization, labelers rewarded copying from source text rather than genuine summarization. The stated motivation is advancing safety techniques for human-machine interaction and learning about human values from feedback.

Frontier Model Releases Evaluation and Benchmarking Reinforcement Learning from Human Feedback GPT-2 Fine-tuning GPT-2 from Human Preferences +2 more

5Openai Blog·1mo ago·source ↗

Summarizing Books with Human Feedback

OpenAI published research on using human feedback to train models to summarize entire books, addressing the challenge of scaling human oversight to tasks that are difficult for humans to evaluate directly. The work explores recursive task decomposition, where models summarize smaller chunks and then summarize those summaries, with humans providing feedback at each level. This represents an early concrete application of scalable oversight techniques to long-document understanding.

Long Context Evolution AI Safety Research Recursive Summarization Reinforcement Learning from Human Feedback OpenAI +2 more

5arXiv · cs.LG·46h ago·source ↗

Probe-and-Refine Tuning improves coding agent performance via iterative repository guidance refinement

A new arXiv paper introduces probe-and-refine tuning, a procedure that uses synthetic bug-fix probes to iteratively improve AGENTS.md repository guidance files for LLM-based coding agents without requiring an agent loop during tuning. Evaluated on SWE-bench Verified with Qwen3.5-35B-A3B, the method achieves 33.0% mean resolve rate versus 28.3% for a static knowledge base baseline and 25.5% for an unguided baseline. The improvement is attributed to coverage gains—refined guidance helps agents locate the correct files rather than improving patch quality—and a step-budget experiment shows guidance is necessary for agents to productively use larger compute budgets.

Evaluation and Benchmarking Agent and Tool Ecosystem Qwen3.5-35B-A3B SWE-Bench Verified NVIDIA Nemotron-3-Nano-30B-A3B +2 more

5arXiv · cs.CL·22d ago·source ↗

LLUMI: Fine-Tuning Open-Source LLMs for Mental Health Writing Assistance Using Reddit Community Feedback

LLUMI is a two-component system (a generation model and an improvement model) designed to provide mental health writing assistance using smaller open-source LLMs hosted in privacy-preserving, on-premise environments. The system leverages Reddit community endorsement signals (upvotes/downvotes) to construct preference pairs for SFT and DPO training, then further aligns outputs via human evaluation across readability, empathy, connection, actionability, and safety dimensions. Results show LLUMI achieves performance comparable to proprietary GPT-based models on linguistic and human evaluations, suggesting community-derived preference signals can substitute for expensive expert labeling in sensitive domains.

Open Weights Progress AI Safety Research Reddit LLUMI Direct Preference Optimization (DPO)+3 more

6arXiv · cs.AI·5d ago·source ↗

Self-improving VLMs can silently regress when verifier quality is task-mismatched

A new arXiv paper demonstrates that verifier-driven self-DPO, a common recipe for self-improving visual-language models, can silently degrade student model performance when the verifier's task-rubric accuracy is insufficient for the target task. Experiments on Qwen-3-VL-2B and Qwen-2.5-VL-3B across MathVista, MMMU, and BLINK show regressions of 3.4–10.9 percentage points below frozen baselines, with the counterintuitive finding that more accurate-but-still-wrong verifiers cause larger regressions than near-random ones. The authors provide a mechanistic explanation via a variance theorem for progress-gated replay and offer operational guidance: measure target-task rubric accuracy before running any verifier-driven loop and rank verifiers by task-specific quality rather than parameter count.

Evaluation and Benchmarking Alignment and RLHF MathVista When Good Verifiers Go Bad: Self-Improving VLMs Can Regress on New Tasks BLINK +5 more

4Hugging Face Blog·1mo ago·source ↗

Finetuning olmOCR to be a faithful OCR-Engine

TNG Technology Consulting describes a fine-tuning approach applied to olmOCR, a vision-language model designed for document OCR tasks, to improve its faithfulness and reduce hallucinations. The post covers dataset construction, training methodology, and evaluation results showing improved accuracy on document extraction benchmarks. This represents a practical community contribution to the open-weights document-understanding ecosystem.

Open Weights Progress Agent and Tool Ecosystem Hugging Face olmOCR TNG Technology Consulting +1 more

5The Batch·22d ago·source ↗

Meta Research Improves Image Generation via Staged Planning and Self-Revision Fine-Tuning

Researchers from Meta and collaborating universities propose a fine-tuning method that teaches image generators to compose images through discrete plan-sketch-inspect-refine cycles rather than generating all at once. Starting from BAGEL-7B, they construct ~62,000 training examples using GPT-4o and FLUX.1 Kontext to supervise each stage, achieving 83% on GenEval versus 77% for the base model and a competing method (PARM) that required 11x more training data and ~8x more inference steps. The approach improves spatial relationship accuracy, object attribute fidelity, and real-world knowledge grounding in generated images.

Evaluation and Benchmarking Agent and Tool Ecosystem University of California San Diego WISE FLUX.1 Kontext +10 more

4arXiv · cs.CL·1mo ago·source ↗

Study: LLM-Derived Error Highlights and APE Suggestions in MT Post-Editing

Researchers conducted a controlled study with professional En-Nl translators comparing post-editing (PE) workflows augmented with LLM-derived error highlights and automatic post-editing (APE) correction suggestions against regular PE and QE-derived highlights. No condition produced measurable productivity or quality gains over standard PE. However, APE-derived highlights were preferred over QE-derived highlights, and correction suggestions improved subjective user experience.

Evaluation and Benchmarking Enterprise Deployment Patterns large language models Automatic Post-Editing (APE)Machine Translation (MT)+1 more