Meta Research Improves Image Generation via Staged Planning and Self-Revision Fine-Tuning
Researchers from Meta and collaborating universities propose a fine-tuning method that teaches image generators to compose images through discrete plan-sketch-inspect-refine cycles rather than generating all at once. Starting from BAGEL-7B, they construct ~62,000 training examples using GPT-4o and FLUX.1 Kontext to supervise each stage, achieving 83% on GenEval versus 77% for the base model and a competing method (PARM) that required 11x more training data and ~8x more inference steps. The approach improves spatial relationship accuracy, object attribute fidelity, and real-world knowledge grounding in generated images.
Related guides (3)
Related events (8)
Image GPT: Transformer Models Applied to Pixel Sequences for Image Generation and Classification
OpenAI demonstrates that a large transformer model trained autoregressively on pixel sequences can generate coherent image completions and samples, analogous to text generation. The work establishes a correlation between generative sample quality and downstream image classification accuracy. The best generative model achieves features competitive with top convolutional networks in the unsupervised setting, suggesting shared representational principles across modalities.
Introducing 4o Image Generation
OpenAI has integrated a native image generation capability directly into GPT-4o, positioning it as a primary model capability rather than a separate system. The announcement frames this as their most advanced image generator to date, emphasizing both aesthetic quality and practical utility. This represents a shift toward unified multimodal models that generate images natively rather than relying on separate diffusion-based pipelines.
Semantic Generative Tuning (SGT) for Unified Multimodal Models
This paper introduces Semantic Generative Tuning (SGT), a post-training paradigm for unified multimodal models (UMMs) that bridges the gap between visual understanding and visual generation. The authors find that image segmentation tasks serve as optimal generative proxies, providing structural semantics that improve both perception and generative layout fidelity. SGT aligns representation spaces across understanding and generation objectives, improving feature linear separability and visual-textual attention allocation. Evaluations show consistent gains on multimodal comprehension and generative fidelity benchmarks.
OpenAI Launches GPT-Image-1.5 via ChatGPT Images Update
OpenAI has rolled out an upgraded image generation model, GPT-Image-1.5, to all ChatGPT users and via the API. The update promises more precise edits, more consistent details, and up to 4× faster image generation compared to the previous version. The rollout is global and simultaneous across consumer and API access tiers.
Provenance-grounded gating and adaptive recovery improve synthetic post-training data curation
A controlled study examines two underexplored practices in synthetic post-training data pipelines: grounding filtering signals in source provenance and systematically recovering rejected samples rather than discarding them. Using adversarially injected corpora as ground-truth failure labels, the authors find that exact source provenance improves faithfulness gating for stronger judges, that hallucination and reward gates reject largely disjoint populations (making both necessary), and that adaptive recovery via failure diagnosis and targeted regeneration outperforms naive resampling. Generator scale is the primary driver of downstream fine-tuning quality, with filtration and recovery contributing meaningfully but secondarily.
Introducing vision to the fine-tuning API
OpenAI has extended its fine-tuning API to support multimodal inputs, allowing developers to fine-tune GPT-4o using both images and text. This enables customization of vision capabilities for domain-specific tasks. The update expands the existing text-only fine-tuning pipeline to handle image-text pairs.
ETCHR: Decoupled Image Editing for Visual Chain-of-Thought Reasoning in MLLMs
ETCHR introduces a question-conditioned, reasoning-aware image editing model that decouples visual transformation from downstream understanding in multimodal LLMs. It addresses two identified gaps—language-side (mapping abstract questions to visual edits) and generation-side (edit quality degrading with reasoning depth)—via a two-stage training recipe combining supervised fine-tuning on edit trajectories and VLM-derived reward signals. Because the editor is decoupled, it plugs into arbitrary MLLMs without retraining, yielding Pass@1 gains of roughly +4.6 to +5.5 points across five task families when paired with Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5. The work advances the 'think with images' paradigm beyond fixed toolkits and unified multimodal approaches.
Representation-Conditioned Diffusion Models for Controllable Image Generation
This paper explores conditioning diffusion models on representations from pre-trained self-supervised models as an alternative to text prompts or semantic maps, which require large annotated datasets. The self-conditioning mechanism improves unconditional image generation quality and provides a controllable representation space. The authors identify directions of variation in this space and demonstrate smoothness and disentanglement properties, suggesting potential for fine-grained generative control without heavy annotation overhead.


