Almanac
← Events
5The Batch (DeepLearning.AI)·22d ago

Meta Research Improves Image Generation via Staged Planning and Self-Revision Fine-Tuning

Researchers from Meta and collaborating universities propose a fine-tuning method that teaches image generators to compose images through discrete plan-sketch-inspect-refine cycles rather than generating all at once. Starting from BAGEL-7B, they construct ~62,000 training examples using GPT-4o and FLUX.1 Kontext to supervise each stage, achieving 83% on GenEval versus 77% for the base model and a competing method (PARM) that required 11x more training data and ~8x more inference steps. The approach improves spatial relationship accuracy, object attribute fidelity, and real-world knowledge grounding in generated images.

Related guides (3)

Related events (8)

6Openai Blog·1mo ago·source ↗

Image GPT: Transformer Models Applied to Pixel Sequences for Image Generation and Classification

OpenAI demonstrates that a large transformer model trained autoregressively on pixel sequences can generate coherent image completions and samples, analogous to text generation. The work establishes a correlation between generative sample quality and downstream image classification accuracy. The best generative model achieves features competitive with top convolutional networks in the unsupervised setting, suggesting shared representational principles across modalities.

8Openai Blog·1mo ago·source ↗

Introducing 4o Image Generation

OpenAI has integrated a native image generation capability directly into GPT-4o, positioning it as a primary model capability rather than a separate system. The announcement frames this as their most advanced image generator to date, emphasizing both aesthetic quality and practical utility. This represents a shift toward unified multimodal models that generate images natively rather than relying on separate diffusion-based pipelines.

6arXiv · cs.AI·1mo ago·source ↗

Semantic Generative Tuning (SGT) for Unified Multimodal Models

This paper introduces Semantic Generative Tuning (SGT), a post-training paradigm for unified multimodal models (UMMs) that bridges the gap between visual understanding and visual generation. The authors find that image segmentation tasks serve as optimal generative proxies, providing structural semantics that improve both perception and generative layout fidelity. SGT aligns representation spaces across understanding and generation objectives, improving feature linear separability and visual-textual attention allocation. Evaluations show consistent gains on multimodal comprehension and generative fidelity benchmarks.

6Openai Blog·1mo ago·source ↗

OpenAI Launches GPT-Image-1.5 via ChatGPT Images Update

OpenAI has rolled out an upgraded image generation model, GPT-Image-1.5, to all ChatGPT users and via the API. The update promises more precise edits, more consistent details, and up to 4× faster image generation compared to the previous version. The rollout is global and simultaneous across consumer and API access tiers.

5arXiv · cs.CL·10d ago·source ↗

Provenance-grounded gating and adaptive recovery improve synthetic post-training data curation

A controlled study examines two underexplored practices in synthetic post-training data pipelines: grounding filtering signals in source provenance and systematically recovering rejected samples rather than discarding them. Using adversarially injected corpora as ground-truth failure labels, the authors find that exact source provenance improves faithfulness gating for stronger judges, that hallucination and reward gates reject largely disjoint populations (making both necessary), and that adaptive recovery via failure diagnosis and targeted regeneration outperforms naive resampling. Generator scale is the primary driver of downstream fine-tuning quality, with filtration and recovery contributing meaningfully but secondarily.

6Openai Blog·1mo ago·source ↗

Introducing vision to the fine-tuning API

OpenAI has extended its fine-tuning API to support multimodal inputs, allowing developers to fine-tune GPT-4o using both images and text. This enables customization of vision capabilities for domain-specific tasks. The update expands the existing text-only fine-tuning pipeline to handle image-text pairs.

6arXiv · cs.AI·26d ago·source ↗

ETCHR: Decoupled Image Editing for Visual Chain-of-Thought Reasoning in MLLMs

ETCHR introduces a question-conditioned, reasoning-aware image editing model that decouples visual transformation from downstream understanding in multimodal LLMs. It addresses two identified gaps—language-side (mapping abstract questions to visual edits) and generation-side (edit quality degrading with reasoning depth)—via a two-stage training recipe combining supervised fine-tuning on edit trajectories and VLM-derived reward signals. Because the editor is decoupled, it plugs into arbitrary MLLMs without retraining, yielding Pass@1 gains of roughly +4.6 to +5.5 points across five task families when paired with Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5. The work advances the 'think with images' paradigm beyond fixed toolkits and unified multimodal approaches.

5arXiv · cs.LG·24d ago·source ↗

Representation-Conditioned Diffusion Models for Controllable Image Generation

This paper explores conditioning diffusion models on representations from pre-trained self-supervised models as an alternative to text prompts or semantic maps, which require large annotated datasets. The self-conditioning mechanism improves unconditional image generation quality and provides a controllable representation space. The authors identify directions of variation in this space and demonstrate smoothness and disentanglement properties, suggesting potential for fine-grained generative control without heavy annotation overhead.