Call Me Almanac

5arXiv cs.LG (Machine Learning)·24d ago

Representation-Conditioned Diffusion Models for Controllable Image Generation

This paper explores conditioning diffusion models on representations from pre-trained self-supervised models as an alternative to text prompts or semantic maps, which require large annotated datasets. The self-conditioning mechanism improves unconditional image generation quality and provides a controllable representation space. The authors identify directions of variation in this space and demonstrate smoothness and disentanglement properties, suggesting potential for fine-grained generative control without heavy annotation overhead.

Frontier Model Releases Multimodal Progress Representation-Conditioned Diffusion Models Self-Supervised Learning Disentangled Representation Learning Diffusion Models

Related guides (3)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Diffusion ModelsConcept

Diffusion Models: How AI Learns to Paint by Unpainting

Read asBeginner In-depth

Related events (8)

5arXiv · cs.LG·25d ago·source ↗

Squeezing Capacity from MLLMs for Subject-driven Image Generation via Dual Layer Aggregation

This paper proposes conditioning diffusion models on Multimodal Large Language Models (MLLMs) that jointly encode text and reference images, augmented with VAE-based identity conditioning to address copy-paste artifacts and identity preservation failures in subject-driven image generation. A Dual Layer Aggregation (DLA) module aggregates multi-level MLLM features, and a multi-stage denoising strategy progressively balances semantic and fine-detail identity signals during inference. Experiments show improved human preference scores on subject-driven generation benchmarks compared to prior approaches that encode text and reference images separately.

Agent and Tool Ecosystem Multimodal Progress Multimodal Large Language Models Dual Layer Aggregation (DLA)Subject-driven Image Generation +3 more

4Hugging Face Blog·1mo ago·source ↗

The Annotated Diffusion Model

A Hugging Face blog post providing a detailed, annotated walkthrough of diffusion models for image generation, likely covering the mathematical foundations and implementation details of denoising diffusion probabilistic models (DDPMs). The post serves as an educational deep-dive into the architecture and training process of diffusion-based generative models. Published in mid-2022, it coincides with the period of rapid growth in diffusion model adoption.

Multimodal Progress DDPM Denoising Diffusion Probabilistic Models Hugging Face

4Hugging Face Blog·1mo ago·source ↗

Train your ControlNet with diffusers

This Hugging Face blog post provides a technical guide for training ControlNet models using the diffusers library. It covers the process of conditioning diffusion models on additional inputs such as edge maps, depth maps, or other spatial signals to enable fine-grained image generation control. The post targets practitioners looking to implement custom ControlNet pipelines on their own datasets.

Agent and Tool Ecosystem Multimodal Progress Stable Diffusion 3 Hugging Face ControlNet +1 more

4Hugging Face Blog·1mo ago·source ↗

Instruction-tuning Stable Diffusion with InstructPix2Pix

This Hugging Face blog post describes a methodology for instruction-tuning Stable Diffusion using the InstructPix2Pix framework, enabling image editing via natural language instructions. The approach adapts techniques from language model instruction-tuning to the image generation domain. The post covers dataset construction, training procedures, and evaluation of the resulting models.

Alignment and RLHF Multimodal Progress Stable Diffusion 3 InstructPix2Pix Hugging Face +1 more

4Hugging Face Blog·1mo ago·source ↗

Training Design for Text-to-Image Models: Lessons from Ablations

Photoroom shares practical lessons from ablation studies on training design choices for text-to-image diffusion models. The post covers decisions around data curation, model architecture, and training hyperparameters derived from systematic experimentation. This is part two of a series documenting Photoroom's internal research into building production-grade image generation systems.

Training Infrastructure Multimodal Progress Hugging Face Photoroom PRX

4Hugging Face Blog·1mo ago·source ↗

Training Stable Diffusion with Dreambooth using Diffusers

This Hugging Face blog post describes how to fine-tune Stable Diffusion models using the DreamBooth technique via the Diffusers library. DreamBooth enables personalized text-to-image generation by training a model on a small set of reference images. The post covers the technical workflow for applying this fine-tuning approach within the Diffusers ecosystem.

Open Weights Progress Agent and Tool Ecosystem Hugging Face Diffusers Stable Diffusion 3 Hugging Face +1 more

7Openai Blog·1mo ago·source ↗

Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2 / unCLIP)

OpenAI published research on hierarchical text-conditional image generation using CLIP latents, the technique underlying DALL-E 2. The approach uses a prior network to map text embeddings to image embeddings, then a diffusion decoder to generate images from those embeddings. This represented a significant advance in text-to-image generation quality and semantic fidelity at the time of release.

Frontier Model Releases Multimodal Progress DALL·E 3 unCLIP OpenAI +2 more

4arXiv · cs.AI·11d ago·source ↗

Pose-ICL: 3D-aware in-context learning for pose-controllable image generation of custom subjects

Researchers introduce Pose-ICL, a tuning-free framework for generating images of user-specified subjects with accurate pose control. The method uses Surface-Anchored Position Embedding (SAPE) to give 2D diffusion models explicit 3D awareness by anchoring image tokens to volumetric bounding box surface coordinates. Evaluations on 3D assets and real-world subjects show improvements over existing methods in both pose accuracy and identity consistency. The framework is designed for compatibility with existing Diffusion Transformer (DiT) models.

Multimodal Progress Surface-Anchored Position Embedding Pose-ICL Pose-ICL: 3D-Aware In-Context Learning for Pose-Controllable Subject Customization