4arXiv cs.AI (Artificial Intelligence)·11d ago

Pose-ICL: 3D-aware in-context learning for pose-controllable image generation of custom subjects

Researchers introduce Pose-ICL, a tuning-free framework for generating images of user-specified subjects with accurate pose control. The method uses Surface-Anchored Position Embedding (SAPE) to give 2D diffusion models explicit 3D awareness by anchoring image tokens to volumetric bounding box surface coordinates. Evaluations on 3D assets and real-world subjects show improvements over existing methods in both pose accuracy and identity consistency. The framework is designed for compatibility with existing Diffusion Transformer (DiT) models.

Multimodal Progress Surface-Anchored Position Embedding Pose-ICL Pose-ICL: 3D-Aware In-Context Learning for Pose-Controllable Subject Customization

Related guides (1)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Related events (8)

5arXiv · cs.LG·1mo ago·source ↗

PIXLRelight: Controllable Single-Image Relighting via Intrinsic Conditioning

PIXLRelight is a feed-forward method for physically controllable single-image relighting that bridges physically based rendering (PBR) and learned image synthesis through shared intrinsic conditioning. At training time, multi-illumination photographs are decomposed into albedo, diffuse shading, and non-diffuse residuals; at inference time, conditioning is derived from a path-traced render of a coarse 3D reconstruction under user-specified PBR lights. A transformer-based neural renderer applies target illumination via per-pixel affine modulation, achieving state-of-the-art quality in under 100ms per image. Code and models are publicly released.

Inference Economics Multimodal Progress PIXLRelight per-pixel affine modulation physically based rendering +1 more

5arXiv · cs.LG·24d ago·source ↗

Representation-Conditioned Diffusion Models for Controllable Image Generation

This paper explores conditioning diffusion models on representations from pre-trained self-supervised models as an alternative to text prompts or semantic maps, which require large annotated datasets. The self-conditioning mechanism improves unconditional image generation quality and provides a controllable representation space. The authors identify directions of variation in this space and demonstrate smoothness and disentanglement properties, suggesting potential for fine-grained generative control without heavy annotation overhead.

Frontier Model Releases Multimodal Progress Representation-Conditioned Diffusion Models Self-Supervised Learning Disentangled Representation Learning +1 more

4Hugging Face Blog·1mo ago·source ↗

Instruction-tuning Stable Diffusion with InstructPix2Pix

This Hugging Face blog post describes a methodology for instruction-tuning Stable Diffusion using the InstructPix2Pix framework, enabling image editing via natural language instructions. The approach adapts techniques from language model instruction-tuning to the image generation domain. The post covers dataset construction, training procedures, and evaluation of the resulting models.

Alignment and RLHF Multimodal Progress Stable Diffusion 3 InstructPix2Pix Hugging Face +1 more

5arXiv · cs.LG·25d ago·source ↗

Squeezing Capacity from MLLMs for Subject-driven Image Generation via Dual Layer Aggregation

This paper proposes conditioning diffusion models on Multimodal Large Language Models (MLLMs) that jointly encode text and reference images, augmented with VAE-based identity conditioning to address copy-paste artifacts and identity preservation failures in subject-driven image generation. A Dual Layer Aggregation (DLA) module aggregates multi-level MLLM features, and a multi-stage denoising strategy progressively balances semantic and fine-detail identity signals during inference. Experiments show improved human preference scores on subject-driven generation benchmarks compared to prior approaches that encode text and reference images separately.

Agent and Tool Ecosystem Multimodal Progress Multimodal Large Language Models Dual Layer Aggregation (DLA)Subject-driven Image Generation +3 more

7Openai Blog·1mo ago·source ↗

Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2 / unCLIP)

OpenAI published research on hierarchical text-conditional image generation using CLIP latents, the technique underlying DALL-E 2. The approach uses a prior network to map text embeddings to image embeddings, then a diffusion decoder to generate images from those embeddings. This represented a significant advance in text-to-image generation quality and semantic fidelity at the time of release.

Frontier Model Releases Multimodal Progress DALL·E 3 unCLIP OpenAI +2 more

4Hugging Face Blog·1mo ago·source ↗

Generate Images with Claude and Hugging Face via MCP

Hugging Face published a blog post demonstrating how to use Claude with the Model Context Protocol (MCP) to generate images through Hugging Face's inference infrastructure. The integration allows Claude to call Hugging Face image generation models as tools via MCP, connecting frontier LLMs with open-weight diffusion models. This represents a practical example of the agent-tool ecosystem pattern where LLMs orchestrate specialized model endpoints.

Agent and Tool Ecosystem Multimodal Progress Claude Hugging Face Anthropic +1 more

5Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

PEVA: Whole-Body Conditioned Egocentric Video Prediction for Embodied World Models

Researchers from BAIR introduce PEVA (Predicting Ego-centric Video from human Actions), a model that generates first-person video frames conditioned on 48-dimensional whole-body kinematic pose trajectories. The model uses an autoregressive conditional diffusion transformer trained on the Nymeria dataset, which pairs real-world egocentric video with body pose capture. PEVA can generate atomic action videos, simulate counterfactuals, and support long video generation, representing a step toward world models grounded in physically embodied human agents.

Agent and Tool Ecosystem Multimodal Progress PEVA Conditional Diffusion Transformer Berkeley AI Research (BAIR)+2 more

3Hugging Face Blog·1mo ago·source ↗

Fine-Tuning CLIP with Remote Sensing Satellite Images and Captions

This Hugging Face blog post describes fine-tuning OpenAI's CLIP model on the RSICD (Remote Sensing Image Captioning Dataset) to improve vision-language alignment for satellite and aerial imagery. The work demonstrates domain adaptation of a general-purpose contrastive vision-language model to a specialized remote sensing context. It serves as a practical tutorial and case study for transfer learning with CLIP on narrow domains.

Multimodal Progress Hugging Face OpenAI RSICD +1 more