5arXiv cs.LG (Machine Learning)·1mo ago

PIXLRelight: Controllable Single-Image Relighting via Intrinsic Conditioning

PIXLRelight is a feed-forward method for physically controllable single-image relighting that bridges physically based rendering (PBR) and learned image synthesis through shared intrinsic conditioning. At training time, multi-illumination photographs are decomposed into albedo, diffuse shading, and non-diffuse residuals; at inference time, conditioning is derived from a path-traced render of a coarse 3D reconstruction under user-specified PBR lights. A transformer-based neural renderer applies target illumination via per-pixel affine modulation, achieving state-of-the-art quality in under 100ms per image. Code and models are publicly released.

Inference Economics Multimodal Progress PIXLRelight per-pixel affine modulation physically based rendering transformer-based neural renderer

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

4arXiv · cs.AI·11d ago·source ↗

Pose-ICL: 3D-aware in-context learning for pose-controllable image generation of custom subjects

Researchers introduce Pose-ICL, a tuning-free framework for generating images of user-specified subjects with accurate pose control. The method uses Surface-Anchored Position Embedding (SAPE) to give 2D diffusion models explicit 3D awareness by anchoring image tokens to volumetric bounding box surface coordinates. Evaluations on 3D assets and real-world subjects show improvements over existing methods in both pose accuracy and identity consistency. The framework is designed for compatibility with existing Diffusion Transformer (DiT) models.

Multimodal Progress Surface-Anchored Position Embedding Pose-ICL Pose-ICL: 3D-Aware In-Context Learning for Pose-Controllable Subject Customization

4Hugging Face Blog·1mo ago·source ↗

Instruction-tuning Stable Diffusion with InstructPix2Pix

This Hugging Face blog post describes a methodology for instruction-tuning Stable Diffusion using the InstructPix2Pix framework, enabling image editing via natural language instructions. The approach adapts techniques from language model instruction-tuning to the image generation domain. The post covers dataset construction, training procedures, and evaluation of the resulting models.

Alignment and RLHF Multimodal Progress Stable Diffusion 3 InstructPix2Pix Hugging Face +1 more

5Hugging Face Blog·1mo ago·source ↗

Efficient Controllable Generation for SDXL with T2I-Adapters

Hugging Face published a blog post detailing T2I-Adapters for Stable Diffusion XL (SDXL), a lightweight conditioning mechanism that enables controllable image generation without full fine-tuning. The approach allows users to guide SDXL outputs using structural signals such as depth maps, edge detection, and pose estimation. T2I-Adapters offer a parameter-efficient alternative to ControlNet for the SDXL architecture, with integration into the Diffusers library.

Agent and Tool Ecosystem Multimodal Progress T2I-Adapter Stable Diffusion 3 Hugging Face +2 more

3arXiv · cs.AI·9d ago·source ↗

Illumination-robust rPPG heart-rate estimation via spatial-temporal transformer for robot-mounted cameras

A new arXiv paper presents an end-to-end spatial-temporal transformer framework for remote photoplethysmography (rPPG) heart-rate estimation that is robust to illumination variation, targeting robot-mounted RGB cameras. The system integrates 3D face alignment, illumination augmentation, a Residual Temporal Standardization Module, and a hybrid waveform-plus-spectral loss. On a new dataset spanning three illumination levels, the method achieves 0.79 bpm MAE and 0.982 HR correlation, reducing error by 93.6% relative to the PhysFormer baseline. The work is relevant to physiological sensing in service and assistive robotics.

Multimodal Progress PhysFormer Illumination-Robust Camera-Based Heart-Rate Estimation for Physiological Sensing in Robots PRNet

5Openai Blog·1mo ago·source ↗

Glow: Better reversible generative models

OpenAI introduces Glow, a reversible generative model using invertible 1x1 convolutions that extends prior work on normalizing flows. The model generates realistic high-resolution images, supports efficient sampling, and learns disentangled features for attribute manipulation. Code and an online visualization tool are released alongside the paper.

Multimodal Progress Glow invertible 1x1 convolutions OpenAI +1 more

6arXiv · cs.AI·26d ago·source ↗

ETCHR: Decoupled Image Editing for Visual Chain-of-Thought Reasoning in MLLMs

ETCHR introduces a question-conditioned, reasoning-aware image editing model that decouples visual transformation from downstream understanding in multimodal LLMs. It addresses two identified gaps—language-side (mapping abstract questions to visual edits) and generation-side (edit quality degrading with reasoning depth)—via a two-stage training recipe combining supervised fine-tuning on edit trajectories and VLM-derived reward signals. Because the editor is decoupled, it plugs into arbitrary MLLMs without retraining, yielding Pass@1 gains of roughly +4.6 to +5.5 points across five task families when paired with Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5. The work advances the 'think with images' paradigm beyond fixed toolkits and unified multimodal approaches.

Agent and Tool Ecosystem Alignment and RLHF Reasoning Enhancement Qwen3-4B ETCHR +5 more

5arXiv · cs.LG·24d ago·source ↗

Representation-Conditioned Diffusion Models for Controllable Image Generation

This paper explores conditioning diffusion models on representations from pre-trained self-supervised models as an alternative to text prompts or semantic maps, which require large annotated datasets. The self-conditioning mechanism improves unconditional image generation quality and provides a controllable representation space. The authors identify directions of variation in this space and demonstrate smoothness and disentanglement properties, suggesting potential for fine-grained generative control without heavy annotation overhead.

Frontier Model Releases Multimodal Progress Representation-Conditioned Diffusion Models Self-Supervised Learning Disentangled Representation Learning +1 more

4arXiv · cs.CL·1mo ago·source ↗

SymbolicLight V1: Spike-Gated Dual-Path Language Model with High Activation Sparsity

SymbolicLight V1 is a 194M-parameter spiking language model that combines binary Leaky Integrate-and-Fire spike dynamics with a continuous residual stream, replacing dense self-attention with a dual-path module using exponential-decay aggregation and spike-gated local attention. Trained from scratch on a 3B-token Chinese-English corpus, it achieves validation perplexity of 8.88–8.93 at over 89% per-element activation sparsity, trailing GPT-2 201M by 7.7% in PPL. Ablations indicate that temporal integration via LIF dynamics contributes more to performance than sparsity alone, and a 0.8B-parameter scale-up on 48.8B tokens demonstrates optimization stability. Current dense-hardware inference is slower than GPT-2; neuromorphic deployment is framed as a future opportunity.

Training Infrastructure Inference Economics GPT-2 Dual-Path SparseTCAM Spiking Neural Networks +2 more