5arXiv cs.AI (Artificial Intelligence)·16d ago

GeM-NR: Training-free multi-view editing for nonrigid 3D scene changes

GeM-NR is a training-free method for multi-view consistent image editing that handles nonrigid edits — changes that substantially alter scene geometry and appearance — a capability that existing methods largely lack. Given an anchor image edited by a backbone model (FLUX, Qwen, or BrushNet) and an unedited query image, the method propagates the edit consistently across viewpoints via depth estimation, point-cloud alignment, projection, and conditioned refinement. The authors report state-of-the-art performance on edit quality and geometric/photometric consistency across multiple views, including generation of 3D representations of edited scenes.

Multimodal Progress BrushNet Qwen GeM-NR FLUX

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Qwen

Qwen: Alibaba's Open-Weight AI Model Family

Read asBeginner In-depth

Related events (8)

6arXiv · cs.AI·26d ago·source ↗

ETCHR: Decoupled Image Editing for Visual Chain-of-Thought Reasoning in MLLMs

ETCHR introduces a question-conditioned, reasoning-aware image editing model that decouples visual transformation from downstream understanding in multimodal LLMs. It addresses two identified gaps—language-side (mapping abstract questions to visual edits) and generation-side (edit quality degrading with reasoning depth)—via a two-stage training recipe combining supervised fine-tuning on edit trajectories and VLM-derived reward signals. Because the editor is decoupled, it plugs into arbitrary MLLMs without retraining, yielding Pass@1 gains of roughly +4.6 to +5.5 points across five task families when paired with Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5. The work advances the 'think with images' paradigm beyond fixed toolkits and unified multimodal approaches.

Agent and Tool Ecosystem Alignment and RLHF Reasoning Enhancement Qwen3-4B ETCHR +5 more

6Qwen Research·1mo ago·source ↗

Qwen-Image-Edit: Image Editing Model with Text Rendering and Dual Visual Control

Alibaba's Qwen team has released Qwen-Image-Edit, a 20B-parameter image editing model built on the Qwen-Image foundation. The model extends Qwen-Image's text rendering capabilities to editing tasks, enabling precise in-image text modification. It uses a dual-path architecture that simultaneously feeds input images into Qwen2.5-VL for semantic control and a VAE Encoder for appearance control, enabling both semantic and appearance-level edits.

Frontier Model Releases Multimodal Progress Qwen2.5-VL Qwen-Image-Edit Qwen-Image +2 more

5The Batch·22d ago·source ↗

Meta Research Improves Image Generation via Staged Planning and Self-Revision Fine-Tuning

Researchers from Meta and collaborating universities propose a fine-tuning method that teaches image generators to compose images through discrete plan-sketch-inspect-refine cycles rather than generating all at once. Starting from BAGEL-7B, they construct ~62,000 training examples using GPT-4o and FLUX.1 Kontext to supervise each stage, achieving 83% on GenEval versus 77% for the base model and a competing method (PARM) that required 11x more training data and ~8x more inference steps. The approach improves spatial relationship accuracy, object attribute fidelity, and real-world knowledge grounding in generated images.

Evaluation and Benchmarking Agent and Tool Ecosystem University of California San Diego WISE FLUX.1 Kontext +10 more

5arXiv · cs.AI·1mo ago·source ↗

IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation

IVGT is a new neural architecture that implicitly models continuous 3D geometry from unposed multi-view images without requiring explicit pointmap regression. It learns a continuous neural scene representation in a canonical coordinate system, supporting SDF-based surface queries and color prediction via lightweight decoders. The model is trained with multi-dataset joint optimization using 2D supervision and 3D geometric regularization, achieving strong generalization across mesh reconstruction, novel view synthesis, depth/normal estimation, and camera pose estimation tasks.

Frontier Model Releases Multimodal Progress IVGT Signed Distance Function (SDF)Neural Radiance Field (NeRF)+1 more

6arXiv · cs.LG·4d ago·source ↗

Geometric Action Model (GAM) repurposes geometric foundation models for 3D-aware robot manipulation

Researchers propose the Geometric Action Model (GAM), a language-conditioned robot manipulation policy that splits a pretrained geometric foundation model (GFM) to serve simultaneously as an observation encoder, causal future predictor, and action decoder. Unlike existing vision-language-action models that operate on 2D image frames, GAM explicitly incorporates 3D geometric priors for contact-rich manipulation. The approach claims improvements in accuracy, robustness, speed, and model size over foundation-model-scale baselines across simulation and real-robot benchmarks.

Agent and Tool Ecosystem Multimodal Progress Geometric Action Model for Robot Policy Learning Geometric Action Model

6arXiv · cs.LG·22d ago·source ↗

DynaFLIP: Dynamics-Aware Multimodal Pre-Training for Robot Manipulation Perception

DynaFLIP is a pre-training framework that injects motion understanding into visual encoders for robot manipulation by constructing image-language-3D flow triplets from human and robot videos. The method encourages tri-modal alignment via simplex-volume minimization in a shared hyperspherical space, combined with cosine regularization and contrastive objectives. The resulting dynamics-aware visual backbone consistently outperforms baselines across diverse downstream policies including VLAs, with gains up to +22.5% in out-of-distribution scenarios. The work argues that robot generalization requires encoding how the world changes under action, not just static scene content.

Frontier Model Releases Agent and Tool Ecosystem Vision-Language-Action models simplex-volume minimization DynaFLIP +3 more

4arXiv · cs.CL·8d ago·source ↗

BitResEdit: Training-free bitwise residual editing for visual autoregressive image generators

BitResEdit is a training-free text-guided image editing method for bitwise-residual visual autoregressive (VAR) models, specifically targeting Infinity-2B. The approach combines per-bit Bernoulli guidance (BitEdit) with scale-aware code residual injection (ResEdit), exploiting native structures of VAR models that prior editors leave unused. On PIE-Bench with Infinity-2B, it achieves the best CLIP text alignment among same-backbone VAR editors (+1.07 over the prior best) while maintaining competitive background preservation.

Multimodal Progress Infinity-2B BitResEdit PIE-Bench

6Google Deepmind Blog·1mo ago·source ↗

Image Editing in Gemini Gets Major Upgrade

Google DeepMind has announced a significant upgrade to native image editing capabilities within the Gemini app. The update enables new ways to transform images directly through the Gemini interface. The blog post is light on technical specifics but signals continued multimodal capability expansion for the Gemini product line.

Frontier Model Releases Multimodal Progress Google DeepMind Gemini App Gemini