5arXiv cs.AI (Artificial Intelligence)·1mo ago

IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation

IVGT is a new neural architecture that implicitly models continuous 3D geometry from unposed multi-view images without requiring explicit pointmap regression. It learns a continuous neural scene representation in a canonical coordinate system, supporting SDF-based surface queries and color prediction via lightweight decoders. The model is trained with multi-dataset joint optimization using 2D supervision and 3D geometric regularization, achieving strong generalization across mesh reconstruction, novel view synthesis, depth/normal estimation, and camera pose estimation tasks.

Frontier Model Releases Multimodal Progress IVGT Signed Distance Function (SDF)Neural Radiance Field (NeRF)Visual Geometry Foundation Models

Related guides (2)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Related events (8)

5arXiv · cs.LG·26d ago·source ↗

Good Token Hunting: Token Selection Framework for Visual Geometry Transformers

This paper introduces a two-stage token selection framework to address the quadratic computational scaling of global attention in visual geometry transformers used for multi-view 3D reconstruction. The approach combines diversity-based inter-frame selection (frame-level) with entropy-guided intra-frame sparsification (token-level within frames). Experiments demonstrate over 85% acceleration for 500-image scenes while maintaining or improving baseline reconstruction quality, offering a favorable speed-accuracy trade-off.

Inference Economics Agent and Tool Ecosystem inter-frame token selection visual geometry transformer global attention +5 more

6arXiv · cs.AI·26d ago·source ↗

PGT: Procedurally Generated Tasks for Improving Visual Grounding in MLLMs

This paper introduces Procedurally Generated Tasks (PGT), a data-driven framework that overlays geometric primitives on images to create dense supervision signals for fine-grained visual grounding in multimodal large language models. PGT serves both as a training augmentation method and a diagnostic tool to isolate perception failures from semantic priors. Instruction tuning on LLaVA-v1.5-Instruct augmented with PGT data yields gains of up to +20% on the What'sUp benchmark and +13.3% on CV-Bench-2D. The results suggest that spatial reasoning deficits in MLLMs stem primarily from inadequate supervision rather than architectural or resolution constraints.

Evaluation and Benchmarking Multimodal Progress PGT (Procedurally Generated Tasks)Multimodal Large Language Models CV-Bench-2D +2 more

5Openai Blog·1mo ago·source ↗

Glow: Better reversible generative models

OpenAI introduces Glow, a reversible generative model using invertible 1x1 convolutions that extends prior work on normalizing flows. The model generates realistic high-resolution images, supports efficient sampling, and learns disentangled features for attribute manipulation. Code and an online visualization tool are released alongside the paper.

Multimodal Progress Glow invertible 1x1 convolutions OpenAI +1 more

5arXiv · cs.AI·16d ago·source ↗

GeM-NR: Training-free multi-view editing for nonrigid 3D scene changes

GeM-NR is a training-free method for multi-view consistent image editing that handles nonrigid edits — changes that substantially alter scene geometry and appearance — a capability that existing methods largely lack. Given an anchor image edited by a backbone model (FLUX, Qwen, or BrushNet) and an unedited query image, the method propagates the edit consistently across viewpoints via depth estimation, point-cloud alignment, projection, and conditioned refinement. The authors report state-of-the-art performance on edit quality and geometric/photometric consistency across multiple views, including generation of 3D representations of edited scenes.

Multimodal Progress BrushNet Qwen GeM-NR +1 more

5arXiv · cs.AI·17d ago·source ↗

Imaginative Perception Tokens improve spatial reasoning in vision-language models

Researchers introduce Imaginative Perception Tokens (IPT), intermediate perceptual representations that externalize what a VLM would perceive from alternative spatial viewpoints, enabling reasoning about unobserved spatial structure. The approach is evaluated on three new tasks—Perspective Taking, Path Tracing, and Multiview Counting—using ~20K examples built on the BAGEL backbone. IPT supervision consistently outperforms textual chain-of-thought training for spatial tasks, with the authors finding that forcing spatial computation through language can degrade performance, suggesting a modality mismatch. The work provides both a practical supervision technique and a diagnostic finding about the limits of language-mediated spatial reasoning.

Evaluation and Benchmarking Multimodal Progress Imaginative Perception Tokens Path Tracing Perspective Taking +2 more

6Openai Blog·1mo ago·source ↗

Image GPT: Transformer Models Applied to Pixel Sequences for Image Generation and Classification

OpenAI demonstrates that a large transformer model trained autoregressively on pixel sequences can generate coherent image completions and samples, analogous to text generation. The work establishes a correlation between generative sample quality and downstream image classification accuracy. The best generative model achieves features competitive with top convolutional networks in the unsupervised setting, suggesting shared representational principles across modalities.

Frontier Model Releases Multimodal Progress Transformers convolutional neural network OpenAI +2 more

4arXiv · cs.AI·11d ago·source ↗

Pose-ICL: 3D-aware in-context learning for pose-controllable image generation of custom subjects

Researchers introduce Pose-ICL, a tuning-free framework for generating images of user-specified subjects with accurate pose control. The method uses Surface-Anchored Position Embedding (SAPE) to give 2D diffusion models explicit 3D awareness by anchoring image tokens to volumetric bounding box surface coordinates. Evaluations on 3D assets and real-world subjects show improvements over existing methods in both pose accuracy and identity consistency. The framework is designed for compatibility with existing Diffusion Transformer (DiT) models.

Multimodal Progress Surface-Anchored Position Embedding Pose-ICL Pose-ICL: 3D-Aware In-Context Learning for Pose-Controllable Subject Customization

6arXiv · cs.AI·1mo ago·source ↗

Semantic Generative Tuning (SGT) for Unified Multimodal Models

This paper introduces Semantic Generative Tuning (SGT), a post-training paradigm for unified multimodal models (UMMs) that bridges the gap between visual understanding and visual generation. The authors find that image segmentation tasks serve as optimal generative proxies, providing structural semantics that improve both perception and generative layout fidelity. SGT aligns representation spaces across understanding and generation objectives, improving feature linear separability and visual-textual attention allocation. Evaluations show consistent gains on multimodal comprehension and generative fidelity benchmarks.

Frontier Model Releases Alignment and RLHF Semantic Generative Tuning (SGT)image segmentation generative post-training +2 more