Good Token Hunting: Token Selection Framework for Visual Geometry Transformers
This paper introduces a two-stage token selection framework to address the quadratic computational scaling of global attention in visual geometry transformers used for multi-view 3D reconstruction. The approach combines diversity-based inter-frame selection (frame-level) with entropy-guided intra-frame sparsification (token-level within frames). Experiments demonstrate over 85% acceleration for 500-image scenes while maintaining or improving baseline reconstruction quality, offering a favorable speed-accuracy trade-off.
Related guides (3)
Related events (8)
VEPO: Vision-anchored token selection improves RL for visual reasoning
A new arXiv paper identifies a failure mode of entropy-based credit assignment in multimodal reinforcement learning: vision-sensitive tokens with naturally low entropy are systematically ignored, causing the mechanism to collapse in visual reasoning tasks. The authors propose VEPO (Vision-Entropy token-selection for Policy Optimization), which couples visual sensitivity with token entropy via a multiplicative scheme to redirect gradient credit toward tokens that are both visually grounded and semantically informative. VEPO outperforms entropy-only baselines by 2.28 points at 7B scale and 3.15 points at 3B scale on visual reasoning benchmarks.
IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation
IVGT is a new neural architecture that implicitly models continuous 3D geometry from unposed multi-view images without requiring explicit pointmap regression. It learns a continuous neural scene representation in a canonical coordinate system, supporting SDF-based surface queries and color prediction via lightweight decoders. The model is trained with multi-dataset joint optimization using 2D supervision and 3D geometric regularization, achieving strong generalization across mesh reconstruction, novel view synthesis, depth/normal estimation, and camera pose estimation tasks.
Apple's AToken: A Unified Multimodal Tokenizer and Encoder for Images, Videos, and 3D Objects
Apple researchers introduced AToken, a transformer model with a single 4D tokenizer and encoder-decoder architecture that handles images, videos, and 3D objects in a shared token space. The model is trained to both reconstruct and classify all three media types, using a pretrained SigLIP2 vision encoder extended to four dimensions with 4D Rotary Position Embedding. AToken approaches or matches specialized models on image classification (82.2% ImageNet), image generation (0.21 rFID), and 3D reconstruction (28.28 PSNR), while remaining competitive on video tasks. The work addresses a longstanding tension between generation-focused and classification-focused encoders by forcing embeddings to retain both fine visual detail and semantic content.
TrajTok: Adaptive Spatial Tokenization for Trajectory Representation Learning
TrajTok is a trajectory encoder that learns transferable GPS trace representations via multi-resolution hexagonal spatial tokenization and masked-token pretraining. It uses a factorized transformer with per-modality self-attention, cross-attention fusion, and spatiotemporal rotary position embeddings (ST-RoPE) to jointly encode geometry and kinematics. A single frozen TrajTok encoder with lightweight adapters outperforms task-specific methods on trajectory similarity search, classification, ETA, and travel-time regression on the Porto dataset. The work positions learned spatial tokenization plus masked pretraining as a viable path toward general-purpose trajectory foundation models.
ATLAS: Unified Agentic and Latent Visual Reasoning via Functional Tokens
ATLAS proposes a framework where a single discrete 'functional token' serves dual roles as both an agentic operation trigger and a latent visual reasoning unit in multimodal models. This design avoids the computational cost of generating intermediate images while sidestepping the context-switching latency of external tool calls and the generalization limitations of pure latent methods. The framework is compatible with standard SFT and RL training pipelines without architectural changes, and introduces Latent-Anchored GRPO (LA-GRPO) to stabilize reinforcement learning when functional tokens are sparse. Experiments show strong performance on visual reasoning benchmarks with maintained interpretability.
PGT: Procedurally Generated Tasks for Improving Visual Grounding in MLLMs
This paper introduces Procedurally Generated Tasks (PGT), a data-driven framework that overlays geometric primitives on images to create dense supervision signals for fine-grained visual grounding in multimodal large language models. PGT serves both as a training augmentation method and a diagnostic tool to isolate perception failures from semantic priors. Instruction tuning on LLaVA-v1.5-Instruct augmented with PGT data yields gains of up to +20% on the What'sUp benchmark and +13.3% on CV-Bench-2D. The results suggest that spatial reasoning deficits in MLLMs stem primarily from inadequate supervision rather than architectural or resolution constraints.
Systematic framework for selecting trajectories in data augmentation evaluated across five strategies
A thesis-derived arXiv preprint proposes a framework for evaluating five trajectory selection strategies—Outlierness, Diversity, Representativeness, Uncertainty, and Random—for data augmentation in spatio-temporal ML tasks. The study tests these strategies across four datasets spanning animal behavior, maritime, and urban traffic domains using linear and non-linear models with Optuna-based hyperparameter optimization. Key findings show systematic strategies (especially Outlierness and Uncertainty) outperform random selection in sparse datasets but can degrade performance in dense, high-quality datasets, with UMAP visualization confirming topological effects.
G2Rec: Scalable framework unifying graph-based user modeling with semantic tokenization for generative recommendation
Researchers propose G2Rec, a framework that combines holistic graph-based user co-engagement modeling with semantic tokenization for industrial-scale generative recommendation systems. The approach addresses limitations of existing methods—scalability issues in graph serialization and lack of supervision in semantic tokenization—by learning user interest prototypes without ground-truth labels. The system has been deployed in production across product surfaces and evaluated on public datasets, showing improvements over prior methods.


