MDA: Mixture-Density Representation for Flying-Point-Free Depth Estimation
This paper introduces MDA (Mixture-Density Ambiguity), a depth estimation framework that replaces single-depth-per-pixel prediction with a mixture-density representation allowing multiple depth hypotheses and associated probabilities per pixel. The approach directly addresses 'flying point' artifacts at object boundaries, where conventional models predict spurious intermediate depths between foreground and background surfaces. MDA extends naturally to transparent objects and sky regions, substantially improving boundary reconstruction with negligible runtime overhead across multiple backbone architectures.
Related guides (1)
Related events (8)
Looped Diffusion Language Models (LoopMDM): Depth Scaling via Layer Looping
LoopMDM introduces selective looping of early-middle transformer layers in masked diffusion language models, achieving a depth-scaling effect without adding parameters. The approach matches same-size MDM performance with up to 3.3× fewer training FLOPs and outperforms deeper non-looped MDMs on reasoning benchmarks, including up to 8.5 points improvement on GSM8K. Inference-time compute scaling is enabled by varying loop counts, with adaptive loop scheduling providing additional efficiency gains. Attention analysis suggests looping works by promoting interactions among masked token positions.
ADAS: Attention-Discounted Adaptive Sampler improves parallel decoding for masked diffusion language models
Researchers propose ADAS, a training-free reranking rule for masked diffusion language model decoding that addresses token interaction failures in parallel token commitment. The method greedily penalizes candidates that attend strongly to already-selected uncertain positions, using attention weights as soft marginal penalties rather than hard constraints. Evaluated on LLaDA-8B-Base and Dream-7B-Base across GSM8K, MATH500, HumanEval, and MBPP, ADAS improves low-NFE performance by 9–10 percentage points on average when plugged into existing samplers with only 3.1% runtime overhead.
Efficient MultiModal Data Pipeline (MMDP) from Hugging Face
Hugging Face published a blog post describing an efficient multimodal data pipeline (MMDP) for processing and preparing multimodal training data at scale. The post covers architectural choices and tooling for handling diverse data modalities in ML workflows. As a tier-2 source with default commentary depth, the technical substance is likely focused on practical data engineering patterns for multimodal model training.
Squeezing Capacity from MLLMs for Subject-driven Image Generation via Dual Layer Aggregation
This paper proposes conditioning diffusion models on Multimodal Large Language Models (MLLMs) that jointly encode text and reference images, augmented with VAE-based identity conditioning to address copy-paste artifacts and identity preservation failures in subject-driven image generation. A Dual Layer Aggregation (DLA) module aggregates multi-level MLLM features, and a multi-stage denoising strategy progressively balances semantic and fine-detail identity signals during inference. Experiments show improved human preference scores on subject-driven generation benchmarks compared to prior approaches that encode text and reference images separately.
MambaGaze: Bidirectional Mamba with Explicit Missing Data Modeling for Cognitive Load Assessment from Eye-Gaze Tracking
MambaGaze is a framework for real-time cognitive load assessment from eye-tracking data, combining XMD encoding (observation masks and time-deltas for missing data) with bidirectional Mamba-2 for efficient long-range temporal modeling. Evaluated on CLARE and CL-Drive datasets under leave-one-subject-out protocol, it achieves 76.8% and 73.1% accuracy, outperforming CNN, Transformer, ResNet, and VGG baselines by 4-12 percentage points. Edge deployment on NVIDIA Jetson platforms achieves 43-68 FPS at under 7.5W, demonstrating feasibility for wearable and safety-critical applications such as driver vigilance monitoring.
ZEDA: Post-Trained MoE Models Can Skip Half Their Experts via Self-Distillation
This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a framework that converts static post-trained Mixture-of-Experts (MoE) language models into dynamic ones without pre-training from scratch. ZEDA injects parameter-free zero-output experts into each MoE layer and uses two-stage self-distillation with the original model as a frozen teacher. Applied to Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks, ZEDA eliminates over 50% of expert FLOPs with marginal accuracy loss and achieves approximately 1.20× end-to-end inference speedup, outperforming the strongest dynamic MoE baseline by 4–6 points.
PGT: Procedurally Generated Tasks for Improving Visual Grounding in MLLMs
This paper introduces Procedurally Generated Tasks (PGT), a data-driven framework that overlays geometric primitives on images to create dense supervision signals for fine-grained visual grounding in multimodal large language models. PGT serves both as a training augmentation method and a diagnostic tool to isolate perception failures from semantic priors. Instruction tuning on LLaVA-v1.5-Instruct augmented with PGT data yields gains of up to +20% on the What'sUp benchmark and +13.3% on CV-Bench-2D. The results suggest that spatial reasoning deficits in MLLMs stem primarily from inadequate supervision rather than architectural or resolution constraints.
Mixture of Experts Explained
This Hugging Face blog post provides a technical overview of the Mixture of Experts (MoE) architecture, explaining how sparse gating mechanisms route tokens to subsets of expert feed-forward layers to achieve computational efficiency. The post covers training dynamics, inference considerations, and the tradeoffs between dense and sparse models. It serves as a reference document contextualizing MoE's growing relevance following high-profile model releases using the architecture.
