3arXiv cs.AI (Artificial Intelligence)·9d ago

Illumination-robust rPPG heart-rate estimation via spatial-temporal transformer for robot-mounted cameras

A new arXiv paper presents an end-to-end spatial-temporal transformer framework for remote photoplethysmography (rPPG) heart-rate estimation that is robust to illumination variation, targeting robot-mounted RGB cameras. The system integrates 3D face alignment, illumination augmentation, a Residual Temporal Standardization Module, and a hybrid waveform-plus-spectral loss. On a new dataset spanning three illumination levels, the method achieves 0.79 bpm MAE and 0.982 HR correlation, reducing error by 93.6% relative to the PhysFormer baseline. The work is relevant to physiological sensing in service and assistive robotics.

Multimodal Progress PhysFormer Illumination-Robust Camera-Based Heart-Rate Estimation for Physiological Sensing in Robots PRNet

Related guides (1)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Related events (8)

5arXiv · cs.LG·1mo ago·source ↗

PIXLRelight: Controllable Single-Image Relighting via Intrinsic Conditioning

PIXLRelight is a feed-forward method for physically controllable single-image relighting that bridges physically based rendering (PBR) and learned image synthesis through shared intrinsic conditioning. At training time, multi-illumination photographs are decomposed into albedo, diffuse shading, and non-diffuse residuals; at inference time, conditioning is derived from a path-traced render of a coarse 3D reconstruction under user-specified PBR lights. A transformer-based neural renderer applies target illumination via per-pixel affine modulation, achieving state-of-the-art quality in under 100ms per image. Code and models are publicly released.

Inference Economics Multimodal Progress PIXLRelight per-pixel affine modulation physically based rendering +1 more

6arXiv · cs.AI·17d ago·source ↗

Humanoid-GPT: GPT-style Transformer trained on 2B-frame motion corpus for zero-shot humanoid control

Researchers introduce Humanoid-GPT, a causal Transformer pre-trained on a 2-billion-frame retargeted motion corpus that unifies major mocap datasets with large-scale in-house recordings for whole-body humanoid control. The model achieves zero-shot generalization to unseen motions and control tasks, overcoming the agility-generalization trade-off seen in prior MLP-based trackers. Scaling analyses demonstrate a new performance frontier for dynamic motion tracking without task-specific fine-tuning.

Frontier Model Releases Humanoid-GPT Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

4arXiv · cs.AI·4d ago·source ↗

FusionRS: Large-scale RGB-infrared-text dataset for dual-modal remote sensing vision-language models

Researchers introduce FusionRS, the first large-scale dataset pairing RGB and infrared remote sensing images with both conventional and IR-aware text captions, designed to support dual-modal vision-language learning. The dataset is constructed by translating public RGB remote sensing images into infrared-style counterparts using image translation. Using FusionRS, the authors train CLIP-style alignment models and fine-tune generative VLMs, demonstrating improvements in RGB-IR alignment, infrared-to-text retrieval, and dual-modal captioning over RGB-only baselines. The work addresses a gap in multimodal remote sensing foundation models by providing modality-specific textual supervision for infrared imagery.

Evaluation and Benchmarking Multimodal Progress CLIP FusionRS

4arXiv · cs.AI·29d ago·source ↗

MambaGaze: Bidirectional Mamba with Explicit Missing Data Modeling for Cognitive Load Assessment from Eye-Gaze Tracking

MambaGaze is a framework for real-time cognitive load assessment from eye-tracking data, combining XMD encoding (observation masks and time-deltas for missing data) with bidirectional Mamba-2 for efficient long-range temporal modeling. Evaluated on CLARE and CL-Drive datasets under leave-one-subject-out protocol, it achieves 76.8% and 73.1% accuracy, outperforming CNN, Transformer, ResNet, and VGG baselines by 4-12 percentage points. Edge deployment on NVIDIA Jetson platforms achieves 43-68 FPS at under 7.5W, demonstrating feasibility for wearable and safety-critical applications such as driver vigilance monitoring.

Inference Economics Agent and Tool Ecosystem Mamba MambaGaze XMD encoding +3 more

5Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

PEVA: Whole-Body Conditioned Egocentric Video Prediction for Embodied World Models

Researchers from BAIR introduce PEVA (Predicting Ego-centric Video from human Actions), a model that generates first-person video frames conditioned on 48-dimensional whole-body kinematic pose trajectories. The model uses an autoregressive conditional diffusion transformer trained on the Nymeria dataset, which pairs real-world egocentric video with body pose capture. PEVA can generate atomic action videos, simulate counterfactuals, and support long video generation, representing a step toward world models grounded in physically embodied human agents.

Agent and Tool Ecosystem Multimodal Progress PEVA Conditional Diffusion Transformer Berkeley AI Research (BAIR)+2 more

6Google Deepmind Blog·1mo ago·source ↗

D4RT: DeepMind's Unified 4D Reconstruction and Tracking System, Up to 300x Faster

DeepMind has announced D4RT, a system for unified four-dimensional (spatial + temporal) scene reconstruction and tracking. The method claims up to 300x speed improvements over prior approaches. The announcement positions D4RT as a significant efficiency advance in dynamic 3D scene understanding, with potential applications in robotics, video understanding, and embodied AI.

Agent and Tool Ecosystem Multimodal Progress DeepMind 4D reconstruction D4RT +1 more

7arXiv · cs.AI·29d ago·source ↗

Foundation Model for Wearable Health Data Pretrained on 1 Trillion Minutes from 5 Million Participants

Researchers propose a large-scale foundation model for wearable health data, pretrained on over one trillion minutes of unlabeled sensor signals from five million participants. The model demonstrates systematic performance improvements across 35 health prediction tasks spanning cardiovascular, metabolic, sleep, and mental health domains, with joint scaling of model capacity and data volume. A 'classroom' of LLM agents autonomously searches downstream predictive head configurations, and the resulting embeddings are integrated into a Personal Health Agent validated by 1,860 clinician ratings. The work establishes label-efficient few-shot learning and generative capabilities for daily health metric estimation.

Frontier Model Releases Evaluation and Benchmarking LLM Agent Classroom Personal Health Agent few-shot learning +4 more

4Hugging Face Blog·1mo ago·source ↗

PRX Part 3 — Training a Text-to-Image Model in 24 Hours

Photoroom shares the third installment of their PRX series on Hugging Face, detailing how they trained a text-to-image model within a 24-hour window. The post covers the practical engineering and training infrastructure decisions that enabled rapid model development. This is part of an ongoing series documenting Photoroom's internal model development process.

Training Infrastructure Multimodal Progress Hugging Face Photoroom PRX