5arXiv cs.CL (Computation and Language)·1mo ago

AnyMo: Geometry-Aware Setup-Agnostic Framework for Wearable IMU Human Motion Understanding

AnyMo is a geometry-aware framework that addresses the setup-dependence problem in wearable IMU-based human motion modeling by using physics-grounded simulation over dense body-surface placements to generate synthetic training signals. It pre-trains a graph encoder from synthetic placement views and masked partial observations, then tokenizes multi-position IMU data into full-body motion tokens aligned with an LLM for motion-language understanding. Evaluated across zero-shot activity recognition (14 unseen datasets), cross-modal retrieval, and motion captioning, AnyMo improves average Accuracy/F1 by ~11.7%/11.6%, zero-shot retrieval MRR by 15.9–28.6%, and captioning BERT-F1 by 18.8%. The work positions itself as a generalist model for wearable motion understanding transferable across devices and sensing configurations.

Agent and Tool Ecosystem Multimodal Progress large language models BERT-F1 Baiyu Chen Graph Neural Network Encoder AnyMo Inertial Measurement Unit (IMU)Human Activity Recognition (HAR)

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·4d ago·source ↗

MolmoMotion: Language-guided 3D motion forecasting from Allen AI

Allen AI published a blog post on Hugging Face introducing MolmoMotion, a system for language-guided 3D motion forecasting. The work extends the Molmo model family into motion prediction tasks, combining natural language conditioning with 3D spatial reasoning. The post appears to be an announcement or demonstration of the capability, though the body content was not available for detailed review.

Frontier Model Releases Multimodal Progress MolmoMotion Molmo Hugging Face +1 more

6arXiv · cs.CL·3d ago·source ↗

OmniAgent: POMDP-based active perception agent for long video understanding with test-time scaling

Researchers introduce OmniAgent, a multimodal agent that reformulates long video understanding as a POMDP-based iterative Observation-Thought-Action cycle, selectively distilling audio-visual cues into persistent textual memory rather than processing all frames uniformly. The system uses Agentic Supervised Fine-Tuning and a novel reinforcement learning method (TAURA) with turn-level entropy for credit assignment. OmniAgent demonstrates positive test-time scaling and achieves state-of-the-art open-source results across ten benchmarks, with its 7B model outperforming Qwen2.5-VL-72B on LVBench (50.5% vs. 47.3%).

Inference Economics Agent and Tool Ecosystem OmniAgent Qwen2.5-VL-72B LVBench +4 more

6arXiv · cs.LG·5d ago·source ↗

Geometric Action Model (GAM) repurposes geometric foundation models for 3D-aware robot manipulation

Researchers propose the Geometric Action Model (GAM), a language-conditioned robot manipulation policy that splits a pretrained geometric foundation model (GFM) to serve simultaneously as an observation encoder, causal future predictor, and action decoder. Unlike existing vision-language-action models that operate on 2D image frames, GAM explicitly incorporates 3D geometric priors for contact-rich manipulation. The approach claims improvements in accuracy, robustness, speed, and model size over foundation-model-scale baselines across simulation and real-robot benchmarks.

Agent and Tool Ecosystem Multimodal Progress Geometric Action Model for Robot Policy Learning Geometric Action Model

7arXiv · cs.CL·25d ago·source ↗

MobileMoE: Scaling Mixture-of-Experts for Sub-Billion Parameter On-Device Deployment

MobileMoE introduces a family of on-device MoE language models with 0.3–0.9B active parameters and 1.3–5.3B total parameters, targeting mobile deployment under memory and compute constraints. The authors derive an on-device MoE scaling law identifying a sweet spot of moderate sparsity with fine-grained and shared experts, then train models through a four-stage recipe including quantization-aware training on open-source data. Across 14 benchmarks, MobileMoE matches or exceeds leading dense on-device LLMs with 2–4× fewer inference FLOPs, and delivers 1.8–3.8× faster prefill and 2.2–3.4× faster decode than dense baselines on commodity smartphones at comparable INT4 memory.

Training Infrastructure Frontier Model Releases MobileLLM-Pro OLMoE-1B-7B INT4 Quantization +7 more

6arXiv · cs.LG·23d ago·source ↗

DynaFLIP: Dynamics-Aware Multimodal Pre-Training for Robot Manipulation Perception

DynaFLIP is a pre-training framework that injects motion understanding into visual encoders for robot manipulation by constructing image-language-3D flow triplets from human and robot videos. The method encourages tri-modal alignment via simplex-volume minimization in a shared hyperspherical space, combined with cosine regularization and contrastive objectives. The resulting dynamics-aware visual backbone consistently outperforms baselines across diverse downstream policies including VLAs, with gains up to +22.5% in out-of-distribution scenarios. The work argues that robot generalization requires encoding how the world changes under action, not just static scene content.

Frontier Model Releases Agent and Tool Ecosystem Vision-Language-Action models simplex-volume minimization DynaFLIP +3 more

5arXiv · cs.LG·9d ago·source ↗

Mana framework achieves zero-shot sim-to-real transfer for dexterous articulated tool manipulation

Researchers introduce Mana (Manipulation Animator), a sim-to-real framework that reframes dexterous robotic manipulation as an animation problem using a coarse-to-fine pipeline of procedurally-generated grasp keyframes, motion planning, and reinforcement learning. The system requires minimal human input (under one minute per tool) and achieves zero-shot sim-to-real transfer across four articulated tools with varying joint types and scales. The work addresses a longstanding gap in dexterous robotics where articulated tool use—requiring coordination of internal degrees of freedom and contact-rich interactions—has been underexplored relative to rigid object manipulation.

Agent and Tool Ecosystem Mana Mana: Dexterous Manipulation of Articulated Tools

4arXiv · cs.AI·1mo ago·source ↗

MambaGaze: Bidirectional Mamba with Explicit Missing Data Modeling for Cognitive Load Assessment from Eye-Gaze Tracking

MambaGaze is a framework for real-time cognitive load assessment from eye-tracking data, combining XMD encoding (observation masks and time-deltas for missing data) with bidirectional Mamba-2 for efficient long-range temporal modeling. Evaluated on CLARE and CL-Drive datasets under leave-one-subject-out protocol, it achieves 76.8% and 73.1% accuracy, outperforming CNN, Transformer, ResNet, and VGG baselines by 4-12 percentage points. Edge deployment on NVIDIA Jetson platforms achieves 43-68 FPS at under 7.5W, demonstrating feasibility for wearable and safety-critical applications such as driver vigilance monitoring.

Inference Economics Agent and Tool Ecosystem Mamba MambaGaze XMD encoding +3 more

5Hugging Face Blog·1mo ago·source ↗

EMO: Pretraining Mixture of Experts for Emergent Modularity

AllenAI introduces EMO, a pretraining approach for Mixture of Experts (MoE) models that aims to produce emergent modularity during training. The work explores how MoE architectures can develop specialized expert routing without explicit supervision. Published on the Hugging Face blog, this represents research-level work on improving MoE training dynamics and efficiency.

Training Infrastructure Frontier Model Releases AllenAI Mixture of Experts Hugging Face +2 more