4arXiv cs.AI (Artificial Intelligence)·29d ago

MambaGaze: Bidirectional Mamba with Explicit Missing Data Modeling for Cognitive Load Assessment from Eye-Gaze Tracking

MambaGaze is a framework for real-time cognitive load assessment from eye-tracking data, combining XMD encoding (observation masks and time-deltas for missing data) with bidirectional Mamba-2 for efficient long-range temporal modeling. Evaluated on CLARE and CL-Drive datasets under leave-one-subject-out protocol, it achieves 76.8% and 73.1% accuracy, outperforming CNN, Transformer, ResNet, and VGG baselines by 4-12 percentage points. Edge deployment on NVIDIA Jetson platforms achieves 43-68 FPS at under 7.5W, demonstrating feasibility for wearable and safety-critical applications such as driver vigilance monitoring.

Inference Economics Agent and Tool Ecosystem Mamba MambaGaze XMD encoding CLARE dataset CL-Drive dataset NVIDIA Jetson

Related guides (3)

MambaConcept

Mamba: The Attention-Free Architecture That Scales Without Slowing Down

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Bamba: Inference-Efficient Hybrid Mamba2 Model

Hugging Face published a blog post introducing Bamba, a hybrid architecture combining Mamba2 state-space layers with attention layers, designed for inference efficiency. The model targets reduced KV-cache memory and improved throughput compared to pure transformer architectures. The post covers architecture details, training approach, and benchmarking results positioning Bamba as a practical alternative for deployment-constrained settings.

Training Infrastructure Frontier Model Releases Mamba2 Bamba Hugging Face +2 more

5arXiv · cs.AI·24d ago·source ↗

Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection

This paper introduces Social Gaze Consistency (SGC), a high-level semantic detection axis based on the mutual coherence of gaze direction, head-eye alignment, and pupil placement between interacting individuals in images. The authors construct a controlled diagnostic dataset with region-specific gaze perturbations and a Block-Compositional Caption Supervision scheme to train detectors without generator-fingerprint memorization shortcuts. Cross-architecture validation shows +3.7 pp improvement on the COCOAI Interaction subset when applied to FakeVLM, with gains transferring from a single inpainter (FLUX.1-Fill) to multi-generator suites. The work argues that diffusion models share a spectral weakness in periocular structure, making gaze coherence a robust, backbone-agnostic detection signal orthogonal to existing low-level artifact methods.

Evaluation and Benchmarking AI Safety Research Effort FLUX.1-Fill Social Gaze Consistency +5 more

6arXiv · cs.AI·12d ago·source ↗

MemDreamer: Hierarchical graph memory and agentic retrieval for long video understanding

MemDreamer is a plug-and-play framework that decouples perception and reasoning for long-video understanding by incrementally building a three-tier Hierarchical Graph Memory capturing spatiotemporal and causal relations. During inference, a reasoning model uses an Observation-Reason-Action loop with agentic tool-augmented retrieval to navigate the memory graph, constraining the context window to 2% of full-context ingestion while achieving a 12.5-point absolute accuracy gain. The system reaches SOTA on four benchmarks, narrowing the gap with human experts to 3.7 points. The authors also report a strong linear correlation between logical reasoning performance and long-video understanding, proposing agentic capability scaling as a new paradigm for multimodal comprehension.

Long Context Evolution Agent and Tool Ecosystem MemDreamer Hierarchical Graph Memory Observation-Reason-Action +1 more

4arXiv · cs.AI·16d ago·source ↗

BabyCL: Continual multimodal learning from egocentric child video in a single chronological pass

Researchers introduce BabyCL, a continual learning framework that processes the SAYCam egocentric child video dataset in a single chronological pass rather than shuffled multi-epoch training, more closely mimicking how children actually encounter their environment. The system combines streaming visual representation learning with image-text contrastive objectives, a multi-stage temporal segmentation, and a dual replay buffer managing visual and multimodal histories. BabyCL outperforms streaming baselines on the SAYCam Labeled-S 4AFC benchmark under matched compute budgets, substantially closing the gap to offline training upper bounds. The work advances understanding of whether neural networks can acquire word-referent mappings under biologically plausible training conditions.

Evaluation and Benchmarking Multimodal Progress SAYCam BabyCL SAYCam Labeled-S 4AFC

6arXiv · cs.CL·5d ago·source ↗

Gaze Heads: Attention heads in VLMs that track and control image region description

Researchers identify a small set of attention heads in vision-language model backbones, called 'gaze heads', whose attention patterns track the image region currently being described. Using comic strips as a controlled testbed, they show that intervening on the top-100 gaze heads (fewer than 9% of all heads) can steer the model to describe any chosen region at 83.1% accuracy, without retraining. The mechanism generalizes across model sizes from 2B to 32B parameters and to natural images (COCO), establishing a practical inference-time control lever for multimodal models via mechanistic analysis.

Multimodal Progress Gaze Heads: How VLMs Look at What They Describe baulab Gaze Heads: How VLMs Look at What They Describe +2 more

6arXiv · cs.CL·23d ago·source ↗

VisualMem: Personal Visual Memory Benchmark and Architecture for Personalized AI Agents

This paper introduces a benchmark and hybrid architecture (VisualMem) for personal visual memory in long-term AI agent memory systems. The work addresses a gap in existing text-centric memory systems by capturing both explicit evidence (recurring user-associated entities) and implicit evidence (latent user facts from visual/multimodal cues) from images. VisualMem augments a text-memory backend with a structured personal visual memory module that uses conversational context to resolve identity, ownership, and durable user facts. Experiments show VisualMem substantially outperforms prior memory systems on the new benchmark while remaining competitive on standard text-memory benchmarks.

Long Context Evolution Evaluation and Benchmarking VisualMem long-term memory Personal Visual Memory Benchmark +3 more

7The Batch·34h ago·source ↗

Nvidia Nemotron 3 Ultra: hybrid Mamba-transformer open-weights model targeting agentic workloads

Nvidia released Nemotron 3 Ultra, a 550B parameter (55B active) hybrid Mamba-transformer mixture-of-experts model with a 1M token context window, publishing weights, training data, and RL environments under an open license. The model ranks as the highest-scoring U.S. open-weights model on the Artificial Analysis Intelligence Index (47.7-48.2) and is approximately three times faster than comparable open-weights rivals, though it trails leading Chinese models like Kimi K2.6 and DeepSeek V4 Pro on intelligence benchmarks. Nvidia used a novel Multi-Teacher On-Policy Distillation approach with 10+ specialized teacher models and trained using NVFP4 quantization. The release is strategically motivated by Nvidia's interest in a healthy open-weights ecosystem that drives AI semiconductor adoption.

Frontier Model Releases Open Weights Progress Mamba IFBench Artificial Analysis Intelligence Index +17 more

3Hugging Face Blog·1mo ago·source ↗

Accelerating Vision-Language Models: BridgeTower on Habana Gaudi2

This Hugging Face blog post covers the deployment and acceleration of BridgeTower, a vision-language model, on Intel's Habana Gaudi2 AI accelerator hardware. The piece likely benchmarks inference throughput and training performance on Gaudi2 compared to other hardware. It represents a practical infrastructure and deployment case study for multimodal models on alternative AI accelerators.

Training Infrastructure Inference Economics BridgeTower Habana Gaudi Hugging Face +2 more