4arXiv cs.LG (Machine Learning)·17d ago

SEAOTTER: Learned compression framework for cloud robotics combining autoencoder latents with JPEG compatibility

SEAOTTER is a compression framework for cloud robotics that pairs a sensor-embedded autoencoder with a one-time JPEG transcode step, enabling extreme compression ratios while remaining compatible with standard JPEG infrastructure. At 200:1 compression versus AVIF, the system achieves 7x faster encoding, 3.5x faster decoding, and +8% ImageNet top-1 accuracy. The approach targets the asymmetric power/bandwidth constraints of sensor, cloud, and consumer stages in robotic vision pipelines, and supports general-purpose and task-aware transcoding for dense and vision-language perception tasks.

Inference Economics Multimodal Progress SEAOTTER University of Texas SysML Lab ImageNet

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

Scaling Robotics Datasets with Video Encoding

Hugging Face published a blog post on using video encoding techniques to scale robotics datasets. The post addresses the practical challenge of storing and transmitting large-scale robot learning data efficiently. Video compression is presented as a key infrastructure enabler for expanding robotics training corpora.

Training Infrastructure Agent and Tool Ecosystem video encoding robotics datasets Hugging Face

5The Batch·17d ago·source ↗

Apple researchers propose Feature Auto-Encoder to speed diffusion training via compressed DINOv2 embeddings

Researchers at Apple introduced Feature Auto-Encoder (FAE), a latent diffusion image generator that compresses DINOv2 vision encoder embeddings before learning to denoise them, then expands them back for decoding. The approach achieves comparable image quality to state-of-the-art diffusion models while training roughly 7x faster on ImageNet class-conditional generation. The key insight is that shrinking semantically rich vision embeddings reduces compute during diffusion training without sacrificing the representational benefits of large pretrained encoders.

Training Infrastructure Multimodal Progress DINOv2 Yuan Gao MS COCO +7 more

4arXiv · cs.AI·12d ago·source ↗

COMPACT-VA: Planning-aligned token compression for long-context autonomous driving

Researchers introduce COMPACT-VA, a working memory framework using conditional VQ-VAE to compress extended temporal context in vision-action autonomous driving models. Compression is conditioned on historical trajectory and a learned planning intent derived from future trajectories during training, enabling end-to-end optimization without backbone modifications. On high-signal dynamic scenarios, the method achieves 68.3% success rate (>6% improvement) with 3.3x speedup and 2.7x memory reduction over uncompressed processing.

Long Context Evolution Inference Economics conditional VQ-VAE Planning-aligned Token Compression for Long-Context Autonomous Driving COMPACT-VA

5arXiv · cs.CL·12d ago·source ↗

TEVI: Sparse autoencoders for text-conditioned editing of CLIP image embeddings to improve vision-language alignment

TEVI is a framework that uses sparse autoencoders to disentangle CLIP image embeddings and a learned masking module to selectively reconstruct embeddings conditioned on a given caption, addressing the information imbalance between images and their captions. The approach improves image-text retrieval on both coarse-grained benchmarks (MS COCO, Flickr) and fine-grained long-caption benchmarks (IIW, DOCCI), with larger gains on richer captions. The work also shows improved robustness on the RoCOCO benchmark.

Evaluation and Benchmarking Multimodal Progress DOCCI MS COCO IIW +4 more

6Hugging Face Blog·1mo ago·source ↗

Ettin Suite: SoTA Paired Encoders and Decoders

Hugging Face introduces the Ettin Suite, a collection of paired encoder and decoder models claiming state-of-the-art performance. The suite appears to offer jointly trained or architecturally matched encoder-decoder pairs, potentially useful for tasks requiring both embedding and generation capabilities. The blog post is published on the Hugging Face platform, positioning it as a notable open-weights or open-access model release.

Frontier Model Releases Evaluation and Benchmarking Ettin Suite Hugging Face +1 more

6arXiv · cs.CL·24d ago·source ↗

SAERL: Using Sparse Autoencoders to Guide LLM Reinforcement Learning Data Engineering

SAERL is a post-training data engineering framework that uses Sparse Autoencoders (SAEs) — a mechanistic interpretability tool — to extract intrinsic model signals for controlling data diversity, difficulty, and quality during RL fine-tuning. The framework applies SAE-space clustering for batch diversity, a difficulty proxy for curriculum ordering, and a quality probe for data filtering. On Qwen2.5-Math-1.5B with GRPO, SAERL achieves 3% average accuracy improvement and reaches target accuracy with 20% fewer training steps. SAE representations transfer across model families and scales, suggesting broad applicability as a lightweight data engineering tool.

Training Infrastructure Evaluation and Benchmarking mechanistic interpretability GRPO Reinforcement Learning from Human Feedback +6 more

5Hugging Face Blog·1mo ago·source ↗

Bringing Robotics AI to Embedded Platforms: Dataset Recording, VLA Fine-Tuning, and On-Device Optimizations

NXP and Hugging Face describe a pipeline for deploying Vision-Language-Action (VLA) models on embedded/edge hardware, covering dataset recording, fine-tuning, and on-device optimization techniques. The post targets robotics applications where inference must run on resource-constrained microcontrollers or SoCs rather than cloud GPUs. Key topics include quantization, model compression, and integration with the LeRobot ecosystem. This represents a practical engineering bridge between frontier VLA research and real-world embedded robotics deployment.

Inference Economics Agent and Tool Ecosystem LeRobot NXP Semiconductors Vision-Language-Action model +3 more

5arXiv · cs.CL·9d ago·source ↗

SKIM: Adaptive soft-token compression for procedural skills in LLM workflows

Researchers introduce SKIM (SKIll coMpression), a multi-resolution soft token compression framework targeting procedural knowledge (skills/workflows) rather than factual documents. SKIM compresses reusable natural language skills to 30–60% of their original token length while preserving task performance, reducing prefill cost and latency when skills are repeatedly invoked. The method adapts compression depth to skill complexity and supports offline compression for frequently updated community skills.

Inference Economics Agent and Tool Ecosystem Adaptive Multi-Resolution Procedural Knowledge Compression for Large Language Models SKIM (SKIll coMpression)