Almanac
← Events
6arXiv cs.LG (Machine Learning)·19d ago

CHARM: Multimodal JEPA for Semantic Time-Series Embeddings via Channel-Aware Representation Learning

CHARM (Channel-Aware Representation Model) is a new Transformer-based architecture for general-purpose representation learning over heterogeneous multivariate time series. It integrates channel-level textual descriptions into a permutation-equivariant encoder trained with a Joint Embedding Predictive Architecture (JEPA) and a novel temporally stable embedding loss. The model achieves strong performance across anomaly detection, classification, and forecasting tasks using only a linear probe, with text descriptions primarily serving as channel identifiers enabling cross-dataset generalization.

Related guides (2)

Related events (8)

5arXiv · cs.AI·1mo ago·source ↗

Beyond Isotropy in JEPAs: Hamiltonian Geometry and Symplectic Prediction

This paper critiques the standard practice of regularizing Joint-Embedding Predictive Architecture (JEPA) encoders toward isotropic Gaussian marginals, showing that this Euclidean symmetry assumption incurs a quantifiable 'price of isotropy' and that no geometry-independent fixed marginal target is universally canonical. The authors prove that oracle one-view marginals do not identify the view-to-view predictive coupling, arguing structural bias should enter the cross-view coupling instead. They introduce HamJEPA, which encodes views as phase-space states and uses a learned Hamiltonian leapfrog map for view-to-view prediction, with symplectic coupling identified as the key driver of gains. HamJEPA outperforms SIGReg on CIFAR-100 by up to +6.45 kNN@20 and +10.64 linear-probe points at 80 epochs, with similar improvements on ImageNet-100.

5Hugging Face Blog·1mo ago·source ↗

Multimodal Embedding & Reranker Models with Sentence Transformers

Hugging Face's Sentence Transformers library has added support for multimodal embedding and reranking models, enabling joint text-image (and potentially other modality) representations within a unified framework. The update extends the library's existing text-focused embedding capabilities to handle cross-modal retrieval and reranking tasks. This lowers the barrier for practitioners building multimodal search and RAG pipelines using open-weights models.

4arXiv · cs.CL·2d ago·source ↗

CADE framework proposes direct timestep embedding and contrastive alignment for time-series question answering

A new arXiv preprint introduces CADE (Contrastive Alignment with Direct Embedding), a framework for time-series question answering (TSQA) that bypasses the tokenization bottleneck of standard LLMs by mapping each timestep directly into the LLM embedding space via a point-wise linear encoder and MLP projector. The approach also introduces a one-directional supervised contrastive loss to align time-series embeddings with frozen class-name text anchors, bridging the semantic gap between numerical and language representations. Evaluated on the Time-MQA benchmark across six TSQA tasks, CADE outperforms both open-source and proprietary LLM baselines. The work addresses a concrete limitation of patch-based encoders — fixed granularity and poor cross-dataset transfer — with a cleaner architectural alternative.

5Hugging Face Blog·1mo ago·source ↗

Introduction to Matryoshka Embedding Models

This Hugging Face blog post introduces Matryoshka Representation Learning (MRL), a technique for training embedding models that encode information at multiple granularities within a single vector. The approach allows truncating embeddings to smaller dimensions without significant loss in retrieval quality, enabling flexible trade-offs between storage/compute costs and accuracy. The post covers training, evaluation, and practical usage of Matryoshka embedding models via the Sentence Transformers library.

5arXiv · cs.LG·46h ago·source ↗

Multi-Task Bayesian In-Context Learning for Amortized Hierarchical Inference

A new arXiv preprint introduces a multi-task in-context learning framework for amortized hierarchical Bayesian predictive inference, representing prior information as a prefix of in-context datasets fed to a transformer. The model learns to adapt predictions across families of priors, addressing the brittleness of prior-data fitted models under distribution shift. On evaluations including out-of-meta-distribution priors and high-dimensional latent structures, the method matches oracle Bayesian predictors while being orders of magnitude faster, with a real-world spatiotemporal temperature prediction demonstration.

4arXiv · cs.CL·4d ago·source ↗

Transformer embeddings shown to intrinsically encode Russell's circumplex model of emotion geometry

A new arXiv paper investigates whether Transformer-based text and speech encoders (RoBERTa, wav2vec 2.0) recover the geometric structure of Russell's circumplex model of affect — a valence-arousal topology from psychology. Experiments on naturalistic datasets (MSP-Podcast) and LLM-generated stimuli show that multimodal fusion achieves perfect topological alignment with Russell's primary emotion ordering, and zero-shot generic text embeddings place fine-grained emotion terms near their human-mapped coordinates. The authors argue this structure is intrinsically encoded in the representations rather than being an artifact of labeling, bridging psychological theory and representation learning.

5Hugging Face Blog·1mo ago·source ↗

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Hugging Face published a blog post detailing how to train and finetune multimodal embedding and reranker models using the Sentence Transformers library. The post covers techniques for building models that can jointly embed text and images for retrieval and reranking tasks. This represents an extension of the Sentence Transformers ecosystem into multimodal territory, enabling practitioners to build cross-modal search and ranking systems.

5arXiv · cs.CL·1mo ago·source ↗

Conditional Scale Entropy: A Wavelet-Derived Tool for Mechanistic Interpretability of Metaphor Processing in Transformers

This paper introduces Conditional Scale Entropy (CSE), a wavelet-derived measure of how transformer computation engages across frequency scales at each layer, and applies it to study metaphor processing in decoder-only language models. The authors prove CSE is invariant to update magnitude, isolating structural computation patterns from intensity. Across architectures ranging from GPT-2 (124M) to LLaMA-2 7B and GPT-oss 20B, metaphorical tokens consistently produce higher spectral breadth than literal tokens in early-to-mid layers, with the effect surviving permutation correction and specificity controls. The work establishes multi-scale coordination as a consistent mechanistic signature of metaphorical language processing and positions CSE as a general interpretability tool for cross-depth structure in transformers.