RDS Fusion: Hybrid neuro-symbolic gating with compressed CoT for zero-shot irony detection
Researchers introduce the Robust Dual-Signal (RDS) Fusion framework, a hybrid neuro-symbolic architecture that compresses Chain-of-Thought reasoning without supervised fine-tuning for irony and sarcasm detection in social media text. Evaluated on TweetEval (N=734) and iSarcasm, the zero-shot system matches fine-tuned BERTweet performance and outperforms supervised SemEval transformer ensembles on the imbalanced iSarcasm dataset. A statistical ablation shows that only the full concurrent fusion of all three signals yields a validated improvement, with individual components providing no significant standalone gain.
Related guides (1)
Related events (8)
FusionRS: Large-scale RGB-infrared-text dataset for dual-modal remote sensing vision-language models
Researchers introduce FusionRS, the first large-scale dataset pairing RGB and infrared remote sensing images with both conventional and IR-aware text captions, designed to support dual-modal vision-language learning. The dataset is constructed by translating public RGB remote sensing images into infrared-style counterparts using image translation. Using FusionRS, the authors train CLIP-style alignment models and fine-tune generative VLMs, demonstrating improvements in RGB-IR alignment, infrared-to-text retrieval, and dual-modal captioning over RGB-only baselines. The work addresses a gap in multimodal remote sensing foundation models by providing modality-specific textual supervision for infrared imagery.
Transformer embeddings shown to intrinsically encode Russell's circumplex model of emotion geometry
A new arXiv paper investigates whether Transformer-based text and speech encoders (RoBERTa, wav2vec 2.0) recover the geometric structure of Russell's circumplex model of affect — a valence-arousal topology from psychology. Experiments on naturalistic datasets (MSP-Podcast) and LLM-generated stimuli show that multimodal fusion achieves perfect topological alignment with Russell's primary emotion ordering, and zero-shot generic text embeddings place fine-grained emotion terms near their human-mapped coordinates. The authors argue this structure is intrinsically encoded in the representations rather than being an artifact of labeling, bridging psychological theory and representation learning.
SARDI: Self-Augmenting Retrieval for Diffusion Language Models using lookahead tokens
Researchers introduce SARDI, a training-free RAG framework for discrete diffusion language models that repurposes discarded low-confidence tokens during denoising as lookahead signals to guide retrieval before output is finalized. The method is retriever-agnostic and applicable to any reasoning-capable discrete diffusion LM. Evaluated across five multi-hop QA benchmarks, SARDI outperforms training-free diffusion and autoregressive retrieval baselines at up to 8x higher throughput.
Powerful ASR + Diarization + Speculative Decoding with Hugging Face Inference Endpoints
Hugging Face published a blog post describing a pipeline that combines automatic speech recognition (ASR), speaker diarization, and speculative decoding on their Inference Endpoints platform. The post demonstrates how these three techniques can be integrated to produce faster, speaker-attributed transcriptions. Speculative decoding is highlighted as a key inference optimization that reduces latency for ASR workloads.
RAT: Reference-Augmented Training improves deepfake audio detection without reference at inference
Researchers introduce Reference-Augmented Training (RAT), a training strategy for automatic speaker verification (ASV) anti-spoofing that conditions a model on speaker-reference recordings during training but discovers the model learns to ignore the reference at inference. Counterintuitively, this training regime induces invariances that improve deepfake detection even when the reference is replaced with a zero vector at test time. RAT achieves state-of-the-art 2.57% EER and 0.074 minDCF on the ASVspoof 5 benchmark with a single detector, outperforming large ensemble systems.
Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection
This paper introduces Social Gaze Consistency (SGC), a high-level semantic detection axis based on the mutual coherence of gaze direction, head-eye alignment, and pupil placement between interacting individuals in images. The authors construct a controlled diagnostic dataset with region-specific gaze perturbations and a Block-Compositional Caption Supervision scheme to train detectors without generator-fingerprint memorization shortcuts. Cross-architecture validation shows +3.7 pp improvement on the COCOAI Interaction subset when applied to FakeVLM, with gains transferring from a single inpainter (FLUX.1-Fill) to multi-generator suites. The work argues that diffusion models share a spectral weakness in periocular structure, making gaze coherence a robust, backbone-agnostic detection signal orthogonal to existing low-level artifact methods.
MemDreamer: Hierarchical graph memory and agentic retrieval for long video understanding
MemDreamer is a plug-and-play framework that decouples perception and reasoning for long-video understanding by incrementally building a three-tier Hierarchical Graph Memory capturing spatiotemporal and causal relations. During inference, a reasoning model uses an Observation-Reason-Action loop with agentic tool-augmented retrieval to navigate the memory graph, constraining the context window to 2% of full-context ingestion while achieving a 12.5-point absolute accuracy gain. The system reaches SOTA on four benchmarks, narrowing the gap with human experts to 3.7 points. The authors also report a strong linear correlation between logical reasoning performance and long-video understanding, proposing agentic capability scaling as a new paradigm for multimodal comprehension.
Multimodal Embedding & Reranker Models with Sentence Transformers
Hugging Face's Sentence Transformers library has added support for multimodal embedding and reranking models, enabling joint text-image (and potentially other modality) representations within a unified framework. The update extends the library's existing text-focused embedding capabilities to handle cross-modal retrieval and reranking tasks. This lowers the barrier for practitioners building multimodal search and RAG pipelines using open-weights models.
