Almanac
← Events
4arXiv cs.CL (Computation and Language)·10d ago

GenAIR: LLM-grounded archetype representations improve sequential recommendation

GenAIR is a framework that uses LLMs to infer 'archetype' profiles of items' ideal target audiences, generating richer item embeddings for sequential recommendation systems. A behavioral calibration objective aligns these semantic embeddings with actual user interaction patterns, closing the gap between language-space representations and real-world behavior. Experiments on three datasets show consistent improvements over state-of-the-art baselines across multiple sequential recommendation models.

Related guides (1)

Related events (8)

4arXiv · cs.AI·46h ago·source ↗

G2Rec: Scalable framework unifying graph-based user modeling with semantic tokenization for generative recommendation

Researchers propose G2Rec, a framework that combines holistic graph-based user co-engagement modeling with semantic tokenization for industrial-scale generative recommendation systems. The approach addresses limitations of existing methods—scalability issues in graph serialization and lack of supervision in semantic tokenization—by learning user interest prototypes without ground-truth labels. The system has been deployed in production across product surfaces and evaluated on public datasets, showing improvements over prior methods.

6arXiv · cs.AI·17d ago·source ↗

AgenticRL: Self-refining LLM-guided reward design and policy refinement for UAV navigation

AgenticRL is a framework that uses a multimodal GPT agent to automate reward function generation, policy training via PPO, and closed-loop self-refinement for UAV navigation tasks. The agent evaluates trained policies through diagnostic feedback, identifies failure modes, and iteratively refines rewards without human intervention. Evaluated across five navigation tasks, the closed-loop refinement improves policy behavior by 71% over initial rewards, with sim-to-real transfer achieving 91% real-world success rate and 94% sim-to-real accuracy.

6arXiv · cs.CL·1mo ago·source ↗

Mem-π: Adaptive Memory for LLM Agents via On-Demand Generation and Decoupled RL

Mem-π introduces a framework where a dedicated language or vision-language model generates context-specific guidance for LLM agents on demand, rather than retrieving static entries from episodic memory banks. The system is trained with a decision-content decoupled reinforcement learning objective that jointly learns when to generate guidance and what to generate, enabling abstention when generation would not help. Evaluated across web navigation, terminal-based tool use, and text-based embodied interaction benchmarks, Mem-π achieves over 30% relative improvement on web navigation tasks compared to retrieval-based and prior RL-optimized memory baselines.

5arXiv · cs.CL·4d ago·source ↗

ContextRL: Context-aware reinforcement learning improves grounding in agentic and multimodal LLMs

Researchers introduce ContextRL, a reinforcement learning method that trains LLMs to select the context that supports a given query-answer pair from two highly similar candidates, rather than supervising only final answers. The approach constructs contrastive context pairs in two domains: coding agent trajectories (1k pairs) and multimodal image pairs (7k pairs). ContextRL achieves +2.2% average gains over standard GRPO on 5 long-horizon benchmarks and +1.8% across 12 visual QA benchmarks, with ablations showing the gains stem from the context-selection objective rather than the contrastive data alone.

4Mistral Ai News·1mo ago·source ↗

Mistral AI: Using LLM-as-a-Judge with Structured Outputs for RAG Evaluation

Mistral AI published a technical guide on evaluating Retrieval-Augmented Generation (RAG) systems using the 'LLM as a Judge' paradigm combined with their structured outputs API feature. The approach implements the RAG Triad framework—context relevance, groundedness, and answer relevance—using Pydantic schemas to enforce machine-readable evaluation outputs. Mistral models serve as both the generator and judge components, enabling scalable automated evaluation without human annotators.

3arXiv · cs.LG·11d ago·source ↗

LLM-augmented XAI framework with mutual feature interactions for network operations

A new arXiv paper proposes a framework combining LLMs with SHAP-based explainability, augmented by mutual feature interaction data, to generate natural language explanations for AI/ML models used in network operations. The approach is validated on an optical quality-of-transmission estimation task with human evaluators, showing 12.2% and 6.2% improvements in explanation usefulness and scope over a SHAP-only baseline, with 97.5% correctness. The work targets the gap between technical XAI outputs and actionable insights for non-specialist network operators.

5arXiv · cs.CL·15d ago·source ↗

OneReason: Activating Chain-of-Thought Reasoning in Generative Recommendation Models

Researchers from the OneRec team introduce OneReason, a framework for enabling reasoning capabilities in generative recommendation models deployed across short-video, live-streaming, advertising, and e-commerce. The work identifies a key failure mode — that naive thinking-mode integration does not outperform non-thinking baselines — and diagnoses this as a deficit in two factors: itemic token perception and user behavior cognition. The proposed solution combines perception-focused pre-training, a three-level cognition-enhanced CoT format for supervised fine-tuning, and a specialize-then-unify RL training recipe.

6arXiv · cs.CL·29d ago·source ↗

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Agentic CLEAR is an automatic evaluation framework for LLM-based agentic systems that analyzes behavior at three granularity levels: system, trace, and node. Unlike existing tools that rely on static error taxonomies or focus only on observability, it dynamically generates textual insights and integrates above the observability layer with an accessible UI. Experiments across four benchmarks and seven agentic settings demonstrate strong alignment with human-annotated errors and predictive accuracy for task success rates.