GenAIR: LLM-grounded archetype representations improve sequential recommendation
GenAIR is a framework that uses LLMs to infer 'archetype' profiles of items' ideal target audiences, generating richer item embeddings for sequential recommendation systems. A behavioral calibration objective aligns these semantic embeddings with actual user interaction patterns, closing the gap between language-space representations and real-world behavior. Experiments on three datasets show consistent improvements over state-of-the-art baselines across multiple sequential recommendation models.
Related guides (1)
Related events (8)
G2Rec: Scalable framework unifying graph-based user modeling with semantic tokenization for generative recommendation
Researchers propose G2Rec, a framework that combines holistic graph-based user co-engagement modeling with semantic tokenization for industrial-scale generative recommendation systems. The approach addresses limitations of existing methods—scalability issues in graph serialization and lack of supervision in semantic tokenization—by learning user interest prototypes without ground-truth labels. The system has been deployed in production across product surfaces and evaluated on public datasets, showing improvements over prior methods.
AgenticRL: Self-refining LLM-guided reward design and policy refinement for UAV navigation
AgenticRL is a framework that uses a multimodal GPT agent to automate reward function generation, policy training via PPO, and closed-loop self-refinement for UAV navigation tasks. The agent evaluates trained policies through diagnostic feedback, identifies failure modes, and iteratively refines rewards without human intervention. Evaluated across five navigation tasks, the closed-loop refinement improves policy behavior by 71% over initial rewards, with sim-to-real transfer achieving 91% real-world success rate and 94% sim-to-real accuracy.
Mem-π: Adaptive Memory for LLM Agents via On-Demand Generation and Decoupled RL
Mem-π introduces a framework where a dedicated language or vision-language model generates context-specific guidance for LLM agents on demand, rather than retrieving static entries from episodic memory banks. The system is trained with a decision-content decoupled reinforcement learning objective that jointly learns when to generate guidance and what to generate, enabling abstention when generation would not help. Evaluated across web navigation, terminal-based tool use, and text-based embodied interaction benchmarks, Mem-π achieves over 30% relative improvement on web navigation tasks compared to retrieval-based and prior RL-optimized memory baselines.
ContextRL: Context-aware reinforcement learning improves grounding in agentic and multimodal LLMs
Researchers introduce ContextRL, a reinforcement learning method that trains LLMs to select the context that supports a given query-answer pair from two highly similar candidates, rather than supervising only final answers. The approach constructs contrastive context pairs in two domains: coding agent trajectories (1k pairs) and multimodal image pairs (7k pairs). ContextRL achieves +2.2% average gains over standard GRPO on 5 long-horizon benchmarks and +1.8% across 12 visual QA benchmarks, with ablations showing the gains stem from the context-selection objective rather than the contrastive data alone.
Mistral AI: Using LLM-as-a-Judge with Structured Outputs for RAG Evaluation
Mistral AI published a technical guide on evaluating Retrieval-Augmented Generation (RAG) systems using the 'LLM as a Judge' paradigm combined with their structured outputs API feature. The approach implements the RAG Triad framework—context relevance, groundedness, and answer relevance—using Pydantic schemas to enforce machine-readable evaluation outputs. Mistral models serve as both the generator and judge components, enabling scalable automated evaluation without human annotators.
LLM-augmented XAI framework with mutual feature interactions for network operations
A new arXiv paper proposes a framework combining LLMs with SHAP-based explainability, augmented by mutual feature interaction data, to generate natural language explanations for AI/ML models used in network operations. The approach is validated on an optical quality-of-transmission estimation task with human evaluators, showing 12.2% and 6.2% improvements in explanation usefulness and scope over a SHAP-only baseline, with 97.5% correctness. The work targets the gap between technical XAI outputs and actionable insights for non-specialist network operators.
OneReason: Activating Chain-of-Thought Reasoning in Generative Recommendation Models
Researchers from the OneRec team introduce OneReason, a framework for enabling reasoning capabilities in generative recommendation models deployed across short-video, live-streaming, advertising, and e-commerce. The work identifies a key failure mode — that naive thinking-mode integration does not outperform non-thinking baselines — and diagnoses this as a deficit in two factors: itemic token perception and user behavior cognition. The proposed solution combines perception-focused pre-training, a three-level cognition-enhanced CoT format for supervised fine-tuning, and a specialize-then-unify RL training recipe.
Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents
Agentic CLEAR is an automatic evaluation framework for LLM-based agentic systems that analyzes behavior at three granularity levels: system, trace, and node. Unlike existing tools that rely on static error taxonomies or focus only on observability, it dynamically generates textual insights and integrates above the observability layer with an accessible UI. Experiments across four benchmarks and seven agentic settings demonstrate strong alignment with human-annotated errors and predictive accuracy for task success rates.
