3arXiv cs.LG (Machine Learning)·12d ago

Bradley-Terry model proposed for fairer ranking of recommendation algorithms across dataset types

A new arXiv preprint introduces a Bradley-Terry (BT) model-based methodology for ranking recommendation algorithms in a way that accounts for dataset characteristics such as sparsity, sequential structure, and scale. The authors argue that naive metric aggregation (e.g., averaging NDCG) produces misleading rankings and propose BT trees and covariate-extended BT models as alternatives. The framework also enables ranking predictions on unseen datasets without running the models, and includes a new metric for ranking consistency.

Evaluation and Benchmarking Bradley-Terry Rankings for Recommender Systems Across Dataset Taxonomies NDCG

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·2d ago·source ↗

RECOM benchmark reveals validity-discrimination tradeoff in automatic metrics for open-ended QA

Researchers introduce RECOM, a contamination-free evaluation dataset of 15,000 r/AskReddit questions paired with authentic community replies postdating all evaluated models' training cutoffs. Testing five open-source 7–10B LLMs, the paper finds that no standard automatic metric (cosine similarity, BERTScore, LLM judges) simultaneously achieves both validity (distinguishing real from random answers) and discriminative power (ranking models against each other). Cosine similarity is valid but cannot rank models; BERTScore's apparent ranking collapses when response length is controlled. The authors argue this tradeoff is a structural property of metric representation design and recommend reporting metrics on both axes with an explicit random-baseline floor.

Evaluation and Benchmarking BERTScore RECOM r/AskReddit

6arXiv · cs.CL·17d ago·source ↗

Taiji: Pareto Optimal Policy Optimization for LLM-enhanced recommendation at Kuaishou scale

Researchers from Kuaishou present Taiji, an LLM-as-Enhancer framework for industrial recommender systems that addresses two bottlenecks: generating high-quality chain-of-thought data via reverse-engineered reasoning and rejection sampling during SFT, and balancing semantic vs. ID-based rewards during RL alignment via a new algorithm called Pareto Optimal Policy Optimization (POPO). The system has been deployed on Kuaishou's advertising platform since May 2026, serving over 400 million daily users. The paper contributes both a practical deployment case study and a novel RL alignment technique for the LLM4Rec paradigm.

Enterprise Deployment Patterns Alignment and RLHF Taiji Pareto Optimal Policy Optimization Kuaishou

6arXiv · cs.AI·4d ago·source ↗

Bayesian audit framework for public AI evaluation archives challenges frontier model claims

A new arXiv preprint proposes a Bayesian inference and decision-audit framework for interpreting public AI evaluation archives (LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, tau-bench) as longitudinal time series rather than terminal leaderboards. The paper demonstrates that a single terminal snapshot is compatible with multiple distinct performance histories, yielding ambiguous timing estimates for reaching capability ceilings. A candidate selection-aware frontier model is shown to fail synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration, with fixed audit gates rejecting its stronger claims. The work proposes an archive-and-adjudication protocol to reconstruct evaluation histories and falsify unsupported frontier capability claims.

Evaluation and Benchmarking AI Safety Research Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations GAIA Open LLM Leaderboard +3 more

4arXiv · cs.CL·10d ago·source ↗

GenAIR: LLM-grounded archetype representations improve sequential recommendation

GenAIR is a framework that uses LLMs to infer 'archetype' profiles of items' ideal target audiences, generating richer item embeddings for sequential recommendation systems. A behavioral calibration objective aligns these semantic embeddings with actual user interaction patterns, closing the gap between language-space representations and real-world behavior. Experiments on three datasets show consistent improvements over state-of-the-art baselines across multiple sequential recommendation models.

Enterprise Deployment Patterns GenAIR

4arXiv · cs.AI·46h ago·source ↗

G2Rec: Scalable framework unifying graph-based user modeling with semantic tokenization for generative recommendation

Researchers propose G2Rec, a framework that combines holistic graph-based user co-engagement modeling with semantic tokenization for industrial-scale generative recommendation systems. The approach addresses limitations of existing methods—scalability issues in graph serialization and lack of supervision in semantic tokenization—by learning user interest prototypes without ground-truth labels. The system has been deployed in production across product surfaces and evaluated on public datasets, showing improvements over prior methods.

Enterprise Deployment Patterns G2Rec

4arXiv · cs.LG·22d ago·source ↗

FedTSV: Fairness-Aware Federated Learning via Trajectory Shapley Value

This paper introduces the Trajectory Shapley Value (TSV), a contribution metric that evaluates each federated learning client's influence on the global model's optimization trajectory using validation-based, temporally consistent utility. Building on TSV, the authors propose FedTSV, an adaptive aggregation method that converts per-round evaluations into dynamic client weights to handle heterogeneous and adversarial participation. Experiments on benchmark datasets demonstrate improved convergence speed, robustness, and equitable contribution assessment compared to fixed-weight aggregation baselines.

Training Infrastructure AI Safety Research Federated Learning Trajectory Shapley Value (TSV)Shapley values +1 more

5arXiv · cs.CL·15d ago·source ↗

OneReason: Activating Chain-of-Thought Reasoning in Generative Recommendation Models

Researchers from the OneRec team introduce OneReason, a framework for enabling reasoning capabilities in generative recommendation models deployed across short-video, live-streaming, advertising, and e-commerce. The work identifies a key failure mode — that naive thinking-mode integration does not outperform non-thinking baselines — and diagnoses this as a deficit in two factors: itemic token perception and user behavior cognition. The proposed solution combines perception-focused pre-training, a three-level cognition-enhanced CoT format for supervised fine-tuning, and a specialize-then-unify RL training recipe.

Agent and Tool Ecosystem Alignment and RLHF Chain-of-Thought Reasoning OneRec OneReason Technical Report

3arXiv · cs.LG·11d ago·source ↗

Systematic framework for selecting trajectories in data augmentation evaluated across five strategies

A thesis-derived arXiv preprint proposes a framework for evaluating five trajectory selection strategies—Outlierness, Diversity, Representativeness, Uncertainty, and Random—for data augmentation in spatio-temporal ML tasks. The study tests these strategies across four datasets spanning animal behavior, maritime, and urban traffic domains using linear and non-linear models with Optuna-based hyperparameter optimization. Key findings show systematic strategies (especially Outlierness and Uncertainty) outperform random selection in sparse datasets but can degrade performance in dense, high-quality datasets, with UMAP visualization confirming topological effects.

Evaluation and Benchmarking Optuna A Systematic Approach for Selecting Trajectories for Data Augmentation UMAP