4arXiv cs.AI (Artificial Intelligence)·13d ago

PaperFlow: A longitudinal framework for daily scientific paper recommendation with profiling and interest drift

PaperFlow is a new framework for scientific paper recommendation that models the process as a longitudinal, daily workflow rather than a static ranking task. It comprises three coupled stages: Profiling (building user scholarly profiles from cold-start evidence), Recommending (ranking daily paper streams under a display budget), and Adapting (updating user state from feedback and modeling interest drift). The authors introduce a benchmark with 24 simulated users, 50 daily paper streams, and over 1.2 million episode-paper records, plus a blind human-evaluation protocol. PaperFlow outperforms five baselines on oracle ranking, behavioral alignment, and human evaluation.

Evaluation and Benchmarking PaperFlow

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.AI·4d ago·source ↗

DRFLOW: Benchmark for Evaluating Agent Workflow Prediction from Heterogeneous Sources

Researchers introduce DRFLOW, a benchmark targeting a gap in deep research (DR) agent evaluation: predicting concrete, personalized action-step workflows rather than generating summaries or reports. The benchmark contains 100 tasks across five domains, grounded in over 3,900 sources, with seven diagnostic metrics covering factual grounding, step recovery, structural ordering, and personalization. A reference agent (DRFA) is also presented, improving over strong baselines by up to 10% average F1 but leaving substantial headroom, indicating workflow prediction remains a hard open problem for DR agents.

Evaluation and Benchmarking Agent and Tool Ecosystem DRFLOW-Agent DRFLOW

5arXiv · cs.CL·25d ago·source ↗

GraphReview: Scientific Paper Evaluation via LLM-Based Graph Message Passing

GraphReview proposes a graph-based LLM framework that models scientific paper evaluation as review-signal message passing over a semantic paper graph, capturing both intrinsic quality and relational context (synchronic and diachronic links). LLMs estimate node-level quality priors and generate edge-level comparative evidence via pairwise comparisons, while Personalized PageRank integrates signals for ranking, decision prediction, and review generation. The system uses reward-induced maximum likelihood objectives to train LLM backbones and achieves average improvements of 29.7% over the strongest baseline on decision and ranking metrics, including 23.7% accuracy gain and 57.6% Spearman's ρ gain.

Evaluation and Benchmarking Agent and Tool Ecosystem ECNU-Text-Computing Personalized PageRank reward-induced maximum likelihood +2 more

5arXiv · cs.LG·24d ago·source ↗

AMRS: Rollout-Based World Model for Offline Affective Music Recommendation with DPO

LUCID's Affective Music Recommendation System (AMRS) uses a causal transformer world model trained on logged listening data to jointly predict engagement, ratings, and self-reported valence/arousal, enabling offline policy optimization without ethically problematic online experimentation. A recommender policy is initialized via behavior cloning and fine-tuned with Direct Preference Optimization (DPO) against a multi-objective utility function. The system is deployed on LUCID's health-and-wellness platforms serving clinical users (older adults with neurocognitive conditions) and consumer-wellness users across four modes. Under cold-start conditions, DPO improves predicted affective signals over the cloned baseline while maintaining diversity and avoiding distributional collapse.

Enterprise Deployment Patterns Agent and Tool Ecosystem behavior cloning world model Direct Preference Optimization (DPO)+4 more

7Openai Blog·1mo ago·source ↗

PaperBench: OpenAI Benchmark for Evaluating AI Agents on Research Replication

OpenAI introduces PaperBench, a benchmark designed to evaluate AI agents' ability to replicate state-of-the-art AI research papers end-to-end. The benchmark targets a high-complexity capability: reproducing experimental results from frontier AI research, which requires code generation, experimental design, and scientific reasoning. This positions PaperBench as a tool for tracking progress toward autonomous AI research agents.

Evaluation and Benchmarking AI Safety Research OpenAI PaperBench +1 more

6arXiv · cs.CL·18d ago·source ↗

Taiji: Pareto Optimal Policy Optimization for LLM-enhanced recommendation at Kuaishou scale

Researchers from Kuaishou present Taiji, an LLM-as-Enhancer framework for industrial recommender systems that addresses two bottlenecks: generating high-quality chain-of-thought data via reverse-engineered reasoning and rejection sampling during SFT, and balancing semantic vs. ID-based rewards during RL alignment via a new algorithm called Pareto Optimal Policy Optimization (POPO). The system has been deployed on Kuaishou's advertising platform since May 2026, serving over 400 million daily users. The paper contributes both a practical deployment case study and a novel RL alignment technique for the LLM4Rec paradigm.

Enterprise Deployment Patterns Alignment and RLHF Taiji Pareto Optimal Policy Optimization Kuaishou

3arXiv · cs.AI·13d ago·source ↗

Twelve practical tips for designing AI-driven HPC workflows

A preprint from arXiv offers twelve practical guidelines for researchers designing AI and foundation-model-driven workflows on HPC clusters. The guide addresses system-level challenges including containerisation, job arrays, feedback loop mechanics, and I/O optimisation for small files. The work targets the transition from deterministic linear pipelines to adaptive, probabilistic computational environments, with particular emphasis on computational biology use cases.

Training Infrastructure Enterprise Deployment Patterns Twelve quick tips for designing AI-driven HPC workflows

5arXiv · cs.AI·2d ago·source ↗

FlowEdit: Lifelong pronunciation adaptation for flow-matching TTS via associative memory

FlowEdit is a new framework enabling lifelong pronunciation correction in frozen flow-matching text-to-speech systems without retraining model weights. Corrections are stored as token-level perturbations in text embedding space within a Modern Hopfield Network, retrieved at inference via soft attention with fuzzy morphological matching. On a curated benchmark of 312 multilingual proper nouns across 18 language families, the method reduces target-word Phoneme Error Rate by 92.7% relative to the zero-shot baseline, with each correction completing in ~15 seconds on a single GPU.

Inference Economics Enterprise Deployment Patterns Modern Hopfield Network FlowEdit FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS

5arXiv · cs.CL·5d ago·source ↗

MetaSyn benchmark reveals critical screening bottleneck in LLM-based meta-analysis pipelines

Researchers introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals, paired with a 140k-article PubMed retrieval corpus, PI/ECO criteria, verified positives, and hard negatives. Benchmarking twelve pipeline configurations — nine RAG variants and a protocol-driven agent — shows that despite 90.9% retrieval recall at K=200, no system recovers more than 52.7% of ground-truth included studies. The core failure is LLMs' inability to reliably distinguish eligible studies from topically similar but criteria-failing distractors. The paper argues that end-to-end scores obscure where pipelines break down and proposes stage-attributed metrics.

Evaluation and Benchmarking Agent and Tool Ecosystem PubMed Nature Portfolio MetaSyn