4arXiv cs.AI (Artificial Intelligence)·3d ago

X+Slides benchmark evaluates audience-conditioned slide generation from documents

Researchers introduce X+Slides, a benchmark for evaluating LLM-based slide deck generation that incorporates target audience as a first-class evaluation dimension. Built on 113 topics and seven presentation scenes, it uses 8,133 source-grounded probes and four metrics covering audience coverage, domain coverage, efficiency, and correctness. Experiments on DeepPresenter, SlideTailor, and NotebookLM reveal that current systems recover substantial but incomplete audience-essential information, with NotebookLM achieving the highest audience coverage of 0.853 at the tested threshold.

Evaluation and Benchmarking Google X+Slides notebooklm-py DeepPresenter SlideTailor

Related guides (2)

Google

Google: The AI Lab That Builds Everything from DNA Models to Your Phone's Assistant

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·11d ago·source ↗

NCRE-based benchmark reveals frontier LLMs top out at 68.8% on professional Office automation tasks

Researchers introduce an evaluation suite derived from China's National Computer Rank Examination (NCRE), comprising 200 practical tasks across Word, Excel, and PowerPoint scored via 7,118 machine-gradable criteria. Seven frontier LLMs are benchmarked: single-turn models peak at 36.6% Score Rate, while a full agentic system with execution feedback and iterative repair reaches 68.8%, still well below the 95.5% community-reference score. The results demonstrate that fine-grained, long-horizon Office document automation remains a significant unsolved challenge for current LLM and agent systems despite strong code-generation capabilities.

Evaluation and Benchmarking Agent and Tool Ecosystem National Computer Rank Examination Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

5arXiv · cs.CL·26d ago·source ↗

QUIET: Multi-Blank Cascaded Story Cloze Benchmark for LLM Creative Generation

QUIET (Quality Understanding via Interlocked Evaluation Testing) is a new benchmark designed to evaluate LLM creative generation capability rather than discriminative recognition, addressing limitations of benchmarks like Story Cloze Test and HellaSwag. The benchmark places 10-20 blanks with explicit content constraints and cascade dependencies into complete stories, requiring open-ended generation rather than multiple-choice selection. Scoring uses an information-theoretic automated protocol operationalizing a 'calibrated surprise' framework: score = satisfy * (1 + lambda * surprise), combining constraint satisfaction with a surprise measure, enabling objective automated evaluation without human graders or LLM-as-Judge subjectivity.

Frontier Model Releases Evaluation and Benchmarking Zou & Xu HellaSwag QUIET +2 more

4arXiv · cs.CL·13d ago·source ↗

DEFINED: Data-efficient framework for fine-grained creativity assessment in debate using LLMs

DEFINED is a computational framework for automated creativity assessment in debate scenarios, operationalizing creativity through an eight-dimensional hierarchical metric system implemented via a pretrained autoregressive language model with a hierarchical scoring head. The system addresses data scarcity through constrained data augmentation and mixed-granularity training from limited expert-annotated data. It outperforms prompt-based LLM evaluators and existing debate scoring methods on authentic competition data. The work is relevant to AI evaluation methodology and the broader question of whether LLMs can reliably assess complex human cognitive outputs.

Evaluation and Benchmarking DEFINED

6arXiv · cs.LG·23d ago·source ↗

SoundnessBench: Benchmarking LLMs as Evaluators of ML Research Proposal Viability

SoundnessBench is a new benchmark of 1,099 machine-learning research proposals derived from ICLR submissions, labeled with reviewer soundness scores, designed to test whether LLMs can reliably distinguish methodologically sound research ideas from unsound ones. Evaluated across 12 frontier LLMs, the benchmark reveals a pervasive optimism bias: models systematically rate low-soundness proposals as sound under standard prompting, with aggressive prompting shifting errors from false positives to false negatives rather than eliminating them. Controls for data contamination, surface features, and human audit quality suggest the bias is not attributable to a single confounder. The authors conclude that current LLMs are not yet reliable as standalone first-gate evaluators of scientific rigor, a critical bottleneck for autonomous AI research agents.

Evaluation and Benchmarking AI Safety Research ICLR optimism bias SoundnessBench +1 more

4arXiv · cs.CL·12d ago·source ↗

TABVERSE benchmark isolates table representation effects across formats in LLMs and VLMs

TABVERSE is a new controlled multimodal benchmark that evaluates LLMs and VLMs on table understanding by holding table content fixed while varying representation format (HTML, Markdown, LaTeX, rendered images). Evaluation across three tasks—Question Answering, Structural Understanding, and Structure Reconstruction—shows that representation choice substantially affects performance, with structured text generally outperforming rendered images and HTML being the most robust text format. The benchmark addresses a gap in existing evaluations where content, format, and modality vary simultaneously, making it impossible to isolate representation effects.

Evaluation and Benchmarking Multimodal Progress TABVERSE

5arXiv · cs.AI·1mo ago·source ↗

SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents

SkillGenBench is a new benchmark designed to evaluate the ability of LLM agents to generate correct, reusable, and executable skills from raw repositories and documents, rather than merely using pre-provided skills. It covers two generation regimes (task-conditioned and task-agnostic) and two procedural sources (repository-grounded and document-grounded), with standardized execution-based evaluation protocols. Experiments across multiple skill-generation methods reveal substantial performance variation and distinct failure modes depending on source type. The benchmark aims to establish skill generation as an independent research problem within agent systems.

Evaluation and Benchmarking Agent and Tool Ecosystem task-conditioned generation task-agnostic generation SkillGenBench +2 more

5arXiv · cs.CL·13d ago·source ↗

M³Exam: Benchmark for Multimodal Memory in Realistic User-Agent Interactions

Researchers introduce M³Exam, a query-centric multimodal conversational memory benchmark designed to evaluate language agents on realistic user-agent interactions, including cross-modal grounding and implicit information inference. Existing benchmarks are critiqued for assuming sparse visuals and human-human interaction formats. The paper also proposes M³Proctor, a companion memory method that detects query modality bias and retrieves raw visual sources on demand, achieving 13% accuracy improvement while reducing index-construction time and retrieved tokens by over 70%.

Evaluation and Benchmarking Agent and Tool Ecosystem M³Exam M³Proctor +1 more

5arXiv · cs.CL·1mo ago·source ↗

Text Analytics Evaluation Framework: Benchmarking LLMs on Social Media NLP Tasks

Researchers introduce a 470-question evaluation framework to assess LLM performance on aggregated social media text, applied to Twitter datasets across sentiment analysis, hate speech detection, and emotion recognition. Results show performance degrades substantially as input scale exceeds 500 instances, particularly for open-weights models on numerical tasks. Multi-label and target-dependent scenarios also show notable performance drops, and task complexity progressively erodes accuracy from basic semantic identification to comparison and counting operations. The findings point to architectural bottlenecks in current LLMs for rigorous quantitative analysis over large text collections.

Long Context Evolution Evaluation and Benchmarking Emotion Recognition Text Analytics Evaluation Framework X (Twitter)+3 more