4arXiv cs.CL (Computation and Language)·11d ago

TABVERSE benchmark isolates table representation effects across formats in LLMs and VLMs

TABVERSE is a new controlled multimodal benchmark that evaluates LLMs and VLMs on table understanding by holding table content fixed while varying representation format (HTML, Markdown, LaTeX, rendered images). Evaluation across three tasks—Question Answering, Structural Understanding, and Structure Reconstruction—shows that representation choice substantially affects performance, with structured text generally outperforming rendered images and HTML being the most robust text format. The benchmark addresses a gap in existing evaluations where content, format, and modality vary simultaneously, making it impossible to isolate representation effects.

Evaluation and Benchmarking Multimodal Progress TABVERSE

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·19d ago·source ↗

Semantic Triplet Restoration: A Novel Protocol for Hierarchical Table Understanding in Large Language Models

This paper proposes Semantic Triplet Restoration (STR), a table serialization protocol that rewrites each cell as an atomic fact <item path, feature path, value> to make header-cell alignments explicit for LLMs, replacing HTML/Markdown representations. The authors also introduce TripletQL, a query-aware router that selects relevant triplets per question. Evaluated on four Chinese and English table-QA benchmarks, STR matches or outperforms HTML-based baselines while reducing input token count. Benefits are most pronounced for smaller models and longer tables, suggesting value under constrained inference budgets.

Inference Economics Agent and Tool Ecosystem Table Question Answering Semantic Triplet Restoration TripletQL +2 more

5arXiv · cs.LG·2d ago·source ↗

Act2Answer: Benchmarking commonsense and world knowledge retention in Vision-Language-Action models

Researchers introduce Act2Answer, a protocol for evaluating how much commonsense and factual knowledge VLA models retain after fine-tuning on robotics data. The approach converts knowledge benchmark questions into tabletop object-placement episodes, yielding action-grounded success rates that reduce confounds from low-level control failures. A large-scale study of 7 VLA models and 9 VLM baselines finds that VLAs retain solid performance on simple concepts but show larger gaps on richer semantic categories compared to their source VLMs, and that VQA co-training is associated with better knowledge retention.

Evaluation and Benchmarking Multimodal Progress Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models Act2Answer

4Hugging Face Blog·1mo ago·source ↗

BenCzechMark: A Benchmark for Evaluating LLM Czech Language Understanding

BenCzechMark is a new evaluation benchmark designed to assess large language model performance on Czech language tasks. The benchmark addresses the gap in non-English language evaluation, providing a structured way to measure LLM capabilities in Czech across multiple task types. Published on Hugging Face, it contributes to the growing ecosystem of multilingual and language-specific benchmarks.

Evaluation and Benchmarking Hugging Face BenCzechMark

5Hugging Face Blog·1mo ago·source ↗

Introducing RTEB: A New Standard for Retrieval Evaluation

Hugging Face introduces RTEB (Retrieval Text Embedding Benchmark), a new benchmark designed to standardize evaluation of retrieval systems and text embeddings. The benchmark aims to address gaps in existing evaluation frameworks by providing more comprehensive and realistic retrieval tasks. This represents an effort to improve how the community measures progress in retrieval-augmented generation and semantic search systems.

Evaluation and Benchmarking Agent and Tool Ecosystem MTEB RTEB Hugging Face

4Hugging Face Blog·1mo ago·source ↗

TextQuests: How Good are LLMs at Text-Based Video Games?

A Hugging Face blog post introduces TextQuests, an evaluation framework that tests LLMs on text-based video games as a proxy for interactive reasoning, planning, and language understanding. The benchmark assesses how well models can navigate, solve puzzles, and maintain state across multi-turn interactions in classic interactive fiction environments. This type of evaluation targets agentic capabilities including long-horizon planning and grounded language understanding.

Evaluation and Benchmarking Agent and Tool Ecosystem TextQuests Hugging Face

5arXiv · cs.AI·1mo ago·source ↗

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

WikiVQABench is a new human-curated VQA benchmark that requires external knowledge beyond visual perception, constructed by combining Wikipedia images, captions, and Wikidata structured knowledge with LLM-generated question candidates reviewed by human annotators. The benchmark evaluates knowledge-intensive reasoning in vision-language models, covering 15 VLMs ranging from 256M to 90B parameters. Accuracy spans 24.7% to 75.6%, indicating meaningful discrimination across model scales. The dataset and code are publicly released.

Evaluation and Benchmarking Multimodal Progress large language models Wikidata WikiVQABench +2 more

6arXiv · cs.CL·24d ago·source ↗

MATCHA: Contrastive Semantic Alignment Metric for LLM Evaluation

MATCHA is a new automatic evaluation metric for LLMs that addresses a fundamental flaw in existing metrics: both token-overlap (ROUGE) and embedding-based (BERTScore) metrics routinely assign near-identical scores to semantically contradictory texts. The metric uses a dual-view approach that rewards proximity to a gold reference while penalizing adversarially generated counterfactual contradictions. Evaluated across eight benchmarks spanning QA, summarization, NLI, and semantic similarity tasks, MATCHA outperforms 23 embedding models and achieves 18.38% and 20.82% improvements over ROUGE-L and BERTScore respectively on TruthfulQA. Code and metric are publicly released.

Evaluation and Benchmarking AI Safety Research TruthfulQA ROUGE-L Siran Li +3 more

5arXiv · cs.CL·12d ago·source ↗

M³Exam: Benchmark for Multimodal Memory in Realistic User-Agent Interactions

Researchers introduce M³Exam, a query-centric multimodal conversational memory benchmark designed to evaluate language agents on realistic user-agent interactions, including cross-modal grounding and implicit information inference. Existing benchmarks are critiqued for assuming sparse visuals and human-human interaction formats. The paper also proposes M³Proctor, a companion memory method that detects query modality bias and retrieves raw visual sources on demand, achieving 13% accuracy improvement while reducing index-construction time and retrieved tokens by over 70%.

Evaluation and Benchmarking Agent and Tool Ecosystem M³Exam M³Proctor +1 more