TABVERSE benchmark isolates table representation effects across formats in LLMs and VLMs
TABVERSE is a new controlled multimodal benchmark that evaluates LLMs and VLMs on table understanding by holding table content fixed while varying representation format (HTML, Markdown, LaTeX, rendered images). Evaluation across three tasks—Question Answering, Structural Understanding, and Structure Reconstruction—shows that representation choice substantially affects performance, with structured text generally outperforming rendered images and HTML being the most robust text format. The benchmark addresses a gap in existing evaluations where content, format, and modality vary simultaneously, making it impossible to isolate representation effects.
Related guides (2)
Related events (8)
Semantic Triplet Restoration: A Novel Protocol for Hierarchical Table Understanding in Large Language Models
This paper proposes Semantic Triplet Restoration (STR), a table serialization protocol that rewrites each cell as an atomic fact <item path, feature path, value> to make header-cell alignments explicit for LLMs, replacing HTML/Markdown representations. The authors also introduce TripletQL, a query-aware router that selects relevant triplets per question. Evaluated on four Chinese and English table-QA benchmarks, STR matches or outperforms HTML-based baselines while reducing input token count. Benefits are most pronounced for smaller models and longer tables, suggesting value under constrained inference budgets.
Act2Answer: Benchmarking commonsense and world knowledge retention in Vision-Language-Action models
Researchers introduce Act2Answer, a protocol for evaluating how much commonsense and factual knowledge VLA models retain after fine-tuning on robotics data. The approach converts knowledge benchmark questions into tabletop object-placement episodes, yielding action-grounded success rates that reduce confounds from low-level control failures. A large-scale study of 7 VLA models and 9 VLM baselines finds that VLAs retain solid performance on simple concepts but show larger gaps on richer semantic categories compared to their source VLMs, and that VQA co-training is associated with better knowledge retention.
BenCzechMark: A Benchmark for Evaluating LLM Czech Language Understanding
BenCzechMark is a new evaluation benchmark designed to assess large language model performance on Czech language tasks. The benchmark addresses the gap in non-English language evaluation, providing a structured way to measure LLM capabilities in Czech across multiple task types. Published on Hugging Face, it contributes to the growing ecosystem of multilingual and language-specific benchmarks.
Introducing RTEB: A New Standard for Retrieval Evaluation
Hugging Face introduces RTEB (Retrieval Text Embedding Benchmark), a new benchmark designed to standardize evaluation of retrieval systems and text embeddings. The benchmark aims to address gaps in existing evaluation frameworks by providing more comprehensive and realistic retrieval tasks. This represents an effort to improve how the community measures progress in retrieval-augmented generation and semantic search systems.
TextQuests: How Good are LLMs at Text-Based Video Games?
A Hugging Face blog post introduces TextQuests, an evaluation framework that tests LLMs on text-based video games as a proxy for interactive reasoning, planning, and language understanding. The benchmark assesses how well models can navigate, solve puzzles, and maintain state across multi-turn interactions in classic interactive fiction environments. This type of evaluation targets agentic capabilities including long-horizon planning and grounded language understanding.
WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata
WikiVQABench is a new human-curated VQA benchmark that requires external knowledge beyond visual perception, constructed by combining Wikipedia images, captions, and Wikidata structured knowledge with LLM-generated question candidates reviewed by human annotators. The benchmark evaluates knowledge-intensive reasoning in vision-language models, covering 15 VLMs ranging from 256M to 90B parameters. Accuracy spans 24.7% to 75.6%, indicating meaningful discrimination across model scales. The dataset and code are publicly released.
MATCHA: Contrastive Semantic Alignment Metric for LLM Evaluation
MATCHA is a new automatic evaluation metric for LLMs that addresses a fundamental flaw in existing metrics: both token-overlap (ROUGE) and embedding-based (BERTScore) metrics routinely assign near-identical scores to semantically contradictory texts. The metric uses a dual-view approach that rewards proximity to a gold reference while penalizing adversarially generated counterfactual contradictions. Evaluated across eight benchmarks spanning QA, summarization, NLI, and semantic similarity tasks, MATCHA outperforms 23 embedding models and achieves 18.38% and 20.82% improvements over ROUGE-L and BERTScore respectively on TruthfulQA. Code and metric are publicly released.
M³Exam: Benchmark for Multimodal Memory in Realistic User-Agent Interactions
Researchers introduce M³Exam, a query-centric multimodal conversational memory benchmark designed to evaluate language agents on realistic user-agent interactions, including cross-modal grounding and implicit information inference. Existing benchmarks are critiqued for assuming sparse visuals and human-human interaction formats. The paper also proposes M³Proctor, a companion memory method that detects query modality bias and retrieves raw visual sources on demand, achieving 13% accuracy improvement while reducing index-construction time and retrieved tokens by over 70%.

