3arXiv cs.CL (Computation and Language)·2d ago

TalentCLEF 2026 overview: NLP benchmarks for job-person matching and skill classification

The second edition of TalentCLEF, a shared evaluation challenge at CLEF 2026, introduced two tasks: contextualized job-person matching (English and Spanish) and job-skill matching with skill type classification. The challenge attracted 113 registered teams and over 400 submissions, indicating significant community interest in NLP benchmarks for Human Capital Management. The paper describes datasets, evaluation settings, and results across participating teams.

Evaluation and Benchmarking CLEF 2026 TalentCLEF 2026

Related guides (1)

Evaluation and BenchmarkingTopic guide

AI Evaluation and Benchmarking: From Leaderboards to the Limits of Measurement

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·May 19, 2026·source ↗

SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents

SkillGenBench is a new benchmark designed to evaluate the ability of LLM agents to generate correct, reusable, and executable skills from raw repositories and documents, rather than merely using pre-provided skills. It covers two generation regimes (task-conditioned and task-agnostic) and two procedural sources (repository-grounded and document-grounded), with standardized execution-based evaluation protocols. Experiments across multiple skill-generation methods reveal substantial performance variation and distinct failure modes depending on source type. The benchmark aims to establish skill generation as an independent research problem within agent systems.

Evaluation and Benchmarking Agent and Tool Ecosystem task-conditioned generation task-agnostic generation SkillGenBench +2 more

3arXiv · cs.LG·Jun 12, 2026·source ↗

SkMTEB: First comprehensive MTEB-style text embedding benchmark for Slovak with adapted E5 models

Researchers introduce SkMTEB, the first MTEB-style embedding benchmark for Slovak, covering 31 datasets across 7 task types — roughly 4× the existing multilingual benchmark coverage for the language. Evaluation of 31 embedding models shows large instruction-tuned multilingual models outperform Slovak-specific NLU models on embedding tasks. The authors also release e5-sk-small (45M) and e5-sk-large (365M), derived from Multilingual E5 via vocabulary trimming and fine-tuning, achieving competitive performance with proprietary APIs at up to 62% size reduction.

Evaluation and Benchmarking Open Weights Progress MTEB SkMTEB e5_large +2 more

4arXiv · cs.CL·Jun 25, 2026·source ↗

HIPE-2026 evaluation campaign: person-place relation extraction from multilingual historical texts

HIPE-2026 is the third edition of a shared-task evaluation series, shifting focus from named entity recognition to temporally grounded relation extraction across French, German, and English historical documents. Seventeen teams submitted over 40 runs, spanning large language models to lightweight classifiers, evaluated on predictive accuracy, computational efficiency, and cross-domain generalization. The campaign surfaces trade-offs between accuracy and robustness when processing noisy OCR text from 19th–20th century newspapers and early modern literary sources. Results provide a benchmark snapshot of the current state of historical relation extraction for cultural heritage applications.

Evaluation and Benchmarking HIPE-2022 HIPE-2020 HIPE-2026

4arXiv · cs.CL·Jun 24, 2026·source ↗

ParaPairAudioBench: Pairwise benchmark reveals large gaps in LALM paralinguistic judgment

Researchers introduce ParaPairAudioBench, a pairwise audio benchmark of 5,175 audio pairs spanning five paralinguistic dimensions (Style, Rate, Emphasis, Age, Gender) designed to evaluate Large Audio-Language Models as judges. Experiments show current LALMs lag human judgment by 32 percentage points on average and exhibit severe calibration failures, especially in ambiguous 'Tie' cases. The benchmark includes same-transcript and cross-transcript conditions to disentangle lexical from acoustic reliance, enabling more rigorous assessment of LALM reliability for speech evaluation.

Evaluation and Benchmarking Multimodal Progress ParaPairAudioBench

5arXiv · cs.CL·Jun 11, 2026·source ↗

Claw-SWE-Bench: A benchmark for evaluating agent harnesses on multilingual coding tasks

Researchers introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol designed to fairly compare heterogeneous agent harnesses ("claws") on GitHub issue-resolution tasks. The benchmark contains 350 instances across 8 languages and 43 repositories, with an 80-instance Lite subset for cost-efficient validation. Key findings show adapter design dominates raw model choice: a minimal adapter scores 19.1% Pass@1 versus 73.4% for a full adapter using the same GLM 5.1 backbone, and harness choice and model choice each shift Pass@1 by roughly 27-29 percentage points. The work also introduces cost accounting as a first-class evaluation axis alongside accuracy.

Evaluation and Benchmarking Inference Economics SWE-Bench Multilingual OpenClaw SWE-Bench Verified +4 more

6Openai Blog·May 20, 2026·source ↗

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

OpenAI introduces MLE-bench, a benchmark designed to measure AI agent performance on machine learning engineering tasks. The benchmark draws from Kaggle competitions to evaluate agents on realistic ML engineering workflows. Initial results show that current agents, including those powered by o1-preview, achieve competitive performance on a subset of tasks but fall well short of top human competitors. The benchmark is intended to track progress in agentic ML capabilities over time.

Frontier Model Releases Evaluation and Benchmarking Kaggle o1-preview MLE-bench +2 more

4arXiv · cs.CL·May 21, 2026·source ↗

Fifth Shared Task on Multilingual Coreference Resolution: Long-Range Entities and LLM Participation

The fifth CODI-CRAC shared task on multilingual coreference resolution expanded its scope with five new datasets and two additional languages, leveraging CorefUD 1.4 covering 27 datasets across 19 languages. The 2026 edition emphasized long-range coreference chains spanning many words and sentences. Ten systems participated, including four LLM-based approaches; traditional systems still led but LLMs showed notable potential, suggesting competitive parity may be near.

Long Context Evolution Evaluation and Benchmarking CODI-CRAC 2026 CorefUD Multilingual Coreference Resolution Shared Task

5arXiv · cs.CL·Jun 25, 2026·source ↗

SpeechEQ benchmark evaluates emotional intelligence in speech-language models across 15 EQ subscales

Researchers introduce SpeechEQ, a benchmark framework for evaluating sociolinguistic and emotional reasoning in Speech-Language Models (SLMs), comprising 2,265 multi-turn dialogues across 15 Emotional Quotient subscales grounded in EQ-i 2.0 theory. The benchmark reveals three systematic failure modes in current multimodal models: over-reliance on text (modality shortcut), alignment-induced safety trap, and contextual amnesia across turns. End-to-end architectures outperform cascaded systems but all evaluated models fall short of genuine emotional awareness. The dataset and demo are publicly released on HuggingFace.

Evaluation and Benchmarking Multimodal Progress EQ-i 2.0 SpeechEQ

TalentCLEF 2026 overview: NLP benchmarks for job-person matching and skill classification

Related events (8)

5arXiv · cs.AI·May 19, 2026·source ↗