NeurIPS 2025 E2LM Competition: Early Training Evaluation of Language Models
TII UAE and collaborators are announcing the Early Training Evaluation of Language Models (E2LM) competition at NeurIPS 2025. The competition focuses on predicting or evaluating language model capabilities from early training checkpoints, addressing the challenge of forecasting final model performance without completing full training runs. This is relevant to evaluation methodology and training efficiency research in the AI/ML community.
Related guides (3)
Related events (8)
3LM: A Benchmark for Arabic LLMs in STEM and Code
TII UAE has released 3LM, a benchmark designed to evaluate large language models on Arabic-language STEM and coding tasks. The benchmark addresses a gap in multilingual evaluation infrastructure, where Arabic has been underrepresented relative to English and other high-resource languages. It targets both general-purpose and Arabic-specialized LLMs to assess their capabilities in technical domains.
Very Large Language Models and How to Evaluate Them
This Hugging Face blog post from October 2022 discusses approaches to zero-shot evaluation of large language models hosted on the Hub. It covers methodologies for benchmarking LLMs without task-specific fine-tuning, addressing the practical challenges of evaluating very large models at scale. The post situates evaluation tooling within the broader ecosystem of open model hosting and assessment.
QIMMA: A Quality-First Arabic LLM Leaderboard
TII UAE (Technology Innovation Institute) has launched QIMMA, a leaderboard specifically designed to evaluate large language models on Arabic language tasks with a focus on quality-first assessment. The leaderboard aims to address gaps in Arabic NLP evaluation by providing standardized benchmarks tailored to Arabic linguistic characteristics. This represents a dedicated infrastructure effort for tracking Arabic LLM progress, a historically underserved language in evaluation frameworks.
CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models
CyberSecEval 2 is a benchmark framework designed to evaluate both the cybersecurity risks and capabilities of large language models. The framework appears to be hosted or featured on Hugging Face's leaderboard infrastructure, extending prior cybersecurity evaluation work. It assesses LLMs across multiple dimensions of security-relevant behavior, including potential for misuse and defensive capabilities.
Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study
This paper systematically investigates strategies for extending LLM-based automatic evaluation (LLMs-as-a-Judge) to multilingual settings, covering high-, mid-, and low-resource languages (English, Spanish, Basque). The authors compare instruction translation, monolingual vs. multilingual supervision, and model size, finding that fine-tuned smaller models can match proprietary models when in-domain data is available, while zero-shot larger models are preferable out-of-domain. Two meta-evaluation datasets are extended to Spanish and Basque, and all data and code are publicly released.
Alyah: Benchmark for Evaluating Emirati Dialect Capabilities in Arabic LLMs
TII UAE introduces Alyah, a benchmark designed to evaluate large language models on Emirati Arabic dialect understanding and generation. The work addresses a gap in Arabic NLP evaluation, where most benchmarks focus on Modern Standard Arabic and neglect regional dialects. The benchmark aims to provide robust assessment of LLM capabilities specific to Emirati linguistic and cultural context.
Introducing the Open Leaderboard for Japanese LLMs
Hugging Face has launched an open leaderboard specifically for evaluating large language models on Japanese language tasks. The leaderboard aims to provide standardized benchmarking for Japanese LLMs, filling a gap in multilingual evaluation infrastructure. This initiative supports the growing ecosystem of Japanese-language AI development and open evaluation practices.
Letting Large Models Debate: The First Multilingual LLM Debate Competition
Hugging Face introduces a multilingual LLM debate competition where large language models compete against each other in structured debates. The initiative explores multi-agent interaction, argumentation quality, and cross-lingual reasoning capabilities. This represents an evaluation framework for assessing LLM persuasion, coherence, and multilingual performance in adversarial settings.


