Entity · benchmark

Arena-Hard

benchmarkactivearena-hard-4066a011·2 events·first seen May 19, 2026

Aliases: Arena-Hard, Arena Hard

Co-occurring entities

More like this (12)

ArenaHard Game Arena Arena Search NP-Hard Judge Arena BigCodeArena Arena Code SWE-Bench-Pro-Hard-AA TTS Arena Chatbot Arena STT-Arena HypoArena

Recent events (2)

8Mistral Ai News·Jun 1, 2026·source ↗

Mistral Large 2 (123B): New Frontier Model with 128k Context, Multilingual and Code Capabilities

Mistral AI releases Mistral Large 2, a 123-billion-parameter model with a 128k context window, supporting 80+ coding languages and over a dozen natural languages. The model claims competitive performance with GPT-4o, Claude 3 Opus, and Llama 3 405B on code generation, reasoning, and multilingual benchmarks, while targeting cost-efficient single-node inference. Weights are available under a Mistral Research License for non-commercial use, with a commercial license required for self-deployment. The model is accessible via Mistral's la Plateforme API (mistral-large-2407), HuggingFace, and Google Cloud Vertex AI.

Long Context Evolution Frontier Model Releases Mistral AI MT-Bench Claude Opus 4.6 +14 more

7arXiv · cs.CL·May 19, 2026·source ↗

General Preference Reinforcement Learning (GPRL): Bridging Online RL and Preference Optimization for Open-Ended Tasks

GPRL proposes a new alignment framework that replaces scalar reward models with a General Preference Model (GPM) embedding responses into k skew-symmetric subspaces to capture multi-dimensional, intransitivity-aware preferences. The method computes per-dimension group-relative advantages, normalizes across axes, and uses a closed-loop drift monitor to detect and correct single-axis reward hacking during training. Starting from Llama-3-8B-Instruct, GPRL achieves a 56.51% length-controlled win rate on AlpacaEval 2.0 and outperforms SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench. The work directly addresses the gap between verifiable-reward online RL (strong on math/code) and preference optimization (strong on open-ended tasks).

Frontier Model Releases Evaluation and Benchmarking WildBench MT-Bench General Preference Reinforcement Learning +7 more