Almanac
benchmark

MT-Bench

benchmarkactivemt-bench-35b1aa90·12 events·first seen 1mo ago

Aliases: MT-Bench, MM-MT-Bench

Co-occurring entities

More like this (12)

Recent events (12)

6Qwen Research·1mo ago·source ↗

Qwen-Max-0428: Alibaba's Largest Instruction-Tuned Model Released

Alibaba's Qwen team has released Qwen-Max-0428, a new instruction-tuned model larger than the previously open-sourced Qwen1.5-110B-Chat. The model has entered Chatbot Arena and reached the top-10 on the leaderboard, while also outperforming Qwen1.5-110B-Chat on MT-Bench. The model is available via API, though it does not appear to be open-weights at this stage.

9Mistral Ai News·15d ago·source ↗

Mixtral 8x7B: Mistral AI Releases Sparse Mixture-of-Experts Open-Weight Model

Mistral AI has released Mixtral 8x7B, a sparse mixture-of-experts (SMoE) model with 46.7B total parameters but only 12.9B active parameters per token, enabling inference speed and cost equivalent to a 12.9B model. Licensed under Apache 2.0, Mixtral outperforms Llama 2 70B on most benchmarks and matches or exceeds GPT-3.5, with support for 32k context, five European languages, and strong code generation. An instruction-tuned variant (Mixtral 8x7B Instruct) achieves 8.3 on MT-Bench, claimed best among open-source models at release. The model is deployed behind Mistral's mistral-small API endpoint and supported via vLLM with Megablocks CUDA kernels.

5arXiv · cs.CL·6d ago·source ↗

Information-theoretic metric for measuring semantic progress in multi-turn dialogue

A new arXiv preprint formalizes 'semantic progress' in multi-turn dialogue as question-conditioned uncertainty reduction and introduces an information-theoretic metric approximated in embedding space using a Gaussian formulation with closed-form updates. The metric has desirable theoretical properties (monotonicity, additive decomposition, diminishing returns) and requires no autoregressive inference at evaluation time, making it reproducible and lightweight. Experiments on MT-Bench, Chatbot Arena, and UltraFeedback show competitive or improved agreement with human judgments compared to several LLM-as-a-judge baselines. The approach works with lightweight embedding models under CPU-only execution.

7arXiv · cs.CL·28d ago·source ↗

General Preference Reinforcement Learning (GPRL): Bridging Online RL and Preference Optimization for Open-Ended Tasks

GPRL proposes a new alignment framework that replaces scalar reward models with a General Preference Model (GPM) embedding responses into k skew-symmetric subspaces to capture multi-dimensional, intransitivity-aware preferences. The method computes per-dimension group-relative advantages, normalizes across axes, and uses a closed-loop drift monitor to detect and correct single-axis reward hacking during training. Starting from Llama-3-8B-Instruct, GPRL achieves a 56.51% length-controlled win rate on AlpacaEval 2.0 and outperforms SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench. The work directly addresses the gap between verifiable-reward online RL (strong on math/code) and preference optimization (strong on open-ended tasks).

7Qwen Research·1mo ago·source ↗

Qwen1.5-110B: Alibaba Releases First 100B+ Model in Qwen1.5 Series

Alibaba's Qwen team released Qwen1.5-110B, their first open-weights model exceeding 100 billion parameters. The model claims comparable performance to Meta's Llama-3-70B on base model benchmarks, with strong results on MT-Bench and AlpacaEval 2 chat evaluations. The release follows a wave of large open-source models exceeding 100B parameters from various organizations.

7Mistral Ai News·1mo ago·source ↗

Mistral AI Launches La Plateforme: First API Endpoints in Early Access

Mistral AI opened beta access to its first developer platform, La Plateforme, offering three generative text endpoints (mistral-tiny, mistral-small, mistral-medium) and an embedding endpoint. Mistral-tiny serves Mistral 7B Instruct v0.2, mistral-small serves Mixtral 8x7B, and mistral-medium serves an unreleased prototype model scoring 8.6 on MT-Bench. The platform also introduces Mistral-embed with a 1024-dimension embedding model achieving 55.26 on MTEB. The API follows OpenAI-compatible chat interface specifications and is ramping toward general availability.

7Mistral Ai News·1mo ago·source ↗

Pixtral Large: Mistral AI's 124B Open-Weights Multimodal Model

Mistral AI released Pixtral Large, a 124B open-weights multimodal model built on Mistral Large 2, featuring a 1B parameter vision encoder and 128K context window supporting at least 30 high-resolution images. The model claims state-of-the-art results on MathVista, DocVQA, and ChartQA, outperforming GPT-4o and Gemini-1.5 Pro on several benchmarks, and leads the LMSys Vision Leaderboard among open-weights models by ~50 ELO points. Simultaneously, Mistral updated its text model to Mistral Large 24.11 with improvements in long-context understanding, function calling, and RAG/agentic workflows. Note: the model has since been deprecated and replaced by newer Mistral vision models.

6arXiv · cs.CL·26d ago·source ↗

ChunkFT: Memory-Efficient Full Fine-Tuning via Byte-Streamed Chunk Optimization

ChunkFT is a fine-tuning framework that reformulates full-parameter optimization around a dynamically activated working set of sub-tensors, enabling gradient computation without dense gradient materialization. It achieves full-parameter fine-tuning of a 7B model in 13.72GB GPU memory on a single RTX 4090, and scales Llama 3-70B fine-tuning to 2×H800 GPUs. Downstream evaluations on language understanding, math reasoning, and MT-Bench show ChunkFT matches or exceeds full-parameter fine-tuning quality while outperforming existing memory-efficient baselines such as LoRA-class methods. A theoretical convergence analysis in the deterministic setting is also provided.

8Mistral Ai News·15d ago·source ↗

Mistral 7B: Open-Weights 7B Model Outperforming Llama 2 13B

Mistral AI released Mistral 7B, a 7.3B parameter language model under the Apache 2.0 license that outperforms Llama 2 13B across all evaluated benchmarks and approaches Llama 34B on many tasks. The model employs Grouped-Query Attention (GQA) for faster inference and Sliding Window Attention (SWA) to handle longer sequences at reduced cost, achieving roughly 2x speed improvement at 16k sequence length. A fine-tuned chat variant, Mistral 7B Instruct, outperforms all 7B chat models on MT-Bench and is competitive with 13B-class chat models. The release includes deployment support for AWS, GCP, Azure, HuggingFace, and local use via vLLM.

5Mistral Ai News·1mo ago·source ↗

Pixtral 12B: Mistral AI's First Multimodal Model (Now Deprecated)

Mistral AI released Pixtral 12B in September 2024 as their first natively multimodal model, combining a new 400M parameter vision encoder trained from scratch with a 12B multimodal decoder based on Mistral Nemo. The model supports variable image sizes and aspect ratios, a 128K token context window for multiple images, and achieved 52.5% on MMMU while maintaining strong text-only benchmark performance. The model is now deprecated and has been replaced by newer vision and multimodal models from Mistral. It was released under Apache 2.0 license.

8Mistral Ai News·15d ago·source ↗

Mistral Large 2 (123B): New Frontier Model with 128k Context, Multilingual and Code Capabilities

Mistral AI releases Mistral Large 2, a 123-billion-parameter model with a 128k context window, supporting 80+ coding languages and over a dozen natural languages. The model claims competitive performance with GPT-4o, Claude 3 Opus, and Llama 3 405B on code generation, reasoning, and multilingual benchmarks, while targeting cost-efficient single-node inference. Weights are available under a Mistral Research License for non-commercial use, with a commercial license required for self-deployment. The model is accessible via Mistral's la Plateforme API (mistral-large-2407), HuggingFace, and Google Cloud Vertex AI.

7Mistral Ai News·15d ago·source ↗

Mistral Small 3.1: Multimodal, 128k Context, Apache 2.0 Open-Weight Model

Mistral AI releases Mistral Small 3.1, a ~24B parameter model with multimodal understanding, 128k token context window, and claimed best-in-class performance among small models, outperforming Gemma 3 and GPT-4o Mini on text, multimodal, and multilingual benchmarks. The model runs on a single RTX 4090 or 32GB RAM Mac at 150 tokens/second and is released under Apache 2.0 license with both base and instruct checkpoints. It is available on HuggingFace, Mistral's La Plateforme API, and Google Cloud Vertex AI, with NVIDIA NIM and Azure AI Foundry support coming soon. The release targets enterprise and on-device use cases including document verification, agentic workflows, and domain fine-tuning.