3arXiv cs.AI (Artificial Intelligence)·18h ago

LLM-based pipeline for research entity extraction from UKRI grant proposals outperforms bespoke taxonomy approach

A UKRI-funded metascience project compares GPT-4o, Mistral, and a bespoke algorithm (DSIT-Taxonomies) for extracting and classifying research entities from funding proposal abstracts. Using a three-stage pipeline with Mistral as the primary extractor mapped against the OpenAlex Topics taxonomy, the LLM-based approach achieved 90.5% topic classification accuracy versus 71.4% for the DSIT-Taxonomies pipeline across 42 proposals. The authors conclude Mistral offers a practical, secure solution for large-scale analysis of sensitive grant data, with implications for identifying emerging research areas to guide public investment.

Evaluation and Benchmarking Enterprise Deployment Patterns UKRI GPT-4o Mistral OpenAlex DSIT-Taxonomies

Related guides (3)

GPT-4o

GPT-4o: OpenAI's All-in-One Multimodal Model

Read asBeginner

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Related events (8)

4Mistral Ai News·1mo ago·source ↗

Mistral AI: Using LLM-as-a-Judge with Structured Outputs for RAG Evaluation

Mistral AI published a technical guide on evaluating Retrieval-Augmented Generation (RAG) systems using the 'LLM as a Judge' paradigm combined with their structured outputs API feature. The approach implements the RAG Triad framework—context relevance, groundedness, and answer relevance—using Pydantic schemas to enforce machine-readable evaluation outputs. Mistral models serve as both the generator and judge components, enabling scalable automated evaluation without human annotators.

Evaluation and Benchmarking Enterprise Deployment Patterns Mistral AI RAG Triad Mistral Structured Outputs API +4 more

8Mistral Ai News·29d ago·source ↗

Mistral AI Releases Mistral Large, Claims Second-Best API Model After GPT-4

Mistral AI has released Mistral Large, its most capable model to date, claiming second place among API-accessible models behind GPT-4 on standard benchmarks including MMLU, HellaSwag, and coding/math evals. The model features a 32K context window, native fluency in five European languages, function calling, and constrained output mode. Simultaneously, Mistral is launching a new Mistral Small optimized for latency, restructuring its endpoint lineup, and announcing Microsoft Azure as its first major distribution partner. This marks Mistral's first significant commercial partnership and expansion beyond its own infrastructure.

Long Context Evolution Frontier Model Releases Azure AI Studio Mistral AI Llama 2 70B +13 more

6arXiv · cs.AI·18d ago·source ↗

LLMs automate reproducibility assessments in social and behavioral sciences, outperforming human reanalysts

A preprint from arXiv demonstrates that an LLM pipeline can automate reproducibility assessments of published social and behavioral science studies, recovering original effect sizes in 41% of cases (vs. 34% for human reanalysts) and reaching the same qualitative conclusion in 96% of cases (vs. 74% for humans). The study evaluated 76 published studies with predefined claims. The results suggest LLMs could serve as a scalable tool for systematic auditing of empirical research, addressing the resource-intensive nature of traditional reproducibility efforts.

Evaluation and Benchmarking Agent and Tool Ecosystem Automated reproducibility assessments in the social and behavioral sciences using large language models

7Mistral Ai News·29d ago·source ↗

Mistral AI Releases Devstral: Apache 2.0 Agentic Coding Model with SWE-Bench SOTA

Mistral AI, in collaboration with All Hands AI, releases Devstral, an agentic LLM specialized for software engineering tasks under the Apache 2.0 license. The model achieves 46.8% on SWE-Bench Verified, surpassing prior open-source state-of-the-art by over 6 percentage points and outperforming larger models like DeepSeek-V3-0324 (671B) and Qwen3 232B-A22B under the same OpenHands scaffold. Devstral is small enough to run on a single RTX 4090 or a Mac with 32GB RAM, and is available via Mistral's API at $0.1/M input tokens, as well as on HuggingFace, Ollama, and other platforms. Mistral indicates a larger agentic coding model is in development.

Frontier Model Releases Evaluation and Benchmarking DeepSeek-V3-0324 Mistral AI GPT-4.1 mini +10 more

8Mistral Ai News·1mo ago·source ↗

Mistral Small 4: Unified Multimodal, Reasoning, and Coding MoE Model Released Under Apache 2.0

Mistral AI has released Mistral Small 4, a 119B-parameter Mixture-of-Experts model (6B active per token) that unifies capabilities previously split across Magistral (reasoning), Pixtral (multimodal), and Devstral (coding agents) into a single open-weights model. The model features a 256k context window, configurable reasoning effort via a `reasoning_effort` parameter, native text and image input support, and is released under Apache 2.0. Mistral claims 40% latency reduction and 3x throughput improvement over Mistral Small 3, with benchmark results showing competitive performance against GPT-OSS 120B and Qwen models while producing significantly shorter outputs. The release includes day-0 availability as an NVIDIA NIM and support across vLLM, llama.cpp, SGLang, and Transformers.

Long Context Evolution Frontier Model Releases Mistral AI Mistral Small 4 Pixtral +14 more

3arXiv · cs.AI·26d ago·source ↗

Fine-tuned PEGASUS-large outperforms LLaMA-3 and GPT-3.5 for automatic research paper title generation

Researchers propose a system for generating research paper titles from abstracts using pre-trained and large language models, evaluated on CSPubSum, LREC-COLING-2024, and a new dataset SpringerSSAT. Fine-tuned PEGASUS-large outperforms fine-tuned LLaMA-3-8B and zero-shot GPT-3.5-turbo across most metrics including ROUGE, METEOR, BERTScore, and SciBERTScore. The work is a narrow NLP application study with limited broader implications for the AI/ML landscape.

Evaluation and Benchmarking GPT-3.5 Turbo SciBERTScore SpringerSSAT +3 more

6Mistral Ai News·29d ago·source ↗

Mistral Releases Search Toolkit: Open-Source Composable Framework for Production RAG and Enterprise Search Pipelines

Mistral AI has launched Search Toolkit in public preview, an open-source framework that unifies document ingestion, retrieval, and evaluation into a single composable pipeline for AI applications. The toolkit ships with BM25 sparse retrieval, dense embedding-based retrieval, hybrid configurations, and built-in metrics (recall, precision, MRR, NDCG), targeting enterprise RAG workflows, domain-specific retrieval, and agentic systems. It integrates with MCP-based Connectors for live data access from CRMs, code repositories, and productivity tools. CMA CGM is cited as a production user, combining Search Toolkit with Voxtral for real-time fake news detection across audio sources.

Evaluation and Benchmarking Inference Economics Mistral AI BM25 Vespa +7 more

6Mistral Ai News·29d ago·source ↗

Mistral AI Releases Mathstral 7B: Math-Specialized Model with SOTA Reasoning in Size Category

Mistral AI has released Mathstral 7B, a math and STEM-specialized model built on Mistral 7B, developed in collaboration with Project Numina. The model achieves 56.6% on MATH and 63.47% on MMLU in standard evaluation, improving to 74.59% on MATH with a reward model over 64 candidates using inference-time compute scaling. Weights are open on HuggingFace and compatible with mistral-inference and mistral-finetune tooling.

Frontier Model Releases Evaluation and Benchmarking Mistral AI Mathstral 7B Project Numina +8 more