Entity · model

Llama 3.1 70B

modelactivellama-3-1-70b-a6be205a·12 events·first seen May 19, 2026

Aliases: Llama 3.1 70B, Llama 3-70B, Llama 3.3 70B, Llama-3 70B, Llama-3.3-70B, LLaMA-3.1-70B

Co-occurring entities

More like this (12)

Llama 2 70B Llama-3.1-8B Llama 3.2 Meta Llama 3.1 405B Llama 3 Meta-Llama-3-70B Llama 3.3 70B Instruct Llama-3 Llama 3.2 90B Vision Llama Llama3-8B Llama 1B

Recent events (12)

4arXiv · cs.CL·14h ago·source ↗

Structured LLM extraction of financial news outperforms sentiment-only approaches for stock prediction

Researchers propose a framework using LLaMA-3.1-70B to extract six semantic dimensions from financial news (event type, impact scope, temporal horizon, semantic confidence, etc.) beyond simple sentiment polarity. Experiments on 41,618 news-stock pairs from the FNSPID dataset show that combining LLM-extracted structured features with FinBERT sentiment achieves F1=0.600, significantly outperforming either alone, with a 53.5% systematic disagreement rate indicating the two signal sources are largely orthogonal. The work argues that compressing financial news to a single sentiment score incurs substantial information loss and that multi-dimensional NLP extraction is systematically exploitable for prediction tasks.

Evaluation and Benchmarking Enterprise Deployment Patterns FNSPID Llama 3.1 70B FinBERT +1 more

5arXiv · cs.CL·Jul 24, 2026·source ↗

CM-LRS: A capital markets reliability benchmark for LLM workflow outputs

Researchers introduce CM-LRS (Capital Markets LLM Reliability Score), a seven-dimension evaluation framework assessing LLM outputs at the workflow level rather than the question-answer layer, targeting regulated capital-markets use cases such as DCM/ECM term extraction, M&A comparables, and issuer profiling. The benchmark is demonstrated on five workflows using public SEC EDGAR and UK takeover filings, scoring four models across four LLM judges. Key findings: frontier closed-source models cluster tightly (Sonnet 4.6 = 4.31, Opus 4.7 = 4.30, GPT-5.5 = 4.09) while Llama 3.3 70B lags at 3.15, with the gap concentrated in retrieval and synthesis tasks rather than extraction. The work advances domain-specific evaluation methodology for high-stakes financial workflows where regulatory defensibility matters.

Evaluation and Benchmarking Enterprise Deployment Patterns CM-LRS SEC EDGAR Llama 3.1 70B +6 more

6The Batch·Jul 17, 2026·source ↗

MIT and CMU introduce Puppet benchmark to measure LLM belief manipulation in users

Researchers at MIT and Carnegie Mellon University developed Puppet, a benchmark that measures how much LLMs actually shift users' beliefs after conversation, as opposed to detecting manipulative language patterns. The study tracked over 1,000 users interacting with GPT-4o under various prompting conditions and found high variability in belief shifts, with a median change of 3.3 but standard deviation of ~22. Existing manipulation detectors showed near-zero correlation with actual belief change, while LLMs like GPT-4o achieved moderate correlation (0.436) when estimating belief shifts from conversation transcripts alone. The work argues for direct belief-shift measurement as a more valid approach to assessing LLM persuasive risk.

Evaluation and Benchmarking AI Safety Research MIT Carnegie Mellon University Llama 3.1 70B +7 more

6arXiv · cs.CL·Jul 15, 2026·source ↗

Formal framework for valid extractable memorization claims in LLMs

A new arXiv preprint proposes a principled methodology for making valid extractable memorization claims about LLMs, addressing both over- and under-statement problems in prior work. The core contribution is a 'matched comparison' approach that measures generation probabilities of training sequences against comparable non-training sequences to establish a calibrated baseline for predictability. Two formalizations are offered: a conformal test for population-level claims and a census method for single-document claims. Applied to OLMo 2 32B on Wikipedia and Llama 3.1 70B on books, the framework reveals significant false-positive rates in naive extraction studies and supports memorization claims at probability thresholds as low as 1e-27.

Evaluation and Benchmarking AI Safety Research Llama 3.1 70B OLMo-3 Allen Institute for AI +2 more

5arXiv · cs.CL·Jun 29, 2026·source ↗

Triadic Werewolf benchmark exposes multi-hop Theory of Mind failures in LLMs

Researchers introduce a Werewolf game variant with a Jester faction whose inverted utility function (winning by being voted out) requires models to reason across three opposing incentive structures simultaneously. Across 60 games, GPT-4.1, DeepSeek-V3.1, and Llama-3.3-70B all struggle: Werewolves never exceed 20% win rate and GPT-4.1 wolves vote out the Jester in 60-70% of games, a self-defeating action. Only DeepSeek-V3.1 learns the nuanced strategy of appearing suspicious without appearing intentionally suspicious, and benefits most from self-learning. The work argues dyadic social-deduction benchmarks systematically underestimate the difficulty of multi-agent Theory of Mind.

Evaluation and Benchmarking Agent and Tool Ecosystem Llama 3.1 70B Triadic Werewolf DeepSeek V4 +3 more

5arXiv · cs.CL·Jun 24, 2026·source ↗

AdversaBench: Automated LLM red-teaming pipeline with multi-judge confirmation and cross-model transferability

AdversaBench is a new end-to-end red-teaming pipeline that mutates seed prompts using five structured operators and confirms failures via a three-judge panel with a meta-judge tiebreaker. Experiments on 45 seeds across reasoning, instruction-following, and tool-use categories produced confirmed failures for every seed. Key findings include sharp variation in operator effectiveness by category, misleading binary failure rates, judge agreement metrics distorted by label skew, and zero-shot transferability of adversarial prompts from Llama 3.1 8B to Llama 3.3 70B. Code and dataset are publicly released.

Evaluation and Benchmarking AI Safety Research Llama 3.1 70B AdversaBench Meta +1 more

6The Batch·Jun 19, 2026·source ↗

DeepSWE, ProgramBench, and ITBench-AA emerge as harder successors to SWE-bench for agent evaluation

Three new benchmarks — DeepSWE (by Datacurve), ProgramBench (Meta/Stanford/Harvard), and ITBench-AA (IBM/Artificial Analysis) — are positioned as more rigorous replacements for the SWE-bench family, which models have largely saturated. DeepSWE tests feature implementation using private codebases and human-written problems; ProgramBench evaluates agents' ability to recreate functional programs from scratch; ITBench-AA measures root-cause diagnosis in real-world IT incident scenarios. Current top performers include GPT-5.5 (70% on DeepSWE), Claude Opus 4.7 (46.7% on ITBench-AA), and Claude Opus 4.7 (3% on ProgramBench at the 95% pass threshold), illustrating that even frontier models have substantial headroom.

Evaluation and Benchmarking Agent and Tool Ecosystem Artificial Analysis Llama 3.1 70B Datacurve +13 more

4arXiv · cs.CL·Jun 2, 2026·source ↗

Sentence-Level Clinical Provenance Categorization for Multidisciplinary Hospital Summarization Using Fine-Tuned Llama-3

This pilot study presents a pipeline for categorizing sentence-level clinical provenance across multi-source hospital notes, targeting structured summarization in high-complexity settings like the NICU. The authors fine-tune Llama-3 8B and 70B models on MedSecId (MIMIC-III annotations), achieving Macro F1 above 92% in-domain. Cross-domain evaluation reveals a scale-dependent transfer effect: SFT substantially improves the 70B model (+7% Macro F1) but yields only marginal gains for the 8B model. A quantized fine-tuned 70B model outperforms its full-precision baseline while reducing compute, suggesting quantized adaptation is viable for structured clinical NLP tasks.

Inference Economics Enterprise Deployment Patterns MIMIC-III Llama 3.1 70B quantization +4 more

7Mistral Ai News·Jun 1, 2026·source ↗

Mistral Small 3: 24B Latency-Optimized Open-Weight Model Released Under Apache 2.0

Mistral AI has released Mistral Small 3, a 24B-parameter instruction-tuned model optimized for low latency, achieving over 81% on MMLU at 150 tokens/s on a single GPU. The model is competitive with Llama 3.3 70B and Qwen 32B while being more than 3x faster on equivalent hardware, and is released under Apache 2.0 for both pretrained and instruction-tuned checkpoints. It is explicitly not trained with RL or synthetic data, positioning it as a base model for community fine-tuning and reasoning capability development. Deployment targets include local inference on consumer hardware (RTX 4090, MacBook 32GB RAM), agentic function calling, and domain-specific fine-tuning.

Frontier Model Releases Open Weights Progress Mistral AI Mistral Small 4 Ollama +12 more

6The Batch·Jun 1, 2026·source ↗

Activation Capping Technique Stabilizes LLM Assistant Personas Against Drift and Jailbreaks

Researchers from MATS, Oxford, and Anthropic introduced the 'assistant axis,' a vector derived from LLM layer outputs that quantifies how closely a model adheres to its trained assistant persona. They developed 'activation capping,' an inference-time method that corrects deviations from this axis when similarity falls below a threshold. Testing on Gemma 2 27B, Qwen3 32B, and Llama 3.3 70B showed harmful response rates to jailbreak prompts dropped by roughly half (e.g., 83% to 41% for Qwen3 32B) without degrading benchmark performance. The technique targets character-based jailbreaks that bypass system prompts by manipulating a model's internal representational state.

Evaluation and Benchmarking AI Safety Research Gemma 2 9B assistant axis Llama 3.1 70B +12 more

6arXiv · cs.CL·May 21, 2026·source ↗

ChunkFT: Memory-Efficient Full Fine-Tuning via Byte-Streamed Chunk Optimization

ChunkFT is a fine-tuning framework that reformulates full-parameter optimization around a dynamically activated working set of sub-tensors, enabling gradient computation without dense gradient materialization. It achieves full-parameter fine-tuning of a 7B model in 13.72GB GPU memory on a single RTX 4090, and scales Llama 3-70B fine-tuning to 2×H800 GPUs. Downstream evaluations on language understanding, math reasoning, and MT-Bench show ChunkFT matches or exceeds full-parameter fine-tuning quality while outperforming existing memory-efficient baselines such as LoRA-class methods. A theoretical convergence analysis in the deterministic setting is also provided.

Training Infrastructure Open Weights Progress Llama 3.1 70B MT-Bench Meta AI +5 more

9Hugging Face Blog·May 19, 2026·source ↗

Llama 3.1 Released: 405B, 70B & 8B Models with Multilinguality and Long Context

Meta released Llama 3.1, a family of open-weights models at three scales (405B, 70B, 8B) featuring multilingual support and extended context windows. The 405B model represents Meta's largest open-weights release to date, positioning it as a frontier-class open model. Hugging Face published a blog post covering the release, integration details, and deployment options across the ecosystem.

Long Context Evolution Frontier Model Releases Llama 3.1 70B Meta Llama 3.1 405B Hugging Face +5 more