Entity · model

Llama-3.1-8B

modelactivellama-3-1-8b-d662880a·29 events·first seen May 18, 2026

Aliases: Llama-3.1-8B, Llama 3-8B, Llama 3.1, Llama 3.1 8B, Llama-3-8B, Llama 3 8B, Llama-3 8B, LLaMA-3-8B, Llama 3.1-8B

Co-occurring entities

More like this (12)

Llama 3.1 70B Llama 3.2 Llama3-8B Llama 3 Llama-3 Meta Llama 3.1 405B Llama3-8B-Instruct Llama-Krikri-8B Llama 2 70B Llama-3.2-1B-Instruct Llama Llama 2

Recent events (29)

5arXiv · cs.CL·47h ago·source ↗

Multilingual study finds LLMs are not uniformly robust to non-canonical tokenizations, with up to 23.7% performance drops

A new arXiv paper investigates how language models behave when given alternative (non-canonical) tokenizations of the same input string across 27 languages and six downstream tasks. While prior work showed English models are largely invariant to such perturbations, the study finds this does not generalize: Llama-3.1-8B drops 23.7% on average, Qwen3-8B 11.4%, and Gemma-3-12B 9.9% in relative performance. Languages with higher token fragmentation are systematically more sensitive, and the authors show LoRA fine-tuning on multi-tokenization data—including English-only data—provides meaningful mitigation.

Evaluation and Benchmarking LoRA Qwen3-4B Gemma 3 12B Instruct +1 more

4arXiv · cs.CL·2d ago·source ↗

AIriskEval-edu: Platform for auditing pedagogical risks in AI-generated educational explanations

Researchers present AIriskEval-edu Demo, a platform that audits instructional explanations across five pedagogical risk dimensions including factual accuracy, ideological bias, and student-level appropriateness. The system integrates GPT-5.5 via API alongside a fine-tuned self-hosted Llama 3.1 8B evaluator, with the local model outperforming GPT-5.5 on most metrics. The platform targets K-12 educational contexts and supports both automated auditing of AI-generated explanations and real-time auditing of human-written content, offering institutions a privacy-preserving deployment option.

Enterprise Deployment Patterns OpenAI Meta GPT-5.5 +2 more

6arXiv · cs.CL·2d ago·source ↗

Input-only prompt optimization can suppress evaluation-awareness latents in LLMs, but activation readability ≠ behavioral control

Researchers study the input-side dual of activation steering: optimizing fluent prompts to drive a chosen internal latent toward zero without inference-time model access. The target is an 'evaluation-awareness' latent whose suppression would threaten safety evaluation validity if models behave differently when detecting they are being tested. Experiments on Llama-3.2-3B and Llama-3.1-8B across five latent constructions (CAA direction, subspace norm, SAE feature, MLP neuron, behavioral logit) find the latent is robustly suppressible, but a key cautionary result emerges: a placebo random direction is suppressed just as hard and shifts behavior just as far, and suppressing the eval-direction in context fails to reduce behavioral eval judgment. The paper concludes that activation-readability does not imply behavioral controllability, with implications for how safety evaluations should be designed and interpreted.

Evaluation and Benchmarking AI Safety Research Minimizing Targeted Activations: Input-Only Suppression of Evaluation-Awareness Latents in Large Language Models Llama Scope Fluent Dreaming +6 more

5arXiv · cs.LG·3d ago·source ↗

CARE: Confidence-Adaptive Routing for Mixture-of-Experts LoRA adjusts expert count per token

Researchers introduce CARE (Confidence-Adaptive Routing of Experts), a drop-in routing rule for MoE-LoRA that dynamically adjusts the number of active experts per token based on router output uncertainty rather than using a fixed top-k. The method uses nucleus-style cumulative mass thresholding with a budget thermostat to hit any target average expert count. Evaluated on LLaMA-3.1-8B and Qwen2.5-7B across commonsense, math, code, and knowledge benchmarks, CARE matches or outperforms fixed top-k baselines at equal compute while also improving out-of-distribution detection.

Open Weights Progress Inference Economics Qwen2.5-7B CARE (Confidence-Adaptive Routing of Experts)Spend Experts Where You Are Unsure: Confidence-Adaptive Routing for Mixture-of-Experts LoRA +1 more

5arXiv · cs.CL·3d ago·source ↗

Closed-loop validation-repair achieves 99% schema compliance for clinical LLMs across healthcare standards

A new arXiv paper evaluates three open-source models (Qwen2.5 7B, Llama 3.1 8B, Gemma2 9B) on schema compliance with ICD-10, CPT, and HL7 FHIR standards across 960 clinical scenario-model pairs. Baseline compliance ranged from 85.9–91.6%, with 96% of failures being representation-level format violations rather than clinical reasoning errors. A closed-loop validation-repair framework raised overall compliance to 99.0%, with most errors resolving in one or two iterations, suggesting this system-level approach is a viable safeguard for healthcare EHR integration.

Evaluation and Benchmarking Enterprise Deployment Patterns HL7 FHIR R4 Gemma 2 9B Qwen2.5-7B +3 more

4arXiv · cs.CL·4d ago·source ↗

DWT-Fusion: Wavelet-based training-free framework for detecting LLM-generated text

Researchers introduce DWT-Fusion, a training-free method for detecting LLM-generated text that applies discrete wavelet transforms to token-level log-probability sequences from a proxy language model, capturing local and multiscale predictability patterns rather than global statistics. The framework evaluates four voting ensemble variants and is tested on HC3, M4, and MAGE benchmarks using GPT-Neo-2.7B, GPT-J-6B, Falcon-7B, and LLaMA-3-8B as proxy models. Best ensemble results achieve AUROC of 0.9919, 0.8477, and 0.7471 on the three benchmarks respectively. The approach is notable for requiring no supervised training while remaining interpretable.

Evaluation and Benchmarking AI Safety Research Falcon-7B DWT-Fusion MAGE +5 more

5arXiv · cs.CL·Jul 22, 2026·source ↗

MaLoRA and MaRA: Selective state-space adapters improve multi-hop reasoning over LoRA

A new arXiv preprint proposes two adapter families — MaLoRA (token-level dynamic scaling via Mamba recurrence) and MaRA (context-level segment retrieval via cross-segment state tracking) — as improvements over standard LoRA for language model reasoning. Evaluated on three frozen backbones (Qwen-2.5-7B, Llama-3.1-8B, Gemma-2-9B) and two multi-hop QA benchmarks (MuSiQue, 2WikiMultihopQA), the methods yield average gains of +6.8 F1 (+10.5% relative) over LoRA, with up to +18.2% relative improvement on the hardest configuration. Token-level gains also transfer to RULER QA-2 under length-stress conditions.

Long Context Evolution Evaluation and Benchmarking MaRA Gemma 2 9B MaLoRA +5 more

4arXiv · cs.CL·Jul 14, 2026·source ↗

Token probability measurements reveal production-perception asymmetry in LLMs

A new arXiv preprint investigates whether LLMs exhibit a functional analog to the psycholinguistic production-perception distinction, using direct token probability measurements rather than metalinguistic prompting. Using Llama-3.1-8B and four other open-weight models, the authors find that production-perception prompt distances consistently exceed production-production distances by a ratio of ~1.8, with near-ceiling correlations in the production-production control confirming the effect is specific to communicative framing. The effect replicates across five models spanning base and instruction-tuned variants, and temporal analysis shows perception prompts exert strongest influence at sequence beginnings. The findings suggest prompt framing alone induces a production-perception distinction in decoder-only architectures.

Evaluation and Benchmarking Gemma 2 9B Qwen2.5-7B-Instruct-1M Mistral 7B Instruct v0.2 +2 more

4arXiv · cs.CL·Jul 9, 2026·source ↗

PALS: Percentile-aware per-layer sparsity improves LLM pruning on LLaMA-2 but not universally

PALS (Percentile-Aware Layerwise Sparsity) is a one-shot pruning method that assigns per-layer sparsity ratios based on the 99th percentile of activation magnitudes, bounded within ±5% of a target ratio. On LLaMA-2-7B at 50% sparsity, PALS achieves perplexity of 10.96 vs. 12.92 for uniform Wanda, a statistically significant improvement requiring no fine-tuning. However, gains are architecture-dependent: LLaMA-3-8B shows marginal improvement and Mistral-7B shows none. A notable negative finding is that gradient-based allocation performs worse than random, suggesting gradient magnitude is a poor proxy for the impact of discrete weight removal.

Open Weights Progress Inference Economics PALS WikiText-2 LLaMA-7B +5 more

5arXiv · cs.CL·Jul 8, 2026·source ↗

LongCrafter: Evidence-graph-guided synthesis framework for long-context SFT data

LongCrafter is a structured framework for synthesizing long-context supervised fine-tuning data, addressing limitations of prior approaches including narrow task coverage, low difficulty, and lack of faithfulness supervision. The system uses a hierarchical 32-task taxonomy and constructs explicit evidence graphs modeling cross-paragraph dependencies to generate grounded instruction-response pairs. Models fine-tuned on LongCrafter data outperform SFT baselines and official post-trained models on LongBench, LongBench v2, and LooGLE for both Qwen2.5-7B and LLaMA-3.1-8B, with notable gains on high-difficulty tasks and improved robustness to the 'lost in the middle' problem.

Long Context Evolution Evaluation and Benchmarking Qwen2.5-7B LongCrafter LooGLE +2 more

5arXiv · cs.CL·Jul 8, 2026·source ↗

RL reward function design for LLM-generated BPMN process models: systematic study across 48 configurations

Researchers present a systematic study of reward function design for reinforcement learning applied to LLM-based BPMN process model generation, training Llama 3.1 8B and Qwen 2.5 14B across 48 configurations using Group Sequence Policy Optimization. Key findings: RL substantially improves syntactic and pragmatic quality while preserving semantic fidelity, equal reward weighting outperforms targeted weighting, and reward design effects interact with model architecture in non-trivial ways. The paper argues reward composition is as consequential as the decision to apply RL at all, with implications for any multi-dimensional structured generation task.

Evaluation and Benchmarking Alignment and RLHF Qwen2.5-7B Improving LLM-Generated Process Model Quality Through Reinforcement Learning: The Role of Reward Function Design GSPO (Group Sequence Policy Optimization)+1 more

5arXiv · cs.CL·Jun 29, 2026·source ↗

VASAE: Vocabulary-Aligned Sparse Autoencoder assigns intrinsic token names to SAE features during training

Researchers introduce VASAE (Vocabulary-Aligned Sparse Autoencoder), a method that trains SAE features with vocabulary-aligned anchoring so each feature is intrinsically named by the nearest token in the model's embedding space. Applied to GPT-2-small and Llama-3.1-8B, VASAE achieves ~90% feature alignment in shallow-to-middle layers without degrading reconstruction quality, though final-layer alignment is limited. The work addresses a longstanding interpretability bottleneck where SAE dictionary features require expensive post-hoc labeling, potentially enabling more scalable mechanistic analysis.

Evaluation and Benchmarking AI Safety Research GPT-2-small VASAE Llama-3.1-8B

4arXiv · cs.CL·Jun 26, 2026·source ↗

FisherSketch: Efficient Fisher Alignment for Training-Free Source Selection in LLM Fine-Tuning

A new arXiv preprint introduces FisherSketch, a method for estimating head Fisher alignment between LLM tasks using a streaming, memory-efficient sketch (16 KB task signature, 192 KB streaming state) without materializing the full Fisher matrix. The work targets training-free source corpus selection for LLM families with shared vocabularies, particularly in scientific string domains like SMILES, protein, and genomic sequences. The authors demonstrate that representation similarity metrics like CKA are non-identifiable for transfer in shared-output-head settings, and validate FisherSketch on Llama-3.1-8B with verbalizer-shift experiments.

Evaluation and Benchmarking Alignment and RLHF CKA The Geometry of Updates: Fisher Alignment at Vocabulary Scale FisherSketch +1 more

5arXiv · cs.CL·Jun 24, 2026·source ↗

AdversaBench: Automated LLM red-teaming pipeline with multi-judge confirmation and cross-model transferability

AdversaBench is a new end-to-end red-teaming pipeline that mutates seed prompts using five structured operators and confirms failures via a three-judge panel with a meta-judge tiebreaker. Experiments on 45 seeds across reasoning, instruction-following, and tool-use categories produced confirmed failures for every seed. Key findings include sharp variation in operator effectiveness by category, misleading binary failure rates, judge agreement metrics distorted by label skew, and zero-shot transferability of adversarial prompts from Llama 3.1 8B to Llama 3.3 70B. Code and dataset are publicly released.

Evaluation and Benchmarking AI Safety Research Llama 3.1 70B AdversaBench Meta +1 more

4arXiv · cs.CL·Jun 24, 2026·source ↗

Multi-agent semantic rewriting framework for privacy-preserving RAG

A new arXiv preprint proposes a three-agent framework for sanitizing retrieved content in RAG pipelines by performing privacy extraction, semantic analysis, and reconstruction as an offline preprocessing step. Evaluated on ChatDoctor and Wiki-PII datasets across six LLMs, the approach reduces targeted information exposure in LLaMA-3-8B from 144 baseline instances to 1, while maintaining contextual fidelity (BLEU-1 of 0.122 vs. SAGE's 0.117). The framework introduces no additional online inference latency since rewriting is done offline. Source code is publicly released.

AI Safety Research Enterprise Deployment Patterns Privacy-Preserving RAG via Multi-Agent Semantic Rewriting Wiki-PII SAGE +2 more

5arXiv · cs.CL·Jun 23, 2026·source ↗

ORBIT: Training-free multi-attribute behavioral steering via orthogonal subspace rotation

Researchers introduce ORBIT (Orthogonal Rotation-Based Intervention Technique), a training-free activation steering method that simultaneously controls multiple behavioral attributes in language models. The approach constructs a joint subspace from per-attribute steering planes via SVD and applies a single norm-preserving rotation, avoiding the norm imbalance and directional cancellation problems of naive vector summation. The authors also release TraitFactory, a new multi-attribute behavioral benchmark, and evaluate across Llama-3.2-3B, Qwen-2.5-7B, and Llama-3.1-8B. ORBIT outperforms existing training-free baselines on multi-attribute steering while better preserving output coherence.

Evaluation and Benchmarking Alignment and RLHF TraitFactory Llama 3.2 ORBIT +3 more

6arXiv · cs.CL·Jun 17, 2026·source ↗

Location metadata causes systematic geographic bias leakage in LLMs, even with 'Unknown' placeholders

Researchers evaluate 'location leakage' — the phenomenon where LLMs generate geographically biased outputs when exposed to location metadata in user profiles, even when prompts are geographically neutral. Across creative writing and Q&A tasks, leakage spikes up to 793x above baseline for models including Llama 3.1-8B, Qwen3-8B, and Claude Sonnet 4.6. A novel structural finding shows that replacing location with 'Unknown' still elevates leakage by up to 72x, indicating the user profile frame itself acts as a conditioning signal independent of geographic content. This has direct implications for AI systems that use user metadata for localization.

Evaluation and Benchmarking AI Safety Research Claude Sonnet 4 Alibaba Qwen3-4B +4 more

7arXiv · cs.CL·Jun 9, 2026·source ↗

RLHF produces shallow political neutrality by severing causal pathways, not erasing partisan structure

Researchers compare internal representations of Llama 3.1 8B before and after RLHF, finding that alignment training does not remove partisan political geometry from the model but instead compresses output variance to produce balanced responses. Sparse autoencoder decomposition shows that policy-encoding features active in the base model become completely inactive in the instruction-tuned version, while feature-level steering experiments confirm the causal disconnect is real. The underlying partisan structure remains intact and can be reactivated by inferring and amplifying a user's partisan identity, suggesting RLHF alignment is functionally fragile. The authors argue this 'disconnection rather than removal' pattern may generalize to other value domains beyond political orientation.

AI Safety Research Alignment and RLHF Reinforcement Learning from Human Feedback The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model Sparse Autoencoder +2 more

5arXiv · cs.CL·Jun 5, 2026·source ↗

LLMs fail to consistently simulate demographic perspective-taking in hate speech annotation

A new arXiv paper evaluates whether persona-conditioned LLMs can replicate how different demographic groups perceive hate speech, testing three dimensions: inter-group disagreement, in-group sensitivity, and vicarious prediction. No model consistently captures all three dimensions, and performance is highly model-dependent rather than emerging reliably from identity prompts alone. Vicarious prompting with Llama 3.1 provides the closest approximation to human disagreement patterns across demographic axes. The findings have implications for using LLMs as proxies for diverse human annotators in content moderation tasks.

Evaluation and Benchmarking AI Safety Research From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation Meta Llama-3.1-8B

3arXiv · cs.AI·Jun 4, 2026·source ↗

Fine-tuned PEGASUS-large outperforms LLaMA-3 and GPT-3.5 for automatic research paper title generation

Researchers propose a system for generating research paper titles from abstracts using pre-trained and large language models, evaluated on CSPubSum, LREC-COLING-2024, and a new dataset SpringerSSAT. Fine-tuned PEGASUS-large outperforms fine-tuned LLaMA-3-8B and zero-shot GPT-3.5-turbo across most metrics including ROUGE, METEOR, BERTScore, and SciBERTScore. The work is a narrow NLP application study with limited broader implications for the AI/ML landscape.

Evaluation and Benchmarking GPT-3.5 Turbo SciBERTScore SpringerSSAT +3 more

4arXiv · cs.CL·Jun 2, 2026·source ↗

Sentence-Level Clinical Provenance Categorization for Multidisciplinary Hospital Summarization Using Fine-Tuned Llama-3

This pilot study presents a pipeline for categorizing sentence-level clinical provenance across multi-source hospital notes, targeting structured summarization in high-complexity settings like the NICU. The authors fine-tune Llama-3 8B and 70B models on MedSecId (MIMIC-III annotations), achieving Macro F1 above 92% in-domain. Cross-domain evaluation reveals a scale-dependent transfer effect: SFT substantially improves the 70B model (+7% Macro F1) but yields only marginal gains for the 8B model. A quantized fine-tuned 70B model outperforms its full-precision baseline while reducing compute, suggesting quantized adaptation is viable for structured clinical NLP tasks.

Inference Economics Enterprise Deployment Patterns MIMIC-III Llama 3.1 70B quantization +4 more

7Mistral Ai News·Jun 1, 2026·source ↗

Mistral NeMo: 12B Open-Weights Model with 128k Context, Built with NVIDIA

Mistral AI and NVIDIA jointly release Mistral NeMo, a 12B parameter model under Apache 2.0 license featuring a 128k token context window and a new tokenizer called Tekken based on Tiktoken. The model is designed as a drop-in replacement for Mistral 7B, supports multilingual applications across 11+ languages, and was trained with quantization awareness enabling FP8 inference without performance loss. Benchmark comparisons show competitive performance against Gemma 2 9B and Llama 3 8B. Weights are available on HuggingFace and the model is also packaged as an NVIDIA NIM inference microservice.

Long Context Evolution Frontier Model Releases Mistral AI Gemma 2 9B Apache 2.0 +9 more

7Mistral Ai News·Jun 1, 2026·source ↗

Mistral AI Releases Ministral 3B and 8B Edge Models

Mistral AI has introduced two new small language models, Ministral 3B and Ministral 8B, targeting on-device and edge computing use cases. Both models support up to 128k context length and claim state-of-the-art performance in the sub-10B parameter category, outperforming comparable models from Google and Meta on internal benchmarks. Ministral 8B features an interleaved sliding-window attention mechanism for memory-efficient inference and is priced at $0.1/M tokens via API, while Ministral 3B is priced at $0.04/M tokens. Weights for Ministral 8B Instruct are available for research use, with commercial licensing available on request.

Long Context Evolution Frontier Model Releases Mistral AI Gemma 2 9B Ministral 8B +12 more

6arXiv · cs.CL·May 26, 2026·source ↗

Signal Collapse and Reward Hacking in Checker-Guided RAG for Biomedical QA

This paper investigates why NLI-based claim checkers used as process rewards in RL-trained medical RAG agents succeed or fail during training. The authors find that a checker's output distribution during training—not its held-out accuracy—determines whether it provides useful gradient signal, with LLM log-probability scoring causing near-total signal collapse (97%+ neutral labels) while a calibrated MedNLI classifier avoids this. A key finding is that stronger checkers can trigger reward hacking cascades (ultra-short answers, search avoidance, language collapse), while moderate-signal local classifiers yield better final model quality (+12% BERTScore over zero-shot). The work frames these as boundary conditions for verifier-as-reward systems in RLVR pipelines.

Evaluation and Benchmarking Agent and Tool Ecosystem Qwen2.5-7B GRPO reward hacking +8 more

4arXiv · cs.CL·May 21, 2026·source ↗

Automated ICD Classification of Psychiatric Diagnoses Using NLP and LLMs

This study evaluates NLP and ML approaches for automating the mapping of free-text psychiatric descriptions to ICD diagnostic codes, using a dataset of 145,513 Spanish clinical records. Methods range from classical BoW/TF-IDF representations to transformer-based embeddings including e5_large, BioLORD, and Llama-3-8B. Fine-tuned e5_large achieved the best performance with a micro-F1 of 0.866, outperforming classical methods by capturing semantic nuance and medical terminology. The work highlights challenges of long-tail label distributions and ambiguity specific to psychiatric clinical language.

Enterprise Deployment Patterns Agent and Tool Ecosystem International Classification of Diseases (ICD)e5_large Bag of Words (BoW)+3 more

6arXiv · cs.CL·May 21, 2026·source ↗

ChunkFT: Memory-Efficient Full Fine-Tuning via Byte-Streamed Chunk Optimization

ChunkFT is a fine-tuning framework that reformulates full-parameter optimization around a dynamically activated working set of sub-tensors, enabling gradient computation without dense gradient materialization. It achieves full-parameter fine-tuning of a 7B model in 13.72GB GPU memory on a single RTX 4090, and scales Llama 3-70B fine-tuning to 2×H800 GPUs. Downstream evaluations on language understanding, math reasoning, and MT-Bench show ChunkFT matches or exceeds full-parameter fine-tuning quality while outperforming existing memory-efficient baselines such as LoRA-class methods. A theoretical convergence analysis in the deterministic setting is also provided.

Training Infrastructure Open Weights Progress Llama 3.1 70B MT-Bench Meta AI +5 more

9Hugging Face Blog·May 19, 2026·source ↗

Llama 3.1 Released: 405B, 70B & 8B Models with Multilinguality and Long Context

Meta released Llama 3.1, a family of open-weights models at three scales (405B, 70B, 8B) featuring multilingual support and extended context windows. The 405B model represents Meta's largest open-weights release to date, positioning it as a frontier-class open model. Hugging Face published a blog post covering the release, integration details, and deployment options across the ecosystem.

Long Context Evolution Frontier Model Releases Llama 3.1 70B Meta Llama 3.1 405B Hugging Face +5 more

5arXiv · cs.CL·May 19, 2026·source ↗

DiSP: A Sample-and-Judge Framework for Efficient In-Context Learning Demonstration Selection

DiSP reframes ICL demonstration selection as a prediction problem rather than a search problem, arguing it is cheaper to judge whether a query-context pair will succeed than to find an optimal context. The framework stratifies queries by difficulty using a lightweight router, trains level-specific judges, and applies stop-on-acceptance judging under an explicit budget. Evaluated on five classification datasets with Llama 3-8B and Qwen 2.5-7B, DiSP improves over strong learned selection baselines by up to 3.4% accuracy while achieving up to 23x wall-clock speedup.

Inference Economics Agent and Tool Ecosystem DiSP Qwen 2.5-7B in-context learning +1 more

5arXiv · cs.LG·May 18, 2026·source ↗

Layer Equivalence Is Not a Property of Layers Alone: How You Test Redundancy Changes What You Find

This paper distinguishes two protocols for measuring transformer layer redundancy—replacement (can one layer substitute for another in place?) and interchange (do two layers approximately commute when swapped?)—and shows they can disagree substantially. Experiments on Pythia (410M, 1.4B) and 8B-scale models (Qwen3-8B, Llama-3.1-8B) reveal that the protocol gap grows during training and can change which layers appear safe to prune by several-fold. Notably, Qwen3-8B shows interchange-guided removal is far safer than replacement-guided at the same layer budgets, while Llama-3.1-8B ties the two protocols despite lower interchange KL. The authors recommend scoring both swap-KL metrics before any layer removal or merging, requiring only unlabeled forward passes.

Evaluation and Benchmarking Inference Economics WikiText-2 layer pruning Pythia +3 more