Entity · model

LLaMA-7B

modelactivellama-7b-e0b996cb·6 events·first seen May 19, 2026

Aliases: LLaMA-7B, LLaMA-2 7B, LLaMA-2-7B, LLaMA 2-7B

Co-occurring entities

More like this (12)

LLaVA-1.5-7B LLaMA-2-13B SaulLM-7B OLMoE-1B-7B Falcon-7B LLaDA-8B LLaDA-8B-Base LLaVA-1.5-13B Llama-Krikri-8B Llama 2 70B Qwen-7B Mistral 7B

Recent events (6)

4arXiv · cs.CL·Jul 9, 2026·source ↗

PALS: Percentile-aware per-layer sparsity improves LLM pruning on LLaMA-2 but not universally

PALS (Percentile-Aware Layerwise Sparsity) is a one-shot pruning method that assigns per-layer sparsity ratios based on the 99th percentile of activation magnitudes, bounded within ±5% of a target ratio. On LLaMA-2-7B at 50% sparsity, PALS achieves perplexity of 10.96 vs. 12.92 for uniform Wanda, a statistically significant improvement requiring no fine-tuning. However, gains are architecture-dependent: LLaMA-3-8B shows marginal improvement and Mistral-7B shows none. A notable negative finding is that gradient-based allocation performs worse than random, suggesting gradient magnitude is a poor proxy for the impact of discrete weight removal.

Open Weights Progress Inference Economics PALS WikiText-2 LLaMA-7B +5 more

4arXiv · cs.CL·Jun 23, 2026·source ↗

SVD-Surgeon: Training-free optimal singular value compensation for LLM compression

SVD-Surgeon applies the Optimal Brain Surgeon (OBS) framework to singular-value decomposition-based LLM compression, computing closed-form updates to retained singular values that compensate for pruned ones to second order in the model loss. The method is training-free and modular, designed to layer on top of existing SVD compressors. Applied to SVD-LLM on OPT and LLaMA 2-7B, it improves the perplexity-compression trade-off without retraining.

Open Weights Progress Inference Economics SVD-LLM LLaMA-7B Optimal Brain Surgeon +2 more

4arXiv · cs.LG·Jun 8, 2026·source ↗

SETA: Sparse Subspace-to-Expert Sharing for Continual Learning in LLMs

Researchers introduce SETA (Mixture of Sparse Experts for Task Agnostic Continual Learning), a framework addressing catastrophic forgetting in LLMs via adaptive sparse subspace decomposition into task-specific and shared expert modules. The approach uses adaptive elastic anchoring and routing-aware regularization to protect shared knowledge at both weight and routing levels. Experiments on LLaMA-2 7B and Qwen3-4B show competitive or superior performance versus continual learning baselines, with strong retention of early-task knowledge.

Evaluation and Benchmarking Open Weights Progress LLaMA-7B Qwen3-4B Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning +1 more

6arXiv · cs.AI·May 26, 2026·source ↗

OrpQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization

This paper introduces Orthogonal Residual Projection (ORP), an algorithm-hardware co-design framework for ultra-low-bit quantization of LLMs and Vision Transformers targeting edge deployment. ORP addresses the structural limitations of Power-of-Two (PoT) quantization by formulating quantization as a dual-basis geometric projection that synthesizes higher-resolution residual lattices using only shift-and-add operations, eliminating multipliers. At 3-bit (W3/A16), ORP achieves 6.10 perplexity on LLaMA-2-7B, competitive with MAC-intensive baselines like AWQ, while reducing full-model calibration time to ~15 minutes. RTL synthesis at 28nm confirms hardware efficiency by mitigating timing bottlenecks from dense multiplier trees.

Training Infrastructure Evaluation and Benchmarking ViT (Vision Transformer)Orthogonal Residual Projection AWQ +5 more

5arXiv · cs.CL·May 21, 2026·source ↗

Conditional Scale Entropy: A Wavelet-Derived Tool for Mechanistic Interpretability of Metaphor Processing in Transformers

This paper introduces Conditional Scale Entropy (CSE), a wavelet-derived measure of how transformer computation engages across frequency scales at each layer, and applies it to study metaphor processing in decoder-only language models. The authors prove CSE is invariant to update magnitude, isolating structural computation patterns from intensity. Across architectures ranging from GPT-2 (124M) to LLaMA-2 7B and GPT-oss 20B, metaphorical tokens consistently produce higher spectral breadth than literal tokens in early-to-mid layers, with the effect surviving permutation correction and specificity controls. The work establishes multi-scale coordination as a consistent mechanistic signature of metaphorical language processing and positions CSE as a general interpretability tool for cross-depth structure in transformers.

Evaluation and Benchmarking AI Safety Research Conditional Scale Entropy mechanistic interpretability GPT-2 +3 more

6Hugging Face Blog·May 19, 2026·source ↗

GaLore: Advancing Large Model Training on Consumer-grade Hardware

GaLore (Gradient Low-Rank Projection) is a memory-efficient training technique that reduces optimizer state memory by projecting gradients into a low-rank subspace during training, enabling large model training on consumer-grade hardware. The Hugging Face blog post covers integration of GaLore into the transformers and peft ecosystems. Unlike LoRA, GaLore applies low-rank projection to the full training process rather than constraining weight updates, allowing full-parameter learning with reduced memory footprint. This makes training models like LLaMA-7B feasible on single consumer GPUs.

Training Infrastructure Open Weights Progress PEFT LoRA LLaMA-7B +3 more