Entity · model

Qwen2.5-7B

modelactiveqwen2-5-7b-859ec241·10 events·first seen May 22, 2026

Aliases: Qwen2.5-7B, Qwen 2.5-14B, Qwen 2.5 14B, Qwen2.5 7B

Co-occurring entities

More like this (12)

Qwen2.5-8B Qwen2.5-3B Qwen2.5-14B Qwen2.5-0.5B Qwen2.5-1.5B Qwen 2.5-7B Qwen3.5-0.8B Qwen2.5-1.5B-Base Qwen1.5-7B Qwen3.5-2B-Base Qwen3.6-27B Qwen-0.5B

Recent events (10)

5arXiv · cs.LG·3d ago·source ↗

CARE: Confidence-Adaptive Routing for Mixture-of-Experts LoRA adjusts expert count per token

Researchers introduce CARE (Confidence-Adaptive Routing of Experts), a drop-in routing rule for MoE-LoRA that dynamically adjusts the number of active experts per token based on router output uncertainty rather than using a fixed top-k. The method uses nucleus-style cumulative mass thresholding with a budget thermostat to hit any target average expert count. Evaluated on LLaMA-3.1-8B and Qwen2.5-7B across commonsense, math, code, and knowledge benchmarks, CARE matches or outperforms fixed top-k baselines at equal compute while also improving out-of-distribution detection.

Open Weights Progress Inference Economics Qwen2.5-7B CARE (Confidence-Adaptive Routing of Experts)Spend Experts Where You Are Unsure: Confidence-Adaptive Routing for Mixture-of-Experts LoRA +1 more

5arXiv · cs.CL·4d ago·source ↗

Closed-loop validation-repair achieves 99% schema compliance for clinical LLMs across healthcare standards

A new arXiv paper evaluates three open-source models (Qwen2.5 7B, Llama 3.1 8B, Gemma2 9B) on schema compliance with ICD-10, CPT, and HL7 FHIR standards across 960 clinical scenario-model pairs. Baseline compliance ranged from 85.9–91.6%, with 96% of failures being representation-level format violations rather than clinical reasoning errors. A closed-loop validation-repair framework raised overall compliance to 99.0%, with most errors resolving in one or two iterations, suggesting this system-level approach is a viable safeguard for healthcare EHR integration.

Evaluation and Benchmarking Enterprise Deployment Patterns HL7 FHIR R4 Gemma 2 9B Qwen2.5-7B +3 more

5arXiv · cs.CL·Jul 21, 2026·source ↗

DeLIVeR: Reinforced knowledge graph exploration for LLM fact-checking

DeLIVeR is a new framework for automated fact-checking that decomposes complex claims into targeted questions and traverses structured Knowledge Graphs for evidence retrieval, optimizing a Planner LLM via Group Relative Policy Optimization (GRPO). Evaluated on LIAR, FEVER, and PolitiFact benchmarks using Qwen2.5-7B, the system achieves F1-scores of 83.73, 84.57, and 79.70 respectively, representing a 10-15% improvement over HippoRAG2. The approach addresses 'query brittleness' in traditional retrieval by framing evidence gathering as a reinforced strategic exploration task, yielding auditable reasoning paths.

Evaluation and Benchmarking Agent and Tool Ecosystem LIAR Qwen2.5-7B PolitiFact +4 more

6arXiv · cs.AI·Jul 17, 2026·source ↗

PRISM probe separates physical danger from text-level safety in LLM hidden states for embodied agents

Researchers introduce PRISM, a single-layer logistic probe over LLM hidden states that distinguishes physically grounded danger (e.g., unsafe robot actions) from ordinary text-level content danger, showing these are separable signals in representations across multiple model families. PRISM achieves 86.2–87.7% accuracy on SafeAgentBench with substantially lower false-positive rates than same-scale LLM judges, which over-block safe tasks at 24.7–39.0% FPR. The authors also release PhysicalSafetyBench-1K (PSB-1K), a 1,000-pair contrastive benchmark for evaluating physical-risk detection without relying on explicit harm keywords. The work is relevant to safety in embodied AI and agentic systems where linguistic safety filters are insufficient.

Evaluation and Benchmarking AI Safety Research SafeAgentBench Phi-3.5 Qwen2.5-7B +9 more

5arXiv · cs.CL·Jul 8, 2026·source ↗

LongCrafter: Evidence-graph-guided synthesis framework for long-context SFT data

LongCrafter is a structured framework for synthesizing long-context supervised fine-tuning data, addressing limitations of prior approaches including narrow task coverage, low difficulty, and lack of faithfulness supervision. The system uses a hierarchical 32-task taxonomy and constructs explicit evidence graphs modeling cross-paragraph dependencies to generate grounded instruction-response pairs. Models fine-tuned on LongCrafter data outperform SFT baselines and official post-trained models on LongBench, LongBench v2, and LooGLE for both Qwen2.5-7B and LLaMA-3.1-8B, with notable gains on high-difficulty tasks and improved robustness to the 'lost in the middle' problem.

Long Context Evolution Evaluation and Benchmarking Qwen2.5-7B LongCrafter LooGLE +2 more

5arXiv · cs.CL·Jul 8, 2026·source ↗

RL reward function design for LLM-generated BPMN process models: systematic study across 48 configurations

Researchers present a systematic study of reward function design for reinforcement learning applied to LLM-based BPMN process model generation, training Llama 3.1 8B and Qwen 2.5 14B across 48 configurations using Group Sequence Policy Optimization. Key findings: RL substantially improves syntactic and pragmatic quality while preserving semantic fidelity, equal reward weighting outperforms targeted weighting, and reward design effects interact with model architecture in non-trivial ways. The paper argues reward composition is as consequential as the decision to apply RL at all, with implications for any multi-dimensional structured generation task.

Evaluation and Benchmarking Alignment and RLHF Qwen2.5-7B Improving LLM-Generated Process Model Quality Through Reinforcement Learning: The Role of Reward Function Design GSPO (Group Sequence Policy Optimization)+1 more

3arXiv · cs.CL·Jun 8, 2026·source ↗

Supervised vs. in-context learning for Turkish multiword expression classification

A new arXiv paper evaluates Turkish idiomatic light verb construction (LVC) detection as a binary classification task, comparing a supervised BERTurk baseline against three instruction-tuned LLMs under zero-shot, one-shot, and few-shot prompting. Results show LLMs have very low LVC recall in zero-shot but improve substantially with demonstrations, though one-shot prompting can introduce strong model-specific biases. The supervised baseline remains competitive, while carefully constructed few-shot prompts allow GPT-OSS-20B and Qwen 2.5-14B to match or exceed it. The study highlights significant prompt sensitivity in Turkish metalinguistic classification tasks.

Evaluation and Benchmarking Qwen2.5-7B BERTurk gpt-oss-20b

7arXiv · cs.CL·Jun 3, 2026·source ↗

PROVE framework trains LLMs for multi-step tool use via stateful MCP environments and programmatic rewards

Researchers introduce PROVE (Programmatic Rewards On Verified Environments), a framework for training LLMs to orchestrate multi-step tool calls using reinforcement learning. The system includes a library of 20 stateful MCP servers with 343 tools, an automated data synthesis pipeline that grounds training queries in live server state, and a multi-component programmatic reward function requiring no judge model. Training four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with ~13K examples yields gains of up to +10.2 on BFCL Multi-Turn, +6.8 on tau2-bench, and +6.5 on T-Eval, demonstrating consistent improvements in multi-step tool orchestration.

Evaluation and Benchmarking Agent and Tool Ecosystem Qwen2.5-7B GRPO Qwen3-4B +7 more

6arXiv · cs.CL·May 26, 2026·source ↗

Signal Collapse and Reward Hacking in Checker-Guided RAG for Biomedical QA

This paper investigates why NLI-based claim checkers used as process rewards in RL-trained medical RAG agents succeed or fail during training. The authors find that a checker's output distribution during training—not its held-out accuracy—determines whether it provides useful gradient signal, with LLM log-probability scoring causing near-total signal collapse (97%+ neutral labels) while a calibrated MedNLI classifier avoids this. A key finding is that stronger checkers can trigger reward hacking cascades (ultra-short answers, search avoidance, language collapse), while moderate-signal local classifiers yield better final model quality (+12% BERTScore over zero-shot). The work frames these as boundary conditions for verifier-as-reward systems in RLVR pipelines.

Evaluation and Benchmarking Agent and Tool Ecosystem Qwen2.5-7B GRPO reward hacking +8 more

7arXiv · cs.AI·May 22, 2026·source ↗

The Matching Principle: A Geometric Theory Unifying Robustness, Domain Adaptation, and Alignment via Nuisance Covariance

This paper proposes the 'matching principle': a unified geometric framework arguing that robustness methods (CORAL, IRM, adversarial training, augmentation, metric learning, Jacobian penalties, alignment constraints) are all estimators of the same object—the covariance of label-preserving deployment nuisance—and that regularizing the encoder Jacobian along this covariance's range is the core statistical problem. The authors prove closed-form optimality results in a linear-Gaussian model, introduce the Trajectory Deviation Index (TDI) as a label-free embedding sensitivity probe, and validate predictions across 13 pre-registered experimental blocks including Qwen2.5-7B. At 7B scale, matched style-PMH improves selective honesty while standard DPO degrades Style TDI, connecting the theory to alignment safety.

Evaluation and Benchmarking AI Safety Research Invariant Risk Minimization Matching Principle Qwen2.5-7B +5 more