4arXiv cs.CL (Computation and Language)·47h ago

Adaptive LLM tutoring system with subject-aware prompt routing improves high-school student engagement

Researchers develop and evaluate an LLM-based tutoring system that uses a learned prompt routing model to dynamically select pedagogical strategies based on 14 features extracted from conversation transcripts. The system was trained in simulation and deployed in an A/B test with 359 high-school students (656 conversations), showing sim-to-real transfer and reducing required interactions by ~3 turns. A stochastic routing strategy achieved a notably higher exercise conversion rate (28.1%) compared to a greedy router (19.1%) and static baseline (19.6%).

Enterprise Deployment Patterns Learning to Prompt: Improving Student Engagement with Adaptive LLM-based High-School Tutoring

Related guides (1)

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·10d ago·source ↗

EEVEE: Multi-dataset test-time prompt learning framework for self-improving LLM agents

EEVEE is a new framework enabling LLM agents to perform test-time prompt learning across heterogeneous multi-dataset task streams, addressing a gap where prior methods only handled single-dataset settings. The system uses a router to partition inputs into task clusters and assigns them to suitable prompt configurations, optimized via a router-prompt co-evolution strategy. Experiments show improvements of 10.38 and 24.32 average points over Qwen3-4B-Instruct and DeepSeek-V3.2 respectively, outperforming prior SOTA methods GEPA and ACE by up to 48.2%.

Evaluation and Benchmarking Agent and Tool Ecosystem ACE DeepSeek V4 Qwen3-4B-Instruct +2 more

5arXiv · cs.CL·9d ago·source ↗

Systematic study reveals effectiveness-fluency trade-offs in LLM conditioning methods

A new arXiv paper systematically evaluates a range of LLM conditioning methods across both concept injection and removal scenarios, finding that efficient steering methods often degrade fluency significantly. A key finding is that activation steering is substantially less effective on instruction-tuned models than on base models, a previously overlooked interaction. Simple prompting and supervised fine-tuning work for concept injection but not removal, and cheap textual metrics are found to correlate well with expensive LLM-as-judge evaluations.

Evaluation and Benchmarking Alignment and RLHF On The Effectiveness-Fluency Trade-Off In LLM Conditioning: A Systematic Study

4arXiv · cs.LG·5d ago·source ↗

Dual-adapter routing system improves knowledge editing precision in LLMs

A new arXiv paper introduces a route-specialized dual-adapter architecture for knowledge editing in LLMs, separating the concerns of writing edits (edit adapter) and suppressing them when irrelevant (locality adapter). A relevance router gates which adapter is applied, addressing the locality problem in memory-assisted editing. Evaluated on CounterFact, zsRE, and MQuAKE benchmarks using Llama-3.1-8B-Instruct and Qwen3-8B, the method achieves best-in-class probability-preference accuracy across all three datasets. Ablations show the gain comes from the architectural separation rather than increased parameter capacity.

Evaluation and Benchmarking Alignment and RLHF BGE Llama3-8B-Instruct Qwen3-4B +4 more

5arXiv · cs.CL·25d ago·source ↗

Failure Modes of Multi-Objective Prompt Optimization for LLM Judges

This paper investigates multi-objective prompt optimization for LLM-as-judge systems, testing five decomposition modes of textual gradient optimizers across varying levels of cross-task information sharing. In 6 of 10 configurations, optimization fails to improve over the initial prompt, with gradient specificity dropping 59% when multiple criteria are processed jointly. The authors identify two separable failure modes: gradient dilution at optimization time and instruction interference at inference time. These findings constrain the design space for customizing LLM judges via textual feedback across multiple evaluation criteria simultaneously.

Evaluation and Benchmarking Agent and Tool Ecosystem MGDA Multi-Task Learning LLM-as-a-Judge +4 more

5arXiv · cs.CL·3d ago·source ↗

Study of security and privacy prompts in the wild reveals LLM response quality gaps and inconsistency

Researchers analyzed 14,727 security and privacy (S&P) prompts drawn from WildChat's 3.2M real user-LLM conversations, categorizing them into nine topic areas and evaluating response quality across 270 advice-seeking prompts. Commercial models substantially outperformed open-weight models (GPT achieving 98% 'good enough' responses vs. Llama 4 at 47%), but even high-performing commercial models showed inconsistent responses across repeated runs of the same prompt. The study is the first to analyze real user S&P queries to LLMs rather than expert-authored test sets, surfacing both a capability gap and a reliability concern.

Evaluation and Benchmarking AI Safety Research Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond WildChat Llama +1 more

5arXiv · cs.CL·4d ago·source ↗

RL-trained LLMs learn retriever-specific query formulation strategies for RAG

A new arXiv paper presents the first systematic study of using reinforcement learning to teach LLMs to adapt query formulation strategies to different retrieval backends. The authors find that different retrievers have surprisingly distinct optimal query styles (e.g., descriptive vs. question-like), making cross-retriever strategy transfer ineffective. They introduce a branching-based rollout technique to stabilize training over multi-step retrieval trajectories and show gains from retriever-specific human guidance and model scaling.

Evaluation and Benchmarking Agent and Tool Ecosystem Understanding the Behaviors of Environment-aware Information Retrieval LCO-Embedding

4arXiv · cs.AI·1mo ago·source ↗

Structured Prompt Checklists Outperform Raw and Clarifying-Question Prompts Across LLMs

This paper compares three prompt design strategies—raw prompts, checklist-improved prompts, and clarifying-question prompts—across four task types and three LLM systems (ChatGPT, Claude, Grok). Checklist-improved prompts achieved the highest mean rubric score (7.50/8) versus 5.67 for raw and 6.67 for clarifying-question prompts. Checklist prompts also used fewer tokens on average, suggesting a favorable quality-effort tradeoff. The study provides empirical grounding for structured prompt engineering as a practical technique to reduce multi-turn interaction overhead.

Agent and Tool Ecosystem clarifying-question prompting ChatGPT Grok +2 more

5arXiv · cs.CL·10d ago·source ↗

RL-based alignment improves interactivity in full-duplex spoken dialogue models

Researchers propose a post-training alignment method using reinforcement learning to improve interactivity in full-duplex spoken dialogue models, which can listen and speak simultaneously. The method addresses four canonical axes of interactivity—pause handling, turn-taking, backchanneling, and user interruption—each with axis-specific reward functions, plus an LLM-based reward to prevent semantic degradation. The approach is applied to two open-source models, Moshi and PersonaPlex, showing consistent improvements in both offline and real-time multi-turn evaluation.

Alignment and RLHF Multimodal Progress Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models PersonaPlex Moshi