4arXiv cs.CL (Computation and Language)·11d ago

Pipeline detects curriculum knowledge gaps from student-AI conversational logs using prerequisite graphs

Researchers present a pipeline that classifies student questions directed at a conversational AI teaching assistant into curriculum topics using a few-shot classifier grounded in a GPT-4-extracted prerequisite knowledge graph. Evaluated on 1,340 questions from 164 graduate students, the classifier achieves 80% accuracy across 43 labels. Topic-level question volume significantly correlates with student-reported difficulty (rho=0.491), validating that AI interaction logs carry actionable diagnostic signals about knowledge gaps.

Detecting Knowledge Gaps from Conversational AI Interactions Using Curriculum Prerequisite Graphs OpenAI GPT-4

Related guides (1)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·25d ago·source ↗

Peak-Then-Collapse: RLVR Tool-Use Failures on Knowledge-Graph APIs

This paper investigates RLVR-based tool-use training (GRPO on Qwen2.5-7B-Instruct) on a minimal knowledge-graph API (Freebase over Complex WebQuestions) and documents a 'peak-then-collapse' pattern where tool-grounded answer rates rise then fall to zero within 50 steps, replicated across four seeds and seven reward designs. The authors identify a key structural difference between knowledge-graph APIs and other tool types (Python, web search, JSON): sparse, non-natural-language feedback signals (e.g., empty brackets '[]') prevent the model from recovering via pretraining-familiar error signals. A direct oracle ablation shows relation selection is not the bottleneck—95.4% of errors are retrieval-composition failures—and self-distillation reaches 40% EM at 7B, with capacity scaling to 14B yielding only marginal gains, suggesting an interface-bound ceiling.

Evaluation and Benchmarking Agent and Tool Ecosystem RLVR Self-Distillation GRPO +4 more

6arXiv · cs.CL·29d ago·source ↗

Systematic 14-Day Evaluation of Six AI Chatbots as News Intermediaries Across Languages and Regions

Researchers evaluated six commercial AI chatbots (Gemini 3 Flash/Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services over 14 days in February 2026. Top systems exceed 90% multiple-choice accuracy on breaking news but lose 11-17% under free-response conditions. Key findings include systematic Hindi-language underperformance (79% vs. 89-91% elsewhere) driven by Anglophone retrieval bias, retrieval failures accounting for over 70% of errors, and dramatic accuracy collapse (to 19-70%) on questions containing subtle false premises. A detection-accuracy paradox is identified: the best false-premise detector does not yield the best adversarial accuracy, suggesting premise detection and answer recovery are partially independent capabilities.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.5 Pro BBC News GPT-4o mini +11 more

6arXiv · cs.CL·3d ago·source ↗

ZPPO: Teacher-in-prompt training method outperforms distillation and GRPO for small vision-language models

Researchers introduce Zone of Proximal Policy Optimization (ZPPO), a training method inspired by Vygotsky's zone of proximal development that embeds teacher guidance in prompts rather than policy gradients or logit imitation. On hard questions where student rollouts fail, ZPPO constructs Binary Candidate-included Questions (BCQ) and Negative Candidate-included Questions (NCQ) to help the student discriminate correct from incorrect responses, with a replay buffer that recirculates hard questions until mastered. Evaluated on the Qwen3 family (0.8B–9B) with a 27B teacher across a 31-benchmark suite covering VLM, LLM, and video tasks, ZPPO outperforms both distillation and GRPO baselines, with the largest gains at the smallest model scale. The method addresses a known failure mode of RL training where zero-reward rollouts produce no gradient signal.

Open Weights Progress Alignment and RLHF GRPO Proximal Policy Optimization Qwen3 +1 more

6Openai Blog·1mo ago·source ↗

OpenAI and the CSU System Bring ChatGPT to 500,000 Students & Faculty

OpenAI has partnered with the California State University system to deploy ChatGPT to approximately 500,000 students and faculty, described as the largest single deployment of ChatGPT to date. The initiative aims to expand AI use in higher education and develop an AI-ready workforce in the United States. No technical details about the deployment configuration or specific product tier are disclosed in the announcement.

Enterprise Deployment Patterns California State University ChatGPT OpenAI

6arXiv · cs.CL·19d ago·source ↗

Question-Answering as Hidden State Probing for Test-Time Reasoning Intervention

This paper proposes using question-asking as an inference-time intervention to surface information about an LLM's hidden state during chain-of-thought reasoning. The authors train a probe on a student model's hidden states before and after question generation, finding it predictive of final answer correctness even before the teacher responds—suggesting self-diagnosis during question generation carries meaningful signal. They frame question-asking as a sequential decision problem with a gating policy, but find a gap between detection and recovery: interventions are as likely to harm correct trajectories as to fix incorrect ones. The results have implications for the limits of LLM self-refinement under uncertainty.

Evaluation and Benchmarking Agent and Tool Ecosystem student-teacher prompting Chain-of-Thought Reasoning inference-time intervention +4 more

4Anthropic News·19d ago·source ↗

Anthropic Education Report: How Educators Use Claude in Higher Education

Anthropic analyzed ~74,000 anonymized conversations from higher education professionals on Claude.ai during May–June 2025, finding that curriculum development dominates educator AI use (57% of conversations), followed by academic research (13%) and student assessment (7%). Faculty are not only using Claude as a chatbot but also building custom interactive tools via Claude Artifacts, such as chemistry simulations and grading rubrics. The study, complemented by qualitative research with 22 Northeastern University faculty, reveals a spectrum from augmentation (lesson design, advising) to automation (routine administrative tasks), with grading being a contested and relatively rare but automation-heavy use case.

Enterprise Deployment Patterns Agent and Tool Ecosystem claude.ai Claude O*NET +4 more

7arXiv · cs.LG·9d ago·source ↗

Interpretability-based pipeline for auditing and shaping post-training learning signals

Researchers introduce a data-centric post-training pipeline that applies interpretability methods to preference datasets before optimization, surfacing latent concepts that separate preferred from dispreferred generations. The approach unifies several interpretability-based training protocols as feature or data interventions that shape reward signals. Empirically, the pipeline diagnoses undesirable signals such as sycophancy and over-stylization, mitigates off-target learning, and can amplify desired properties like safety behaviors and model personality. The work reframes post-training from opaque scalar reward optimization into an auditable, concept-level sculpting process.

Evaluation and Benchmarking AI Safety Research Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal +1 more

10Openai Blog·1mo ago·source ↗

Introducing ChatGPT

OpenAI announced ChatGPT, a conversational model trained to engage in dialogue, answer follow-up questions, acknowledge errors, challenge incorrect premises, and decline inappropriate requests. The model's dialogue format represented a significant step in making large language models accessible and interactive for general users. This November 2022 launch marked a pivotal moment in public AI adoption.

Frontier Model Releases Enterprise Deployment Patterns ChatGPT OpenAI +2 more