WorkflowView: LLM-based framework abstracts user action logs into interpretable workflows
Researchers introduce WorkflowView, a framework using LLMs to convert low-level interaction logs into high-level activity descriptions across diverse domains. The system achieves strong results on three tasks: zero-shot browser log reconstruction (semantic similarity 0.91), few-shot MOOC dropout prediction (F1=0.90 with five examples), and privacy-preserving analysis of AI tool usage in Microsoft Word. The work addresses limitations of prior deep learning clustering approaches, which struggled with noise and cross-application generalization, and discusses deployment considerations including computational efficiency and privacy.
Related guides (3)
Related events (8)
Survey: Human-View Video Understanding with MLLMs — Watch, Remember, Reason Framework
A new arXiv survey paper proposes a unified 'human-view' framework for analyzing multimodal LLM-based video understanding, organized around three functional abilities: watching (perception), remembering (memory), and reasoning. The authors introduce a formulation characterizing video understanding systems by perceptual representations, memory states, reasoning traces, and predictions, then survey methods, datasets, and benchmarks across these dimensions. The work covers challenges including spatio-temporal perception, long-video processing, streaming understanding, and faithful reasoning, with application domains spanning egocentric, sports, medical, and narrative video.
IFLLM dataset uses mouse and eye-tracking signals to improve LLM alignment via implicit feedback
Researchers introduce IFLLM, a dataset of 1,336 multi-turn interactions from 59 Mechanical Turk workers capturing mouse trajectories and webcam-derived eye gaze to study implicit user feedback for LLM alignment. A reward model trained on this implicit feedback improves text-based reward model accuracy from 55% to 64% and nearly triples relative response quality improvements when combined with DPO across eight LLMs. The work addresses the scarcity and cost of explicit preference annotations by mining behavioral signals already present in user interactions.
Text Analytics Evaluation Framework: Benchmarking LLMs on Social Media NLP Tasks
Researchers introduce a 470-question evaluation framework to assess LLM performance on aggregated social media text, applied to Twitter datasets across sentiment analysis, hate speech detection, and emotion recognition. Results show performance degrades substantially as input scale exceeds 500 instances, particularly for open-weights models on numerical tasks. Multi-label and target-dependent scenarios also show notable performance drops, and task complexity progressively erodes accuracy from basic semantic identification to comparison and counting operations. The findings point to architectural bottlenecks in current LLMs for rigorous quantitative analysis over large text collections.
LLM vs. first-year PhD student on EconCS research: workflow study using stable menus of public goods
A preprint uses an open problem from EC 2025 as a testbed to evaluate AI-assisted research workflows in economics and computer science. The study examines whether human intuition in prompts, multi-turn interaction, and LLM capability compare favorably to a first-year PhD student's contributions. Key findings: human intuition in prompts improves LLM 'taste', multi-turn workflows help when encouraging ambitious steps, and the LLM performs slightly below the first-year PhD student on the same problem. The work contributes empirical evidence on the practical utility and limits of LLMs as research collaborators in formal theory domains.
LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning?
This Hugging Face blog post introduces LAVE (LLM-Assisted Visual Evaluation), a zero-shot VQA evaluation methodology applied to the Docmatix dataset. The post investigates whether large vision-language models can perform document visual question answering without task-specific fine-tuning by leveraging LLM-based evaluation metrics. The analysis probes the gap between zero-shot and fine-tuned performance on document understanding tasks, raising questions about the continued necessity of supervised adaptation for VQA.
Semi-supervised framework scales LLM reasoning with minimal labeled data via lightweight verifier
A new arXiv preprint proposes a semi-supervised framework for training LLMs to reason with very few labeled examples, using a lightweight classifier to judge the validity of intermediate reasoning traces. An entropy-based confidence threshold filters unreliable pseudo-labels before fine-tuning. Experiments on math reasoning (Orca-Math subset) and visual QA (GQA) show accuracy comparable to using 10-15x more labeled data. The approach reduces dependence on expensive answer-level supervision by turning verification into a data-creation mechanism.
PROVE framework trains LLMs for multi-step tool use via stateful MCP environments and programmatic rewards
Researchers introduce PROVE (Programmatic Rewards On Verified Environments), a framework for training LLMs to orchestrate multi-step tool calls using reinforcement learning. The system includes a library of 20 stateful MCP servers with 343 tools, an automated data synthesis pipeline that grounds training queries in live server state, and a multi-component programmatic reward function requiring no judge model. Training four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with ~13K examples yields gains of up to +10.2 on BFCL Multi-Turn, +6.8 on tau2-bench, and +6.5 on T-Eval, demonstrating consistent improvements in multi-step tool orchestration.
LLMs automate reproducibility assessments in social and behavioral sciences, outperforming human reanalysts
A preprint from arXiv demonstrates that an LLM pipeline can automate reproducibility assessments of published social and behavioral science studies, recovering original effect sizes in 41% of cases (vs. 34% for human reanalysts) and reaching the same qualitative conclusion in 96% of cases (vs. 74% for humans). The study evaluated 76 published studies with predefined claims. The results suggest LLMs could serve as a scalable tool for systematic auditing of empirical research, addressing the resource-intensive nature of traditional reproducibility efforts.


