4arXiv cs.AI (Artificial Intelligence)·5d ago

WorkflowView: LLM-based framework abstracts user action logs into interpretable workflows

Researchers introduce WorkflowView, a framework using LLMs to convert low-level interaction logs into high-level activity descriptions across diverse domains. The system achieves strong results on three tasks: zero-shot browser log reconstruction (semantic similarity 0.91), few-shot MOOC dropout prediction (F1=0.90 with five examples), and privacy-preserving analysis of AI tool usage in Microsoft Word. The work addresses limitations of prior deep learning clustering approaches, which struggled with noise and cross-application generalization, and discusses deployment considerations including computational efficiency and privacy.

Enterprise Deployment Patterns Agent and Tool Ecosystem Microsoft Microsoft Word WorkflowView

Related guides (3)

Microsoft

Microsoft: The AI Infrastructure Giant Betting on Every Horse

Read asBeginner In-depth

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

4arXiv · cs.AI·12d ago·source ↗

Survey: Human-View Video Understanding with MLLMs — Watch, Remember, Reason Framework

A new arXiv survey paper proposes a unified 'human-view' framework for analyzing multimodal LLM-based video understanding, organized around three functional abilities: watching (perception), remembering (memory), and reasoning. The authors introduce a formulation characterizing video understanding systems by perceptual representations, memory states, reasoning traces, and predictions, then survey methods, datasets, and benchmarks across these dimensions. The work covers challenges including spatio-temporal perception, long-video processing, streaming understanding, and faithful reasoning, with application domains spanning egocentric, sports, medical, and narrative video.

Long Context Evolution Multimodal Progress Watch, Remember, Reason: Human-View Video Understanding with MLLMs

5arXiv · cs.CL·47h ago·source ↗

IFLLM dataset uses mouse and eye-tracking signals to improve LLM alignment via implicit feedback

Researchers introduce IFLLM, a dataset of 1,336 multi-turn interactions from 59 Mechanical Turk workers capturing mouse trajectories and webcam-derived eye gaze to study implicit user feedback for LLM alignment. A reward model trained on this implicit feedback improves text-based reward model accuracy from 55% to 64% and nearly triples relative response quality improvements when combined with DPO across eight LLMs. The work addresses the scarcity and cost of explicit preference annotations by mining behavioral signals already present in user interactions.

Evaluation and Benchmarking Alignment and RLHF Direct Preference Optimization (DPO)IFLLM

5arXiv · cs.CL·1mo ago·source ↗

Text Analytics Evaluation Framework: Benchmarking LLMs on Social Media NLP Tasks

Researchers introduce a 470-question evaluation framework to assess LLM performance on aggregated social media text, applied to Twitter datasets across sentiment analysis, hate speech detection, and emotion recognition. Results show performance degrades substantially as input scale exceeds 500 instances, particularly for open-weights models on numerical tasks. Multi-label and target-dependent scenarios also show notable performance drops, and task complexity progressively erodes accuracy from basic semantic identification to comparison and counting operations. The findings point to architectural bottlenecks in current LLMs for rigorous quantitative analysis over large text collections.

Long Context Evolution Evaluation and Benchmarking Emotion Recognition Text Analytics Evaluation Framework X (Twitter)+3 more

5arXiv · cs.AI·4d ago·source ↗

LLM vs. first-year PhD student on EconCS research: workflow study using stable menus of public goods

A preprint uses an open problem from EC 2025 as a testbed to evaluate AI-assisted research workflows in economics and computer science. The study examines whether human intuition in prompts, multi-turn interaction, and LLM capability compare favorably to a first-year PhD student's contributions. Key findings: human intuition in prompts improves LLM 'taste', multi-turn workflows help when encouraging ambitious steps, and the LLM performs slightly below the first-year PhD student on the same problem. The work contributes empirical evidence on the practical utility and limits of LLMs as research collaborators in formal theory domains.

Evaluation and Benchmarking Stable Menus of Public Goods EC 2025

4Hugging Face Blog·1mo ago·source ↗

LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning?

This Hugging Face blog post introduces LAVE (LLM-Assisted Visual Evaluation), a zero-shot VQA evaluation methodology applied to the Docmatix dataset. The post investigates whether large vision-language models can perform document visual question answering without task-specific fine-tuning by leveraging LLM-based evaluation metrics. The analysis probes the gap between zero-shot and fine-tuned performance on document understanding tasks, raising questions about the continued necessity of supervised adaptation for VQA.

Evaluation and Benchmarking Multimodal Progress Visual Question Answering LAVE Hugging Face +1 more

5arXiv · cs.CL·4d ago·source ↗

Semi-supervised framework scales LLM reasoning with minimal labeled data via lightweight verifier

A new arXiv preprint proposes a semi-supervised framework for training LLMs to reason with very few labeled examples, using a lightweight classifier to judge the validity of intermediate reasoning traces. An entropy-based confidence threshold filters unreliable pseudo-labels before fine-tuning. Experiments on math reasoning (Orca-Math subset) and visual QA (GQA) show accuracy comparable to using 10-15x more labeled data. The approach reduces dependence on expensive answer-level supervision by turning verification into a data-creation mechanism.

Evaluation and Benchmarking Alignment and RLHF GQA Scaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight Verifier Orca-Math

7arXiv · cs.CL·17d ago·source ↗

PROVE framework trains LLMs for multi-step tool use via stateful MCP environments and programmatic rewards

Researchers introduce PROVE (Programmatic Rewards On Verified Environments), a framework for training LLMs to orchestrate multi-step tool calls using reinforcement learning. The system includes a library of 20 stateful MCP servers with 343 tools, an automated data synthesis pipeline that grounds training queries in live server state, and a multi-component programmatic reward function requiring no judge model. Training four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with ~13K examples yields gains of up to +10.2 on BFCL Multi-Turn, +6.8 on tau2-bench, and +6.5 on T-Eval, demonstrating consistent improvements in multi-step tool orchestration.

Evaluation and Benchmarking Agent and Tool Ecosystem Qwen2.5-7B GRPO Qwen3-4B +7 more

6arXiv · cs.AI·8d ago·source ↗

LLMs automate reproducibility assessments in social and behavioral sciences, outperforming human reanalysts

A preprint from arXiv demonstrates that an LLM pipeline can automate reproducibility assessments of published social and behavioral science studies, recovering original effect sizes in 41% of cases (vs. 34% for human reanalysts) and reaching the same qualitative conclusion in 96% of cases (vs. 74% for humans). The study evaluated 76 published studies with predefined claims. The results suggest LLMs could serve as a scalable tool for systematic auditing of empirical research, addressing the resource-intensive nature of traditional reproducibility efforts.

Evaluation and Benchmarking Agent and Tool Ecosystem Automated reproducibility assessments in the social and behavioral sciences using large language models