Almanac
← Events
4arXiv cs.AI (Artificial Intelligence)·31h ago

Study finds GitHub Copilot dialogue accuracy low for HIPAA compliance NFR assessment despite high developer agreement

A controlled study with 49 programmers using GitHub Copilot to assess 148 HIPAA-derived non-functional requirements (NFRs) against a real codebase finds that developers tend to agree with LLM assessments, but accuracy against expert ground truth is low. The paper evaluates multi-turn dialogue quality across requirement satisfaction, reasoning, and code localization dimensions. User satisfaction modeling reveals that longer responses and more information-providing turns hurt satisfaction, while proactive interactions help. The work highlights a gap in current LLM evaluation benchmarks, which focus on functional correctness and single-turn accuracy rather than multi-turn NFR assessment.

Related guides (3)

Related events (8)

5arXiv · cs.CL·15d ago·source ↗

NCRE-based benchmark reveals frontier LLMs top out at 68.8% on professional Office automation tasks

Researchers introduce an evaluation suite derived from China's National Computer Rank Examination (NCRE), comprising 200 practical tasks across Word, Excel, and PowerPoint scored via 7,118 machine-gradable criteria. Seven frontier LLMs are benchmarked: single-turn models peak at 36.6% Score Rate, while a full agentic system with execution feedback and iterative repair reaches 68.8%, still well below the 95.5% community-reference score. The results demonstrate that fine-grained, long-horizon Office document automation remains a significant unsolved challenge for current LLM and agent systems despite strong code-generation capabilities.

6arXiv · cs.AI·8d ago·source ↗

Empirical study finds 80% of AI agent-authored test patches lack meaningful verification logic

A large-scale empirical study of 86,156 test-file patches from 33,596 agent-authored GitHub PRs finds that 80.2% contain weak or no explicit oracle signals — meaning they execute code without verifying behavior. The study covers five coding agents (OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code) across 2,807 repositories, and introduces a syntactic taxonomy of eight oracle signal categories. Despite lower raw merge rates, regression analysis shows strong oracles significantly improve merge likelihood (OR=1.28), suggesting current quality gates based on test-file presence substantially overestimate verification strength.

5arXiv · cs.CL·8d ago·source ↗

Study of security and privacy prompts in the wild reveals LLM response quality gaps and inconsistency

Researchers analyzed 14,727 security and privacy (S&P) prompts drawn from WildChat's 3.2M real user-LLM conversations, categorizing them into nine topic areas and evaluating response quality across 270 advice-seeking prompts. Commercial models substantially outperformed open-weight models (GPT achieving 98% 'good enough' responses vs. Llama 4 at 47%), but even high-performing commercial models showed inconsistent responses across repeated runs of the same prompt. The study is the first to analyze real user S&P queries to LLMs rather than expert-authored test sets, surfacing both a capability gap and a reliability concern.

4Hugging Face Blog·1mo ago·source ↗

Personal Copilot: Train Your Own Coding Assistant

This Hugging Face blog post walks through fine-tuning an open-weights code model to create a personalized coding assistant. It covers dataset preparation, training techniques (likely LoRA/PEFT), and deployment considerations for self-hosted code completion. The post targets practitioners who want a GitHub Copilot-like experience without relying on proprietary APIs.

6arXiv · cs.CL·28d ago·source ↗

Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests

This paper introduces a large, consensus-labeled benchmark of 6,675 prompts drawn from eight existing corpora (ASTRA, CySecBench, AdvBench, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) to evaluate whether coding-specialized LLMs refuse malicious requests. A key contribution is the distinction between requests for executable malicious code (4,748 prompts) versus harmful security knowledge (1,923 prompts), arguing that coding models should face a stricter refusal standard given their outputs can be directly weaponized. A five-judge consensus protocol achieves Fleiss' kappa of 0.767, providing a reliability-quantified substrate for cross-corpus compliance measurement that the field has previously lacked.

6Latent Space·22d ago·source ↗

GitHub's plan for agentic coding — Kyle Daigle interview on Latent Space

Latent Space interviews Kyle Daigle of GitHub about the company's strategy for agentic coding workflows and the platform pressures created by the explosion in AI-assisted development following Copilot. The discussion covers how GitHub is adapting its infrastructure and product direction to support agents operating at scale. This is a strategic signal from one of the most central platforms in the developer AI ecosystem.

5arXiv · cs.LG·8d ago·source ↗

ReproRepo: Scalable LLM agent framework for reproducibility auditing using GitHub issues

ReproRepo is a new framework for evaluating LLM agents on reproducibility auditing of ML research, using naturally occurring GitHub issues as supervision signals rather than costly manual curation. The framework is instantiated on 1,149 recent ML papers from major conferences and benchmarks four frontier model-agent configurations. The best-performing agent (Codex with GPT-5.5) surfaces at least one semantically related human-reported reproduction blocker for ~90% of papers, though exact localization of issues remains a weakness. The work provides a reusable, scalable evaluation harness for this underexplored agentic task.

5arXiv · cs.CL·19d ago·source ↗

Pre-registered study finds Popperian code-generation prompt skills add no benefit beyond structural scaffolding

A pre-registered two-tier ablation study tests whether 'Popperian falsificationist' prompt skills improve LLM code generation through their procedural content or merely through structural scaffolding. Using Claude Sonnet 4.6 and Qwen2.5-Coder-0.5B with execution-based evaluation (HumanEval+ unit tests) rather than LLM-as-judge, the authors find that on the small model, structured prompts lift correctness by 20-22 points but the full Popperian skill shows no separable benefit over a labels-only scaffold. The paper contributes a calibrated negative result and a reusable disambiguation protocol for evaluating prompt-skill families, while also documenting that LLM self-judges at 0.5B scale perform no better than random selection.