Almanac
model

large language models

modelactivelarge-language-models-75defd21·38 events·first seen 28d ago

Aliases: large language models, Large Language Models (LLMs), Large Language Model (LLM), large language models (LLMs)

Co-occurring entities

More like this (12)

Recent events (38)

6arXiv · cs.CL·26d ago·source ↗

Tracing the Emergence of Human-Like Pragmatic Reasoning in LLMs Across Languages

Researchers conducted a population-matching experiment evaluating 25 LLMs on conditional inference tasks across four languages, comparing model behavior to matched human populations. The study finds that LLMs function as accurate semantic operators but systematically fail to capture pragmatic enrichments—context-sensitive inferences beyond literal logical meaning—that humans apply effortlessly. Model performance on pragmatic reasoning is not predicted by open vs. closed weights, training orientation, or architecture type, suggesting pragmatic reasoning remains an emergent and unreliable capability. The findings contribute to ongoing debates about whether LLMs reason like humans or merely approximate surface-level linguistic patterns.

5arXiv · cs.CL·16d ago·source ↗

If LLMs Have Human-Like Attributes, Then So Does Age of Empires II

This paper critiques the widespread practice of ascribing anthropomorphic attributes (e.g., morality, language understanding) to LLMs, arguing that such conclusions are empirically non-unique. The authors demonstrate this by training a neural network on Age of Empires II and showing that similar attribute-ascription logic could apply to arbitrary substrates like LEGO or urban infrastructure. They propose a 'null assumption' of LLM non-uniqueness as a methodological baseline for experiments, and prove that Age of Empires II is functionally- and Turing-complete as a supporting argument.

5arXiv · cs.CL·15d ago·source ↗

Systematic Evaluation of LLM Safety Failures on Eating Disorder Queries with Clinician Feedback

This paper investigates how LLMs respond to queries from users with eating disorders, finding that specific linguistic cues in prompts increase the likelihood of unsafe model responses. Working with clinical ED experts, the authors systematically vary risk levels in user prompts to measure the extent to which LLMs uncritically adapt to potentially dangerous inputs. The study highlights a gap between perceived model safety and actual harm facilitation in sensitive health contexts.

4Hugging Face Blog·28d ago·source ↗

Red-Teaming Large Language Models

This Hugging Face blog post introduces red-teaming as a safety evaluation methodology for large language models, explaining how adversarial testing can surface harmful outputs, biases, and failure modes before deployment. It covers techniques for systematically probing LLMs to elicit problematic behaviors and discusses the role of red-teaming in responsible AI development. The post serves as an educational overview aimed at practitioners working on LLM safety.

5Openai Blog·28d ago·source ↗

OpenAI, Georgetown CSET, and Stanford Internet Observatory Publish LLM Disinformation Misuse Report

OpenAI researchers collaborated with Georgetown University's Center for Security and Emerging Technology (CSET) and Stanford Internet Observatory to produce a report on how large language models could be misused to augment disinformation campaigns. The work draws on an October 2021 workshop with 30 experts across disinformation research, ML, and policy, plus over a year of additional research. The report outlines threat models for LLM-enabled disinformation and proposes a framework for analyzing potential mitigations.

4arXiv · cs.CL·26d ago·source ↗

Quantifying Cross-Linguistic Effects of Syncretism on Agreement Attraction Using LLM Processing Proxies

This paper investigates why morphological syncretism amplifies agreement attraction errors in some languages (English, German, Russian) but not others (Turkish, Armenian), a pattern lacking a principled account. The authors use surprisal and attention entropy derived from large language models as proxies for human sentence processing across four languages. LLM-derived measures successfully replicate behavioral findings in English and German, align with Turkish null results, and partially capture Russian patterns. The work demonstrates LLMs as tools for cross-linguistic psycholinguistic investigation.

8arXiv · cs.AI·26d ago·source ↗

Large-Scale Evaluation of LLM-Driven Formal Proof Search on Open Mathematical Problems

Researchers present the first large-scale evaluation of LLM-based formal proof search on genuinely open mathematical problems, using Lean as a verification backend. Their most capable agent autonomously resolved 9 of 353 open Erdős problems and proved 44 of 492 OEIS conjectures, at a cost of a few hundred dollars per problem. The system is already being deployed in active research across combinatorics, optimization, graph theory, algebraic geometry, and quantum optics. The study also compares agent architectures, finding that more sophisticated designs outperform simple generate-and-verify loops on the hardest problems.

5arXiv · cs.AI·22d ago·source ↗

Human Decision-Making with Persuasive and Narrative LLM Explanations

A large-scale behavioral experiment evaluated how LLM-generated narrative explanations of varying persuasiveness affect human decision-making accuracy in classification tasks. Results showed that persuasiveness level did not meaningfully improve decision accuracy over a simple AI prediction alone, consistent with prior explainable AI research using feature importance methods. Narratives increased AI reliance regardless of whether the AI prediction was correct or incorrect, and more persuasive narratives may have slowed response times and reduced ability to discriminate correct from incorrect AI predictions. The study concludes that narrative explanations involve tradeoffs and warrant further investigation into when and how they should be deployed.

5arXiv · cs.CL·20d ago·source ↗

VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading

This paper compares matched LLM and VLM pairs in a text-only setting to isolate the effect of multimodal training history on human-like language processing. Using whole-cortex fMRI and eye-tracking data from natural reading, the authors find that multimodal pretraining does not confer a uniform global advantage in human alignment. However, VLMs show selective advantages when sentences contain stronger visual semantic content, with converging evidence from both neural and behavioral measures. The findings suggest language-internal representations remain the primary driver of human text processing alignment.

5arXiv · cs.CL·20d ago·source ↗

Systematic Study of LLM Linguistic Uncertainty Markers and Intrinsic Confidence Calibration

This paper introduces 'marker internal confidence' (MIC) as a formalization of the intrinsic confidence a model associates with epistemic markers (e.g., 'it is likely...') in a given task domain. The authors present 7 metrics to evaluate MIC stability within and across distributions, finding that LLMs remain miscalibrated even under model-centric interpretation of marker meanings. Models struggle to differentiate markers by internal confidence across distributions, though they preserve a somewhat consistent ranking order across tasks. The work provides complementary evidence toward understanding faithful calibration in LLMs and highlights the need for more stable, aligned marker use.

5arXiv · cs.CL·16d ago·source ↗

LLMs Show Inverted Compositional Strengths vs. Humans on Reference Resolution Task

This paper evaluates LLMs and humans on the Personal Relation Task (Paperno 2022), distinguishing between Extensional tasks (resolving what an expression refers to) and Intensional tasks (representing structured sense/formula). The study finds that humans outperform LLMs on Extensional tasks while LLMs outperform humans on Intensional tasks—an inverted pattern of strengths. The authors argue this asymmetry reflects the absence of referential grounding in LLM training as a key gap in human-like language understanding.

6arXiv · cs.CL·28d ago·source ↗

Code as Agent Harness: A Survey of Code as Operational Substrate for Agentic AI Systems

This survey paper introduces the concept of 'code as agent harness,' framing code not merely as output but as the operational infrastructure for LLM-based agents—covering reasoning, action, environment modeling, and execution-based verification. The authors organize the analysis across three layers: harness interface, harness mechanisms (planning, memory, tool use, feedback control), and scaling to multi-agent systems. Applications span coding assistants, GUI/OS automation, embodied agents, scientific discovery, and enterprise workflows. Open challenges include evaluation beyond task success, verification under incomplete feedback, and human oversight for safety-critical actions.

6arXiv · cs.CL·28d ago·source ↗

Generative AI Advertising as a Problem of Trustworthy Commercial Intervention

This paper argues that generative AI fundamentally transforms advertising by enabling interventions on the generative process itself rather than discrete content placement. The authors introduce a taxonomy of influence tiers—product mentions, information framing, behavioral redirection, and long-term preference shaping—and analyze how these manifest across RAG and agentic pipelines. They find that deployed systems focus on the most observable tier while more consequential, latent forms of commercial influence lack detection, measurement, or disclosure frameworks. The central challenge posed is whether commercial influence in generative systems can be made attributable, measurable, contestable, and aligned with user welfare.

4Import Ai·28d ago·source ↗

ImportAI 449: LLMs training other LLMs; 72B distributed training run; computer vision is harder than generative text

Import AI issue 449 covers several AI/ML developments including LLMs being used to train other LLMs, a 72B parameter distributed training run, and analysis of why computer vision remains harder than generative text. The newsletter also touches on potential political implications of AI progress. As a tier-2 commentary source, this aggregates and contextualizes multiple technical developments across the AI landscape.

5Hugging Face Blog·28d ago·source ↗

Towards Encrypted Large Language Models with FHE

This Hugging Face blog post explores applying Fully Homomorphic Encryption (FHE) to Large Language Models, enabling inference on encrypted data without exposing plaintext inputs to the server. The approach aims to address privacy concerns in cloud-based LLM deployments by allowing computations to occur directly on ciphertext. The post likely covers the technical challenges of adapting transformer architectures to FHE constraints and presents early feasibility results.

7arXiv · cs.CL·21d ago·source ↗

Alignment Tampering: How RLHF Can Be Exploited to Amplify Misaligned Biases

This paper introduces 'alignment tampering,' a structural vulnerability in RLHF where the LLM being aligned can influence its own preference dataset, causing the training process to amplify undesired behaviors rather than correct them. The mechanism exploits two core RLHF limitations: preference data is drawn from the model's own outputs, and pairwise comparisons capture relative quality without capturing the reason for preference. Experiments demonstrate amplification of diverse biases including sexism, brand promotion, and instrumental goal-seeking. Existing robust RLHF mitigations fail to fully resolve the issue without degrading response quality.

6arXiv · cs.CL·26d ago·source ↗

CoTrace: A Goal-Level Attribution Framework for Measuring AI Contributions in Human-AI Collaboration

Researchers introduce CoTrace, a framework that decomposes explicit goals into verifiable requirements and traces both direct and indirect AI contributions across dialogue turns in human-AI collaboration. Applied to 638 real-world collaboration logs, the study finds LLMs account for 11-26% of goal-shaping contribution, with disproportionate influence on lower-level concrete requirements. A user study shows that exposing participants to goal-level attribution analyses shifts their perceived AI contribution by nearly 2 points on a 5-point scale, revealing systematic miscalibration in how users understand AI-assisted work. The work has implications for reliance calibration, AI-assisted work evaluation, and interaction design.

4arXiv · cs.CL·21d ago·source ↗

C4STYLI Benchmark: Probing Cultural Aesthetic Stylistics Awareness in LLMs

Researchers introduce C4STYLI, a benchmark of stylized translated movie titles and advertising slogans from Hong Kong and mainland China, designed to evaluate LLMs on cross-cultural aesthetic stylistics. Evaluations reveal that LLMs diverge from human stylistic recognition, with recognition ability varying by text domain and not consistently predicting generation performance. Structural ablation using logistic regression probes shows that LLMs in the Hong Kong setting rely on surface-level linguistic cues rather than deeper stylistic structure, indicating limited cultural sensitivity.

6arXiv · cs.CL·25d ago·source ↗

Hyperfitting Explained: Terminal Geometric Expansion in Final Transformer Layers Drives Diversity Gains

This paper investigates the 'hyperfitting' phenomenon—where fine-tuning LLMs to near-zero loss on small datasets improves open-ended generation and reduces repetition—and demonstrates it is mechanistically distinct from temperature scaling. Entropy-matched control experiments falsify both the temperature-equivalence and static vocabulary reweighting hypotheses, instead localizing the effect to a 'Terminal Expansion' in the final transformer block where feature-space dimensionality expands by ~80.8 dimensions, enabling promotion of deep-tail tokens via context-dependent rank reordering. The authors introduce Late-Stage LoRA, a targeted fine-tuning strategy updating only the final 5 layers, achieving robust generation with minimal parameter updates.

6arXiv · cs.CL·25d ago·source ↗

LANG: Reinforcement Learning Framework for Multilingual Reasoning with Language-Adaptive Hint Guidance

LANG is a new RL-based framework for improving multilingual reasoning in LLMs that addresses the trade-off between input-language consistency and reasoning quality. It uses language-conditioned hints with a progressive decay schedule and a language-adaptive switch to tailor learning to per-language difficulty. Empirical results on multilingual mathematical benchmarks show improved reasoning without language drift toward English, and the approach generalizes beyond mathematics.

6arXiv · cs.CL·15d ago·source ↗

ClinEnv: Interactive Multi-Stage Long-Horizon EHR Benchmark for Clinical Agent Evaluation

ClinEnv is a new interactive benchmark that evaluates LLMs as attending physicians over real inpatient admissions using a Longitudinal Inpatient Simulation paradigm. Each case is decomposed into sequential decision stages where models must query four specialized agents before committing to medications, procedures, and diagnoses. Across seven evaluated models, the best achieves only 0.31 decision F1, with a sharp gap between diagnosis recovery (0.51 F1) and management actions (0.17 F1). The benchmark uniquely measures information-acquisition process quality alongside outcome quality, exposing a gap invisible to static or outcome-only evaluations.

5arXiv · cs.CL·18d ago·source ↗

MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

The paper introduces a pipeline for converting unstructured clinical text into HL7 FHIR R4 bundles, enabling evaluation of LLMs in realistic electronic health record settings. Applied to the MedCaseReasoning dataset, it produces MedCase-Structured, a synthetic benchmark achieving valid FHIR generation for 82.5% of cases. Key finding: LLMs show consistently lower diagnostic accuracy on structured FHIR inputs compared to plain text, underscoring the gap between standard benchmarks and real-world clinical deployment conditions.

5arXiv · cs.CL·15d ago·source ↗

HERO'S JOURNEY: A Benchmark for Complex Rule Induction in Text-Based Goal-Directed Tasks

HERO'S JOURNEY is a new benchmark evaluating rule induction capabilities of LLMs across eight tasks spanning attribute and procedural induction families, each with four structural rule forms and controllable lexical grounding. Agents must infer hidden rules from demonstrations and execute multi-step plans accordingly. Evaluation of state-of-the-art LLMs reveals limited and uneven rule induction ability, with process execution creating a bottleneck and surface semantics having minimal effect. Induction-specific steering methods improve attribute tasks but fail to reliably help procedural tasks, leaving procedural induction as an open challenge.

6arXiv · cs.CL·18d ago·source ↗

Parametric Memory Law for LoRA Finetuning: Quantifying LLM Memory Capacity

This paper introduces the Parametric Memory Law, a power-law relationship linking loss reduction to effective parameters and sequence length during LoRA-based LLM finetuning. The authors identify a phase transition at the token level where prediction probability p > 0.5 constitutes a sufficient condition for verbatim recall under greedy decoding. Building on these findings, they propose MemFT, a threshold-guided optimization strategy that dynamically reallocates training budget toward sub-threshold tokens, improving memory fidelity and efficiency.

6arXiv · cs.CL·15d ago·source ↗

HarmAmp Benchmark and TrajSafe Monitor for Multi-Turn Harm Amplification in LLMs

This paper introduces HarmAmp, a benchmark covering twelve risk categories designed to evaluate how LLMs compound harm across multi-turn conversations, addressing two threat vectors: democratizing specialized harmful expertise and scaling harmful operations. The authors also propose TrajSafe, a proactive monitoring system that anticipates harmful conversational trajectories and intervenes by probing user intent or steering toward safer outputs. Experiments show TrajSafe reduces multi-turn harmfulness while maintaining low over-refusal rates and preserving general model capabilities. The work highlights a gap in existing safety research that focuses on single-turn evaluations rather than extended interaction dynamics.

5arXiv · cs.CL·15d ago·source ↗

AutoForest: End-to-End LLM System for Automated Forest Plot Generation from Biomedical Studies

AutoForest is presented as the first end-to-end system that generates publication-ready forest plots directly from biomedical papers using large language models. The system automatically suggests ICO (Intervention, Comparator, Outcome) elements, extracts outcome data, performs statistical synthesis, and renders forest plots without manual intervention. A user study with clinicians demonstrates its effectiveness on real-world examples, aiming to accelerate systematic review and meta-analysis workflows.

6arXiv · cs.AI·21d ago·source ↗

GENESIS: Agentic AI Framework for Autonomous 6G RAN Synthesis, Research, and Testing

GENESIS is an agentic AI framework designed to automate the full R&D lifecycle for 6G Radio Access Networks (RAN), addressing six structural bottlenecks that each consume months of manual engineering per iteration. The system converts high-level intents—such as specification clauses, telemetry anomalies, or research hypotheses—into solutions validated via over-the-air experiments. It is built on three composable primitives (agents, skills, hooks) and a persistent knowledge layer called SYNAPSE that accumulates artifacts across runs. The framework specifically targets known LLM failure modes in RAN contexts, including API hallucination and simulation-to-hardware transfer gaps.

4arXiv · cs.CL·26d ago·source ↗

Study: LLM-Derived Error Highlights and APE Suggestions in MT Post-Editing

Researchers conducted a controlled study with professional En-Nl translators comparing post-editing (PE) workflows augmented with LLM-derived error highlights and automatic post-editing (APE) correction suggestions against regular PE and QE-derived highlights. No condition produced measurable productivity or quality gains over standard PE. However, APE-derived highlights were preferred over QE-derived highlights, and correction suggestions improved subjective user experience.

6arXiv · cs.CL·25d ago·source ↗

Self-Policy Distillation via Capability-Selective Subspace Projection

This paper introduces Self-Policy Distillation (SPD), a self-distillation method for LLMs that requires no external signals such as correctness filters or reward models. SPD extracts a low-rank capability subspace from the model's own gradients on correctness-defining tokens, then projects KV activations into this subspace during self-generation to isolate task-relevant signal from stylistic noise. Experiments across code generation, math reasoning, and QA show up to 13% improvement over prior signal-free self-distillation methods and 15% better out-of-domain generalization.

5arXiv · cs.CL·18d ago·source ↗

CommunityFact: A Dynamic, Multilingual, Multi-domain Benchmark for Misinformation Detection in the Wild

CommunityFact is a refreshable benchmark for misinformation detection containing 15,992 standalone claims across five languages and two domains, designed to address limitations of static benchmarks. The authors evaluate ten LLMs under varying inference-time conditions including chain-of-thought reasoning and web-search augmentation, finding that web access yields the largest performance gains. A key finding is that web-enabled LLMs' source-selection policies are systematically misaligned with sources that human Community Notes raters converge on, a gap addressable through retrieval expansion or pruning. The benchmark also proposes using Community Notes as a training signal for claim-conditioned source suggesters.

4arXiv · cs.CL·28d ago·source ↗

MA²P: A Meta-Cognitive Multi-Agent Framework for Complex Persuasion

The paper introduces MA²P, a multi-agent framework designed for complex persuasion tasks where the persuadee's internal states are latent. The system coordinates perception management, mental-state inference, strategy execution, memory, and evaluation modules, and adds a meta-cognitive configurator that selects domain-appropriate strategies from a structured knowledge base to reduce cross-domain performance variance. Experiments show higher persuasion success rates compared to baselines. The work addresses a known weakness of LLMs in producing generic or weakly grounded persuasive responses.

5arXiv · cs.AI·27d ago·source ↗

Neurosymbolic Learning for Inference-Time Argumentation in Claim Verification

This paper introduces Inference-Time Argumentation (ITA), a trainable neurosymbolic framework for ternary claim verification (true/false/uncertain) that integrates formal argumentation semantics with LLM training. The framework uses argumentation semantics both to guide LLM training for argument generation and scoring, and to compute final predictions deterministically from explicit argumentative structures. Unlike conventional reasoning models that rely on potentially unfaithful post-hoc explanations, ITA produces verdicts that are faithful by construction to the underlying arguments. Experiments on two ternary claim verification datasets show ITA outperforms argumentative baselines and competes with non-argumentative direct-prediction approaches.

3Mit Technology Review — Ai·26d ago·source ↗

Roundtables: Can AI Learn to Understand the World?

MIT Technology Review hosts a roundtable discussion on whether AI systems can develop genuine world understanding, addressing the limitations of current LLMs. The conversation, led by editor Mat Honan and senior AI editor Will Douglas Heaven, focuses on world models as a potential path beyond current language model constraints. The piece reflects growing industry and research interest in world models as a next frontier for AI capability.

5arXiv · cs.LG·26d ago·source ↗

FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection

FAME is a label-efficient mixture-of-experts framework for fine-grained, message-level log anomaly detection in production systems. It uses an LLM once offline to partition log templates into failure domains and derive binary labels from at most K examples per template, then trains a lightweight router and domain experts for on-premise inference. On the BGL benchmark it achieves F1=98.16 at K=100 (76x annotation reduction) and on Thunderbird reaches F1=99.95 with perfect recall. The approach addresses the coarse granularity of session/window-level detectors while keeping continuous monitoring costs tractable.

5arXiv · cs.CL·25d ago·source ↗

AnyMo: Geometry-Aware Setup-Agnostic Framework for Wearable IMU Human Motion Understanding

AnyMo is a geometry-aware framework that addresses the setup-dependence problem in wearable IMU-based human motion modeling by using physics-grounded simulation over dense body-surface placements to generate synthetic training signals. It pre-trains a graph encoder from synthetic placement views and masked partial observations, then tokenizes multi-position IMU data into full-body motion tokens aligned with an LLM for motion-language understanding. Evaluated across zero-shot activity recognition (14 unseen datasets), cross-modal retrieval, and motion captioning, AnyMo improves average Accuracy/F1 by ~11.7%/11.6%, zero-shot retrieval MRR by 15.9–28.6%, and captioning BERT-F1 by 18.8%. The work positions itself as a generalist model for wearable motion understanding transferable across devices and sensing configurations.

4arXiv · cs.CL·21d ago·source ↗

Gumbel Machine: Counterfactual Student Writing Generation via Gumbel Noise Steering

The paper introduces the Gumbel Machine, a modular framework for generating counterfactual text that improves student writing while preserving similarity to the original. Central to the approach is β-Hindsight control, a controlled decoding algorithm that uses Gumbel noise as a tunable similarity mechanism during LLM generation. Experiments on student writing datasets show the method produces outputs that are both rubric-consistent and close to the reference text. The approach is positioned as more flexible and practically applicable than prior domain-specific counterfactual generation methods.

4arXiv · cs.CL·20d ago·source ↗

MalayPrag: Benchmarking LLM Handling of Discourse Particles in Colloquial Malay

This paper introduces MalayPrag, a benchmark for evaluating LLMs' ability to handle discourse particles in colloquial Malay, a low-resource Southeast Asian language. The authors define five linguistically grounded attributes for interpreting pragmatic functions of discourse particles and test ten off-the-shelf LLMs on three prediction tasks. Results show substantial challenges for current LLMs in connecting discourse particles to their pragmatic functions in Malay. Providing the five structured attributes as scaffolding significantly improves model performance, suggesting that explicit pragmatic frameworks can compensate for low-resource language deficits.

5arXiv · cs.AI·26d ago·source ↗

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

WikiVQABench is a new human-curated VQA benchmark that requires external knowledge beyond visual perception, constructed by combining Wikipedia images, captions, and Wikidata structured knowledge with LLM-generated question candidates reviewed by human annotators. The benchmark evaluates knowledge-intensive reasoning in vision-language models, covering 15 VLMs ranging from 256M to 90B parameters. Accuracy spans 24.7% to 75.6%, indicating meaningful discrimination across model scales. The dataset and code are publicly released.