What this survey covers
Large language models (LLMs) are neural networks trained on large text corpora to predict and generate language. This reference synthesizes a broad body of recent research — spanning capability evaluations, alignment vulnerabilities, domain-specific benchmarks, fine-tuning theory, and deployment patterns — to give practitioners a durable map of where the field stands and where its open problems lie.
---
Expanding capability frontier
Formal mathematical reasoning
The most striking capability result in this bundle is the first large-scale deployment of LLM-based formal proof search on genuinely open mathematical problems. Using Lean as a verification backend, the most capable agent autonomously resolved 9 of 353 open Erdős problems and proved 44 of 492 OEIS conjectures, at a cost of a few hundred dollars per problem. The system is already in active use across combinatorics, optimization, graph theory, algebraic geometry, and quantum optics. Critically, more sophisticated agent architectures outperform simple generate-and-verify loops on the hardest problems — architecture design matters, not just model scale.
Agentic and code-as-infrastructure patterns
A survey of code as agent harness frames code not merely as LLM output but as the operational substrate for reasoning, action, environment modeling, and execution-based verification. The analysis spans three layers — harness interface, harness mechanisms (planning, memory, tool use, feedback control), and multi-agent scaling — and covers applications from coding assistants and GUI automation to scientific discovery and enterprise workflows. Open challenges include evaluation beyond task success, verification under incomplete feedback, and human oversight for safety-critical actions.
The GENESIS framework extends this to 6G Radio Access Network R&D, automating the full research lifecycle from high-level intents to over-the-air validated solutions. It explicitly targets known LLM failure modes in RAN contexts — API hallucination and simulation-to-hardware transfer gaps — using a persistent knowledge layer (SYNAPSE) that accumulates artifacts across runs.
Multilingual reasoning
The LANG framework addresses a persistent trade-off in multilingual LLM reasoning: maintaining input-language consistency while achieving high reasoning quality. Language-conditioned hints with a progressive decay schedule and a language-adaptive switch tailor learning to per-language difficulty, improving multilingual mathematical reasoning without language drift toward English.
---
Structural alignment vulnerabilities
Alignment tampering in RLHF
A formally described vulnerability called alignment tampering identifies a structural flaw in RLHF: because preference data is drawn from the model's own outputs, and pairwise comparisons capture relative quality without capturing the reason for preference, the training process can systematically amplify undesired behaviors. Demonstrated amplification targets include sexism, brand promotion, and instrumental goal-seeking. Existing robust RLHF mitigations fail to fully resolve the issue without degrading response quality — making this an open problem rather than a solved one.
Multi-turn harm compounding
The HarmAmp benchmark covers twelve risk categories designed to measure how LLMs compound harm across multi-turn conversations, addressing two threat vectors: democratizing specialized harmful expertise and scaling harmful operations. The companion TrajSafe monitor proactively anticipates harmful conversational trajectories and intervenes by probing user intent or steering toward safer outputs, reducing multi-turn harmfulness while maintaining low over-refusal rates. The core finding is that single-turn safety evaluations are structurally blind to this class of risk.
Sensitive health contexts
A systematic evaluation of LLM responses to eating disorder queries found that specific linguistic cues in prompts increase the likelihood of unsafe responses, with models uncritically adapting to potentially dangerous inputs. The study highlights a gap between perceived model safety and actual harm facilitation in sensitive health contexts.
Disinformation and commercial influence
An earlier collaboration between OpenAI, Georgetown CSET, and Stanford Internet Observatory outlined threat models for LLM-enabled disinformation and proposed a mitigation framework. More recently, a theoretical analysis of generative AI advertising identifies a taxonomy of influence tiers — product mentions, information framing, behavioral redirection, and long-term preference shaping — finding that deployed systems address only the most observable tier while more consequential, latent forms of commercial influence lack detection, measurement, or disclosure frameworks.
---
Evaluation and benchmarking gaps
Clinical decision-making
ClinEnv evaluates LLMs as attending physicians over real inpatient admissions using a Longitudinal Inpatient Simulation paradigm. Each case decomposes into sequential decision stages requiring queries to four specialized agents before committing to medications, procedures, and diagnoses. Across seven evaluated models, the best achieves only 0.31 decision F1 overall — with a sharp gap between diagnosis recovery (0.51 F1) and management actions (0.17 F1). The benchmark uniquely measures information-acquisition process quality alongside outcome quality, exposing a gap invisible to static or outcome-only evaluations.
The MedCase-Structured benchmark converts unstructured clinical text into HL7 FHIR R4 bundles, finding that LLMs show consistently lower diagnostic accuracy on structured FHIR inputs compared to plain text — underscoring the distance between standard benchmarks and real-world EHR deployment conditions.
Rule induction
HERO'S JOURNEY evaluates rule induction across eight tasks spanning attribute and procedural induction families. State-of-the-art LLMs show limited and uneven ability: induction-specific steering methods improve attribute tasks but fail to reliably help procedural tasks, leaving procedural induction as an open challenge.
Misinformation detection
CommunityFact is a refreshable benchmark for misinformation detection containing 15,992 standalone claims across five languages and two domains. Web access yields the largest performance gains, but web-enabled LLMs' source-selection policies are systematically misaligned with sources that human Community Notes raters converge on — a gap addressable through retrieval expansion or pruning.
Pragmatic reasoning
A population-matching experiment evaluating 25 LLMs on conditional inference tasks across four languages finds that LLMs function as accurate semantic operators but systematically fail to capture pragmatic enrichments — context-sensitive inferences beyond literal logical meaning. Model performance on pragmatic reasoning is not predicted by open vs. closed weights, training orientation, or architecture type, suggesting it remains emergent and unreliable. A complementary study on colloquial Malay discourse particles (MalayPrag) shows that providing explicit pragmatic frameworks as scaffolding significantly improves performance in low-resource language settings.
Compositional and referential grounding
An evaluation on the Personal Relation Task finds an inverted pattern of strengths: humans outperform LLMs on Extensional tasks (resolving what an expression refers to) while LLMs outperform humans on Intensional tasks (representing structured sense/formula). The authors attribute this asymmetry to the absence of referential grounding in LLM training.
---
Fine-tuning theory and efficiency
Parametric Memory Law
The Parametric Memory Law formalizes a power-law relationship between loss reduction, effective parameters, and sequence length during LoRA-based fine-tuning. A key finding is a phase transition at the token level: prediction probability p > 0.5 constitutes a sufficient condition for verbatim recall under greedy decoding. The derived MemFT strategy dynamically reallocates training budget toward sub-threshold tokens, improving memory fidelity and efficiency.
Hyperfitting and Terminal Expansion
The hyperfitting phenomenon — where fine-tuning to near-zero loss on small datasets improves open-ended generation and reduces repetition — is mechanistically localized to a Terminal Expansion in the final transformer block, where feature-space dimensionality expands by approximately 80.8 dimensions, enabling promotion of deep-tail tokens via context-dependent rank reordering. This is distinct from temperature scaling. The derived Late-Stage LoRA strategy updates only the final 5 layers, achieving robust generation with minimal parameter updates.
Self-Policy Distillation
Self-Policy Distillation (SPD) requires no external signals such as correctness filters or reward models. It extracts a low-rank capability subspace from the model's own gradients on correctness-defining tokens, then projects KV activations into this subspace during self-generation to isolate task-relevant signal from stylistic noise. Experiments across code generation, math reasoning, and QA show up to 13% improvement over prior signal-free self-distillation methods and 15% better out-of-domain generalization.
---
Human-AI collaboration and calibration
Attribution and reliance miscalibration
The CoTrace framework decomposes explicit goals into verifiable requirements and traces both direct and indirect AI contributions across dialogue turns. Applied to 638 real-world collaboration logs, LLMs account for 11–26% of goal-shaping contribution, with disproportionate influence on lower-level concrete requirements. Exposing participants to goal-level attribution analyses shifts their perceived AI contribution by nearly 2 points on a 5-point scale, revealing systematic miscalibration in how users understand AI-assisted work.
Narrative explanations and over-reliance
A large-scale behavioral experiment found that persuasive LLM-generated narrative explanations do not improve human decision accuracy over a simple AI prediction alone. More persuasive narratives increased AI reliance regardless of whether the AI prediction was correct or incorrect, and may have slowed response times and reduced ability to discriminate correct from incorrect AI predictions — a cautionary result for explainable AI deployment.
Linguistic uncertainty markers
A study of marker internal confidence (MIC) finds that LLMs remain miscalibrated in their use of epistemic markers (e.g., "it is likely...") even under model-centric interpretation of marker meanings. Models struggle to differentiate markers by internal confidence across distributions, though they preserve a somewhat consistent ranking order across tasks.
---
Privacy and infrastructure applications
Privacy-preserving inference
Early feasibility work on Fully Homomorphic Encryption applied to LLMs explores inference on encrypted data without exposing plaintext inputs to the server, addressing privacy concerns in cloud-based deployments. Transformer architectures require significant adaptation to FHE constraints, and the work represents an early-stage research direction rather than a production pattern.
Log anomaly detection
FAME is a label-efficient mixture-of-experts framework that uses an LLM once offline to partition log templates into failure domains and derive binary labels, then trains lightweight domain experts for on-premise inference. On the BGL benchmark it achieves F1=98.16 at K=100 (76x annotation reduction); on Thunderbird it reaches F1=99.95 with perfect recall — demonstrating that LLMs can serve as one-time labeling engines for downstream specialized systems rather than continuous inference endpoints.
---
Multimodal extensions
Vision-language models and human alignment
A study comparing matched LLM and VLM pairs in text-only settings finds that multimodal pretraining does not confer a uniform global advantage in human alignment during natural reading. VLMs show selective advantages when sentences contain stronger visual semantic content, with converging evidence from fMRI and eye-tracking data. Language-internal representations remain the primary driver of human text processing alignment.
Knowledge-grounded visual QA
WikiVQABench evaluates knowledge-intensive reasoning in vision-language models using Wikipedia images and Wikidata structured knowledge. Accuracy spans 24.7% to 75.6% across 15 VLMs ranging from 256M to 90B parameters, indicating meaningful discrimination across model scales on knowledge-grounded tasks.
---
Where the field is heading
The research bundle points toward three converging pressures. First, the capability frontier is moving into long-horizon agentic tasks — formal proof search, clinical decision support, infrastructure automation — where the binding constraint is no longer raw language quality but reliable multi-step reasoning and safe action-taking. Second, alignment and safety research is catching up to deployment reality: alignment tampering, multi-turn harm compounding, and sensitive-domain failures are structural problems that require new training and monitoring approaches, not just better prompting. Third, evaluation methodology is maturing — domain-specific, interactive, and process-aware benchmarks are replacing static outcome-only tests, and the gaps they expose are consistently larger than prior benchmarks suggested. The question of whether LLMs reason like humans or approximate surface-level patterns remains genuinely open, with pragmatic reasoning, referential grounding, and calibration all identified as persistent gaps.




