Almanac
← Events
4arXiv cs.LG (Machine Learning)·11d ago

Constrained LLM interface for multi-physics finite element simulations in FEniCS

Researchers present a natural-language interface for FEniCS finite element simulations that deliberately constrains LLM involvement to front-end parsing and geometry generation, while a deterministic dispatcher routes validated specifications to human-written solver templates. The system achieves 100% final valid parse rate on a 15-prompt benchmark and 90% success on custom geometry generation. Validation against analytical solutions shows sub-percent agreement for smooth cases and 2-5% for harder nonlinear cases. The architecture is positioned as a reliability-focused alternative to open-ended LLM code generation for scientific computing.

Related guides (1)

Related events (8)

4Github Trending·24d ago·source ↗

Langfuse: Open Source LLM Engineering Platform Trending on GitHub

Langfuse is an open-source LLM engineering platform providing observability, metrics, evaluations, prompt management, and dataset tooling. It integrates with OpenTelemetry, LangChain, OpenAI SDK, and LiteLLM. The project has accumulated 28,075 GitHub stars with 89 new stars today, indicating sustained community traction. Backed by Y Combinator (W23), it represents a notable entry in the LLM ops/tooling ecosystem.

8arXiv · cs.AI·29d ago·source ↗

Large-Scale Evaluation of LLM-Driven Formal Proof Search on Open Mathematical Problems

Researchers present the first large-scale evaluation of LLM-based formal proof search on genuinely open mathematical problems, using Lean as a verification backend. Their most capable agent autonomously resolved 9 of 353 open Erdős problems and proved 44 of 492 OEIS conjectures, at a cost of a few hundred dollars per problem. The system is already being deployed in active research across combinatorics, optimization, graph theory, algebraic geometry, and quantum optics. The study also compares agent architectures, finding that more sophisticated designs outperform simple generate-and-verify loops on the hardest problems.

7arXiv · cs.LG·25d ago·source ↗

DiscoverPhysics: Interactive Benchmark for LLM Scientific Discovery in Novel Physics Worlds

DiscoverPhysics is a new interactive benchmark that tests LLM agents on their ability to discover laws of motion in 22 simulated worlds with deliberately non-standard physics, including screened gravity, fractional-power interactions, and hidden dark-matter-like particles. Agents must propose experiments, observe N-body trajectory data, and submit both natural-language explanations and Python implementations of inferred laws. Evaluation across eleven frontier models shows the best agents pass only half the worlds, with consistent failures on latent-structure problems and a substantial gap between open-source and commercial models. The benchmark reveals that predictive accuracy and conceptual understanding are dissociable, and that genuine hypothesis refinement through well-designed experiments is required for high explanation scores.

5arXiv · cs.CL·19d ago·source ↗

PowerCodeBench: Knowledge Boundary Probing and Intervention for LLM-Based Power System Code Generation

This paper introduces PowerCodeBench, an execution-validated benchmark for evaluating LLMs on power-system simulation code generation using the pandapower library. The authors identify that failures are dominated by API-knowledge boundary errors (hallucinated function names, misused parameters) rather than reasoning failures, and propose a boundary-aware intervention combining API demand estimation with targeted documentation injection. Evaluated across ten open-weight models (1.5B–480B) and four commercial APIs on 2,000 tasks, the intervention yields 32–56 accuracy point improvements while using only 41% of baseline prompt-token cost. Open-weight models in the 70B–120B range match commercial mid-tier accuracy, with Llama-3.1-405B and Qwen3-Coder-480B leading.

5arXiv · cs.AI·4d ago·source ↗

LLM vs. first-year PhD student on EconCS research: workflow study using stable menus of public goods

A preprint uses an open problem from EC 2025 as a testbed to evaluate AI-assisted research workflows in economics and computer science. The study examines whether human intuition in prompts, multi-turn interaction, and LLM capability compare favorably to a first-year PhD student's contributions. Key findings: human intuition in prompts improves LLM 'taste', multi-turn workflows help when encouraging ambitious steps, and the LLM performs slightly below the first-year PhD student on the same problem. The work contributes empirical evidence on the practical utility and limits of LLMs as research collaborators in formal theory domains.

4Hugging Face Blog·1mo ago·source ↗

Introducing the Open FinLLM Leaderboard

Hugging Face has launched the Open FinLLM Leaderboard, a benchmarking platform specifically designed to evaluate large language models on financial domain tasks. The leaderboard aims to provide standardized, open evaluation of LLMs across finance-specific capabilities such as financial reasoning, document understanding, and numerical analysis. This fills a gap in domain-specific evaluation infrastructure for the financial sector.

5Hugging Face Blog·1mo ago·source ↗

Consilium: When Multiple LLMs Collaborate

Hugging Face introduces Consilium, a framework for multi-LLM collaboration where multiple language models work together on tasks rather than relying on a single model. The approach explores how ensembling or deliberation among diverse LLMs can improve output quality and robustness. This fits into the broader agent-tool ecosystem trend of orchestrating multiple AI models for better results.

4Hugging Face Blog·1mo ago·source ↗

Investing in Performance: Fine-tune small models with LLM insights — a CFM case study

This Hugging Face blog post presents a case study from CFM (Capital Fund Management) on using large language model outputs to guide fine-tuning of smaller, more efficient models for financial applications. The approach leverages LLM-generated signals or labels to train compact models that can be deployed at lower cost and latency. The case study illustrates an enterprise pattern of distilling LLM capabilities into task-specific smaller models for production use.