6arXiv cs.LG (Machine Learning)·1mo ago

Universal Magnetic Structure Prediction from Atomic Coordinates with Near-Experimental Accuracy

The paper introduces MSN (Magnetic Structure Network), an E(3) equivariant graph neural network that predicts collinear and non-collinear magnetic structures directly from atomic crystal coordinates. Trained on experimentally determined structures from the MAGNDATA database, it uses a novel Primitive Modulated Structure Representation (PMSR) to handle both commensurate and incommensurate magnetic orders in a unified framework without symmetry assumptions. The model achieves near-experimental accuracy across diverse magnetic structure types, offering a scalable alternative to costly experiments and computationally demanding first-principles methods for magnetic materials discovery.

Evaluation and Benchmarking Magnetic Structure Network (MSN)Primitive Modulated Structure Representation (PMSR)E(3) equivariant graph neural network MAGNDATA

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.LG·1mo ago·source ↗

EvoStruct: Bridging Evolutionary and Structural Priors for Antibody CDR Design via Protein Language Model Adaptation

EvoStruct addresses vocabulary collapse in GNN-based antibody CDR design by combining a frozen protein language model with an E(3)-equivariant GNN through a cross-attention adapter. The method introduces progressive PLM unfreezing and R-Drop consistency regularization to recover functionally important amino acid diversity. On CHIMERA-Bench, EvoStruct improves sequence recovery by 16%, reduces perplexity by 43%, and achieves 2.3x greater amino acid diversity compared to the best GNN baselines.

Evaluation and Benchmarking Multimodal Progress EvoStruct protein language models E(3)-equivariant GNN +4 more

5arXiv · cs.AI·1mo ago·source ↗

Beyond Prediction Accuracy: Target-Space Recovery Profiles for Evaluating Model-Brain Alignment

This paper introduces a framework for evaluating alignment between artificial vision models and the human visual cortex that goes beyond scalar prediction accuracy. Using repeated fMRI data from the Natural Scenes Dataset, the authors decompose brain response spaces into reproducible dimensions and measure which of these dimensions are recovered by model predictions. A key finding is that pretrained and randomly initialized models can achieve similar prediction accuracy while showing distinct recovery profiles, revealing that accuracy alone can mask fundamental model-brain mismatches. The framework also enables brain-to-brain comparisons as a diagnostic human reference baseline.

Evaluation and Benchmarking Multimodal Progress Natural Scenes Dataset human visual cortex target-space recovery profiles +1 more

4arXiv · cs.CL·11d ago·source ↗

Multilingual word-level forced alignment using MMS and learned dynamic programming outperforms MFA

Researchers present a forced alignment system combining Meta's Massively Multilingual Speech (MMS) model with a self-supervised phoneme boundary detector (UnSupSeg) and a learned dynamic programming decoder. Trained on TIMIT and Buckeye, the system outperforms Montreal Forced Aligner and MMS-based alignment on both datasets and generalizes to unseen languages (Dutch, German, Hebrew) without additional training. The approach claims potential to scale to 1100+ languages supported by MMS, making it relevant for low-resource speech processing pipelines.

Multimodal Progress MMS (Massively Multilingual Speech)Montreal Forced Aligner Buckeye +2 more

7arXiv · cs.LG·25d ago·source ↗

DiscoverPhysics: Interactive Benchmark for LLM Scientific Discovery in Novel Physics Worlds

DiscoverPhysics is a new interactive benchmark that tests LLM agents on their ability to discover laws of motion in 22 simulated worlds with deliberately non-standard physics, including screened gravity, fractional-power interactions, and hidden dark-matter-like particles. Agents must propose experiments, observe N-body trajectory data, and submit both natural-language explanations and Python implementations of inferred laws. Evaluation across eleven frontier models shows the best agents pass only half the worlds, with consistent failures on latent-structure problems and a substantial gap between open-source and commercial models. The benchmark reveals that predictive accuracy and conceptual understanding are dissociable, and that genuine hypothesis refinement through well-designed experiments is required for high explanation scores.

Frontier Model Releases Evaluation and Benchmarking LLM-judged explanation score N-body simulator trajectory MSE +2 more

5arXiv · cs.LG·11d ago·source ↗

Local linear structures in LLM weights and activations are dynamic, not fixed global directions

A new arXiv paper investigates the nature of linear structures in transformer weights and activations, finding strong local low-rank task-gradient structure but rejecting the hypothesis that fixed task planes exist. The authors show that useful bases drift substantially within 100 optimization steps, yet early recovery updates form a trajectory-prefix basis capturing 77% of LoRA recovery displacement. They also establish a formal connection between parameter perturbations and activation steering, finding a 0.58 cosine similarity between gradient-step-induced activation shifts and CAA steering vectors, suggesting linear structures are evolving local geometries rather than stable global task directions.

Evaluation and Benchmarking Alignment and RLHF CAA Qwen-0.5B LoRA +4 more

6arXiv · cs.CL·22d ago·source ↗

COMPOSE: Dual-Graph Framework for Generating Future Mathematical Theorems from Citations and Formal Structure

COMPOSE is a framework that generates plausible future mathematical theorem-like claims by conditioning a language model on both a scientific citation graph and a formal theorem dependency graph simultaneously. The authors construct a dataset of 108K paired scientific-formal graph examples from arXiv and Mathlib, plus a benchmark of 47K future papers from 2024–2025. Experiments show COMPOSE outperforms baselines on retrieval to real future papers and LLM-judge evaluation, producing more grounded and mathematically richer outputs. The work advances AI-assisted mathematical reasoning by combining informal scientific context with formal proof structure.

Frontier Model Releases Evaluation and Benchmarking COMPOSE Mathlib grounded future mathematical generation +3 more

5arXiv · cs.CL·9d ago·source ↗

Manifold Power Iteration redesigns MoE routers by aligning rows with expert singular directions

A new arXiv preprint proposes Manifold Power Iteration (MPI), a principled redesign of Mixture-of-Experts router matrices that aligns each router row with the principal singular direction of its associated expert. The method uses a 'Power-then-Retract' paradigm to enforce norm constraints while driving convergence toward these singular directions. Empirical validation spans MoE pretraining at scales from 1B to 11B parameters, showing improved model effectiveness.

Training Infrastructure Frontier Model Releases Redesign Mixture-of-Experts Routers with Manifold Power Iteration Manifold Power Iteration

3arXiv · cs.LG·2d ago·source ↗

P-K-GCN: Physics-augmented Koopman-enhanced Graph Convolutional Network for spatiotemporal super-resolution

Researchers propose P-K-GCN, a framework combining graph convolutional networks, Koopman operator theory, and physics-informed loss functions for spatiotemporal super-resolution on irregular geometries. The method linearizes nonlinear dynamics in a latent space and enforces physical constraints to improve reconstruction fidelity. Theoretical analysis claims guaranteed error reduction via Rademacher complexity bounds. The framework is evaluated on reconstructing high-resolution cardiac electrodynamics from sparse 3D heart geometry measurements.

P-K-GCN