4arXiv cs.CL (Computation and Language)·14h ago

CCG directed types improve structural generalization on SLOG, surpassing AM-Parser on position-shift categories

A new arXiv preprint redesigns the symbolic backend of a semantic parsing system using CCG directed types with a deterministic CKY decoder and only 30K learnable parameters, achieving 75.9% LF exact match on the SLOG benchmark under BERT-base, surpassing the previous SOTA AM-Parser (70.8%). Gains are highly category-specific: the CCG system outperforms AM-Parser on all 5 position-shift categories (+29.9pp) while AM-Parser retains an edge on recursive-depth categories. Swapping in DeBERTa-v3-large as encoder pushes performance to 90.7%, with encoder gains complementing directionality gains in orthogonal category groupings. The work argues that directional representations shift the generalization bottleneck from the symbolic layer to the neural encoder, enabling further improvement through encoder scaling.

Evaluation and Benchmarking Combinatory Categorial Grammar AM-Parser DeBERTa-v3 BERT-base SLOG

Related guides (1)

Evaluation and BenchmarkingTopic guide

AI Evaluation and Benchmarking: From Leaderboards to the Limits of Measurement

Read asBeginner In-depth

Related events (8)

6arXiv · cs.LG·1mo ago·source ↗

Positional vs. Symbolic Attention Heads: Learning Dynamics, RoPE Geometry, and Length Generalization

Researchers train a decoder-only Transformer (GPT-J) on two structurally equivalent multi-hop reasoning tasks to study how attention heads specialize into positional or symbolic roles during learning. They find that successful task learning correlates with the emergence of 'pure' heads—exclusively positional or symbolic—and provide theoretical constructions showing how single-layer RoPE-based attention realizes these functions geometrically. A novel 'discrepancy' metric formalizes the robustness difference between the two head types, with symbolic mechanisms shown to extrapolate more reliably to longer sequences than positional ones. The findings have implications for understanding length generalization failures in RoPE-based models.

Long Context Evolution Evaluation and Benchmarking Transformers multi-hop reasoning Rotary Position Embedding (RoPE)+5 more

4arXiv · cs.CL·10d ago·source ↗

CTC oracle gap anatomy: acoustic scoring saturates, linguistic MBR decoding recovers WER

A new arXiv paper systematically diagnoses why CTC-internal N-best rescoring fails to improve over greedy decoding on LibriSpeech, showing that blank-path proliferation causes a 53% degradation in rank correlation between CTC scores and WER as beam size grows. The authors demonstrate that the bottleneck is linguistic rather than acoustic: MBR decoding with RoBERTa pseudo-log-likelihood achieves 9% relative WER reduction on LibriSpeech test-other and generalizes across two architectures and three domains. The paper also analyzes MWER sequence-level fine-tuning failure at near-converged checkpoints, attributing collapse to a vanishingly small training oracle gap.

Evaluation and Benchmarking RoBERTa LibriSpeech The Anatomy of the CTC Oracle Gap: Acoustic Exhaustion and Linguistic Recovery +3 more

4arXiv · cs.CL·18d ago·source ↗

Synthetic data generation method enables small LLMs to match large models on Text-To-Cypher tasks

A new arXiv paper presents an automatic synthetic data generation method for fine-tuning small LLMs on Text-To-Cypher (Text2Cypher) parsing, enabling natural language interfaces to property graph databases. Experiments across major Text-To-Cypher benchmarks show that small fine-tuned models can compete with much larger proprietary models. The approach is positioned as a solution for local deployment scenarios requiring data sovereignty without expensive annotation.

Evaluation and Benchmarking Enterprise Deployment Patterns Cypher Achieving Precise Text-To-Cypher Via Grounded Knowledge Graph Data Generation

6arXiv · cs.CL·7d ago·source ↗

CARVE: Content-aware gating for linear attention recurrent models improves efficiency and quality over GDN-2

CARVE (Content-Aware Recurrent with Value Efficiency) is a new linear attention architecture that addresses three coupled defects in the GDN-2 delta-rule architecture by restricting erasure to the key axis rather than the value axis. This design choice is proven necessary and sufficient to enable the WY-form triangular chunk solver, enabling competitive training throughput with Transformers. At 1.3B parameters trained on 100B tokens, CARVE achieves lower perplexity than GDN-2, leads recurrent baselines on nine commonsense reasoning benchmarks, and sets state-of-the-art on RULER retrieval probes, while using 13% less peak memory and 19% fewer parameters at 0.4% throughput overhead.

Training Infrastructure Long Context Evolution WikiText-2 CARVE GDN-2 +2 more

5arXiv · cs.CL·22d ago·source ↗

AGDO: Attention-guided denoising and optimization framework improves diffusion language model reasoning

Researchers propose AGDO, a framework that replaces random masking in diffusion large language models (dLLMs) with attention-guided denoising order and token weighting during fine-tuning and reinforcement learning. The work is motivated by an empirical finding that tokens with stronger attention to unmasked context are more stable and critical for reasoning. Experiments on math and coding benchmarks show AGDO outperforms existing post-training methods for dLLMs, advancing the case for attention-aware training in parallel-decoding language models.

Alignment and RLHF AGDO Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models

5arXiv · cs.CL·24d ago·source ↗

GGRO: Gradient-Guided Reward Optimization for inference-time LLM alignment

Researchers introduce Gradient-Guided Reward Optimization (GGRO), an inference-time alignment method that uses gradient signals from a reward model to inject 'nudging tokens' at high-uncertainty decoding steps, rather than relying on sampling-intensive re-ranking approaches like Best-of-N. The method monitors token-level entropy to detect distribution drift and steers generation trajectories directly, claiming improved robustness to reward hacking with minimal computational overhead. Experiments show gains across safety, helpfulness, and reasoning benchmarks compared to standard inference-time alignment baselines.

Inference Economics Alignment and RLHF Best-of-N Sampling Gradient-Guided Reward Optimization

3arXiv · cs.CL·17d ago·source ↗

Revisiting LLM systematicity in negation understanding via in-context learning

A new arXiv preprint analyzes how well large language models handle negation from two angles: behavioral systematicity (whether models correctly recognize negation expressions and scope) and representational systematicity (whether function vectors can be reliably constructed from in-context examples). Results show LLMs partially succeed at negation cue recognition via in-context learning but struggle with scope recognition, with performance varying by output format. Function vectors can be composed for cue extraction but are harder to extract for scope recognition tasks.

Evaluation and Benchmarking Revisiting the Systematicity in Negation in the Era of In-Context Learning

4arXiv · cs.CL·25d ago·source ↗

HKVM-RAG: Hypergraph key-value separation improves multi-hop retrieval-augmented generation

A new arXiv preprint introduces HKVM-RAG, an evidence-organization layer for multi-hop RAG that uses weighted hyperedges as retrieval keys while retaining passage text as answer values. Under a fixed-substrate protocol controlling for tuple cache, reader, and evaluation budget, the hypergraph key-value approach improves over KG-PPR by +3.4 F1 on 2WikiMultiHopQA and +3.6 F1 on MuSiQue. A dense-aware controller combining frozen ColBERTv2 with HKVM features reaches 88.8, 65.1, and 85.8 F1 on three benchmarks, outperforming ColBERTv2 alone by 5–11 F1 points. The work positions hypergraph organization as a reusable evidence-control mechanism rather than a dense-retrieval replacement.

Evaluation and Benchmarking Agent and Tool Ecosystem ColBERTv2 MuSiQue 2WikiMultiHopQA +2 more