3arXiv cs.CL (Computation and Language)·17h ago

Tree-of-Thoughts hybrid approach for legal case judgement summarization using LLMs

A new arXiv preprint proposes a tree-of-thoughts-inspired extractive-abstractive summarization method for legal case judgements. The authors evaluate DeepSeek and LLaMA models across extractive, abstractive, and hybrid summarization strategies, finding the hybrid prompt approach yields better summaries. The work addresses a narrow but practically relevant domain application of LLMs in legal NLP.

Evaluation and Benchmarking DeepSeek V4 A Tree-of-Thoughts Inspired Hybrid Approach for Legal Case Judgement Summarization using LLMs Tree of Thoughts Llama

Related guides (2)

DeepSeek V4

DeepSeek V4: The Open-Weights Giant Reshaping AI Economics

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.CL·26d ago·source ↗

Training-free mixture-of-agents framework combines LLMs and knowledge graphs for multi-document summarization

A new arXiv preprint proposes a training-free multi-agent framework for multi-document summarization (MDS) that decomposes the task into specialized agents for extractive selection, knowledge-aware abstraction, and iterative refinement, unified via a multi-perspective consistency mechanism. The system integrates LLMs with knowledge graphs without task-specific fine-tuning. Experiments across four datasets in English and Vietnamese show state-of-the-art or competitive performance, with the authors emphasizing cross-domain and cross-lingual generalization.

Evaluation and Benchmarking Agent and Tool Ecosystem A Training-Free Mixture-of-Agents Framework for Multi-Document Summarization using LLMs and Knowledge Graphs

4arXiv · cs.CL·24d ago·source ↗

Study compares human and LLM active causal reasoning, finding LLMs less efficient but near human-level on conjunctive rules

A new arXiv paper investigates whether active exploration reduces the 'conjunctive handicap' in causal learning, using a blicket detector task with adult participants who could freely intervene to identify causal objects. Results show active exploration substantially improves human conjunctive causal reasoning, though conjunctive rules still require more tests than disjunctive ones. State-of-the-art LLMs approach human-level hypothesis inference accuracy but show less efficient exploration strategies and similar conjunctive-disjunctive performance gaps, raising questions about the nature of LLM causal reasoning.

Evaluation and Benchmarking Human Adults and LLMs as Scientists: Who Benefits from Active Exploration?

5arXiv · cs.CL·4d ago·source ↗

Unified defense framework detects and remediates data poisoning in text summarization fine-tuning

A new arXiv preprint introduces a post-hoc defense framework for detecting and recovering from training-time data poisoning in LLMs fine-tuned for abstractive summarization. The framework uses influence-function analysis in white-box settings and behavioral perturbation auditing in black-box settings, achieving 85-92% detection precision across nine architectures and six benchmarks. Gradient-ascent unlearning restores up to 96% of original model behavior with less than 0.6% ROUGE degradation. The authors also introduce novel attacks targeting factual distortion and representational bias that evade conventional evaluation metrics.

Evaluation and Benchmarking AI Safety Research ROUGE-L Detect, Unlearn, Restore: Defending Text Summarization Models Against Data Poisoning

4arXiv · cs.CL·3d ago·source ↗

Judge-Aware Gated Multi-Task Learning achieves state-of-the-art on UK Employment Tribunal outcome prediction

Researchers propose a Judge-Aware Gated Multi-Task Learning architecture for legal outcome prediction that explicitly disentangles factual case merits from judicial discretion via a gated fusion mechanism conditioned on judge identity. Evaluated on 13,937 UK Employment Tribunal decisions, the approach outperforms supervised fine-tuning of a Gemma-4 26B backbone while requiring an order of magnitude fewer trainable parameters. The key finding is that differentiable structured composition of identity signals outperforms prompt-based composition over a much larger generative model, suggesting conditioning interface choice dominates scale for identity-conditioned classification tasks.

Evaluation and Benchmarking Gemma-4 E4B-it Towards Explainable Adjudicative Variance: Quantifying Judicial Discretion via Gated Multi-Task Learning LoRA

6arXiv · cs.CL·1mo ago·source ↗

Tracing the Emergence of Human-Like Pragmatic Reasoning in LLMs Across Languages

Researchers conducted a population-matching experiment evaluating 25 LLMs on conditional inference tasks across four languages, comparing model behavior to matched human populations. The study finds that LLMs function as accurate semantic operators but systematically fail to capture pragmatic enrichments—context-sensitive inferences beyond literal logical meaning—that humans apply effortlessly. Model performance on pragmatic reasoning is not predicted by open vs. closed weights, training orientation, or architecture type, suggesting pragmatic reasoning remains an emergent and unreliable capability. The findings contribute to ongoing debates about whether LLMs reason like humans or merely approximate surface-level linguistic patterns.

Frontier Model Releases Evaluation and Benchmarking large language models Population-Matching Experiment Pragmatic Reasoning +1 more

5arXiv · cs.CL·1mo ago·source ↗

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

This paper systematically investigates strategies for extending LLM-based automatic evaluation (LLMs-as-a-Judge) to multilingual settings, covering high-, mid-, and low-resource languages (English, Spanish, Basque). The authors compare instruction translation, monolingual vs. multilingual supervision, and model size, finding that fine-tuned smaller models can match proprietary models when in-domain data is available, while zero-shot larger models are preferable out-of-domain. Two meta-evaluation datasets are extended to Spanish and Basque, and all data and code are publicly released.

Evaluation and Benchmarking Basque language LLM-as-a-Judge mJudge +2 more

4arXiv · cs.CL·17h ago·source ↗

LLMs outperform traditional methods on single and multi-truth data fusion tasks

A new arXiv preprint investigates using LLMs for data fusion (truth discovery) over tabular data, covering both single-truth and multi-truth scenarios. The authors evaluate domain-dependent, domain-independent, zero-shot, and one-shot prompting strategies across three benchmark datasets. LLM-based approaches outperform traditional unsupervised methods including DART and LTM on all datasets, with code released publicly.

Evaluation and Benchmarking Enterprise Deployment Patterns DART LTM Single and Multi Truth Data Fusion using Large Language Models

4Hugging Face Blog·1mo ago·source ↗

Open-Source Text Generation & LLM Ecosystem at Hugging Face

Hugging Face published a blog post surveying the open-source LLM ecosystem as of mid-2023, covering text generation models, tooling, and deployment patterns available on the platform. The post highlights the breadth of open-weight models and associated infrastructure for inference and fine-tuning. It serves as a reference overview of the state of open-source LLMs at that point in time.

Open Weights Progress Inference Economics Hugging Face +1 more