Consensus-based multi-agent LLM framework for HTS tariff code classification in maritime logistics
Researchers propose a multi-agent LLM framework for classifying Canadian 10-digit Harmonized Tariff Schedule codes, targeting smart-port and maritime logistics applications. The system combines semantic retrieval over official tariff documents, consensus-based validation, element-wise voting across hierarchical code components, and human-in-the-loop escalation. Evaluated on 3,300 expert-labeled product records, results show exact 10-digit classification remains difficult even for advanced LLMs, with accuracy degrading at finer granularity levels. The work argues for uncertainty-aware, evidence-grounded workflows over fully autonomous classification.
Related guides (2)
Related events (8)
TradingAgents: Multi-Agent LLM Financial Trading Framework
TradingAgents is an open-source Python framework by TauricResearch that applies multi-agent LLM architectures to financial trading tasks. The repository has accumulated 81,650 GitHub stars with 284 added today, indicating strong community traction. It represents a concrete deployment pattern for agentic AI systems in quantitative finance.
Multi-agent LLM framework for Chinese civil court simulation with five-stage trial procedure
Researchers present a multi-agent LLM framework for simulating Chinese civil court proceedings, organized around a five-stage civil trial procedure with memory modules and statute retrieval. The system targets civil litigation specifically, which is more common and harder to simulate than criminal cases due to flexible claims and remedies. Experiments show reliable judgment outputs with particular strengths in liability allocation, and find that memory quality substantially affects downstream simulation quality. Code and dataset are publicly released.
AgentMob: Training-free LLM agent framework for evidence-grounded mobility prediction
AgentMob is a training-free LLM-driven agent framework that formulates next-location prediction as adaptive evidence-controlled decision making, using a fast path for routine cases and iterative tool use for ambiguous ones. Evaluated on three mobility datasets, it achieves the strongest overall performance among training-free LLM-based methods, with GPT-5.4 reaching 71.42% Acc@1 on the BW dataset. The framework demonstrates that LLM controllers add most value in resolving ambiguous predictions through adaptive evidence gathering rather than routine cases.
LLM Agent Framework for Last-Mile Time Series Forecasting Revision
This paper introduces a 'last-mile forecasting' framework where an LLM agent sits atop a statistical forecasting backbone to incorporate weakly structured business context—holidays, campaigns, expert feedback, external events—into decision-ready forecasts. The system uses tool-invocation for contextual retrieval, converts reasoning into explicit revision actions under safety constraints, and supports long-horizon forecasting via map-reduce decomposition with a memory bank for post-hoc reflection. The authors validate the approach through real-world case studies, positioning it as a bridge between statistical prediction and operationally usable forecasts.
Open-source LLMs as LangChain Agents
This Hugging Face blog post explores using open-source LLMs as agents within the LangChain framework. It examines the capability of various open-weight models to perform tool use, reasoning, and multi-step task execution in agentic settings. The post likely benchmarks or compares several models on agent-relevant tasks, providing practical guidance for deploying open-source alternatives to proprietary models in agent pipelines.
Structure-Aware Code Change Labeling with LLMs via Two-Stage Taxonomy Pipeline
This paper presents a systematic study of using LLMs for taxonomy-based labeling of code diff hunks, going beyond summarization to assign structured labels capturing semantic attributes like renames, moves, and logic modifications. The authors introduce a two-stage pipeline combining diff-hunk labeling with structural refinement, using few-shot prompting to remain language-agnostic. Evaluated across four LLMs on a curated benchmark of natural and synthetic patches, the best configuration achieves 84% recall and 81% precision. Results suggest LLM-based structured labeling can complement static analysis tools in code review workflows.
HLL: Benchmark for Evaluating Multimodal Agents on CAPTCHA Human-Verification Boundaries
The paper introduces Humanity's Last Line of Verification (HLL), a controlled benchmark that tests whether multimodal agents can solve CAPTCHA challenges through grounded, human-like GUI interaction rather than mere recognition. Eight frontier multimodal agents are evaluated in a closed-loop environment across diverse CAPTCHA types with realism stressors including cluttered interfaces, harder variants, and trace-conditioned validation. Results show current agents remain brittle at this human-substitution boundary, with performance degrading under realistic conditions and when action traces must be consistent with correct answers. The benchmark exposes specific gaps in localization, action calibration, state tracking, and process consistency.
Automated ICD Classification of Psychiatric Diagnoses Using NLP and LLMs
This study evaluates NLP and ML approaches for automating the mapping of free-text psychiatric descriptions to ICD diagnostic codes, using a dataset of 145,513 Spanish clinical records. Methods range from classical BoW/TF-IDF representations to transformer-based embeddings including e5_large, BioLORD, and Llama-3-8B. Fine-tuned e5_large achieved the best performance with a micro-F1 of 0.866, outperforming classical methods by capturing semantic nuance and medical terminology. The work highlights challenges of long-tail label distributions and ambiguity specific to psychiatric clinical language.

