4arXiv cs.AI (Artificial Intelligence)·20h ago

KnowsTFM: Knowledge graph-informed fine-tuning of small tabular foundation models

A new arXiv preprint introduces KnowsTFM, a method for fine-tuning small tabular foundation models (nanoscale TabPFN and TabICL variants) using structural attention priors derived from knowledge graphs and parameter-efficient low-rank updates. The approach targets niche domains with scarce, high-dimensional data shifted from pretraining distributions, showing meaningful gains in specialist settings but marginal gains on general tasks. The paper also reports that continual fine-tuning of frontier tabular models can trigger collapse of pretrained knowledge, a notable failure mode.

Evaluation and Benchmarking KnowsTFM TabPFN TabICL

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·1mo ago·source ↗

Distilling Tabular Foundation Models for Structured Health Data

This paper investigates knowledge distillation from tabular foundation models (TFMs) to lightweight student models for healthcare applications. The authors address context leakage in in-context TFMs via stratified out-of-fold teacher labeling, evaluating across 19 healthcare datasets, 6 TFM teachers, and 4 student families. Distilled students retain at least 90% of teacher AUC while running 26× faster on CPU, with preserved calibration and fairness properties. Multi-teacher ensembles do not consistently outperform the best single teacher.

Evaluation and Benchmarking Inference Economics knowledge distillation Stratified Out-of-Fold Teacher Labeling AUC +2 more

5arXiv · cs.AI·1mo ago·source ↗

Ensembling Tabular Foundation Models: A Diversity Ceiling and a Calibration Trap

This paper benchmarks six ensemble strategies across six tabular foundation models (TFMs) on 153 OpenML classification tasks, finding that ensembling provides minimal gains over the best single TFM. The best ensemble strategy (two-level cascade stacking) achieves only +0.18% accuracy improvement at 253× the compute cost. A key finding is that logistic-regression meta-learner stacking improves accuracy while severely degrading calibration (log-loss), because sharpening class boundaries destroys probability estimates. The authors recommend greedy ensemble selection as the practical default.

Evaluation and Benchmarking Enterprise Deployment Patterns Q-statistic Greedy Ensemble Selection Friedman-Nemenyi Test +3 more

5arXiv · cs.CL·7d ago·source ↗

Sub-billion parameter SLMs outperform zero-shot GPT-5.4 and Claude Sonnet 4.6 on relation extraction benchmarks

A new arXiv paper demonstrates that small language models (360M–3B parameters) fine-tuned on task-specific data can substantially outperform zero-shot frontier LLMs on relation extraction tasks. The best sub-billion model, Qwen2.5-0.5B fine-tuned on pooled general-domain data, achieves micro-F1 of 0.83 versus 0.69 for GPT-5.4 and 0.66 for Claude Sonnet 4.6 in zero-shot settings. The authors attribute the gains to task adaptation rather than model architecture, with a discriminative RoBERTa baseline also exceeding frontier models, and show that 4-bit quantized models deployable on consumer GPUs can match or beat proprietary API-based systems for this narrow task. The work provides evidence that for well-defined NLP tasks with available training data, compact adapted models offer a practical, private, and hardware-efficient alternative to frontier APIs.

Evaluation and Benchmarking Open Weights Progress RoBERTa Claude Sonnet 4 Biographical +3 more

7arXiv · cs.CL·28d ago·source ↗

On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters

This paper reframes parameter-efficient fine-tuning (PEFT) not merely as a cheaper alternative to full fine-tuning, but as a substrate for persistent, instance-specific personal models layered atop shared foundation models. The authors analyze three scaling axes: Scale Up (stronger base models amplifying adapter utility), Scale Down (minimum viable adapter size), and Scale Out (managing millions of concurrent adapted instances). They introduce MinT as an infrastructure reference for adapter identity, versioning, provenance, evaluation, and serving at scale.

Training Infrastructure Inference Economics LoRA Parameter-Efficient Fine-Tuning MinT +2 more

6arXiv · cs.CL·1mo ago·source ↗

ChunkFT: Memory-Efficient Full Fine-Tuning via Byte-Streamed Chunk Optimization

ChunkFT is a fine-tuning framework that reformulates full-parameter optimization around a dynamically activated working set of sub-tensors, enabling gradient computation without dense gradient materialization. It achieves full-parameter fine-tuning of a 7B model in 13.72GB GPU memory on a single RTX 4090, and scales Llama 3-70B fine-tuning to 2×H800 GPUs. Downstream evaluations on language understanding, math reasoning, and MT-Bench show ChunkFT matches or exceeds full-parameter fine-tuning quality while outperforming existing memory-efficient baselines such as LoRA-class methods. A theoretical convergence analysis in the deterministic setting is also provided.

Training Infrastructure Open Weights Progress Llama 3.1 70B MT-Bench Meta AI +5 more

4Hugging Face Blog·1mo ago·source ↗

Investing in Performance: Fine-tune small models with LLM insights — a CFM case study

This Hugging Face blog post presents a case study from CFM (Capital Fund Management) on using large language model outputs to guide fine-tuning of smaller, more efficient models for financial applications. The approach leverages LLM-generated signals or labels to train compact models that can be deployed at lower cost and latency. The case study illustrates an enterprise pattern of distilling LLM capabilities into task-specific smaller models for production use.

Inference Economics Enterprise Deployment Patterns knowledge distillation Hugging Face Capital Fund Management +1 more

6arXiv · cs.LG·1mo ago·source ↗

PEFT-Arena: Benchmarking Parameter-Efficient Finetuning via Stability-Plasticity Trade-offs

PEFT-Arena is a new benchmark that evaluates parameter-efficient finetuning methods jointly on downstream task performance and retention of pretrained general capabilities, framing the problem as a stability-plasticity dilemma. Across methods tested under comparable parameter budgets, orthogonal finetuning achieves the best Pareto frontier. The paper provides geometric analyses in both weight space (spectral/singular-value structure) and activation space (representation distortion metrics) to explain why different PEFT methods differ in forgetting behavior. A practical finding is that final SFT checkpoints often overshoot an optimal retention operating point, motivating path-wise rewinding as a post-hoc correction.

Evaluation and Benchmarking Agent and Tool Ecosystem stability-plasticity dilemma stability-plasticity dilemma orthogonal finetuning +7 more

3arXiv · cs.LG·15d ago·source ↗

HumP-KD: Uncertainty-aware multi-stage knowledge distillation for efficient fire classification

Researchers propose HumP-KD, a knowledge distillation framework that compresses two heterogeneous transformer teachers (Swin-Tiny and ViT-Base) into a lightweight MobileViT-S student for real-time fire classification. The student model achieves 0.9876 mean F1 on a 31K-image dataset while retaining only 4.94M parameters—a 5.7× reduction over Swin-Tiny—and runs at 37.72 CPU FPS. The framework combines hierarchical feature alignment, spatial attention masking, and progressive multi-stage distillation to maintain accuracy under degraded visual conditions.

Inference Economics FlameVision HumP-KD Swin-Tiny +2 more