4arXiv cs.CL (Computation and Language)·5d ago

Synthetic data generation method enables small LLMs to match large models on Text-To-Cypher tasks

A new arXiv paper presents an automatic synthetic data generation method for fine-tuning small LLMs on Text-To-Cypher (Text2Cypher) parsing, enabling natural language interfaces to property graph databases. Experiments across major Text-To-Cypher benchmarks show that small fine-tuned models can compete with much larger proprietary models. The approach is positioned as a solution for local deployment scenarios requiring data sovereignty without expensive annotation.

Evaluation and Benchmarking Enterprise Deployment Patterns Cypher Achieving Precise Text-To-Cypher Via Grounded Knowledge Graph Data Generation

Related guides (2)

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·17d ago·source ↗

Synthetic LLM-generated conversations improve ASR training for low-resource languages

Researchers propose a pipeline that uses LLMs to generate scenario-level dialogues and TTS to synthesize multi-speaker audio, creating simulated conversational training data for ASR systems. Evaluated on the Hungarian BEA-Dialogue benchmark, a model trained on 67 hours of real plus 636 hours of synthetic data outperforms a zero-shot model trained on 2,700 hours of real Hungarian speech. The study tests five LLM families under multiple budget and mixing configurations using a FastConformer-Large backbone, finding that generator choice and data composition significantly affect gains.

Evaluation and Benchmarking FastConformer-Large Efficient ASR Training with Conversations that Never Happened BEA-Dialogue

6Hugging Face Blog·1mo ago·source ↗

Cosmopedia: Creating Large-Scale Synthetic Data for Pre-training LLMs

Hugging Face introduces Cosmopedia, a large-scale synthetic dataset designed for pre-training large language models. The blog post details the methodology for generating diverse, high-quality synthetic text at scale using existing LLMs as data generators. The work addresses the growing challenge of data scarcity and quality in LLM pre-training pipelines.

Training Infrastructure Evaluation and Benchmarking Hugging Face Synthetic Data Generator Cosmopedia +1 more

3arXiv · cs.CL·11d ago·source ↗

Synthetic data bootstrapping and LoRA fine-tuning for Q'eqchi' Mayan NMT without web scraping

Researchers introduce a data synthesis methodology for low-resource neural machine translation of Q'eqchi' Mayan, converting community-sourced dictionaries into a synthetic parallel corpus to avoid scraping target-language data. Using LoRA adapters on mT5-base, the approach achieves BLEU 42.02 on in-domain evaluation but only 0.59 against organic text, revealing a structural-semantic gap. An ablation with multi-task learning produced negative transfer, suggesting LoRA capacity limits conflict with auxiliary objectives. The study concludes synthetic bootstrapping is effective for structural priming but requires authentic data for semantic refinement via curriculum learning.

Evaluation and Benchmarking Open Weights Progress BLEU LoRA mT5 +1 more

5Hugging Face Blog·1mo ago·source ↗

Introducing the Synthetic Data Generator - Build Datasets with Natural Language

Hugging Face has launched a Synthetic Data Generator tool that allows users to create datasets using natural language descriptions. The tool is designed to lower the barrier for dataset creation, enabling practitioners to generate training data without writing code. This is relevant to the broader trend of synthetic data as a scalable alternative to manual data collection and annotation.

Evaluation and Benchmarking Agent and Tool Ecosystem Hugging Face Synthetic Data Generator

4Hugging Face Blog·1mo ago·source ↗

SyGra: The One-Stop Framework for Building Data for LLMs and SLMs

ServiceNow AI introduces SyGra, a framework designed to streamline synthetic and curated data generation for training large and small language models. The framework aims to provide a unified pipeline covering data synthesis, filtering, and quality control for LLM/SLM development. The blog post appears on Hugging Face, positioning SyGra as a practical tooling contribution to the data preparation ecosystem.

Evaluation and Benchmarking Agent and Tool Ecosystem ServiceNow AI SyGra Hugging Face

4arXiv · cs.CL·17d ago·source ↗

Synthetic linguistic reasoning traces improve low-resource machine translation via in-context learning

Researchers propose a pipeline that generates step-by-step linguistic reasoning traces from Universal Dependencies treebanks, dictionaries, and grammar-rule banks to assist LLMs in translating extremely low-resource languages. Evaluated on Xibe and Chintang across ICL, SFT, and RFT settings, the traces prove most effective as inference-time guidance rather than training data. Models can leverage reliable grammatical analyses at inference time but struggle to learn to generate accurate traces themselves, identifying trace generation quality as the key bottleneck.

Evaluation and Benchmarking Reasoning over Grammar: Can Synthetic Linguistic Reasoning Traces Enhance Low-Resource Machine Translation?Universal Dependencies

4Hugging Face Blog·1mo ago·source ↗

Synthetic Data: Save Money, Time and Carbon with Open Source

A Hugging Face blog post advocates for using synthetic data generation with open-source tools as a cost-effective, time-efficient, and environmentally friendlier alternative to real data collection and labeling. The post likely covers techniques and tooling available in the open-source ecosystem for generating synthetic training data. This is relevant to the broader trend of reducing dependency on expensive human-labeled datasets in ML pipelines.

Open Weights Progress Agent and Tool Ecosystem Hugging Face Synthetic Data Generator +1 more

4Hugging Face Blog·1mo ago·source ↗

Investing in Performance: Fine-tune small models with LLM insights — a CFM case study

This Hugging Face blog post presents a case study from CFM (Capital Fund Management) on using large language model outputs to guide fine-tuning of smaller, more efficient models for financial applications. The approach leverages LLM-generated signals or labels to train compact models that can be deployed at lower cost and latency. The case study illustrates an enterprise pattern of distilling LLM capabilities into task-specific smaller models for production use.

Inference Economics Enterprise Deployment Patterns knowledge distillation Hugging Face Capital Fund Management +1 more