Synthetic data generation method enables small LLMs to match large models on Text-To-Cypher tasks
A new arXiv paper presents an automatic synthetic data generation method for fine-tuning small LLMs on Text-To-Cypher (Text2Cypher) parsing, enabling natural language interfaces to property graph databases. Experiments across major Text-To-Cypher benchmarks show that small fine-tuned models can compete with much larger proprietary models. The approach is positioned as a solution for local deployment scenarios requiring data sovereignty without expensive annotation.
Related guides (2)
Related events (8)
Synthetic LLM-generated conversations improve ASR training for low-resource languages
Researchers propose a pipeline that uses LLMs to generate scenario-level dialogues and TTS to synthesize multi-speaker audio, creating simulated conversational training data for ASR systems. Evaluated on the Hungarian BEA-Dialogue benchmark, a model trained on 67 hours of real plus 636 hours of synthetic data outperforms a zero-shot model trained on 2,700 hours of real Hungarian speech. The study tests five LLM families under multiple budget and mixing configurations using a FastConformer-Large backbone, finding that generator choice and data composition significantly affect gains.
Cosmopedia: Creating Large-Scale Synthetic Data for Pre-training LLMs
Hugging Face introduces Cosmopedia, a large-scale synthetic dataset designed for pre-training large language models. The blog post details the methodology for generating diverse, high-quality synthetic text at scale using existing LLMs as data generators. The work addresses the growing challenge of data scarcity and quality in LLM pre-training pipelines.
Synthetic data bootstrapping and LoRA fine-tuning for Q'eqchi' Mayan NMT without web scraping
Researchers introduce a data synthesis methodology for low-resource neural machine translation of Q'eqchi' Mayan, converting community-sourced dictionaries into a synthetic parallel corpus to avoid scraping target-language data. Using LoRA adapters on mT5-base, the approach achieves BLEU 42.02 on in-domain evaluation but only 0.59 against organic text, revealing a structural-semantic gap. An ablation with multi-task learning produced negative transfer, suggesting LoRA capacity limits conflict with auxiliary objectives. The study concludes synthetic bootstrapping is effective for structural priming but requires authentic data for semantic refinement via curriculum learning.
Introducing the Synthetic Data Generator - Build Datasets with Natural Language
Hugging Face has launched a Synthetic Data Generator tool that allows users to create datasets using natural language descriptions. The tool is designed to lower the barrier for dataset creation, enabling practitioners to generate training data without writing code. This is relevant to the broader trend of synthetic data as a scalable alternative to manual data collection and annotation.
SyGra: The One-Stop Framework for Building Data for LLMs and SLMs
ServiceNow AI introduces SyGra, a framework designed to streamline synthetic and curated data generation for training large and small language models. The framework aims to provide a unified pipeline covering data synthesis, filtering, and quality control for LLM/SLM development. The blog post appears on Hugging Face, positioning SyGra as a practical tooling contribution to the data preparation ecosystem.
Synthetic linguistic reasoning traces improve low-resource machine translation via in-context learning
Researchers propose a pipeline that generates step-by-step linguistic reasoning traces from Universal Dependencies treebanks, dictionaries, and grammar-rule banks to assist LLMs in translating extremely low-resource languages. Evaluated on Xibe and Chintang across ICL, SFT, and RFT settings, the traces prove most effective as inference-time guidance rather than training data. Models can leverage reliable grammatical analyses at inference time but struggle to learn to generate accurate traces themselves, identifying trace generation quality as the key bottleneck.
Synthetic Data: Save Money, Time and Carbon with Open Source
A Hugging Face blog post advocates for using synthetic data generation with open-source tools as a cost-effective, time-efficient, and environmentally friendlier alternative to real data collection and labeling. The post likely covers techniques and tooling available in the open-source ecosystem for generating synthetic training data. This is relevant to the broader trend of reducing dependency on expensive human-labeled datasets in ML pipelines.
Investing in Performance: Fine-tune small models with LLM insights — a CFM case study
This Hugging Face blog post presents a case study from CFM (Capital Fund Management) on using large language model outputs to guide fine-tuning of smaller, more efficient models for financial applications. The approach leverages LLM-generated signals or labels to train compact models that can be deployed at lower cost and latency. The case study illustrates an enterprise pattern of distilling LLM capabilities into task-specific smaller models for production use.

