4arXiv cs.AI (Artificial Intelligence)·2d ago

FreeStyle: Dual-Reference Image Generation via Community LoRA Mining with Leakage Suppression

FreeStyle is a new framework for style-content dual-reference image generation that mines community LoRA models as compositional anchors to construct large-scale training triplets. The approach uses a two-stage curriculum with attention-level enrichment constraints and frequency-aware RoPE modulation to suppress semantic leakage between style and content references. The authors also introduce a benchmark with novel metrics including a style-invariant Content Alignment Score and a VLM-based Rejection Score for evaluating leakage suppression.

Multimodal Progress FreeStyle FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining RoPE

Related guides (1)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Related events (8)

6Hugging Face Blog·1mo ago·source ↗

SDXL in 4 Steps with Latent Consistency LoRAs

Hugging Face demonstrates combining Latent Consistency Models (LCMs) with LoRA adapters to enable high-quality image generation with Stable Diffusion XL in as few as 4 inference steps. This approach dramatically reduces the number of diffusion steps required compared to standard SDXL, lowering inference latency and compute cost. The technique leverages consistency distillation applied via lightweight LoRA weights, making it accessible without full model retraining.

Inference Economics Agent and Tool Ecosystem LoRA Stable Diffusion 3 Hugging Face +3 more

4Hugging Face Blog·1mo ago·source ↗

Fast LoRA inference for Flux with Diffusers and PEFT

Hugging Face published a technical blog post detailing optimizations for LoRA inference speed with the Flux image generation model using the Diffusers and PEFT libraries. The post covers techniques to accelerate adapter loading and inference throughput for diffusion models. This is relevant to practitioners deploying fine-tuned image generation models in production or research settings.

Inference Economics Agent and Tool Ecosystem PEFT LoRA Hugging Face +2 more

5arXiv · cs.LG·26d ago·source ↗

Squeezing Capacity from MLLMs for Subject-driven Image Generation via Dual Layer Aggregation

This paper proposes conditioning diffusion models on Multimodal Large Language Models (MLLMs) that jointly encode text and reference images, augmented with VAE-based identity conditioning to address copy-paste artifacts and identity preservation failures in subject-driven image generation. A Dual Layer Aggregation (DLA) module aggregates multi-level MLLM features, and a multi-stage denoising strategy progressively balances semantic and fine-detail identity signals during inference. Experiments show improved human preference scores on subject-driven generation benchmarks compared to prior approaches that encode text and reference images separately.

Agent and Tool Ecosystem Multimodal Progress Multimodal Large Language Models Dual Layer Aggregation (DLA)Subject-driven Image Generation +3 more

4Hugging Face Blog·1mo ago·source ↗

LoRA Training Scripts of the World, Unite!

Hugging Face published a blog post consolidating and comparing advanced LoRA fine-tuning scripts for Stable Diffusion XL, covering techniques such as pivotal tuning, custom captions, and various regularization strategies. The post aims to unify fragmented community training approaches into a more coherent set of best practices. It serves as a practical guide for practitioners fine-tuning SDXL models with LoRA adapters.

Open Weights Progress Agent and Tool Ecosystem LoRA Stable Diffusion 3 Pivotal Tuning +2 more

6arXiv · cs.CL·22d ago·source ↗

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

This paper identifies a 'carrier sensitivity' problem in Vision-Language Models (VLMs), where replacing textual queries with rendered-image equivalents causes significant performance degradation due to asymmetric roles of text and images in training data. The authors propose Local Modality Substitution (LoMo), a data curation paradigm that reformulates single-modality prompts into interleaved multimodal sequences by dynamically rendering text spans as images, enforcing cross-modal representational invariance. Evaluated across 13 multimodal benchmarks, LoMo improves over standard supervised fine-tuning by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B. The approach is architecture-agnostic and lightweight, requiring no changes to model architecture.

Evaluation and Benchmarking Alignment and RLHF LoMo LLaVA-OneVision-1.5-8B Qwen3-4B +3 more

5arXiv · cs.AI·15d ago·source ↗

Code2LoRA: Hypernetwork generates repository-specific LoRA adapters for code models with zero token overhead

Code2LoRA is a hypernetwork framework that generates repository-specific LoRA adapters for code language models, eliminating the inference-time token overhead of RAG or long-context injection. It supports both static repository snapshots and evolving codebases via a GRU-backed adapter updated per code diff. The authors introduce RepoPeftBench, a new benchmark of 604 Python repositories with static and evolution tracks, on which Code2LoRA-Static matches per-repository LoRA fine-tuning upper bounds and Code2LoRA-Evo outperforms a shared LoRA by 5.2 percentage points.

Evaluation and Benchmarking Agent and Tool Ecosystem RepoPeftBench LoRA GRU +1 more

5arXiv · cs.CL·2d ago·source ↗

StylisticBias benchmark reveals a small set of visual cues drives most social bias in MLLMs

Researchers introduce StylisticBias, a controlled benchmark of ~25K photorealistic face images with single-attribute variations designed to isolate how specific visual cues shift social judgments in multimodal LLMs. Evaluating six MLLMs across 25 binary social judgment scenarios, they find that age and body type dominate identity-level effects, while fashion style drives the largest attribute-level shifts, with ~15 attributes accounting for ~80% of total bias variation. The benchmark is released publicly on GitHub and Hugging Face, enabling fine-grained bias auditing of multimodal models.

Evaluation and Benchmarking AI Safety Research StylisticBias StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs +1 more

5Hugging Face Blog·1mo ago·source ↗

(LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware

This Hugging Face blog post covers techniques for fine-tuning the FLUX.1-dev image generation model using LoRA (Low-Rank Adaptation) on consumer-grade hardware. The post likely addresses quantization strategies (QLoRA) to reduce memory requirements, enabling training on GPUs with limited VRAM. This is relevant to the open-weights and accessible fine-tuning ecosystem for diffusion models.

Open Weights Progress Inference Economics Black Forest Labs FLUX.1-dev LoRA +3 more