Almanac
technique

Graft

techniqueactivegraft-bda63d87·1 events·first seen 27d ago

Aliases: Graft

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.AI·27d ago·source ↗

Graft: Hybrid Tree Construction for Speculative Decoding via Prune-Then-Retrieve

Graft is a training-free framework that improves speculative decoding by coupling dynamic-depth pruning with retrieval-based token compensation. Pruning reduces VRAM and compute overhead while freeing budget for retrieval, which fills topological gaps in the draft tree with near-zero additional cost. On short-context benchmarks, Graft achieves up to 5.41× speedup and improves average speedup over EAGLE-3 by up to 21.8% on Qwen3-235B. The method is evaluated across short- and long-context settings and extended to block-drafting paradigms.