Tagged "speculative-decoding"

Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference 27 May 2026
DFlash Speculative Decoding Delivers 8.5x Speed Improvement for LLM Inference 11 May 2026
Google Releases Gemma 4 Multi-Token Prediction Drafters To Accelerate AI Inference 8 May 2026
Google Accelerates Gemma 4 Inference Speed 3x With Multi-Token Prediction Drafters 6 May 2026
Supercharging LLM Inference on Google TPUs: Achieving 3X Speedups With Diffusion-Style Speculative Decoding 5 May 2026
Prefill Is Compute-Bound, Decode Is Memory-Bound: Optimizing GPU Utilization for LLM Inference 16 April 2026
DFlash Doubles Token Generation Speed of Qwen3.5 27B on Mac M5 Max 15 April 2026
Speculative Decoding Achieves 29% Speed Boost for Gemma-4 31B 13 April 2026
DFlash Speculative Decoding Achieves 3.3x Speedup on Apple Silicon 12 April 2026
Speculative Decoding Made My Local LLM Actually Usable 9 April 2026
I Replaced My Local LLM With a Model Half Its Size and Got Better Results — and It Wasn't About the Parameters 9 April 2026
Qwen 3.5 27B Achieves 1.1M Tokens/Second on B200 GPUs with Optimized vLLM Config 27 March 2026
P-EAGLE: Faster LLM Inference with Parallel Speculative Decoding in vLLM 14 March 2026
Qwen 3.5-27B Demonstrates Exceptional Performance with Thoughtful Prompt Engineering 28 February 2026