Tagged "speculative-decoding"
- Prefill Is Compute-Bound, Decode Is Memory-Bound: Optimizing GPU Utilization for LLM Inference
- DFlash Doubles Token Generation Speed of Qwen3.5 27B on Mac M5 Max
- Speculative Decoding Achieves 29% Speed Boost for Gemma-4 31B
- DFlash Speculative Decoding Achieves 3.3x Speedup on Apple Silicon
- Speculative Decoding Made My Local LLM Actually Usable
- I Replaced My Local LLM With a Model Half Its Size and Got Better Results — and It Wasn't About the Parameters
- Qwen 3.5 27B Achieves 1.1M Tokens/Second on B200 GPUs with Optimized vLLM Config
- P-EAGLE: Faster LLM Inference with Parallel Speculative Decoding in vLLM
- Qwen 3.5-27B Demonstrates Exceptional Performance with Thoughtful Prompt Engineering