Batch Processing

SPIN：Accelerating Large Language Model Inference with Heterogeneous Speculative Models

This paper introduces SPIN, an efficient LLM inference serving system leveraging speculative decoding, which addresses limitations of current approaches through dynamic selection of heterogeneous small speculative models (SSMs), batch-optimized request decomposition during verification, and GPU-pipelined execution coordination, achieving a 2.28× performance improvement over state-of-the-art methods.

Fahao Chen, Peng Li, Tom H Luan

SPIN：Accelerating Large Language Model Inference with Heterogeneous Speculative Models