LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding

Gang Lin, Dongfang Li, Zhuoen Chen, Yukun Shi, Xuhui Chen, Baotian Hu, Min Zhang
Harbin Institute of Technology, Shenzhen

Abstract

The proliferation of long-context large language models (LLMs) exposes a key bottleneck: the rapidly expanding key-value cache during decoding, which imposes heavy memory and latency costs. While recent approaches attempt to alleviate this by sharing a single set of crucial tokens across layers, such coarse-grained sharing undermines model performance by neglecting the functional diversity of attention heads.

To address this, we propose LycheeDecode, an efficient decoding method centered on a fine-grained hybrid-head attention mechanism that employs a hardware-efficient top-k selection strategy. Specifically, the novel HardKuma-based mechanism partitions attention heads into a small subset of retrieval heads that dynamically identify crucial tokens and a majority of sparse heads that reuse them for efficient computation.

Through extensive experiments on leading models like Llama3 and Qwen3 across diverse benchmarks (LongBench, RULER, AIME24), we demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing even the full-attention baseline. Crucially, this is accomplished with up to 2.7× speedup at a 128K context length.

LycheeDecode Logo

Methodology

To address the limitations of coarse-grained token sharing, we propose LycheeDecode, an efficient decoding method centered on a fine-grained Hybrid-Head attention mechanism. Specifically, we employ a HardKuma-based mechanism to partition attention heads into:

  • 🔍 Retrieval Heads: A small subset that dynamically identifies crucial tokens using full attention.
  • ⚡ Sparse Heads: A majority subset that reuses the identified tokens for efficient computation.

By preserving the functional diversity of attention heads, LycheeDecode achieves generative quality comparable to the full-attention baseline while delivering significant speedups. Furthermore, this HardKuma-based formulation tackles the discrete optimization problem in identifying head types, allowing the model to learn a stable, near-binary selection mechanism end-to-end, which bridges the gap between training and inference.

LycheeDecode Framework

Experiments

Training Dynamics of Kuma Distribution

Drag the training step slider below, or click on any cell in the heatmap to visualize the evolution of the specific Kuma distribution Probability Density Function (PDF) for individual attention heads.

  • Heatmap: Illustrates the expected probability of each head being identified as a Retrieval Head across training steps. As the training progresses, you will notice the values quickly converge to either 0 (Sparse Head, blue) or 1 (Retrieval Head, red).
  • PDF Chart: Provides a microscopic view of this process for the selected head. Initially uniform at Step 0, the distributions undergo a dramatic transformation. You can observe how the probability mass effectively concentrates at the boundaries: shifting almost entirely to the right for Retrieval Heads, or collapsing to the left for Sparse Heads.
  • BibTeX

    @misc{lin2026lycheedecodeacceleratinglongcontextllm,
          title={LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding},
          author={Gang Lin and Dongfang Li and Zhuoen Chen and Yukun Shi and Xuhui Chen and Baotian Hu and Min Zhang},
          year={2026},
          eprint={2602.04541},
          archivePrefix={arXiv},
          primaryClass={cs.CL},
          url={https://arxiv.org/abs/2602.04541},
          }