Abstract
The proliferation of long-context large language models (LLMs) exposes a key bottleneck: the rapidly expanding key-value cache during decoding, which imposes heavy memory and latency costs. While recent approaches attempt to alleviate this by sharing a single set of crucial tokens across layers, such coarse-grained sharing undermines model performance by neglecting the functional diversity of attention heads.
To address this, we propose LycheeDecode, an efficient decoding method centered on a fine-grained hybrid-head attention mechanism that employs a hardware-efficient top-k selection strategy. Specifically, the novel HardKuma-based mechanism partitions attention heads into a small subset of retrieval heads that dynamically identify crucial tokens and a majority of sparse heads that reuse them for efficient computation.
Through extensive experiments on leading models like Llama3 and Qwen3 across diverse benchmarks (LongBench, RULER, AIME24), we demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing even the full-attention baseline. Crucially, this is accomplished with up to 2.7× speedup at a 128K context length.
Methodology
To address the limitations of coarse-grained token sharing, we propose LycheeDecode, an efficient decoding method centered on a fine-grained Hybrid-Head attention mechanism. Specifically, we employ a HardKuma-based mechanism to partition attention heads into:
- 🔍 Retrieval Heads: A small subset that dynamically identifies crucial tokens using full attention.
- ⚡ Sparse Heads: A majority subset that reuses the identified tokens for efficient computation.
By preserving the functional diversity of attention heads, LycheeDecode achieves generative quality comparable to the full-attention baseline while delivering significant speedups. Furthermore, this HardKuma-based formulation tackles the discrete optimization problem in identifying head types, allowing the model to learn a stable, near-binary selection mechanism end-to-end, which bridges the gap between training and inference.
Experiments
Long Context Understanding: Evaluated on LongBench across 8 tasks using Llama-3-8B and Qwen3-8B. LycheeDecode achieves the highest average score, outperforming advanced sparse attention baselines and even surpassing the Full Attention model.
Complex Reasoning Task: Tested on challenging math reasoning benchmarks (e.g., AIME24, Minerva) with DeepSeek-R1-Distill models. Combined with a Cache Correction strategy, LycheeDecode significantly outperforms the Full Attention baseline by effectively filtering out irrelevant noisy context.
End-to-End Latency: Measuring Time Per Output Token (TPOT). As context length grows, LycheeDecode consistently maintains low latency, achieving up to 2.7× speedup over the Full Attention baseline at a 128K context length.
Kernel-level Latency: A custom hybrid-head block-sparse decoding kernel implemented via TileLang. It effectively overcomes I/O bottlenecks, delivering a peak speedup of up to 7× compared to FlashAttention-2 under high sparsity and large batch sizes.
Training Dynamics of Kuma Distribution
Drag the training step slider below, or click on any cell in the heatmap to visualize the evolution of the specific Kuma distribution Probability Density Function (PDF) for individual attention heads.
BibTeX
@misc{lin2026lycheedecodeacceleratinglongcontextllm,
title={LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding},
author={Gang Lin and Dongfang Li and Zhuoen Chen and Yukun Shi and Xuhui Chen and Baotian Hu and Min Zhang},
year={2026},
eprint={2602.04541},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.04541},
}