TurboSparse: Elite Inference Speed via dReLU Sparsity

cover
3 Mar 2026

Abstract and 1. Introduction

  1. Related Work and Background

  2. Analysis

    3.1 Limitations about Existing ReLUficatio

    3.2 dReLU

  3. Are Neurons in Expert still Sparsely Activated?

  4. dReLU Sparsification

  5. Experiments Results

    6.1 Downstream Tasks Performance

    6.2 Sparsity of Sparsified Models

  6. Practical Inference Speedup Evaluation

    7.1 Experiments Setting

    7.2 Pure CPU Inference and 7.3 Hybrid GPU-CPU Inference

    7.4 Deploy LLMs on mobile phones

  7. Conclusion and References

A. Appendix / supplemental material

B. Limitation

C. Broader Impact

7.1 Experiments Setting

Baselines. We take llama.cpp [20] as our baselines for comparison. llama.cpp is the most representative inference framework.

Models. For PowerInfer and PowerInfer-2 [62], we deployed our sparsified models, while for llama.cpp, we employed the original models for speed comparison.

Hardware Configurations. All experiments were conducted on three distinct configurations:

• PC-Laptop: Intel i9-14900HX processor, 32GB host memory (67.2 GB/s bandwidth), an NVIDIA RTX 4090 GPU (16GB), and PCIe 4.0 interface (64GB/s bandwidth).

• PC-2080Ti: Intel i7-12700K processor (eight 4.9GHz cores), 64GB host memory (38.4 GB/s bandwidth), an NVIDIA RTX 2080Ti GPU (11GB), and PCIe 3.0 interface (32GB/s bandwidth).

• OnePlus-12: Equipped with a Snapdragon 8 Gen 3 SoC, 24 GB DRAM, and UFS 4.0 storage.

Authors:

(1) Yixin Song, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(2) Haotong Xie, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(3) Zhengyan Zhang, Department of Computer Science and Technology, Tsinghua University;

(4) Bo Wen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(5) Li Ma, Shanghai Artificial Intelligence Laboratory;

(6) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University Mi [email protected]);

(7) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.


This paper is available on arxiv under CC BY 4.0 license.