Fast inference with Kronecker-sparse matrices - Université Lumière Lyon 2
Pré-Publication, Document De Travail Année : 2024

Fast inference with Kronecker-sparse matrices

Résumé

This paper benchmarks and improves existing GPU matrix multiplication algorithms specialized for Kronecker-sparse matrices, whose sparsity patterns are described by Kronecker products. These matrices have recently gained popularity as replacements for dense matrices in neural networks because they preserve accuracy while using fewer parameters. We present the first energy and time benchmarks for the multiplication with such matrices, helping users identify scenarios where Kronecker-sparse matrices are more time- and energy-efficient than their dense counterparts. Our benchmark also reveals that specialized implementations spend up to 50% of their total runtime on memory rewriting operations. To address the challenge of reducing memory transfers, we introduce a new so-called tiling strategy adapted to the Kronecker-sparsity structure, which reduces reads and writes between levels of GPU memory. We implement this tiling strategy in a new CUDA kernel that achieves a median speed-up of x1.4, while also cutting energy consumption by 15%. We further demonstrate the broader impact of our results by applying the new kernel to accelerate transformer inference.
Fichier principal
Vignette du fichier
main.pdf (1.16 Mo) Télécharger le fichier
Origine Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-04584450 , version 1 (23-05-2024)
hal-04584450 , version 2 (23-05-2024)
hal-04584450 , version 3 (08-10-2024)
hal-04584450 , version 4 (03-11-2024)

Identifiants

  • HAL Id : hal-04584450 , version 4

Citer

Antoine Gonon, Léon Zheng, Pascal Carrivain, Quoc-Tung Le. Fast inference with Kronecker-sparse matrices. 2024. ⟨hal-04584450v4⟩
226 Consultations
133 Téléchargements

Partager

More