Fast inference with Kronecker-sparse matrices

Antoine Gonon; Léon Zheng; Pascal Carrivain; Quoc-Tung Le

Pré-Publication, Document De Travail Année : 2024

Fast inference with Kronecker-sparse matrices

(1, 2) , (3, 1) , (1) , (4)

1
2
3
4

Antoine Gonon

Fonction : Auteur

Optimisation, Connaissances pHysiques, Algorithmes et Modèles

Arithmétiques des ordinateurs, méthodes formelles, génération de code

Léon Zheng

Fonction : Auteur
PersonId : 752426
IdHAL : leonzheng

Valeo.ai

Optimisation, Connaissances pHysiques, Algorithmes et Modèles

Pascal Carrivain

Fonction : Auteur
PersonId : 938245

Optimisation, Connaissances pHysiques, Algorithmes et Modèles

Quoc-Tung Le

Fonction : Auteur
PersonId : 752424
IdHAL : quoc-tung-le
ORCID : 0009-0009-9952-2194

Toulouse School of Economics

Résumé

This paper benchmarks and improves existing GPU matrix multiplication algorithms specialized for Kronecker-sparse matrices, whose sparsity patterns are described by Kronecker products. These matrices have recently gained popularity as replacements for dense matrices in neural networks because they preserve accuracy while using fewer parameters. We present the first energy and time benchmarks for the multiplication with such matrices, helping users identify scenarios where Kronecker-sparse matrices are more time- and energy-efficient than their dense counterparts. Our benchmark also reveals that specialized implementations spend up to 50% of their total runtime on memory rewriting operations. To address the challenge of reducing memory transfers, we introduce a new so-called tiling strategy adapted to the Kronecker-sparsity structure, which reduces reads and writes between levels of GPU memory. We implement this tiling strategy in a new CUDA kernel that achieves a median speed-up of x1.4, while also cutting energy consumption by 15%. We further demonstrate the broader impact of our results by applying the new kernel to accelerate transformer inference.

Domaines

Intelligence artificielle [cs.AI] Calcul parallèle, distribué et partagé [cs.DC]

Fichier principal

main.pdf (1.16 Mo)

Origine	Fichiers produits par l'(les) auteur(s)

Antoine Gonon : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04584450

Soumis le : dimanche 3 novembre 2024-18:27:53

Dernière modification le : mardi 12 novembre 2024-15:50:02

Dates et versions

hal-04584450 , version 1 (23-05-2024)

hal-04584450 , version 2 (23-05-2024)

hal-04584450 , version 3 (08-10-2024)

hal-04584450 , version 4 (03-11-2024)

Identifiants

HAL Id : hal-04584450 , version 4

Citer

Antoine Gonon, Léon Zheng, Pascal Carrivain, Quoc-Tung Le. Fast inference with Kronecker-sparse matrices. 2024. ⟨hal-04584450v4⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON UNIV-LYON3 UGA CNRS INRIA UNIV-LYON1 UNIV-LYON2 INSA-LYON EHESS UT1-CAPITOLE INRIA2 INSA-GROUPE UDL INRAE ANR PEPR_IA

226 Consultations

133 Téléchargements

Fast inference with Kronecker-sparse matrices

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager