GPUs All Grown-Up: Fully Device-Driven SpMV Using GPU Work Graphs
Fabian Wildgrube1, Pete Ehrett1, Paul Trojahn1, Richard Membarth2, 4, Bradford Beckmann1, Dominik Baumeister1, Matthäus Chajdas1, 3
1AMD 2Technische Hochschule Ingolstadt 3Intel 4DFKI
Accepted to ISCA ‘25: Proceedings of the 52nd Annual International Symposium on Computer Architecture
Abstract
Sparse matrix-vector multiplication (SpMV) is a key operation across high-performance computing, graph analytics, and many more applications. In these applications, the matrix characteristics, notably non-zero elements per row, can vary widely and impact which algorithm performs best. Thus, Graphics Processing Unit (GPU) SpMV algorithms often rely on costly preprocessing to determine what per-row algorithm to select to achieve high performance. In this work we combine SpMV preprocessing and the subsequent per-row processing on the GPU by leveraging the novel “Work Graphs” GPU programming model—initially designed for graphics applications—for dynamic on-device self-scheduling. Work Graphs allow for fine-grain dataflow execution of individual workgroups using emerging hardware and firmware support. As soon as preprocessing has generated sufficient work, workgroups of individual processing kernels are self-scheduled and executed, interleaved with those of other kernels. This improves cache locality and eliminates host interaction altogether.
Across a suite of 59 sparse matrices, the best of various novel Work Graphs SpMV implementations outperforms state-of-the-art rocSPARSE “LRB” for a single SpMV by up to 7.19× (mean: 3.35×, SD: 1.89). Furthermore, it achieves much more stable performance across various sparsity patterns than the rocSPARSE CSR-General algorithm, and even beats the advanced rocSPARSE CSR-Adaptive algorithm for up to 92 consecutive SpMV calculations. In addition, compared to rocSPARSE LRB, it reduces code complexity by 75%. Its memory footprint for supporting data structures is a fixed ∼25 MiB independent of matrix size, compared to rocSPARSE LRB’s data structures that scale with matrix size to hundreds of megabytes. Overall, this work showcases the performance potential of emerging dynamic on-device scheduling techniques for GPU compute applications.