Anteru's blog
  • Consulting
  • Research
    • Assisted environment probe placement
    • Assisted texture assignment
    • Edge-Friend: Fast and Deterministic Catmull-Clark Subdivision Surfaces
    • Error Metrics for Smart Image Refinement
    • GPUs All Grown-Up: Fully Device-Driven SpMV Using GPU Work Graphs
    • High-Quality Shadows for Streaming Terrain Rendering
    • Hybrid Sample-based Surface Rendering
    • Interactive rendering of Giga-Particle Fluid Simulations
    • Quantitative Analysis of Voxel Raytracing Acceleration Structures
    • Real-time Hybrid Hair Rendering
    • Real-Time Procedural Generation with GPU Work Graphs
    • Scalable rendering for very large meshes
    • Spatiotemporal Variance-Guided Filtering for Motion Blur
    • Subpixel Reconstruction Antialiasing
    • Tiled light trees
    • Towards Practical Meshlet Compression
  • About
  • Archive

GPUs All Grown-Up: Fully Device-Driven SpMV Using GPU Work Graphs

Fabian Wildgrube1, Pete Ehrett1, Paul Trojahn1, Richard Membarth2, 4, Bradford Beckmann1, Dominik Baumeister1, Matthäus Chajdas1, 3

1AMD 2Technische Hochschule Ingolstadt 3Intel 4DFKI

Accepted to ISCA ‘25: Proceedings of the 52nd Annual International Symposium on Computer Architecture

Abstract

Sparse matrix-vector multiplication (SpMV) is a key operation across high-performance computing, graph analytics, and many more applications. In these applications, the matrix characteristics, notably non-zero elements per row, can vary widely and impact which algorithm performs best. Thus, Graphics Processing Unit (GPU) SpMV algorithms often rely on costly preprocessing to determine what per-row algorithm to select to achieve high performance. In this work we combine SpMV preprocessing and the subsequent per-row processing on the GPU by leveraging the novel “Work Graphs” GPU programming model—initially designed for graphics applications—for dynamic on-device self-scheduling. Work Graphs allow for fine-grain dataflow execution of individual workgroups using emerging hardware and firmware support. As soon as preprocessing has generated sufficient work, workgroups of individual processing kernels are self-scheduled and executed, interleaved with those of other kernels. This improves cache locality and eliminates host interaction altogether.

Across a suite of 59 sparse matrices, the best of various novel Work Graphs SpMV implementations outperforms state-of-the-art rocSPARSE “LRB” for a single SpMV by up to 7.19× (mean: 3.35×, SD: 1.89). Furthermore, it achieves much more stable performance across various sparsity patterns than the rocSPARSE CSR-General algorithm, and even beats the advanced rocSPARSE CSR-Adaptive algorithm for up to 92 consecutive SpMV calculations. In addition, compared to rocSPARSE LRB, it reduces code complexity by 75%. Its memory footprint for supporting data structures is a fixed ∼25 MiB independent of matrix size, compared to rocSPARSE LRB’s data structures that scale with matrix size to hundreds of megabytes. Overall, this work showcases the performance potential of emerging dynamic on-device scheduling techniques for GPU compute applications.

Download

  • Paper

Recent posts

  • Data formats: Why CSV and JSON aren't the best
    Posted on 2024-12-29
  • Replacing cron with systemd-timers
    Posted on 2024-04-21
  • Open Source Maintenance
    Posted on 2024-04-02
  • Angular, Caddy, Gunicorn and Django
    Posted on 2023-10-21
  • Effective meetings
    Posted on 2022-09-12
  • Older posts

Find me on the web

  • GitHub
  • GPU database
  • Projects

Follow me

Anteru NIV_Anteru
Contents © 2005-2025
Anteru
Imprint/Impressum
Privacy policy/Datenschutz
Made with Liara