Loop Unrolling Impact on CUDA Matrix Multiplication Operations
Date Issued
2024-11-26
Author(s)
Stefkovski, Vojdan
Mileski, Dimitar
Gusev, Marjan
DOI
10.1109/telfor63250.2024.10819077
Abstract
This paper investigates the impact of loop unrolling on CUDA matrix multiplication operations’ performance across NVIDIA GPUs. We benchmarked both basic and unrolled kernels with varying unroll factors (2, 4, 8, and 16) and CUDA block sizes (8, 16, and 32) on matrices ranging from 128 × 128 to 4096 × 4096. Using two GPUs, the GeForce RTX 4060 and GTX TITAN X, we analyze how unrolling factors impact execution time. Our findings indicate that loop unrolling, particularly with factors of 8 and 16 and a block size of 32, yields significant performance gains on larger matrices. These results confirm loop unrolling as an effective optimization technique for CUDA matrix operations, providing insights for developers to enhance computational efficiency across different GPU architectures.
Subjects
File(s)![Thumbnail Image]()
Loading...
Name
Loop Unrolling Impact on CUDA Matrix Multiplication Operations - accepted version.pdf
Description
Accepted version
Size
217.67 KB
Format
Adobe PDF
Checksum
(MD5):22be5bc806c3588309197c1d37429481
