Loop Unrolling Impact on CUDA Matrix Multiplication Operations

Please use this identifier to cite or link to this item: http://hdl.handle.net/20.500.12188/32233

Title:	Loop Unrolling Impact on CUDA Matrix Multiplication Operations
Authors:	Stefkovski, Vojdan Mileski, Dimitar Gusev, Marjan
Keywords:	Processor scheduling , Graphics processing units , Computer architecture , Performance gain , Distance measurement , Telecommunications , Registers , Computational efficiency , Kernel , Optimization
Issue Date:	26-Nov-2024
Publisher:	IEEE
Conference:	2024 32nd Telecommunications Forum (TELFOR)
Abstract:	This paper investigates the impact of loop unrolling on CUDA matrix multiplication operations’ performance across NVIDIA GPUs. We benchmarked both basic and unrolled kernels with varying unroll factors (2, 4, 8, and 16) and CUDA block sizes (8, 16, and 32) on matrices ranging from 128 × 128 to 4096 × 4096. Using two GPUs, the GeForce RTX 4060 and GTX TITAN X, we analyze how unrolling factors impact execution time. Our findings indicate that loop unrolling, particularly with factors of 8 and 16 and a block size of 32, yields significant performance gains on larger matrices. These results confirm loop unrolling as an effective optimization technique for CUDA matrix operations, providing insights for developers to enhance computational efficiency across different GPU architectures.
URI:	http://hdl.handle.net/20.500.12188/32233
DOI:	10.1109/telfor63250.2024.10819077
Appears in Collections:	Faculty of Computer Science and Engineering: Conference papers

File	Size	Format
Loop Unrolling Impact on CUDA Matrix Multiplication Operations - accepted version.pdf	217.67 kB	Adobe PDF	View/Open

Check