Loop Unrolling Impact on CUDA Matrix Multiplication Operations

Stefkovski, Vojdan; Mileski, Dimitar; Gushev, Marjan

doi:10.1109/telfor63250.2024.10819077

Loop Unrolling Impact on CUDA Matrix Multiplication Operations

Date Issued

2024-11-26

Author(s)

Stefkovski, Vojdan

DOI

10.1109/telfor63250.2024.10819077

Abstract

This paper investigates the impact of loop unrolling on CUDA matrix multiplication operations’ performance across NVIDIA GPUs. We benchmarked both basic and unrolled kernels with varying unroll factors (2, 4, 8, and 16) and CUDA block sizes (8, 16, and 32) on matrices ranging from 128 × 128 to 4096 × 4096. Using two GPUs, the GeForce RTX 4060 and GTX TITAN X, we analyze how unrolling factors impact execution time. Our findings indicate that loop unrolling, particularly with factors of 8 and 16 and a block size of 32, yields significant performance gains on larger matrices. These results confirm loop unrolling as an effective optimization technique for CUDA matrix operations, providing insights for developers to enhance computational efficiency across different GPU architectures.

Subjects

Processor scheduling ...

File(s)

Name

Loop Unrolling Impact on CUDA Matrix Multiplication Operations - accepted version.pdf

Description

Accepted version

Size

217.67 KB

Format

Adobe PDF

Checksum

(MD5):22be5bc806c3588309197c1d37429481