{"id":594176,"date":"2023-01-05T05:49:31","date_gmt":"2023-01-05T11:49:31","guid":{"rendered":"https:\/\/news.sellorbuyhomefast.com\/index.php\/2023\/01\/05\/how-to-optimize-a-cuda-matmul-kernel-for-cublas-like-performance-a-worklog\/"},"modified":"2023-01-05T05:49:31","modified_gmt":"2023-01-05T11:49:31","slug":"how-to-optimize-a-cuda-matmul-kernel-for-cublas-like-performance-a-worklog","status":"publish","type":"post","link":"https:\/\/newsycanuse.com\/index.php\/2023\/01\/05\/how-to-optimize-a-cuda-matmul-kernel-for-cublas-like-performance-a-worklog\/","title":{"rendered":"How to Optimize a CUDA Matmul Kernel for CuBLAS-Like Performance: A Worklog"},"content":{"rendered":"<p>December 31, 2022<\/p>\n<p>In this post, I\u2019ll iteratively optimize an implementation of matrix multiplication written in CUDA.<br \/>\nMy goal is not to build a cuBLAS replacement, but to deeply understand the most important performance characteristics of the GPUs that are used for modern deep learning.<br \/>\nThis includes coalescing global memory accesses, shared memory caching and occupancy optimizations, among others.<label for=\"3\"><\/label><span>You can download the code for all kernels from <a href=\"https:\/\/github.com\/siboehm\/SGEMM_CUDA\">Github<\/a>. Also checkout <a href=\"https:\/\/github.com\/wangzyon\/NVIDIA_SGEMM_PRACTICE\">wangzyon\u2019s repo<\/a> which I used as a starting point for my re-write.<\/span> <label for=\"1\"><\/label><span>This post is less polished than my normal uploads, and includes many more sidenotes. I used it as notepad for ideas and scribbles while writing the kernels. That\u2019s why I called it a worklog \ud83d\ude42<\/span><\/p>\n<p>Matrix multiplication on GPUs may currently be the most important algorithm that exists, considering it makes up almost all the FLOPs during the training and inference of large deep-learning models.<br \/>\nSo how much work is it to write a performant CUDA SGEMM<label for=\"2\"><\/label><span>SGEMM performs <code>C=\u03b1AB+\u03b2C<\/code> at single (=32b) precision.<\/span> from scratch?<br \/>\nI\u2019ll start with a naive kernel and step-by-step apply optimizations until we get within 80% of the performance of cuBLAS (NVIDIA\u2019s official matrix library):<label for=\"4\"><\/label><span>cuBLAS at FP32 that is. In my setting, doing the matmul using TF32 or BF16 precision allows cuBLAS to use the tensor cores, which increases FLOPS by 2.5x or 3.5x. I may look into tensor cores \/ warp matrix functions in a future post.<\/span><\/p>\n<table readabilityDataTable=\"1\">\n<thead>\n<tr>\n<th>Kernel<\/th>\n<th>GFLOPs<\/th>\n<th>Performance relative to cuBLAS (fp32)<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>1: Naive<\/td>\n<td>309<\/td>\n<td>1.3%<\/td>\n<\/tr>\n<tr>\n<td>2: GMEM Coalescing<\/td>\n<td>2006<\/td>\n<td>8.2%<\/td>\n<\/tr>\n<tr>\n<td>3: SMEM Blocktiling<\/td>\n<td>2984<\/td>\n<td>12.2%<\/td>\n<\/tr>\n<tr>\n<td>4: 1D Warptiling<\/td>\n<td>8626<\/td>\n<td>35.3%<\/td>\n<\/tr>\n<tr>\n<td>5: 2D Warptiling<\/td>\n<td>16134<\/td>\n<td>66.0%<\/td>\n<\/tr>\n<tr>\n<td>6: Vectorize loads<\/td>\n<td>20358<\/td>\n<td>83.2%<\/td>\n<\/tr>\n<tr>\n<td>0: cuBLAS<\/td>\n<td>24441<\/td>\n<td>100.0%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2 id=\"kernel-1-naive-implementation\">Kernel 1: Naive Implementation<\/h2>\n<p>In the CUDA programming model, computation is ordered in a three-level hierarchy.<br \/>\nEach invocation of a CUDA kernel creates a new grid, which consists of multiple blocks.<br \/>\nEach block consists of up to 1024 individual threads.<label for=\"5\"><\/label><span>These constants can be looked-up in the <a href=\"https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html#compute-capabilities\">CUDA Programming guide<\/a>.<\/span><br \/>\nThreads that are in the same block have access to the same shared memory region (SMEM).<\/p>\n<p>The number of threads in a block can be configured using a variable normally called <code>blockDim<\/code>, which is a vector consisting of three ints.<br \/>\nThe entries of that vector specify the sizes of <code>blockDim.x<\/code>, <code>blockDim.y<\/code> and <code>blockDim.z<\/code>, as visualized below:<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/siboehm.com\/assets\/img\/CUDA-MMM\/CUDA_thread_hierarchy.png\" alt><\/p>\n<p>Similarly, the number of blocks in a grid is configurable using the <code>gridDim<\/code> variable.<br \/>\nWhen we launch a new kernel from the host<label for=\"6\"><\/label><span>In accelerator lingo, <em>host<\/em> refers to the CPU and <em>device<\/em> is the accelerator, here the GPU.<\/span>, it creates a single grid, containing the blocks and threads as specified.<label for=\"7\"><\/label><span>From here on I\u2019ll only be talking about 2D grids and blocks, partly because the 3D-structure is seldom used and because drawing in 3D is too hard.<\/span><br \/>\nIt\u2019s important to keep in mind that the thread hierarchy we just talked about mostly concerns program correctness.<br \/>\nFor program performance, as we\u2019ll see later, it\u2019s not a good idea to treat all threads in the same block as equals.<\/p>\n<p>For our first kernel, we\u2019ll use the grid, block and thread hierarchy to assign each thread a unique entry in the result matrix C.<br \/>\nThen that thread will compute the dot product of the corresponding row of A and column of B, and write the result to C.<br \/>\nDue to each location of C being written to by only one thread, we have to do no synchronization.<br \/>\nWe\u2019ll launch the kernel like so:<\/p>\n<div>\n<pre><code><span>\/\/ create as many blocks as necessary to map all of C<\/span>\n<span>dim3<\/span> <span>gridDim<\/span><span>(<\/span><span>CEIL_DIV<\/span><span>(<\/span><span>M<\/span><span>,<\/span> <span>32<\/span><span>),<\/span> <span>CEIL_DIV<\/span><span>(<\/span><span>N<\/span><span>,<\/span> <span>32<\/span><span>),<\/span> <span>1<\/span><span>);<\/span>\n<span>\/\/ 32 * 32 = 1024 thread per block<\/span>\n<span>dim3<\/span> <span>blockDim<\/span><span>(<\/span><span>32<\/span><span>,<\/span> <span>32<\/span><span>,<\/span> <span>1<\/span><span>);<\/span>\n<span>\/\/ launch the asynchronous execution of the kernel on the device<\/span>\n<span>\/\/ The function call returns immediately on the host<\/span>\n<span>sgemm_naive<\/span><span><<<<\/span><span>gridDim<\/span><span>,<\/span> <span>blockDim<\/span><span>>>><\/span><span>(<\/span><span>M<\/span><span>,<\/span> <span>N<\/span><span>,<\/span> <span>K<\/span><span>,<\/span> <span>alpha<\/span><span>,<\/span> <span>A<\/span><span>,<\/span> <span>B<\/span><span>,<\/span> <span>beta<\/span><span>,<\/span> <span>C<\/span><span>);<\/span>\n<\/code><\/pre>\n<\/div>\n<p>CUDA code is written from a single-thread perspective.<br \/>\nIn the code of the kernel, we access the <code>blockIdx<\/code> and <code>threadIdx<\/code> built-in variables.<br \/>\nThese will contain return different values based on the thread that\u2019s accessing them.<label for=\"8\"><\/label><span>In our example, <code>threadIdx.x<\/code> and <code>threadIdx.y<\/code> will vary from 0 to 31 based on the position of the thread in the grid. Same for <code>blockIdx.x<\/code> and <code>blockIdx.y<\/code>, which will vary from 0 to <code>CEIL_DIV(N, 32)<\/code> or <code>CEIL_DIV(M, 32)<\/code> based on the position of the thread\u2019s block in the grid.<\/span> <label for=\"9\"><\/label><span>We\u2019ll do a lot of indexing into strided in-memory representations of matrices. Edward Yang\u2019s post on <a href=\"http:\/\/blog.ezyang.com\/2019\/05\/pytorch-internals\/\">PyTorch Internals<\/a> contains a good explanation of strided tensors.<\/span><\/p>\n<div>\n<pre><code><span>__global__<\/span> <span>void<\/span> <span>sgemm_naive<\/span><span>(<\/span><span>int<\/span> <span>M<\/span><span>,<\/span> <span>int<\/span> <span>N<\/span><span>,<\/span> <span>int<\/span> <span>K<\/span><span>,<\/span> <span>float<\/span> <span>alpha<\/span><span>,<\/span> <span>const<\/span> <span>float<\/span> <span>*<\/span><span>A<\/span><span>,<\/span>\n                            <span>const<\/span> <span>float<\/span> <span>*<\/span><span>B<\/span><span>,<\/span> <span>float<\/span> <span>beta<\/span><span>,<\/span> <span>float<\/span> <span>*<\/span><span>C<\/span><span>)<\/span> <span>{<\/span>\n  <span>\/\/ compute position in C that this thread is responsible for<\/span>\n  <span>const<\/span> <span>uint<\/span> <span>x<\/span> <span>=<\/span> <span>blockIdx<\/span><span>.<\/span><span>x<\/span> <span>*<\/span> <span>blockDim<\/span><span>.<\/span><span>x<\/span> <span>+<\/span> <span>threadIdx<\/span><span>.<\/span><span>x<\/span><span>;<\/span>\n  <span>const<\/span> <span>uint<\/span> <span>y<\/span> <span>=<\/span> <span>blockIdx<\/span><span>.<\/span><span>y<\/span> <span>*<\/span> <span>blockDim<\/span><span>.<\/span><span>y<\/span> <span>+<\/span> <span>threadIdx<\/span><span>.<\/span><span>y<\/span><span>;<\/span>\n\n  <span>\/\/ `if` condition is necessary for when M or N aren't multiples of 32.<\/span>\n  <span>if<\/span> <span>(<\/span><span>x<\/span> <span><<\/span> <span>M<\/span> <span>&&<\/span> <span>y<\/span> <span><<\/span> <span>N<\/span><span>)<\/span> <span>{<\/span>\n    <span>float<\/span> <span>tmp<\/span> <span>=<\/span> <span>0.0<\/span><span>;<\/span>\n    <span>for<\/span> <span>(<\/span><span>int<\/span> <span>i<\/span> <span>=<\/span> <span>0<\/span><span>;<\/span> <span>i<\/span> <span><<\/span> <span>K<\/span><span>;<\/span> <span>++<\/span><span>i<\/span><span>)<\/span> <span>{<\/span>\n      <span>tmp<\/span> <span>+=<\/span> <span>A<\/span><span>[<\/span><span>x<\/span> <span>*<\/span> <span>K<\/span> <span>+<\/span> <span>i<\/span><span>]<\/span> <span>*<\/span> <span>B<\/span><span>[<\/span><span>i<\/span> <span>*<\/span> <span>N<\/span> <span>+<\/span> <span>y<\/span><span>];<\/span>\n    <span>}<\/span>\n    <span>\/\/ C = \u03b1*(<a href=\"http:\/\/siboehm.com\/cdn-cgi\/l\/email-protection\" data-cfemail=\"f8b9b8ba\">[email\u00a0protected]<\/a>)+\u03b2*C<\/span>\n    <span>C<\/span><span>[<\/span><span>x<\/span> <span>*<\/span> <span>N<\/span> <span>+<\/span> <span>y<\/span><span>]<\/span> <span>=<\/span> <span>alpha<\/span> <span>*<\/span> <span>tmp<\/span> <span>+<\/span> <span>beta<\/span> <span>*<\/span> <span>C<\/span><span>[<\/span><span>x<\/span> <span>*<\/span> <span>N<\/span> <span>+<\/span> <span>y<\/span><span>];<\/span>\n  <span>}<\/span>\n<span>}<\/span>\n<\/code><\/pre>\n<\/div>\n<p>To visualize this simple kernel:<label for=\"10\"><\/label><span>If the size of the matrix is not divisible by the size of the block, we\u2019ll have to launch extra blocks to process the remainder. For example, in the picture below, we\u2019ll create 9 blocks of equal threadsize, but only 4 of those fully utilize their 1024 threads. This artifact is called <a href=\"https:\/\/docs.nvidia.com\/deeplearning\/performance\/dl-performance-matrix-multiplication\/index.html#tile-quant\">tile quantization<\/a>, and appears whenever we try to map a fixed-sized volume across a variable-sized input.<img decoding=\"async\" src=\"http:\/\/siboehm.com\/assets\/img\/CUDA-MMM\/Tile_quantization.png\" alt><\/span><\/p>\n<p><img decoding=\"async\" src=\"http:\/\/siboehm.com\/assets\/img\/CUDA-MMM\/naive-kernel.png\" alt><\/p>\n<p>This kernel takes about 0.5s to process three 4092\u00b2 fp32 matrices on my A6000 GPU.<br \/>\nLet\u2019s do some non-implementation-specific calculations:<\/p>\n<h3 id=\"lower-bounding-the-fastest-possible-runtime\">Lower Bounding the Fastest Possible Runtime<\/h3>\n<p>For a matrix multiplication of two 4092\u00b2 matrices, followed by an addition of a 4092\u00b2 matrix (to make the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Basic_Linear_Algebra_Subprograms#Level_3\">GEMM<\/a>):<\/p>\n<ol>\n<li>Total FLOPS:<label for=\"11\"><\/label><span>For each of the 4092\u00b2 entries of C, we have to perform a dot product of two vectors of size 4092, involving a multiply and an add at each step. \u201cMultiply then add\u201d is often mapped to a single assembly instruction called FMA (fused multiply-add), but still counts as two FLOPs.<\/span> <code>2*4092\u00b3 + 4092\u00b2 = 137 GFLOPS<\/code><\/li>\n<li>Total data to read (minimum!): <code>3 * 4092\u00b2 * 4B = 201MB<\/code><\/li>\n<li>Total data to store: <code>4092\u00b2 * 4B = 67MB<\/code><\/li>\n<\/ol>\n<p>So 268MB is the absolute minimum of memory that any implementation would have to transfer from\/to global GPU memory,<label for=\"12\"><\/label><span>Global memory is the GPU\u2019s main memory region. If Nvidia sells you a GPU advertised with 80GB of memory and 1TB\/s of bandwidth, they\u2019re talking about the capacity and bandwidth of global memory. Later we\u2019ll talk about other memory regions on the GPU, like the shared memory, which is physically distinct and has very different performance characteristics.<\/span> assuming it has a big enough cache.<label for=\"13\"><\/label><span>The cuBLAS kernel loads a total of 500MB of GMEM during the whole calculation. We\u2019ll see later how increasing arithmetic intensity allows us to achieve an access volume that low.<\/span><br \/>\nLet\u2019s calculate some upper bounds on kernel performance.<br \/>\nThe GPU is advertised with 30TFLOPs\/s of fp32 compute throughput and 768GB\/s of global memory bandwidth.<br \/>\nIf we achieved those numbers,<label for=\"14\"><\/label><span>Reminder that peak FLOPs is a reductionist metric, since it depends on the instruction mix. There\u2019s no way you\u2019d reach those 30TFLOPs\/s if your FLOP of choice is DIV. However, since matmul uses mainly FMA instructions, which tends to be the fastest FLOPs, we have a good chance of actually getting close to that peak FLOP value.<\/span> <label for=\"15\"><\/label><span>Similar story for the bandwidth: Peak bandwidth can only be reach if the access pattern suits the hardware.<\/span> we\u2019d need 4.5ms for the calculation and 0.34ms for the memory transfers.<br \/>\nSo in our napkin math, the calculation takes ~10x more time than the memory accesses.<br \/>\nThis means our final optimized kernel will be compute-bound, as long as we end up having to transfer <10x the absolute minimum memory volume of 278MB.<label for=\"16\"><\/label><span>The A6000 is advertised with 309TFLOPs\/s of tensor core performance. If we could use tensor cores for our fp32 matmul, the calculation would only take 0.44ms, and an optimized kernel doing 4092^2 matrix mul would almost surely still be memory bound. This puts into perspective just how fast the tensor cores are.<\/span><\/p>\n<p>Now that we\u2019ve calculated some lower bounds for our fp32 GEMM calculation, let\u2019s get back to the kernel on hand, to figure out why it\u2019s so much slower than it could be.<\/p>\n<h3 id=\"memory-access-pattern-of-the-naive-kernel\">Memory Access Pattern of the Naive Kernel<\/h3>\n<p>In our kernel, two threads in the same block with ThreadIds (0, 0) and (0, 1) will load the same column of B but different rows of A.<br \/>\nIf we assume the worst case of zero caching, then each thread has to load <code>2*4092+1<\/code> floats from global memory.<br \/>\nAs we have 4092\u00b2 threads total, this would result in 548GB of memory traffic.<\/p>\n<p>Below is a visualization of the memory access pattern of our naive kernel, taking two threads A (red) and B (green) as an example:<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/siboehm.com\/assets\/img\/CUDA-MMM\/naive_kernel_mem_access.png\" alt><\/p>\n<p>So to recap, when I run this kernel on an A6000 GPU it achieves ~300GFLOPs when multiplying two 4092&#215;4092 float32 matrices.<br \/>\nPretty bad, considering that the A6000 is advertised as being able to achieve almost 30 TFLOPs.<label for=\"17\"><\/label><span>Just for comparison, 300 GFLOPs is also roughly the performance achieved by the optimized BLAS library on the 2015 Haswell CPU that I used in my <a href=\"http:\/\/siboehm.com\/articles\/22\/Fast-MMM-on-CPU\">earlier post<\/a> on CPU matmul.<\/span><br \/>\nSo how can we start to make this faster?<br \/>\nOne way is to optimize the memory access pattern of our kernel such that global memory accesses can be coalesced (=combined) into fewer accesses.<\/p>\n<h2 id=\"kernel-2-global-memory-coalescing\">Kernel 2: Global Memory Coalescing<\/h2>\n<p>Before we get into global memory coalescing, we need to learn about the concept of a warp.<br \/>\nFor execution, the threads of a block are grouped into so-called warps, consisting of 32 threads.<br \/>\nA warp is then assigned to a warp scheduler, which is the physical core that executes the instructions.<label for=\"18\"><\/label><span>Before the Volta architecture, it used to be the case that all threads of a warp were fed from the same instruction stream. On a branch, the threads that didn\u2019t take the branch were inactived using the so-called active mask. However, since Volta, it\u2019s no longer a good idea to rely on this \u2018warp-synchronous\u2019 behaviour, as instructions from different branches may be interleaved even for the same threads within a warp.<\/span><br \/>\nThere are four warp schedulers per multiprocessor.<br \/>\nThe grouping into warps happens based on a consecutive <code>threadId<\/code>.<br \/>\nIf we set the <code>blockDim<\/code> to be multi-dimension, then the threadId is calculated like so:<\/p>\n<div>\n<pre><code>threadId = threadIdx.x+blockDim.x*(threadIdx.y+blockDim.y*threadIdx.z)\n<\/code><\/pre>\n<\/div>\n<p>Then, threads with neighbouring <code>threadId<\/code> become part of the same warp.<br \/>\nBelow I tried to illustrate this, using a smaller \u201cwarpsize\u201d of 8 threads (real warps always contain 32 threads):<label for=\"19\"><\/label><span>I like to think of the three dimensions <code>x,y,z<\/code> of <code>threadId<\/code> as being \u201ccolumn-major\u201d, due to the first dimension <code>x<\/code> being the one that\u2019s continuous in \u201cwarpspace\u201d. I don\u2019t know if others use that term, but it makes the concept more clear to me.<\/span><\/p>\n<p><img decoding=\"async\" src=\"http:\/\/siboehm.com\/assets\/img\/CUDA-MMM\/threadId_to_warp_mapping.png\" alt><\/p>\n<p>The concept of a warp is relevant for this second kernel, as sequential memory accesses by threads that are part of the same warp can be grouped and executed as one.<br \/>\nThis is referred to as <strong>global memory coalescing<\/strong>.<br \/>\nIt\u2019s the most important thing to keep in mind when optimizing a kernel\u2019s GMEM memory accesses toward achieving the peak bandwidth.<\/p>\n<p>Below is an example, where consecutive memory accesses by threads in the same warp are grouped, allowing each warp to execute 8 memory accesses using only 2 32B loads:<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/siboehm.com\/assets\/img\/CUDA-MMM\/GMEM_coalescing.png\" alt><\/p>\n<p>In reality, the GPU supports 32B, 64B and 128B memory accesses.<br \/>\nSo, if each thread is loading a 32bit float from global memory, the warp scheduler (probably the MIO) can coalesce this <code>32*4B=128B<\/code> load into a single transaction.<br \/>\nThis is only possible if the floats loaded are consecutive in memory, and if access is aligned.<label for=\"20\"><\/label><span>In that way, optimizing for global memory coalescing on GPU has a lot of similarities to optimizing for cache line utilization on CPU. Interestingly, to allow coalescing the threads within a warp have to access consecutive addresses, but the accesses don\u2019t have to be consecutive within-warp. Illustrated below: <img decoding=\"async\" src=\"http:\/\/siboehm.com\/assets\/img\/CUDA-MMM\/random_access_coalescing.png\" alt><\/span><br \/>\nIf they aren\u2019t, or if access cannot be coalesced for some other reason, then the GPU will execute as many 32B loads as necessary to fetch all floats, leading to a lot of wasted bandwidth.<br \/>\nProfiling our naive kernel, we can observe the detrimental effect of non-coalesced access as we achieve only 15GB\/s of GMEM throughput.<\/p>\n<p>Looking back at the previous kernel, we assigned threads their entry of C like so:<\/p>\n<div>\n<pre><code><span>const<\/span> <span>uint<\/span> <span>x<\/span> <span>=<\/span> <span>blockIdx<\/span><span>.<\/span><span>x<\/span> <span>*<\/span> <span>blockDim<\/span><span>.<\/span><span>x<\/span> <span>+<\/span> <span>threadIdx<\/span><span>.<\/span><span>x<\/span><span>;<\/span>\n<span>const<\/span> <span>uint<\/span> <span>y<\/span> <span>=<\/span> <span>blockIdx<\/span><span>.<\/span><span>y<\/span> <span>*<\/span> <span>blockDim<\/span><span>.<\/span><span>y<\/span> <span>+<\/span> <span>threadIdx<\/span><span>.<\/span><span>y<\/span><span>;<\/span>\n<\/code><\/pre>\n<\/div>\n<p>Hence, threads of the same warp (those with consecutive <code>threadIdx.x<\/code>) were loading the rows of A non-consecutively from memory.<br \/>\nThe naive kernel\u2019s pattern of accessing the memory of A looked more like so:<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/siboehm.com\/assets\/img\/CUDA-MMM\/Naive_kernel_mem_coalescing.png\" alt><\/p>\n<p>To enable coalescing, we can change how we assign positions of the result matrix C to threads.<br \/>\nThis change in the global memory access pattern is illustrated below:<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/siboehm.com\/assets\/img\/CUDA-MMM\/Naive_kernel_improved_access.png\" alt><\/p>\n<p>To implement this, we only need to change the first two lines:<\/p>\n<div>\n<pre><code><span>const<\/span> <span>int<\/span> <span>x<\/span> <span>=<\/span> <span>blockIdx<\/span><span>.<\/span><span>x<\/span> <span>*<\/span> <span>BLOCKSIZE<\/span> <span>+<\/span> <span>(<\/span><span>threadIdx<\/span><span>.<\/span><span>x<\/span> <span>\/<\/span> <span>BLOCKSIZE<\/span><span>);<\/span>\n<span>const<\/span> <span>int<\/span> <span>y<\/span> <span>=<\/span> <span>blockIdx<\/span><span>.<\/span><span>y<\/span> <span>*<\/span> <span>BLOCKSIZE<\/span> <span>+<\/span> <span>(<\/span><span>threadIdx<\/span><span>.<\/span><span>x<\/span> <span>%<\/span> <span>BLOCKSIZE<\/span><span>);<\/span>\n\n<span>if<\/span> <span>(<\/span><span>x<\/span> <span><<\/span> <span>M<\/span> <span>&&<\/span> <span>y<\/span> <span><<\/span> <span>N<\/span><span>)<\/span> <span>{<\/span>\n  <span>float<\/span> <span>tmp<\/span> <span>=<\/span> <span>0.0<\/span><span>;<\/span>\n  <span>for<\/span> <span>(<\/span><span>int<\/span> <span>i<\/span> <span>=<\/span> <span>0<\/span><span>;<\/span> <span>i<\/span> <span><<\/span> <span>K<\/span><span>;<\/span> <span>++<\/span><span>i<\/span><span>)<\/span> <span>{<\/span>\n    <span>tmp<\/span> <span>+=<\/span> <span>A<\/span><span>[<\/span><span>x<\/span> <span>*<\/span> <span>K<\/span> <span>+<\/span> <span>i<\/span><span>]<\/span> <span>*<\/span> <span>B<\/span><span>[<\/span><span>i<\/span> <span>*<\/span> <span>N<\/span> <span>+<\/span> <span>y<\/span><span>];<\/span>\n  <span>}<\/span>\n  <span>C<\/span><span>[<\/span><span>x<\/span> <span>*<\/span> <span>N<\/span> <span>+<\/span> <span>y<\/span><span>]<\/span> <span>=<\/span> <span>alpha<\/span> <span>*<\/span> <span>tmp<\/span> <span>+<\/span> <span>beta<\/span> <span>*<\/span> <span>C<\/span><span>[<\/span><span>x<\/span> <span>*<\/span> <span>N<\/span> <span>+<\/span> <span>y<\/span><span>];<\/span>\n<span>}<\/span>\n<\/code><\/pre>\n<\/div>\n<p>And we call it like so:<label for=\"21\"><\/label><span>This wasn\u2019t immediately obvious to me, but enabling GMEM coalescing changes nothing in the assembly, see the SASS output on <a href=\"https:\/\/godbolt.org\/#z:OYLghAFBqd5TKALEBjA9gEwKYFFMCWALugE4A0BIEAZgQDbYB2AhgLbYgDkAjF%2BTXRMiAZVQtGIHgBYBQogFUAztgAKAD24AGfgCsp5eiyahUAV0wtyKxqiIEh1ZpgDC6embZMQAJnLOAGQImbAA5TwAjbFJfWQAHdCViByY3Dy9fcgSk%2ByEgkPC2KJifWRtsOxSRIhZSIjTPbz9yyqFq2qJ8sMjo2OsauoaM5oHO4O6i3tKASmt0M1JUTi4AUh8AZjjSFmA2FgBqISWVrQBBE\/ON4NQPHH2V9ZdxJRU6h9wLtfXr2%2Bx7x9QSiIhHQ70%2BVyYNwsfweTyBmHoBAiYLOXx%2B0P%2BTzMESMSgA%2BgA3HwAOiQKMu30hv0x5kseNIZmEBA4pPJFzxeOA9HQEQkHP2BPQBEw%2ByUwGwbDYnO5vPoeI4UowEmwSiWEGCRH2AFlyPsNftQrr9QBpXU0bksTUSOJIKwXfYOx1O50u11ul0YJhA\/bm9CW\/YAKlOus93t9\/oDACFyPb3XH4\/Hw5qojUzRbNQGXNN7gB2SOx\/WRgIAeRcxpEAEkAFofdYAEXWPge%2BbODtDmv1qAASugAO7\/Ov7HHoVAAawrmHUxPUgf2RdL5eruHuPkj%2BwgRCQpGwLEwE6nM4A9HOS2XKzXps3Y%2B29cJ9qg0gOh9yx\/viQBPWfzs9Lldrjdbjue6TtOK4AKwngu564Je6wtucrb7Iex4EDQoo1EQErMB2Sj7CESwvLUn4kPseyjn8m7BMAuG9mQo77IyOCkPs9iMPsACOZjGPYABelopAWaEQN2faYtqK4AGxrBJ96PrCBrZiseaxg6SYsWwcRPloxJaFeiGqWQ676gQWnNnqYnGmZaz5quBCKcp%2BmOkQGl\/g8g7nGBkYif2Ab7Maf56isYGDr5%2BaeSZvmhAFD7uEFdZ6acTpKfFjkuEFXk9j5BrRWkcVPtatqzs5mnWUO2A1LOaWed5s5RaVMX0HFCUOslnw5nWXCzPQ3Bgfw3hcDo5DoNwLgKHWiVpaupVKPMiwwhsfDkEQ2idbMSBAb0ECzKOsTEgAnAdh1HUdEmGNw0h9StQ3cPwSggFoS0rbMcCwCgGAaQw0SUNQ71xJ9MRMASqCoDwPA5uQOAEgQSwAGoENgvbFnEzBXXQ9CYaQd0QBEV0RMERHcIt70cMIxZMPQ75XTgewmJIA38IQ26VASKpXdg6gVGYmGE\/wGrYN19OGEi2ykO%2Bbg4FdRCkMyPOzOaOxKHDCNIyjvD8IIwhiMqUiyBrigqBoV36DwhjGKYFhWIiER3ZAszoHEuRetwAC0xbrPszt1qEda4JGCgAOIe874ohNsmHOxgOBubUqBkvWv1c9geIABynZH2BuelCeYSnp1KFKqd5c7zsO%2BoLBKM79uO0obnOwSHvwm5qDWdZYO3fzFSO04TCuO4jQGIE4yFMUBjZMkQhDN4Jtj47XTD70JstI77SDH3GSLx3rRMCvYwFD0MSL6Mk8GECHRz\/vUizDNCxLL4XU9ZdgvDVwBowy4Lj7KDxI5jp674MQhkvg%2BGmPwZa9NphrQ2jELa5AdqlH2sdRBB1ToCwuuQfqg1n63Xuo9cB5AXrIDQFgPAhASAUCoLQT6rAOA8zkJrcQkgZB0P1moTQgt9B%2BCMCYNAFtrCby7hAZwx8\/CDz3pMEo8REjj1SGvJoWQpGzyHhfUofDbDLyPrIzIS8qijHPuIvop9V7pDkYY3eEwR4zDmDfZYXwtg7D2IcSEGdUSoghFCO4sJnivCIGyNx1JPHwgcL4yk7iYQAnhFbYJ6IPEAmxLiQkJI44fFcSE\/xsS6QMiZCyJJnwzgci5DyPkeIBRChFGKCUUoCmynlBKPEjIlSMFVNgTA6o7w6lvJqQ0HS\/Jpj9FaegNo7SOQTCMkZN41JBhDEIMM6ZAzRhUqMxZcY1IpisD6WZmZ7LwQdIWU8i4axuUbM1e80yOx3hqm5Z8I5xwgRnKFPZ0EAoAW3LuN8M41gQW\/PsmCxybydjkvWK5r4QKfnuVBX8pVnlATeUhSCP4LwJVjMhPUaFT6YRJjhPCFQVRKCIixdApEWDkRYkgKiNE6IMR7tEFin0OJcSZHxR2gl1wXMeOJaS0lZLuDEqELZCy1LFS0jpY5BlmKtI7KZOC5l5KWSlS3WyfLhnqRKlNQFHkMqiV8v5UqBA8qhXShFbK9VcrBRFbmFKiVHRVQ1VlOqq4uWNWCvlAZhVfKCtKqsyqWdMq1RyrFU1cFYytVRO1TqZ0uC9XQVdZ%2Bo1xr7EmjZNc185orkbKAp6kDdybW2rtJBSCUHnX4GwEA6wf5gSjU\/G61gcFgJ0M9RAEA3roA%2BowchP1m1\/VbSAQGwNQbg0htDbAStEbIwwerBgGMsY40FnjVgotaHE2wmTCmVMJRmzpoNRmncCAszuoLdmnNuZq0oMIfmV0rYizFsQyW0ti1qzlkYaiw6Vb9UWnrLWjDdbyGUKwo2mQuHm0sELa28A7YOxSHu127tPbe19gHIOIdoiWmwBHYh0dFhxzrNnJOqdyDp0ztVDtidc7WALlJJ1xdS7l0ruB6Ztd67O0bvHeVkY26qO3Y4QRPdj4m1EeYhe8icgpB44J6ReiR4bzUTojoIntFtF0Uo\/Rh8ZOaOU3UcTC8r6zVvjwe%2BEbH6YO4K\/d%2Bn8eDf1\/hAf%2BZDU26fTeAzNTFqA5vgXmxBBauBoLHddLg2CHq1tWuGnwRaQBgW0snHge0JI5h8DwdYoNpA5jAuWrzWDcF1vwQ2htKAB1LG%2BsJDt\/1QjsGWKEN%2BH8v4\/0Gs0gB0tMAGHfQwzgTC9Y\/sNuwrR\/CUjd17sYgePcNMH1E47ETM8UgDZPp1%2BTKneuSY49vBTYiJP9Gm\/3NTZj54H1mFLbA2BhT3T05GlL3A6zYChksfYw7qWlZMxVizVnAELV1G4Ft1KgEgLSwF9aWboE5ukBJYkElwtA54MDoH4a0HFp4FoB6R2fPVr809DL8BG0gBy9gPLv1CvFe4Nd8rZnKv8Gq2QvbfgGva2a9%2Bg2bDBrG3Y1vbrIm%2BMbdHgo4TqmhtjcU0tuT82Vvrzp%2Bos%2BXOBOmNkwt\/jm2lrbl23VrQB2DP8Gfids7fxLvMVx6Z8zWg\/6kIe42J7BXW02fe\/5iB5AvuOZgTtMC6xiRxdByDx3ydOGFvIJD6HFbDNw7ugj%2BzgXgvSD2sSEHWhGw5lTsnZOYFSgSVOrDuzdac1Q4egLdYCvvMJ4CyzTGXXpBAA%3D%3D\">Godbolt<\/a>. Access coalescing is done at kernel runtime by the hardware. This makes sense since coalescing requires aligned access, which cannot be guaranteed at compile time as we pass the matrix pointers as function arguments. Also: the assembly features partial unrolling of our inner loop even though the loop count <code>K<\/code> is not known at compile time. Exciting!<\/span><\/p>\n<div>\n<pre><code><span>\/\/ gridDim stays the same<\/span>\n<span>dim3<\/span> <span>gridDim<\/span><span>(<\/span><span>CEIL_DIV<\/span><span>(<\/span><span>M<\/span><span>,<\/span> <span>32<\/span><span>),<\/span> <span>CEIL_DIV<\/span><span>(<\/span><span>N<\/span><span>,<\/span> <span>32<\/span><span>));<\/span>\n<span>\/\/ make blockDim 1-dimensional, but don't change number of threads<\/span>\n<span>dim3<\/span> <span>blockDim<\/span><span>(<\/span><span>32<\/span> <span>*<\/span> <span>32<\/span><span>);<\/span>\n<span>sgemm_coalescing<\/span><span><<<<\/span><span>gridDim<\/span><span>,<\/span> <span>blockDim<\/span><span>>>><\/span><span>(<\/span><span>M<\/span><span>,<\/span> <span>N<\/span><span>,<\/span> <span>K<\/span><span>,<\/span> <span>alpha<\/span><span>,<\/span> <span>A<\/span><span>,<\/span> <span>B<\/span><span>,<\/span> <span>beta<\/span><span>,<\/span> <span>C<\/span><span>);<\/span>\n<\/code><\/pre>\n<\/div>\n<p>Global memory coalescing increases memory throughput from 15GB\/s to 110GB\/s.<br \/>\nPerformance reaches 2000 GFLOPS, a big improvement compared to the 300 GFLOPS of the first, naive kernel.<br \/>\nFor the next kernel, we\u2019ll use the GPU\u2019s fast on-chip memory, called shared memory, to cache data that will be re-used.<\/p>\n<p>Next to the large global memory, a GPU has a much smaller region of memory that is physically located on the chip, called shared memory (SMEM).<br \/>\nPhysically, there\u2019s one shared memory per SM.<label for=\"22\"><\/label><span>Here\u2019s a helpful illustration of the memory hierarchy on an A100 GPU (<a href=\"https:\/\/developer.nvidia.com\/blog\/cuda-refresher-cuda-programming-model\/\">source<\/a>):<img decoding=\"async\" src=\"http:\/\/siboehm.com\/assets\/img\/CUDA-MMM\/memory-hierarchy-in-gpus.png\" alt><\/span><br \/>\nLogically, this shared memory is partitioned among the blocks.<br \/>\nThis means that a thread can communicate with the other threads in its block via the shared memory chunk.<br \/>\nOn my A6000 GPU, each block has access to a maximum of 48KB of shared memory.<label for=\"23\"><\/label><span>The amount of SMEM is configurable, by trading off a larger shared memory for a smaller L1 cache. For specifics, see the <a href=\"https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html#compute-capability-8-x\">compute capability documentation<\/a>. Also, it\u2019s possible to use more than 48KB of SMEM per thread by utilizing dynamic shared memory.<\/span><\/p>\n<p>As the shared memory is located on-chip, it has a much lower latency and higher bandwidth than global memory.<br \/>\nI couldn\u2019t find good benchmark results for the Ampere architecture but for Volta (released in 2017) the benchmarks performed in <a href=\"https:\/\/arxiv.org\/abs\/1804.06826\">this paper<\/a> report 750GiB\/s of global memory bandwidth, and 12,080GiB\/s of shared memory bandwidth.<label for=\"24\"><\/label><span>It doesn\u2019t look like these numbers have changed much since Volta. Nvidia reports ~750GB of max GMEM bandwidth for my A6000 (Ampere).<\/span><\/p>\n<p>So for this next kernel, we\u2019ll load a chunk of A and a chunk of B from global memory into shared memory.<br \/>\nThen we\u2019ll perform as much work as possible on the two chunks, with each thread still being assigned one entry of C.<br \/>\nWe\u2019ll move the chunks along the columns of A and the rows of B performing partial sums on C until the result is computed.<\/p>\n<p>This is illustrated below:<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/siboehm.com\/assets\/img\/CUDA-MMM\/cache-blocking.png\" alt><\/p>\n<p>The important parts of the code are below, with variable names corresponding to the plot above:<label for=\"25\"><\/label><span>In general, I didn\u2019t write the code to work for arbitrary sizes of M, N and K, as the condition checking introduces a lot of clutter and isn\u2019t very interesting. To make sure the kernel works correctly, I test it with random data and a few different matrix sizes by comparing to cuBLAS.<\/span><\/p>\n<div>\n<pre><code><span>\/\/ advance pointers to the starting position for this block<\/span>\n<span>A<\/span> <span>+=<\/span> <span>cRow<\/span> <span>*<\/span> <span>CHUNKSIZE<\/span> <span>*<\/span> <span>K<\/span><span>;<\/span>                    <span>\/\/ row=cRow, col=0<\/span>\n<span>B<\/span> <span>+=<\/span> <span>cCol<\/span> <span>*<\/span> <span>CHUNKSIZE<\/span><span>;<\/span>                        <span>\/\/ row=0, col=cCol<\/span>\n<span>\/\/ the pointer to the output location stays fixed during the loop<\/span>\n<span>C<\/span> <span>+=<\/span> <span>cRow<\/span> <span>*<\/span> <span>CHUNKSIZE<\/span> <span>*<\/span> <span>N<\/span> <span>+<\/span> <span>cCol<\/span> <span>*<\/span> <span>CHUNKSIZE<\/span><span>;<\/span> <span>\/\/ row=cRow, col=cCol<\/span>\n\n<span>\/\/ allocate buffer for current chunk of A and chunk of B in shared mem<\/span>\n<span>\/\/ the amoung of SMEM required by this kernel is fixed at compile time<\/span>\n<span>\/\/ this will become important for occupancy calculations later on.<\/span>\n<span>__shared__<\/span> <span>float<\/span> <span>As<\/span><span>[<\/span><span>CHUNKSIZE<\/span> <span>*<\/span> <span>CHUNKSIZE<\/span><span>];<\/span>\n<span>__shared__<\/span> <span>float<\/span> <span>Bs<\/span><span>[<\/span><span>CHUNKSIZE<\/span> <span>*<\/span> <span>CHUNKSIZE<\/span><span>];<\/span>\n\n<span>float<\/span> <span>tmp<\/span> <span>=<\/span> <span>0.0<\/span><span>;<\/span>\n<span>\/\/ the outer loop advances A along the columns and B along<\/span>\n<span>\/\/ the rows until we have fully calculated the result in C.<\/span>\n<span>for<\/span> <span>(<\/span><span>int<\/span> <span>outer<\/span> <span>=<\/span> <span>0<\/span><span>;<\/span> <span>outer<\/span> <span><<\/span> <span>numBlockSteps<\/span><span>;<\/span> <span>++<\/span><span>outer<\/span><span>)<\/span> <span>{<\/span>\n  <span>\/\/ Have each thread load one of the elements in A & B from<\/span>\n  <span>\/\/ global memory into shared memory.<\/span>\n  <span>\/\/ Make the innerRow (=threadIdx.x) the consecutive index<\/span>\n  <span>\/\/ to allow for global memory access coalescing<\/span>\n  <span>Ab<\/span><span>[<\/span><span>innerCol<\/span> <span>*<\/span> <span>CHUNKSIZE<\/span> <span>+<\/span> <span>innerRow<\/span><span>]<\/span> <span>=<\/span> <span>A<\/span><span>[<\/span><span>innerCol<\/span> <span>*<\/span> <span>K<\/span> <span>+<\/span> <span>innerRow<\/span><span>];<\/span>\n  <span>Bb<\/span><span>[<\/span><span>innerCol<\/span> <span>*<\/span> <span>CHUNKSIZE<\/span> <span>+<\/span> <span>innerRow<\/span><span>]<\/span> <span>=<\/span> <span>B<\/span><span>[<\/span><span>innerCol<\/span> <span>*<\/span> <span>N<\/span> <span>+<\/span> <span>innerRow<\/span><span>];<\/span>\n\n  <span>\/\/ block threads in this block until cache is fully populated<\/span>\n  <span>__syncthreads<\/span><span>();<\/span>\n\n  <span>\/\/ advance pointers onto next chunk<\/span>\n  <span>A<\/span> <span>+=<\/span> <span>CHUNKSIZE<\/span><span>;<\/span>\n  <span>B<\/span> <span>+=<\/span> <span>CHUNKSIZE<\/span> <span>*<\/span> <span>N<\/span><span>;<\/span>\n\n  <span>\/\/ execute the dotproduct on the currently cached chunk<\/span>\n  <span>for<\/span> <span>(<\/span><span>int<\/span> <span>inner<\/span> <span>=<\/span> <span>0<\/span><span>;<\/span> <span>inner<\/span> <span><<\/span> <span>CHUNKSIZE<\/span><span>;<\/span> <span>++<\/span><span>inner<\/span><span>)<\/span> <span>{<\/span>\n    <span>tmp<\/span> <span>+=<\/span>\n        <span>Ab<\/span><span>[<\/span><span>innerCol<\/span> <span>*<\/span> <span>CHUNKSIZE<\/span> <span>+<\/span> <span>inner<\/span><span>]<\/span> <span>*<\/span> <span>Bb<\/span><span>[<\/span><span>inner<\/span> <span>*<\/span> <span>CHUNKSIZE<\/span> <span>+<\/span> <span>innerRow<\/span><span>];<\/span>\n  <span>}<\/span>\n  <span>\/\/ need to sync again at the end, to avoid faster threads<\/span>\n  <span>\/\/ fetching the next chunk into the cache before slower threads are done<\/span>\n  <span>__syncthreads<\/span><span>();<\/span>\n<span>}<\/span>\n<span>C<\/span><span>[<\/span><span>innerCol<\/span> <span>*<\/span> <span>N<\/span> <span>+<\/span> <span>innerRow<\/span><span>]<\/span> <span>=<\/span> <span>alpha<\/span> <span>*<\/span> <span>tmp<\/span> <span>+<\/span> <span>beta<\/span> <span>*<\/span> <span>C<\/span><span>[<\/span><span>innerCol<\/span> <span>*<\/span> <span>N<\/span> <span>+<\/span> <span>innerRow<\/span><span>];<\/span>\n<\/code><\/pre>\n<\/div>\n<p>This kernel achieves ~2200 GFLOPS, a 50% improvement over the previous version.<label for=\"26\"><\/label><span>There\u2019s only a 50% improvement partly because our previous kernel already had pretty good L1 cache hit rates.<\/span><br \/>\nWe\u2019re still far away from hitting the ~30 TFLOPs that the GPU can provide.<br \/>\nThis is obvious from the roofline plot below:<label for=\"27\"><\/label><span>Notice how we\u2019re achieving a higher memory bandwidth than cuBLAS. But because we\u2019re doing much less work per byte loaded from memory (=lower arithmetic intensity), overall performance is worse.<\/span><\/p>\n<p><img decoding=\"async\" src=\"http:\/\/siboehm.com\/assets\/img\/CUDA-MMM\/roofline_kernel_3.png\" alt=\"Roofline analysis of kernel 3\"><\/p>\n<p>At a CHUNKSIZE of 32, this uses <code>2*32*32*4B=8KB<\/code> of shared memory space.<label for=\"28\"><\/label><span>This info can also be obtained by compiling with <code>--ptxas-options=-v<\/code>, which outputs: <code>Used 37 registers, 8192 bytes smem, 400 bytes cmem[0]<\/code>.<\/span><br \/>\nMy A6000 GPU has a maximum of 48KB of shared memory space available for each block, so we\u2019re far away from hitting that limit.<br \/>\nThis is not necessarily a problem, as there are downsides to increasing per-block shared-memory usage.<br \/>\nEach multiprocessor (SM) has a maximum of 100KB of SMEM available.<br \/>\nThis means that if we\u2019d modify our kernel to use the full 48KB of SMEM available, each SM could only keep two blocks loaded at the same time.<br \/>\nIn CUDA parlance, increasing per-block SMEM utilization can decrease <a href=\"https:\/\/docs.nvidia.com\/cuda\/cuda-c-best-practices-guide\/index.html#occupancy\">occupancy<\/a>.<br \/>\nOccupancy is defined as the ratio between the number of active warps per SM and the maximum possible number of active warps per SM.<\/p>\n<p>High occupancy is useful because it allows us to hide the high latency of our operations, by having a bigger pool of issue-able instructions available.<label for=\"29\"><\/label><span>On GPUs, math operations like FMA have a latency of 4 cycles which is equal to 2.6ns at a 1.5GHz clock. Compare this to a recent x86 CPU, where FMA has a 6 cycle latency or 1.8ns at a 3.5GHz clock.<\/span><br \/>\nThere are three main limits to keeping more active blocks loaded on an SM: register count, warp count and SMEM capacity.<br \/>\nLet\u2019s do an example calculation for our current kernel.<\/p>\n<h3 id=\"occupancy-calculation-for-kernel-3\">Occupancy Calculation for Kernel 3<\/h3>\n<p>Here are the relevant hardware stats for my GPU, obtained from the <code>cudaGetDeviceProperties<\/code> API (Multiprocessors are the SMs we talked about earlier):<label for=\"30\"><\/label><span>The amount of shared memory is configurable by using a feature called <code>SharedMemoryCarveout<\/code>. The so-called unified data cache is partitioned into L1 cache and shared memory, so we can trade-off less shared-memory for more L1 cache.<\/span><\/p>\n<table readabilityDataTable=\"1\">\n<thead>\n<tr>\n<th>Metric<\/th>\n<th>Value<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Name<\/td>\n<td>NVIDIA RTX A6000<\/td>\n<\/tr>\n<tr>\n<td>Compute Capability<\/td>\n<td>8.6<\/td>\n<\/tr>\n<tr>\n<td>max threads per block<\/td>\n<td>1024<\/td>\n<\/tr>\n<tr>\n<td>max threads per multiprocessor<\/td>\n<td>1536<\/td>\n<\/tr>\n<tr>\n<td>threads per warp<\/td>\n<td>32<\/td>\n<\/tr>\n<tr>\n<td>warp allocation granularity<\/td>\n<td>4<\/td>\n<\/tr>\n<tr>\n<td>max regs per block<\/td>\n<td>65536<\/td>\n<\/tr>\n<tr>\n<td>max regs per multiprocessor<\/td>\n<td>65536<\/td>\n<\/tr>\n<tr>\n<td>reg allocation unit size<\/td>\n<td>256<\/td>\n<\/tr>\n<tr>\n<td>reg allocation granularity<\/td>\n<td>warp<\/td>\n<\/tr>\n<tr>\n<td>total global mem<\/td>\n<td>48685 MB<\/td>\n<\/tr>\n<tr>\n<td>max shared mem per block<\/td>\n<td>48 KB<\/td>\n<\/tr>\n<tr>\n<td>CUDA runtime shared mem overhead per block<\/td>\n<td>1024 B<\/td>\n<\/tr>\n<tr>\n<td>shared mem per multiprocessor<\/td>\n<td>102400 B<\/td>\n<\/tr>\n<tr>\n<td>multiprocessor count<\/td>\n<td>84<\/td>\n<\/tr>\n<tr>\n<td>max warps per multiprocessor<\/td>\n<td>48<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>And here are the resource demands for our kernel:<\/p>\n<table readabilityDataTable=\"0\">\n<tbody>\n<tr>\n<td>Registers per Thread<\/td>\n<td>37<\/td>\n<\/tr>\n<tr>\n<td>SMEM per Block<\/td>\n<td>8192 B<\/td>\n<\/tr>\n<tr>\n<td>Threads per Block<\/td>\n<td>1024<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Work is scheduled onto the SMs on a block granularity.<br \/>\nEach SM will load more blocks, as long as it has enough resources to accommodate them.<br \/>\nCalculation:<label for=\"31\"><\/label><span>I found lots of official and unofficial occupancy calculators, but no official formulae as how to calculate the occupancy. The results are correct (I checked using NVIDIA\u2019s official tools), but there may be small errors eg in the application of rounding.<\/span><\/p>\n<ul>\n<li><strong>Shared memory<\/strong>: 8192B\/Block + 1024B\/Block for CUDA runtime usage = 9216B\/Block. (102400B per SM) \/ (9216B per Block) = 11.11 \u21d2 11 Blocks upper limit.<\/li>\n<li><strong>Threads<\/strong>: 1024 Threads per Block, max 1536 threads per SM \u21d2 Upper limit 1 block.<\/li>\n<li><strong>Registers<\/strong>: 37 regs per thread * 32 threads per warp = 1184 regs per warp. Register allocation granularity is 256 regs on a warp level, hence rounding up to 1280 regs per warp. We have (1024 threads \/ 32) = 32 warps per block, hence 1280 regs per warp * 32 warps per block = 40960 regs per block. Max 65536 regs per SM \u21d2 upper limit 1 block.<\/li>\n<\/ul>\n<p>So this kernel is limited by the number of threads per block, and the number of registers per thread.<br \/>\nWe cannot load more than one block per SM, giving us a final occupancy of 32 active warps \/ 48 max active warps = 66%.<\/p>\n<p>A 66% occupancy is not too bad, so this doesn\u2019t explain why our kernel runs so slow.<br \/>\nLooking at the profiler gives us some hints. First, if we look at the mix of executed instructions, most of them are memory loads:<label for=\"32\"><\/label><span><code>LDS<\/code> are shared memory loads. <code>FMA<\/code> is our fused multiply add. <code>IADD3<\/code> is a \u201c3 input integer addition\u201d, which we need for moving the pointers along the K dimension.<\/span><\/p>\n<p><img decoding=\"async\" src=\"http:\/\/siboehm.com\/assets\/img\/CUDA-MMM\/kernel_3_profiler_instr_mix.png\" alt><\/p>\n<p>Our inner loop looks like this in PTX (<a href=\"https:\/\/godbolt.org\/#z:OYLghAFBqd5TKALEBjA9gEwKYFFMCWALugE4A0BIEAZgQDbYB2AhgLbYgDkAjF%2BTXRMiAZVQtGIHgBYBQogFUAztgAKAD24AGfgCsp5eiyahUAV0wtyKxqiIEh1ZpgDC6embZMQAVgBs5M4AMgRM2AByngBG2KQgAMwA7OQADuhKxA5Mbh5evgFpGfZCIWGRbDFxSdbYtsVMIkQspEQ5nt7SyTbYdlmNzUSlEdGxCV1NLW15ndYTg6HDFaNJAJTW6GakqJxcAKQATPGhqB44ANS78S4SwGTESGyXuLtaAIIHR0wnFtgXV6hKIiEdBPF7vQ7HU6\/S4uAFA%2BgEKKgt4fSE\/P6wsxRIxKAD6ADd9gA6JDI8Gfb7nGHmSy40hmYQEDgkslgj44OhhM4uXAASSCuIAIryAGoQACy5DO4RWZyg4tlBwAQnKZQBaHiygD0qpWbLeuNxwHo6CiEkNZ3x6AImDOSmA2DYbFxSiQzWwmFxHGd2PQqAA1hBQkQzpKzsHpVKIwBpKU0E0sEMSFJu8hgs4ZzNZ7M53N5zMYJiAs7x9CJs4AKleUsLxdL5YrSrTb3zrbb%2BfrIZiTTjCZDFZcisSSvTZ1rIYjSqCAHkXNGRLyAFrPeKC%2BL7S4jlsXbdanVEJC\/DZEFJmLsmgNnA\/lgDuvxvxhDJDH6DYp6Iv1CV6QBCU39I2AsJgvoBqO45nGYEaoAASugN5\/IKZwgf6vKYOoRLqJuYFCMWkHCGOOQIUhF4oWhRIAJ5YSiu46hIF6Jr8URmDQNCxCWZBjpsAH4ch4ZMCWLDFq67q2t6o57naboAaJjrhn%2BwnSUh2BEHezBnHR\/6AZgf5fiwxF%2Bv6o6GgpHoWp2ZyvEouw%2BEqU6zvOS64JWZx2XOC7LtZgpUa8GbGVJpm4iWfYuVZNmuQ5y7OeF7nPD4XnxFu7w0d%2Bn5MGEpBnKQcEXPsfgvvQ363tgYBcAB6moNsSgZCYfHfr%2BmlAdhRYhnhT5IABQGEZciEHh1mCoeh6g5T4Lkzm5jneRm4GtQ1mCwfB3WzQNGFnDq0UTQl%2Bo%2BattGYPixjbGcaTBrEf7PgevyAgMoTAEd6SZDho4%2Bcqi0wdlFajfZMXOdGm7tnmElZTe3VvTeNbuN1WijiqL2rgR7hRWNEUriq\/2toDcGQ%2BD9Agzko4uDlI5w6DiNfY5znhIT8MFR960eQlO2ZZjq6g9juMQ9R23mUQb5EVoRJQ5t26CBlQY8aRQ2LYLKpRBLGJnL9DOywNhOLXTuBDolWYSQAEiw%2BK\/IBqBILNZwJraQhHjQKVnLUjrMEQOn8c9uUuaOGYSeKLD%2Br8F2zYREDdb1WnLeosp%2B7WPRngQBt8TgmHbh7%2B7oOp9AmvBxqmhIZzemQ5HlZVf4YBI2BKKgN3uxZoVKsHQHzaT42Rcq\/sQ3FRHvDZtdze9CtU13eNxZNmZKtXXf17TSPfc3\/et4havWTX7VaeP0p90vnWz95lcSbxXdO3Vf68Qy9gFeIxufn%2BNBmGn%2BdpKeRgfpglfGeRXx7xAepC9tGYu0TiHq0PDMMN9h\/0%2Bo3JyH1whb0TozbA6go4fhtpgdAJ4sqYDMHYM4QgbbmFINxIg9B85n0PLaZClcRZygjMgogKspZ\/WobQq4YDkZ\/ReiAhhaFNaVwzDzFIqs4aWQXmPHu6sqYcMwm3Cs3C8wjwXuIhuyM159QHvFLWmZdiJC8jAiSYQPRXhTkoV%2BqB1LABYLpNqhsmCYClM%2BfW1pbQ0EEh%2BDKe9t46lYkQY2N0bZhHUOeAyfFzqHjHCwc%2BSkRaXXTmxPe6kyrILCM\/F0Rj36fzURorR20XBCPXt3eCkClFaRUQhaRyY3TOV4VTbsekPpZM7jklelNp45JUVvTRXA1j0G4D4fg3guA6HIOgbgLgFCCh8lkkBzclAbC2NCQ4fByBEG0O0tYh4gKjA\/uQf0vgtCGG4NIHpSyBncH4EoEAOzFl9PaeQOAsAUAYDfAwWIlBqD3JSI8uITB8QVR4DwZIOB8QEG2CKAg2AbzThSMwQ5dB6DONORAKIhyoihGaORbg8z7kcGENOJghDDk4DYMYY0Ox%2BmEAAr0A2pzLmBHgeYD8aL%2BAnU6VShEURSAorcDgQ5RBSBMnpWseMLBgBKGBaC8FkLeD8EEMIMQJcpCyClYoFQGhDn6H2IYQlaALBWBZacyAax0ApHqJSk5dteiOAgM4KY3geCBCsUMcolQDCFAetkdw7QnX3XqPakYcQbXdDNQ0OYVqDD%2BvqP0Fo3qli%2BtmAMYNfq5iRsdZqdYmxthSA6V0g5VLBlcGlCKFwBMflEkSALOU%2BBiAcQ%2BJqfgFydArBWVpdZawtk%2BB2Uy\/Z5Ben9JzScs5Cyln1t2VwfY\/A2AgGkPzPwPhEgAE5DhaH2IkeIWgfAzp8D4Tthye39suWsG5yAQD\/MBdgZ5EBXnvPCOwHY4R82Fp4MWgW\/APQVp5ZgAwCqZWSBkHIYQyg1CaCpaqmodQshOCscG6QAAOW1mBE2jB4D4G1zr6hxtSJ6rIcHfXSBncBnoYag1uumNB0NfQE0LAdfBxDMbJiEetdR%2BYZQfVypnWsbl2BsA2jORmrg3TN3Zu4IKbAALDoirvBlG9BazhFpLVoMthASAZSrVKNwDzGCKcOPsFYNaB0NrWXEDZWyeBaDbXs0dvgZ1EniNIfYWhEh%2BBnYkXKM6J2yC7fwbdpzzk6c2VIYzQ74hZu7ccndda1gG1IBkRw0ggA%3D\">Godbolt link<\/a>):<\/p>\n<div>\n<pre><code><span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f91<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r8<\/span><span>+<\/span><span>3456<\/span><span>];<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f92<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r7<\/span><span>+<\/span><span>108<\/span><span>];<\/span>\n<span>fma<\/span><span>.<\/span><span>rn<\/span><span>.<\/span><span>f32<\/span>      <span>%<\/span><span>f93<\/span><span>,<\/span> <span>%<\/span><span>f92<\/span><span>,<\/span> <span>%<\/span><span>f91<\/span><span>,<\/span> <span>%<\/span><span>f90<\/span><span>;<\/span>\n<\/code><\/pre>\n<\/div>\n<p>That\u2019s not good, given that a memory load is bound to have a higher latency than a simple FMA, and given that we know our kernel should be compute bound.<br \/>\nWe see this effect when looking at the profiler\u2019s sampling of warp states.<br \/>\nThis quantifies how many cycles were spent in each state per executed instruction:<label for=\"33\"><\/label><span><code>Stall Not Selected<\/code> means that the warp was eligible to be scheduled, but the scheduler selected another eligible warp instead. This adds evidence to our earlier hypothesis that occupancy is currently not a problem.<\/span><\/p>\n<p><img decoding=\"async\" src=\"http:\/\/siboehm.com\/assets\/img\/CUDA-MMM\/kernel_3_profiler_warp_stalls.png\" alt><\/p>\n<p>The meaning of the states is documented in the <a href=\"https:\/\/docs.nvidia.com\/nsight-compute\/ProfilingGuide\/index.html#metrics-reference\">Kernel Profiling Guide<\/a>.<br \/>\nFor <code>Stall MIO Throttle<\/code> it reads:<\/p>\n<blockquote>\n<p>Warp was stalled waiting for the MIO (memory input\/output) instruction queue to be not full. This stall reason is high in cases of extreme utilization of the MIO pipelines, which include special math instructions, dynamic branches, as well as shared memory instructions<\/p>\n<\/blockquote>\n<p>We\u2019re not using special math instructions, nor dynamic branches, so it\u2019s clear that we\u2019re stalling waiting for our SMEM accesses to return.<br \/>\nSo how do we make our kernel issue less SMEM instructions?<br \/>\nOne way is to have each thread compute more than one output element, which allows us to perform more of the work in registers and relying less on SMEM.<\/p>\n<h2 id=\"kernel-4-1d-blocktiling-for-calculating-multiple-results-per-thread\">Kernel 4: 1D Blocktiling for Calculating Multiple Results per Thread<\/h2>\n<p>So this next kernel works like our last kernel, but adds a new inner loop, for calculating multiple C entries per thread.<br \/>\nWe now use a SMEM cache size of <code>BM*BK + BN*BK = 64*8 + 64*8 = 1024<\/code> floats, for a total of 4KB per block.<br \/>\nBelow a visualization.<br \/>\nI have highlighted two of the threads and the values they access in the inner loop in orange and red.<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/siboehm.com\/assets\/img\/CUDA-MMM\/kernel_4_1D_blocktiling.png\" alt><\/p>\n<p>All of the important changes for this kernel happen in the inner loop.<br \/>\nThe loading for GMEM to SMEM stays largely the same as before.<br \/>\nLet\u2019s have a look:<label for=\"34\"><\/label><span><a href=\"https:\/\/godbolt.org\/#z:OYLghAFBqd5TKALEBjA9gEwKYFFMCWALugE4A0BIEAZgQDbYB2AhgLbYgDkAjF%2BTXRMiAZVQtGIHgBYBQogFUAztgAKAD24AGfgCsp5eiyahUAV0wtyKxqiIEh1ZpgDC6embZMQs5wBkCJmwAOU8AI2xSEAB2LXIAB3QlYgcmNw8vHwSklKEAoNC2CKjY62xbeyERIhZSInTPb1kbbDtU6tqifJDwyJi4pRq6hszmoa7AnqK%2B2IBKa3QzUlROLgBSACYAZkDUDxwAajWtlwlgMmIkNmPcNa0AQU2dpj2LbCOT8SUVOpu7x%2B2u3272OLlQg0I6D%2BDyeQLeHzBEPoBDC0IBz1eh1B5jCRiUAH0AG4bAB0SDRsJewIR5ks%2BNIZmEBA4ZIpMO2ODoQQOLlwAEk\/PiACJ8gBqEAAsuQDsFZgcoBK5ZsAELy2UAWh4coA9GrZv9\/vj8cB6OgwhIjQdCegCJgDkpgNg2GweEKAOq1eIAFQYgWAEECRAOUoOgZl0rDAGlpTRTSwgxJ4kgrP8Dmn0xnM1ns1mMExBgdY%2Bh4wcAFT3aV5gtFkul5UxuNBiI1cipnPtjtZmtB0suJXRZUGh5pqtBsPKiUfIUHABs0mOg%2BHB1HoeEB2VwSns\/nW0X9xHQgL48jW4AHAu2yuw17J8dp%2Bfd23tbq%2BTQDgB3d6xgjxA7qA7GHaACeBxIJE7yfgcjpBgAflsWibAArAcjDfAc8SRIIpBsMYKyFmQKG1I6Bw4UQpAECsSgkk%2BupemBy5LKQzBENK8FIYWLCDJEy5CHQwBLPGqQHMwShLNgSgHEQyZNqaqAANYSe%2Blz2tgACOZjMQQEg0QcuLoPJfJChJLCoJREkYBk%2BYHOgb6qio6maRI9BAdK75IAw7xKMm5EmJJ9FKOw7ykOg77WW%2B9zUUuz4HHRnmmp%2BpA8UwfECZUTAfos9B2l5tTvBZjQSTZBwVrpZgJqZ4kSYG6DrhlZhZTpEQHEwQjqvZGlMhIJIHCINVSV%2BnFENxeYpaQglCKBnG6dgRBDYlSjxIJEgofpEjEEBOmAaBzB4SwUFMfG3F%2BBsoHEAcY1DZF%2B5JQWZhhqgABKIVbnpBmYOoJIbY%2BS4rnda6oOkL2yXJfLvSS6gXjCP2HkGf1BiQNT0A94n1UQSjKsD9iMFuE5luuwSQ9d0X7lJB12gQElMQth4otjWHLhI5hGPYvk3sJjAcMIVXpf1umYx5l4wwccPNZ4dFk%2Bj\/PY3eknoIjyOifQaMY\/pclY%2B8tESoTOkK6jSiqJEKvyQcuq60r%2BuROL2AsHad4y1bNsW6QRtyW2nE\/EQEBMGLSASy76tTjLr1yUKzLg\/q32PFFurW6gSAfp6H4MPQDP0Ezh0HFsGylmz5ROsxSiucpWfTX6fl5e4jS1AchBsFdB75mOa6k9brjuFuLc26DH3\/uxG6Ew3R7N77rdPaFMud5g3fgyb%2BPa9HAH0LJGcLSZX4Ebz5ikExa7BwHgQ9RKuBa0uRo5UxmCWt2xVKGsiHKrjpbrpGd9CgPBxn952CX\/ihaNuut977KhPE\/fuiE36Rx0hKdAhJ3h7w8rLaawBAhMDLkVe4YAuCU2eltZUmDzKVy8G2fcKoZaPWeqAycT8X6RzTKqUhWxpwA3bqAgmtCeRHA2IORhy4x540fjKThqpmEp1YfPImuoSCYHQCAACmBdBmALFJCmiDY7x0noguMdo2Cox\/NjZi5FxIASYJgRqM05rCXUPEU0Z1N4mX8gQAAXuJN23xIiewEcAwOPDg6hzruoCOe40zu3cRADc\/CTx2x8cDPx4d36\/TDCgyI6QSE8MntPXuGxkLAIXCbXU75PTqkYLAlOABxI%2Bk4MASHEqgP0gtG7C0SUwIIpAx6pOnOksG\/5dQ5PYQktcSTSDpHoWkkeXcumcOyWw1U0UCmkHiEU7AJSDjlOPjxapShakmHqbdJpLSx4jI6WMqeEyenTKHBIxey8hp%2BTJkU1aKdxBx3XolKmetVznWwMgripA6CMDbNfSeZs0Z32VDeV%2BW41gDi0CSBC0QIF7h0osCxpp0C\/hgdxYOkkPK3yXPTCAIswgg3eluBCu5dLEt7icA4NDVREu7kImWwD%2BxBPTNFRI8R6oZ15iICpDNnm4uumme4gDlSDLaREoRq4WkpIhTLR499xUhX3NQqVgzZXgPfnQ0VSr3yqlYWq5pyT3CDnATjUFur9WCJVNK419BTUIrbGmM%2BQEXiTyUBAQJFyMzRRtoSXCcCpbYCdcVRlPDemsroWG6cXin7nKhkKvJqd043IwqQdUGi3nmxDfikW0iiAMplmS1U%2BbC3UojYyrhpb3ospDWmWZ7wcJyXeLzfN8RgqYDMHYFaaLy7WTKskQ4qL4iuXcnHDitTkQ1CGoKrM0UmKKPeEVXmyoJIGJch%2BMd8dILiHSk8%2BiB89pEDYL%2Bf1pB66ZkBSenGorq3\/gNTaye6RX5avTLmsMVMy3TmLZ8pQZaXAxS1uS0hXDP01qOAOOt6YgUo2zffMDEMzUMLfkuDsIrQUQCBRQwDUqENylAZErhNc5bdwhU\/Y98RX1pihShxNNGQ0urdccj1Xqlz0YTfW\/J5EbnIr7VmkFeKCIEo\/eJL9Bwf0IYRIB3JIHlR4Yg5GnkGGsOhSfmzG18m41Sqfe4OVjCoPBPoEmPa5HjnAtFQhuVXCDPTRqHjFwymzPYfU0RzT1qiM6foC%2B9h7HHjwq4PMeg3BEL8G8FwHQ5AZFcBcAoIU%2B4HPWaI0oRYywQTbD4OQIg2gAvzDAjbPonryByRAIhOIQWuDSH4GwHw0gSSnnq1sGcTWNjRGkAATliHOcgYWItRf4EoEAcQsvhYC%2BQOAsAUAYBPR5CgVAIBTfiDNkATBCSmR4DwaI5AcCEgotgUUBBsDvgAPIYR6\/wP5c0BsQDCNl8gYRAi1CAtwDLU3OZECO0wZyt2cA4RMJIEbW2CBMTaLAgbAPsDqFaGVVYGXAzlFu8iMIY1SBATcDgW7ZFmTPdG7GFgwAlD7cOyd5g2O5DCDENUqQshBDCGUGoTQAP9AbEMMYUwFgrCI4G5AeYaK0pg\/VEdrYBx1RCmCEKXAyoFClOF%2BqR0LTDrqgwDgO8tQ453gW9D\/Ep4ZzkCV8G\/T98NdDS1zrpQbATcQvVOqeIRB1CcXVLz1It9GHqkJMLiE6vZMbf63nNojgIDOBGN4Hg5B\/CTEKMUAwiRkhpSD1HnIaVugR76CHlofumAdGGIQ4PZQKjtHGEn3oURU\/jDjyXzohfpjF\/mMlpYKwDAFOECdgtyUot8EC8F0Lt2osylFC4AD62STRFhfKfAxACJPC1PwYbOhZi5dbgVjvFWqsgC2DwEkPANgtdPIhU80hpDRD39ELY3Xu\/cH64NzL2X5jjeQCAZFnKWJzYWzN4IgVuDBD7wP9fw%2BIvf3H%2BRJgAYDTqIOIJIDIGTooCoBoLdkzuQO%2BGNPENjkviFqfgDlFkdmVI\/mFL3v3gcIPr\/vKG4NNowIlE8BsLMNPtfkvpVuQNVtsCSG1vVkfohC1tEDwKeBsFoJtmdpFuftYJfjPjlkViVmVtwFsF3ugfwUITfogBACgNtrtpQNQC\/qQW\/hwB\/l\/vgT\/rCvwP\/iQIAcAfIBTuAdTvIHTjAYziAMzmnmlE4CYmXqHiYpXpHiHtHrkGkNnvHjHqkC4Snrnq0GlJnvUF4anr7kEQXuHkXgYIMJ0I4bEXUH4dXplkxNgLaINigZIb1twEKEsrtgcITglLgd\/kPiPhAGPgYZwmvtKMQYtqQVURQVQSNnPuQHljgFEIVsVqVoYNwLQbwX1gIUNtQT0VwBsCvjwFoHEP0dIcMbAqQMkI4NIEAA%3D%3D%3D\">Godbolt link<\/a>.<\/span><\/p>\n<div>\n<pre><code><span>\/\/ allocate thread-local cache for results in registerfile<\/span>\n<span>float<\/span> <span>threadResults<\/span><span>[<\/span><span>TM<\/span><span>]<\/span> <span>=<\/span> <span>{<\/span><span>0.0<\/span><span>};<\/span>\n\n<span>\/\/ outer loop over block tiles<\/span>\n<span>for<\/span> <span>(<\/span><span>uint<\/span> <span>bkIdx<\/span> <span>=<\/span> <span>0<\/span><span>;<\/span> <span>bkIdx<\/span> <span><<\/span> <span>K<\/span><span>;<\/span> <span>bkIdx<\/span> <span>+=<\/span> <span>BK<\/span><span>)<\/span> <span>{<\/span>\n  <span>\/\/ populate the SMEM caches (same as before)<\/span>\n  <span>As<\/span><span>[<\/span><span>innerRowA<\/span> <span>*<\/span> <span>BK<\/span> <span>+<\/span> <span>innerColA<\/span><span>]<\/span> <span>=<\/span> <span>A<\/span><span>[<\/span><span>innerRowA<\/span> <span>*<\/span> <span>K<\/span> <span>+<\/span> <span>innerColA<\/span><span>];<\/span>\n  <span>Bs<\/span><span>[<\/span><span>innerRowB<\/span> <span>*<\/span> <span>BN<\/span> <span>+<\/span> <span>innerColB<\/span><span>]<\/span> <span>=<\/span> <span>B<\/span><span>[<\/span><span>innerRowB<\/span> <span>*<\/span> <span>N<\/span> <span>+<\/span> <span>innerColB<\/span><span>];<\/span>\n  <span>__syncthreads<\/span><span>();<\/span>\n\n  <span>\/\/ advance blocktile for outer loop<\/span>\n  <span>A<\/span> <span>+=<\/span> <span>BK<\/span><span>;<\/span>\n  <span>B<\/span> <span>+=<\/span> <span>BK<\/span> <span>*<\/span> <span>N<\/span><span>;<\/span>\n\n  <span>\/\/ calculate per-thread results<\/span>\n  <span>for<\/span> <span>(<\/span><span>uint<\/span> <span>dotIdx<\/span> <span>=<\/span> <span>0<\/span><span>;<\/span> <span>dotIdx<\/span> <span><<\/span> <span>BK<\/span><span>;<\/span> <span>++<\/span><span>dotIdx<\/span><span>)<\/span> <span>{<\/span>\n    <span>\/\/ we make the dotproduct loop the outside loop, which facilitates<\/span>\n    <span>\/\/ reuse of the Bs entry, which we can cache in a tmp var.<\/span>\n    <span>float<\/span> <span>Btmp<\/span> <span>=<\/span> <span>Bs<\/span><span>[<\/span><span>dotIdx<\/span> <span>*<\/span> <span>BN<\/span> <span>+<\/span> <span>threadCol<\/span><span>];<\/span>\n    <span>for<\/span> <span>(<\/span><span>uint<\/span> <span>resIdx<\/span> <span>=<\/span> <span>0<\/span><span>;<\/span> <span>resIdx<\/span> <span><<\/span> <span>TM<\/span><span>;<\/span> <span>++<\/span><span>resIdx<\/span><span>)<\/span> <span>{<\/span>\n      <span>threadResults<\/span><span>[<\/span><span>resIdx<\/span><span>]<\/span> <span>+=<\/span>\n          <span>As<\/span><span>[(<\/span><span>threadRow<\/span> <span>*<\/span> <span>TM<\/span> <span>+<\/span> <span>resIdx<\/span><span>)<\/span> <span>*<\/span> <span>BK<\/span> <span>+<\/span> <span>dotIdx<\/span><span>]<\/span> <span>*<\/span> <span>Btmp<\/span><span>;<\/span>\n    <span>}<\/span>\n  <span>}<\/span>\n  <span>__syncthreads<\/span><span>();<\/span>\n<span>}<\/span>\n<\/code><\/pre>\n<\/div>\n<p>This kernel achieves ~8600 GFLOPs, 2.2x faster than our previous kernel.<br \/>\nLet\u2019s calculate how many memory accesses each thread performed in our previous kernel, where each thread calculated one result:<\/p>\n<ul>\n<li>GMEM: K\/32 iterations of outer loop * 2 loads<\/li>\n<li>SMEM: K\/32 iterations of outer loop * BLOCKSIZE (=32) * 2 loads<\/li>\n<li>Memory accesses per result: K\/16 GMEM, K*2 SMEM<\/li>\n<\/ul>\n<p>And for our new kernel, where each thread calculates eight results:<\/p>\n<ul>\n<li>GMEM: K\/8 iterations of outer loop * 2 loads<\/li>\n<li>SMEM: K\/8 iterations of outer loop * BK(=8) * (1 + TM(=8))<\/li>\n<li>Memory accesses per result: K\/32 GMEM, K*9\/8 SMEM<\/li>\n<\/ul>\n<p>As expected, we now spend much fewer cycles per instruction stalling due to memory pressure:<label for=\"35\"><\/label><span>Careful: The axis has changed compared to the previous plot.<\/span><\/p>\n<p><img decoding=\"async\" src=\"http:\/\/siboehm.com\/assets\/img\/CUDA-MMM\/Kernel_4_profiler_warp_stalls.png\" alt><\/p>\n<h3 id=\"sidenote-on-compiler-optimizations\">Sidenote on Compiler Optimizations<\/h3>\n<p>Above we explicitly cached the entry of B into <code>Btmp<\/code> and reordered the two inner loops for efficiency.<br \/>\nIf we don\u2019t do that, then the code looks like this:<\/p>\n<div>\n<pre><code><span>for<\/span> <span>(<\/span><span>uint<\/span> <span>resIdx<\/span> <span>=<\/span> <span>0<\/span><span>;<\/span> <span>resIdx<\/span> <span><<\/span> <span>TM<\/span><span>;<\/span> <span>++<\/span><span>resIdx<\/span><span>)<\/span> <span>{<\/span>\n  <span>for<\/span> <span>(<\/span><span>uint<\/span> <span>dotIdx<\/span> <span>=<\/span> <span>0<\/span><span>;<\/span> <span>dotIdx<\/span> <span><<\/span> <span>BK<\/span><span>;<\/span> <span>++<\/span><span>dotIdx<\/span><span>)<\/span> <span>{<\/span>\n    <span>threadResults<\/span><span>[<\/span><span>resIdx<\/span><span>]<\/span> <span>+=<\/span>\n      <span>As<\/span><span>[(<\/span><span>threadRow<\/span> <span>*<\/span> <span>TM<\/span> <span>+<\/span> <span>resIdx<\/span><span>)<\/span> <span>*<\/span> <span>BK<\/span> <span>+<\/span> <span>dotIdx<\/span><span>]<\/span> <span>*<\/span> <span>Bs<\/span><span>[<\/span><span>dotIdx<\/span> <span>*<\/span> <span>BN<\/span> <span>+<\/span> <span>threadCol<\/span><span>];<\/span>\n  <span>}<\/span>\n<span>}<\/span>\n<\/code><\/pre>\n<\/div>\n<p>Interestingly, this has no adverse effect on performance.<br \/>\nThis is surprising since our inner two loops now incur BK (=8) * TM (=8) * 2 = 128 SMEM accesses, instead of the previous 72.<br \/>\nLooking at the assembly (<a href=\"https:\/\/godbolt.org\/#z:OYLghAFBqd5TKALEBjA9gEwKYFFMCWALugE4A0BIEAZgQDbYB2AhgLbYgDkAjF%2BTXRMiAZVQtGIHgBYBQogFUAztgAKAD24AGfgCsp5eiyahUAV0wtyKxqiIEh1ZpgDC6embZMQAVnLOAGQImbAA5TwAjbFIQAHYtcgAHdCViByY3Dy9fJJS0oSCQ8LYomPjrbFt7IREiFlIiTM9vPxtsO3Ta%2BqJCsMjouISlOoam7NaRnuC%2BkoH4gEprdDNSVE4uAFIAJgBmYNQPHABqDZ2XCWAyYiQ2U9wNrQBBbb2mA4tsE7PxJRUGu4ez12%2B0On1OLlQw0I6ABTxeII%2BXwhUPoBAisKBr3ex3B5giRiUAH0AG5bAB0SAx8LeoKR5kshNIZmEBA4FKpcN2ODoISOLlwAEkAoSACICgBqEAAsuQjqF5kcoFKFdsAEKK%2BUAWh4CoA9Br5oDAYTCcB6OgIhITUdiegCJgjkpgNg2GweCKAOr1RIAFQYwWAEGCRCOMqOwblsojAGlZTRzSwQxJEkgrICjhnM1nsznczmMExhkd4%2BhE0cAFSPWUFoslsvl1VxhMhqJ1cjpvOdrs5ush8suFWxVVGp4ZmshiOqqVfEVHABs0lOw9HR3H4eER1VoRn88XO2XjzHQiLk%2BjO4AHEuO2uIz7p6dZ5f9x3dfqBTQjgB3T7xgiJI7qEcxgOgAnkcSDRJ835HM6IYAH47Fo2w%2BEcjC\/EciTRIIpBsMYazFmQqH1M6Ry4UQpAEGsShki%2B%2Bo%2BhBq4rKQzBELKiHIcWLDDNEq5CHQwArIm6RHMwSgrNgShHEQqYtuaqAANaSZ%2B1yOtgACOZgsQQEi0Uc%2BLoApAoipJLCoFRkkYFkhZHOgH7qioGlaRI9AgbKn5IAwnxKKmFEmFJDFKOwnykOgn42R%2Bjw0Sur5HPRXnmt%2BpC8Uw\/GCdUTBfss9AOt59SfJZzSSbZRxVnpZhJmZEmScG6CbplZjZbpURHEwQiag5mkshIZJHCItXST%2BXFEDxBapaQQlCOBXF6dgRDDUlSiJEJEioQZEjECBunAeBzD4SwMHMYmPEBFs4HEEc43DVFh7JUWZgRqgABKoU7vphmYOoZKbc%2BK5rvdG6oJkr1yfJAofWS6hXnCv3HiG\/0hiQdT0I9EkNUQSiqiD9iMDuU4VpuoRQzdMWHtJh0OgQknMYtx5ojj2GrhI5hGPYfl3iJjAcMI1UZQNelY5516w0c8MtZ49HkxjAs4w%2BUnoEjKNifQ6OYwZ8nY58dFSkTumK2jSiqNEqsKUc%2Bp68rBvRBL2AsA6D6y9btuW6QxvyR2XF\/EQEBMOLSCS67GszrLb3ySKrIQ4aP3PNF%2Bo26gSBft6X4MPQjP0MzR1HDsWzluzlQuixShuSp2czQG\/n5e4zT1EchBsNdR6FhOG5kzbrjuDure22Dn2ARxW5E43J4t37bfPWFstd5gPcQ6bBM6zHQH0HJmeLaZP6EXz5ikMxG4h4HwS9VKuDayuJq5cxmDWr2JVKBsPiqnj5abtG98ioPRznz52BX4SxbNpuO%2BD9VRnmfgPHw78o66SlOgYknx96eTljNYAwQmDl2Ko8MAXAqYvW2qqLBFkq5eA7IeNUssnovTAdOZ%2Br8o4ZnVGQnYs5AYdzAYTOhfIThbGHEw1c498ZPzlFw9ULDU5sIXsTfUJBMDoBAEBTAugzBFmkpTJBccE5TyQQmB0bA0Z\/hxixCiEkgJMEwE1Wa80RLqESOac6W9TIBQIAALwku7X40QvaCJAUHXhIcw713UJHA8GYPYeIgFuARZ57a%2BJBv4iOH8\/oRlQdETIpDeFTxnn3LYKEQFLlNvqT83pNSMDganAA4sfacGAJASVQAGIWTcRZJKYCEUg480mzgyeDQC%2BpckcMSRuZJpBMgMPSaPbu3SuE5PYeqGKhTSCJGKdgUpRwKkn14jUpQdSTANLus01p49RmdPGdPSZvSZkjkkUvFew1\/Lk2KWtVO4h44byStTfW64LrYBQdxUgdBGAdhvlPc26N76qjvG\/HcGwhxaDJEhWIkCDy6WWJY806B\/ywJ4iHKSnk74rgZhAUWERQYfR3EhfcekSV9zOEcWh6piU92EbLEBg5gmZhiskRIDVM58xEJUxmLy8U3QzI8IBqohntMicI9crTUmQtls8B%2BErQqHhodKoZcqIEf3oWK5Vn51RsPVS0lJ7hhwQNxmCvVBqhFqhlSa%2BgZrEUdgzOfECbwp5KAgEEy5WYYq22JHheB0tsDOpKky3hfS2X0PDbObxz8LnQ2FQRJKhKIzU0ZbLcl6p02kvBLFbWFKyHcJzYEk4Q5Q0ZgJaLGRRAM28KzbXeWdaXAvzyUW1UNae6sorZmYFqMLZgpLfK7h9sVxdlFWCiAwLKH5ulSWhUYConcMbbWj6kKwFis7aSw1tqp6ZDftqzM0L35jrLSepNrr3UnM9d6lcx6fX5K\/BRW5KKK5fKVqC\/FhFU0bhLWSvJf680QsLSO4tEku1lqjXySd06wrP3Zra%2Bd%2BNty7pOfu81o6k1ZmTKmfGfaP1iqHRhkdp6sytn2s\/FwMGTn8Pg\/eZdSH43Sr3e4A9HD71wgRVwRY9BuA%2BH4N4LgOhyCyK4C4BQIpDxUdA\/ZZYqwwS7D4OQIg2huOLAgrbAYXryDyV8AkXjXBpD8DYCAaQ0gyTnkszsOcNmtixGkAATniAucggnhOif4EoEACQVNCe4%2BQOAsAUAYDYIkTyFAqAQBC2FxgMQmDEjMjwHgsRyA4GJJRbA4oCDYE\/AAeUwm5\/g\/z5peYgBEVT5AIjBHqCBbgSmQtcyILlpgLkKs4FwiYSQfnUsEGYh0OBXnuvYHUO0cq6wlPBkqBV1EERxqkBAm4HAFXyKsjq\/5%2BMLBgBKCyzl\/LzA1tyGEGIGpUhZCCGEMoNQmhuv6C2IYYwpgLBWBm15yAix0XpUG5qXLOwjiahFKEEUuBVQKDKX9zUzpWlHU1BgHAD56jxwfNFsbhJzxznILDkNTCwXI%2BGqj9HSg2D48hZqTUiQiDqC4pqD76Q75MM1MSP7UIkftuS55\/OHRHAQGcGMbwPB\/CmN6MUUoBhkipHSrz0XeR0pC\/6DEfnbROdMC6KMIhfOKhVE6JMWXsx5fWEmJLhX2vpjC4GDqJYKw1gGEKcIfLtaUqib4DxvjAmKuiblOKFwLaktkliHCxU%2BBiCEReDqfgvmdDzHU23LTzvDPGZADsHgZIeBbDs%2BeHw54zOxEz7EHYrm3fcE8955TqnFiBeQCAFFXLWKRei%2BF0IQVuChE997pPfvhM\/yDxRTABhzuiHEJIGQh3FAqA0BV275BPzjUSGt2P\/H8\/ddE7l8q1fwoe690cH37fFRuFC%2BFrh2d5hh9L7Hoz5ATO7DJA5yzOefB2diDwc8WwtApcKyJwv1hi\/h7UzpvThhuA7Cu6L4f7f5l6IAQAoBpYZaUDUB16xYN4cBN4t6b5t5wr8Cd4kDd697yDHaD5nbyCXZj43YgB3aK7pROCmKG4C6YA64i785i75AZBq5S7i7pC0Fm4a7tDpQq6NDMEK4c7cHG5FBy4GDDDdBUFiENDsHy6LDkTYDYD2jeZz5AHubcAijLIZZHA7aJTr6t6%2B7%2B4QCB6YEH7858joB76xYH5bBH4l5%2BaR7kAaY4AxDaa6Y%2BD6bcBn5v4eaf4%2BYn7\/5cBbDx48BaAJBeEgF%2BFwKkCpCODSBAA%3D\">Godbolt link<\/a>) has the answer:<\/p>\n<div>\n<pre><code><span>\/\/ first inner-most loop<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f45<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r9<\/span><span>];<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f46<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r8<\/span><span>];<\/span>\n<span>fma<\/span><span>.<\/span><span>rn<\/span><span>.<\/span><span>f32<\/span>      <span>%<\/span><span>f47<\/span><span>,<\/span> <span>%<\/span><span>f46<\/span><span>,<\/span> <span>%<\/span><span>f45<\/span><span>,<\/span> <span>%<\/span><span>f212<\/span><span>;<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f48<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r9<\/span><span>+<\/span><span>256<\/span><span>];<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f49<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r8<\/span><span>+<\/span><span>4<\/span><span>];<\/span>\n<span>fma<\/span><span>.<\/span><span>rn<\/span><span>.<\/span><span>f32<\/span>      <span>%<\/span><span>f50<\/span><span>,<\/span> <span>%<\/span><span>f49<\/span><span>,<\/span> <span>%<\/span><span>f48<\/span><span>,<\/span> <span>%<\/span><span>f47<\/span><span>;<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f51<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r9<\/span><span>+<\/span><span>512<\/span><span>];<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f52<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r8<\/span><span>+<\/span><span>8<\/span><span>];<\/span>\n<span>fma<\/span><span>.<\/span><span>rn<\/span><span>.<\/span><span>f32<\/span>      <span>%<\/span><span>f53<\/span><span>,<\/span> <span>%<\/span><span>f52<\/span><span>,<\/span> <span>%<\/span><span>f51<\/span><span>,<\/span> <span>%<\/span><span>f50<\/span><span>;<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f54<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r9<\/span><span>+<\/span><span>768<\/span><span>];<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f55<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r8<\/span><span>+<\/span><span>12<\/span><span>];<\/span>\n<span>fma<\/span><span>.<\/span><span>rn<\/span><span>.<\/span><span>f32<\/span>      <span>%<\/span><span>f56<\/span><span>,<\/span> <span>%<\/span><span>f55<\/span><span>,<\/span> <span>%<\/span><span>f54<\/span><span>,<\/span> <span>%<\/span><span>f53<\/span><span>;<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f57<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r9<\/span><span>+<\/span><span>1024<\/span><span>];<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f58<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r8<\/span><span>+<\/span><span>16<\/span><span>];<\/span>\n<span>fma<\/span><span>.<\/span><span>rn<\/span><span>.<\/span><span>f32<\/span>      <span>%<\/span><span>f59<\/span><span>,<\/span> <span>%<\/span><span>f58<\/span><span>,<\/span> <span>%<\/span><span>f57<\/span><span>,<\/span> <span>%<\/span><span>f56<\/span><span>;<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f60<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r9<\/span><span>+<\/span><span>1280<\/span><span>];<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f61<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r8<\/span><span>+<\/span><span>20<\/span><span>];<\/span>\n<span>fma<\/span><span>.<\/span><span>rn<\/span><span>.<\/span><span>f32<\/span>      <span>%<\/span><span>f62<\/span><span>,<\/span> <span>%<\/span><span>f61<\/span><span>,<\/span> <span>%<\/span><span>f60<\/span><span>,<\/span> <span>%<\/span><span>f59<\/span><span>;<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f63<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r9<\/span><span>+<\/span><span>1536<\/span><span>];<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f64<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r8<\/span><span>+<\/span><span>24<\/span><span>];<\/span>\n<span>fma<\/span><span>.<\/span><span>rn<\/span><span>.<\/span><span>f32<\/span>      <span>%<\/span><span>f65<\/span><span>,<\/span> <span>%<\/span><span>f64<\/span><span>,<\/span> <span>%<\/span><span>f63<\/span><span>,<\/span> <span>%<\/span><span>f62<\/span><span>;<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f66<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r9<\/span><span>+<\/span><span>1792<\/span><span>];<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f67<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r8<\/span><span>+<\/span><span>28<\/span><span>];<\/span>\n<span>fma<\/span><span>.<\/span><span>rn<\/span><span>.<\/span><span>f32<\/span>      <span>%<\/span><span>f212<\/span><span>,<\/span> <span>%<\/span><span>f67<\/span><span>,<\/span> <span>%<\/span><span>f66<\/span><span>,<\/span> <span>%<\/span><span>f65<\/span><span>;<\/span>\n<span>\/\/ second inner-most loop<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f68<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r8<\/span><span>+<\/span><span>32<\/span><span>];<\/span>\n<span>fma<\/span><span>.<\/span><span>rn<\/span><span>.<\/span><span>f32<\/span>      <span>%<\/span><span>f69<\/span><span>,<\/span> <span>%<\/span><span>f68<\/span><span>,<\/span> <span>%<\/span><span>f45<\/span><span>,<\/span> <span>%<\/span><span>f211<\/span><span>;<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f70<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r8<\/span><span>+<\/span><span>36<\/span><span>];<\/span>\n<span>fma<\/span><span>.<\/span><span>rn<\/span><span>.<\/span><span>f32<\/span>      <span>%<\/span><span>f71<\/span><span>,<\/span> <span>%<\/span><span>f70<\/span><span>,<\/span> <span>%<\/span><span>f48<\/span><span>,<\/span> <span>%<\/span><span>f69<\/span><span>;<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f72<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r8<\/span><span>+<\/span><span>40<\/span><span>];<\/span>\n<span>fma<\/span><span>.<\/span><span>rn<\/span><span>.<\/span><span>f32<\/span>      <span>%<\/span><span>f73<\/span><span>,<\/span> <span>%<\/span><span>f72<\/span><span>,<\/span> <span>%<\/span><span>f51<\/span><span>,<\/span> <span>%<\/span><span>f71<\/span><span>;<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f74<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r8<\/span><span>+<\/span><span>44<\/span><span>];<\/span>\n<span>fma<\/span><span>.<\/span><span>rn<\/span><span>.<\/span><span>f32<\/span>      <span>%<\/span><span>f75<\/span><span>,<\/span> <span>%<\/span><span>f74<\/span><span>,<\/span> <span>%<\/span><span>f54<\/span><span>,<\/span> <span>%<\/span><span>f73<\/span><span>;<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f76<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r8<\/span><span>+<\/span><span>48<\/span><span>];<\/span>\n<span>fma<\/span><span>.<\/span><span>rn<\/span><span>.<\/span><span>f32<\/span>      <span>%<\/span><span>f77<\/span><span>,<\/span> <span>%<\/span><span>f76<\/span><span>,<\/span> <span>%<\/span><span>f57<\/span><span>,<\/span> <span>%<\/span><span>f75<\/span><span>;<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f78<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r8<\/span><span>+<\/span><span>52<\/span><span>];<\/span>\n<span>fma<\/span><span>.<\/span><span>rn<\/span><span>.<\/span><span>f32<\/span>      <span>%<\/span><span>f79<\/span><span>,<\/span> <span>%<\/span><span>f78<\/span><span>,<\/span> <span>%<\/span><span>f60<\/span><span>,<\/span> <span>%<\/span><span>f77<\/span><span>;<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f80<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r8<\/span><span>+<\/span><span>56<\/span><span>];<\/span>\n<span>fma<\/span><span>.<\/span><span>rn<\/span><span>.<\/span><span>f32<\/span>      <span>%<\/span><span>f81<\/span><span>,<\/span> <span>%<\/span><span>f80<\/span><span>,<\/span> <span>%<\/span><span>f63<\/span><span>,<\/span> <span>%<\/span><span>f79<\/span><span>;<\/span>\n<span>ld<\/span><span>.<\/span><span>shared<\/span><span>.<\/span><span>f32<\/span>   <span>%<\/span><span>f82<\/span><span>,<\/span> <span>[<\/span><span>%<\/span><span>r8<\/span><span>+<\/span><span>60<\/span><span>];<\/span>\n<span>fma<\/span><span>.<\/span><span>rn<\/span><span>.<\/span><span>f32<\/span>      <span>%<\/span><span>f211<\/span><span>,<\/span> <span>%<\/span><span>f82<\/span><span>,<\/span> <span>%<\/span><span>f66<\/span><span>,<\/span> <span>%<\/span><span>f81<\/span><span>;<\/span>\n<span>\/\/ ... continues like this for inner-loops 3-8 ...<\/span>\n<\/code><\/pre>\n<\/div>\n<p>The compiler unrolls both loops<label for=\"36\"><\/label><span>The compiler can unroll them since the loop count is known at compile time.<\/span> and then eliminates the repeated SMEM loads of the <code>Bs<\/code> entries, so we end up with the same amount of SMEM accesses as our optimized CUDA code.<\/p>\n<p>When the PTX is compiled to SASS, the SMEM loads from <code>As<\/code> are vectorized:<label for=\"37\"><\/label><span>This already hints at an optimization we\u2019ll perform later: Transposing <code>Bs<\/code> such that we can also vectorize those loads.<\/span><\/p>\n<div>\n<pre><code><span>LDS<\/span>     <span>R26<\/span><span>,<\/span> <span>[<\/span><span>R35<\/span><span>.<\/span><span>X4<\/span><span>+<\/span><span>0x800<\/span><span>]<\/span> <span>\/\/ a 32b load from Bs<\/span>\n<span>LDS<\/span><span>.<\/span><span>128<\/span> <span>R8<\/span><span>,<\/span>  <span>[<\/span><span>R2<\/span><span>]<\/span>           <span>\/\/ a 128b load from As<\/span>\n<span>LDS<\/span><span>.<\/span><span>128<\/span> <span>R12<\/span><span>,<\/span> <span>[<\/span><span>R2<\/span><span>+<\/span><span>0x20<\/span><span>]<\/span> \n<span>LDS<\/span>     <span>R24<\/span><span>,<\/span> <span>[<\/span><span>R35<\/span><span>.<\/span><span>X4<\/span><span>+<\/span><span>0x900<\/span><span>]<\/span> \n<span>LDS<\/span><span>.<\/span><span>128<\/span> <span>R20<\/span><span>,<\/span> <span>[<\/span><span>R2<\/span><span>+<\/span><span>0x60<\/span><span>]<\/span> \n<span>LDS<\/span>     <span>R36<\/span><span>,<\/span> <span>[<\/span><span>R35<\/span><span>.<\/span><span>X4<\/span><span>+<\/span><span>0xb00<\/span><span>]<\/span> \n<span>LDS<\/span><span>.<\/span><span>128<\/span> <span>R16<\/span><span>,<\/span> <span>[<\/span><span>R2<\/span><span>+<\/span><span>0x40<\/span><span>]<\/span> \n<span>LDS<\/span><span>.<\/span><span>128<\/span> <span>R4<\/span><span>,<\/span>  <span>[<\/span><span>R2<\/span><span>+<\/span><span>0x80<\/span><span>]<\/span> \n<span>LDS<\/span>     <span>R38<\/span><span>,<\/span> <span>[<\/span><span>R35<\/span><span>.<\/span><span>X4<\/span><span>+<\/span><span>0xd00<\/span><span>]<\/span> \n<\/code><\/pre>\n<\/div>\n<h3 id=\"areas-of-improvement-arithmetic-intensity\">Areas of Improvement: Arithmetic Intensity<\/h3>\n<p>Our current kernel still suffers from the same stalling-for-memory problem as kernel 3, just to a lesser extent.<br \/>\nSo we\u2019ll just apply the same optimization again: computing even more results per thread.<br \/>\nThe main reason this makes our kernel run faster is that it increases arithmetic intensity.<label for=\"38\"><\/label><span>Defined as the number of FLOPs executed per byte transferred (load + store!) between GMEM and SMEM.<\/span><br \/>\nBelow I tried to make it more immediately obvious why calculating more results per thread raises arithmetic intensity:<label for=\"39\"><\/label><span>It\u2019s more efficient to calculate a square of results per thread than a column of results because we can share more of the inputs: <img decoding=\"async\" src=\"http:\/\/siboehm.com\/assets\/img\/CUDA-MMM\/1d_warp_tiling.png\" alt><\/span><\/p>\n<p><img decoding=\"async\" src=\"http:\/\/siboehm.com\/assets\/img\/CUDA-MMM\/raising_arith_inten.png\" alt><\/p>\n<p>In conclusion, all our kernels perform the same number of FLOPs, but we can reduce the number of GMEM accesses by calculating more results per thread.<br \/>\nWe\u2019ll continue optimizing arithmetic intensity for as long as we\u2019re still memory bound.<\/p>\n<h2 id=\"kernel-5-increasing-arithmetic-intensity-via-2d-blocktiling\">Kernel 5: Increasing Arithmetic Intensity via 2D Blocktiling<\/h2>\n<p>The basic idea for kernel 5 will be to compute a grid of 8*8 elements of C per thread.<br \/>\nThe first stage of the kernel is for all threads to work together to populate the SMEM cache.<br \/>\nWe\u2019ll have each thread load multiple elements.<br \/>\nThis code looks like so:<label for=\"40\"><\/label><span>Here\u2019s a graphical representation of the GMEM loading:<img decoding=\"async\" src=\"http:\/\/siboehm.com\/assets\/img\/CUDA-MMM\/kernel_5_GMEM_loading.png\" alt><\/span><\/p>\n<div>\n<pre><code><span>for<\/span> <span>(<\/span><span>uint<\/span> <span>loadOffset<\/span> <span>=<\/span> <span>0<\/span><span>;<\/span> <span>loadOffset<\/span> <span><<\/span> <span>BM<\/span><span>;<\/span> <span>loadOffset<\/span> <span>+=<\/span> <span>strideA<\/span><span>)<\/span> <span>{<\/span>\n  <span>As<\/span><span>[(<\/span><span>innerRowA<\/span> <span>+<\/span> <span>loadOffset<\/span><span>)<\/span> <span>*<\/span> <span>BK<\/span> <span>+<\/span> <span>innerColA<\/span><span>]<\/span> <span>=<\/span>\n      <span>A<\/span><span>[(<\/span><span>innerRowA<\/span> <span>+<\/span> <span>loadOffset<\/span><span>)<\/span> <span>*<\/span> <span>K<\/span> <span>+<\/span> <span>innerColA<\/span><span>];<\/span>\n<span>}<\/span>\n<span>for<\/span> <span>(<\/span><span>uint<\/span> <span>loadOffset<\/span> <span>=<\/span> <span>0<\/span><span>;<\/span> <span>loadOffset<\/span> <span><<\/span> <span>BK<\/span><span>;<\/span> <span>loadOffset<\/span> <span>+=<\/span> <span>strideB<\/span><span>)<\/span> <span>{<\/span>\n  <span>Bs<\/span><span>[(<\/span><span>innerRowB<\/span> <span>+<\/span> <span>loadOffset<\/span><span>)<\/span> <span>*<\/span> <span>BN<\/span> <span>+<\/span> <span>innerColB<\/span><span>]<\/span> <span>=<\/span>\n      <span>B<\/span><span>[(<\/span><span>innerRowB<\/span> <span>+<\/span> <span>loadOffset<\/span><span>)<\/span> <span>*<\/span> <span>N<\/span> <span>+<\/span> <span>innerColB<\/span><span>];<\/span>\n<span>}<\/span>\n<span>__syncthreads<\/span><span>();<\/span>\n<\/code><\/pre>\n<\/div>\n<p>Now that the SMEM cache is populated, we have each thread multiply it\u2019s relevant SMEM entries and accumulate the result into local registers.<br \/>\nBelow I illustrated the (unchanged) outer loop along the input matrices, and the three inner loops for the dot product and the <code>TN<\/code> and <code>TM<\/code> dimension:<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/siboehm.com\/assets\/img\/CUDA-MMM\/kernel_5_2D_blocktiling.png\" alt><\/p>\n<p>The interesting parts of the code look like this:<label for=\"41\"><\/label><span><a href=\"https:\/\/godbolt.org\/#z:OYLghAFBqd5TKALEBjA9gEwKYFFMCWALugE4A0BIEAZgQDbYB2AhgLbYgDkAjF%2BTXRMiAZVQtGIHgBYBQogFUAztgAKAD24AGfgCsp5eiyahUAV0wtyKxqiIEh1ZpgDC6embZMQAZnLOAGQImbAA5TwAjbFIQAE5ZAAd0JWIHJjcPL19yJJT7ISCQ8LYomPjrbFt8phEiFlIiDM9vPxtsOzTa%2BqJCsMjouNklOoamrNaRnuC%2BkoH4gEprdDNSVE4uAFIAJh9g1A8cAGoNnxcJYDJiJDYT3A2tAEFt3aZ9i2xj0\/ElFQbb%2B6eOz2Bw%2BJxcqGGhHQ\/0ez2B70%2B4Mh9AIERhgJebyOYPMESMSgA%2BgA3LYAOiQ6LhrxBiPMlgJpDMwgIHHJlNhOxwdBChxcuAAkgECQARfkANQgAFlyIdQvNDlBJfLtgAhBVygC0PHlAHp1fMAQCCQTgPR0BEJMbDkT0ARMIclMBsGw2FthQB1eoJAAqDGCwAgwSIh2lhyDspl4YA0jKaGaWMGJAkkFYAYd0xnM1nsznsxgmMNDnH0AnDgAqB4y\/OF4ulssq2Px4NROrkNO5jud7O14NllzKgDsKvb1eD4ZVks%2BwsOPC2AA4TsPHunR2HhIcVaEpzP54uR0JC%2BOo9uFz4lw8Vwex%2BvvZOTtPT%2BfLwXr8HvVv74dH\/uX4czOHUAAJXQAB3bc8XQVAAGt%2BUwdRSQATz3ZdDlXf911QDJwLNaDYPg9RkKeFC0PDEg6noQDsCUMx6CIJQVRwqD7EYbcJ3LDdQkI9MdT1C8iCQUhsBYe0CCUQ5BKUJIC1RFjBFIVCJHMIx7BMQ5bzLd9DkqZ1mDotdDn4j4IOg5jsB\/Qt0ODJhPG9AShMwejGNM7cyIkSjqNoxzIKYhgPj1CBb3Y98DTPQ0UJ4w53JoujVGiBjvMOPUos82LSFswThKnT90vspRUvi6D2xYH5oiICBrLYHLhK8kzfKynxp2MqDhRZUl1BC892wizcdU0%2BoPkMw4KtKQ50BoAy7Oqgz0AdBJjEOFhUPcZpzNfCaMtcdwXMmzA8La44tgAVgVTdErUuUuKW39SJ24CwM\/fiNr29QzogU69WCwiur1CQcITD5JJYNYizICaPnMUhBPXJrnOCB0OBuFDjSUFNBMwK0e0OB4lA2Q6VTYssNyjXHhUu5HUewdGCSLJsNxxvGVWPQnNxJr7wr1SV0CJIynLqkhDiiYBgiYf1RvGh4wC4MTSFAhamHtFVJbEjBMiYdsL1VT8gNl5nJ0J4nQpQtVNYa1CsOZzjDYvXkDuHU3tbA3X2I\/LY1UwraLbZ62IvEeglITUXBuCQg1jE\/jS340T1vsw4QIYehDnjEThBmkRJVwSVVr\/cNheiO6NdNx77Oes7Gcukj11z0gMgL6ci%2BEkvtmOsure4vVff9ogqLBob%2BnksbxNAsSB%2BxibS36gXsFFpPKf0xaUhMFjhmwBJvoFhCFodf0WKarPLIdIhSDtbBa97yqdpqnyWL1Funyuiyc6YEJSDu43C52kub8tu%2BK%2BvZ%2BMjfnXD%2BcF9pNw4pdCKckBboH4ljMSxgFZiRAh8ECxhgxCVQEgROJZ7T80BkwHuNAaIJxVs0DUcdMD8VjGQNeLYu7yQAOLp0nBgCQVFUCiwgMVUaCRcizzwXNJ%2BosiH0ATjLEC5C7SwIQWGehAcTBryBjLH4S1VZKANMRK82d1zDCPjgQBZ8qoOQKlfPy4CrbtjbgtURkF\/rR2EhqP6JCgZIA%2BFAiS0UxJw0EkLZepA6CMHbJjeumBkp0VxiqQKhN3wk23BsIcWhSRaHiaTVuiU9Q%2BNEvQhSmDu5QNHjIlUOMUKYx8ZnPGt5YmfniSqRJySBypLvqU7AwBOIVLadOapCSkkpK9lY5Y9CNRsGSMGM06AEijW5vJJqBlfLFOtlAiA%2B8IgwTgtuZJZ4BarJemCQ4Bs1QrLwrbT8jNBx3ysUkBINE7GDTThnHJrj5mZkWfvJOAB5GgNAVDBk\/BstU7zPnfMRBuTOmyAVfOwD812n5dHHweGcyxGZsYRMDE\/POoENau2wcJD5EKiDymZseVUa5\/7uCeIdTpDVEXZnJSqVFz98622xZgXF3yCV7KZVXGurM0npl6ShdMLzwzgqBb8xczLWWQuBbfCVgKpUm2nLC\/RCKBUZiKSiqur8mUishey06xKuXuGHBSrKqrszGrpZq0CxssU6vxc7TlaLq5Gp5ec44DTEXIwQq8EJSgIAdTCtbKxwkiTGGBjDXyiLMV22nDKzMEVhncyJqo5oYcZpH2AEgIgiKbUxuTYTb%2B6TDiJo%2BIzQeIExKYFAmrWEqqfaKWuV3Q4CRogahCeJKinjEVCvXFWogRyxWbL7QO04RNxWa1dsOuCKqg0ZgijMoM6aWlZOiE8rMPaxzrPFQQYFlTNkTpVAQGduYykRIIFU02yK8YQBCXdIKd4sVHvYmW4lU6CIUsupmfls7BWgyWTnLdmyd27Jifu6Frsn01OpRmHxbTD0XtjfTFUb7n0uzVCE8251OWuug9%2B7sf794STwneU2fyO1KGI7u0FubVREbgkqd1brnkEfDHR9QH5SPirYxxlw51x3gZVNx49HZb2ds8hEtjessPEu41U6FVKzUdlPXjSTsTCawYk1RPCHTP1Zjw3pj1qr9Ppi9T6i%2B\/rLrfrXiBI%2BTaBk9w8eJkpLH1yScA2qNzIHqPHNdpJ4Tv75L\/tc1puCHHpxke47uwtB6hOMegy4FFt6dZqQfR5kL6gGMFqZRhj20msWyZNfeaDmYkwpnYqJjy4SVPpak5pGT6WOnHMU5mFsi1CYJevUlx2KWmV%2BYdcSnLCdolofI9pnDRnDPWysw0rgix6DcEOvwbwXAdDkHQNwFwChhQXgSwJh0yxVigh2HwcgRBtCzcWK44SAx\/XkCgiAQ6WhDDcGkPwNgIBpDSFJHOH7PgABsAOtgDmkLEAcWg\/uyGW6t9bXB%2BBKBAE9s7K3ZvkDgLAFAGA2AJF8hQKgEBMfY8YDEJgRJUCoB4DwAc5AcBEgIGsMUBBsAgTeS2qH\/AAn0PhxACI53yARGCPUBC3ATuY44MIN5TB6BC%2BR9T50xhTTrFW4QQSHRubw5l9gdQ7QzBd2F\/wIMlRecogiKQQXbgcC88PiyPXiw4wsGAEoBnTOWfMD13IYQYg2FSFkIIYQyg1CaBl\/oLYhh5doAsFYY38PICLHGdUdXGo3k%2BEOBqYUoRhS4BVAoBhKeNROmfv9DUGAcD3nqJg%2B8BOdfYAJHOP75Bi9mSpXjSvXca916UGwNvsSNQagSEQdQxUNRx7SDjBqGoiQp8hBXg9lO4faQ6I4CAzgxjeB4P4eWvRiilAMLkVIQgV87%2BSHvpgm%2B%2B4GDaAvmokwD9r4v9ULoDRT%2BzBiLf6\/y0siv%2B6E\/7f2olgrDWAYKguLn3vyEwIIDbs9lwItuQGzmttwLKGKC4LxhTqSAOEkgqPgMQKDM8NqPwEjjoPMJdvZDdnNi9m9r4DwKSLOEDnOIdHOJ9gOPQQOH4LATDnDgjqdudosGjsgCAAMlckQJQNQATjjqEOwOsKEIgcgZQWgatpTFgXogYL7qIL7JwDIO7ooCoBoLzsHuQDZiwAkBAfNlAUtrzjDm8jrgIWLAgUgTODIegRAG4FjjjgdD4FsPMHgVwaQVwK9uQO9jsKSLED9kwYdEDgODwHOFsFoFTqwdwOwYjl4Xdg9k9sYT4KYTLmwZwcjtwYgBACgDTnTtgEIfjugM4UTmIRwNwJIbYSgbIfwPISQIoWvsoZ7pIOocof7toUHiACHnfmkE4PLDfuvpgN\/gMGvrvtUEMRMWkKMS\/hUFUJ0G\/qrOfvPvfpMLMefksc0JsV\/tMFvmMYsIfNgFPJgAjt4dAbEVwMKNgLTsDE7sgvJNUdIagQ4ZgY0a4WvryKUYTtEK4e4Z4dkUQddjELdvdo9pAb4ZcfEVkQQd4VsOQTwFoE9lCTCRduQFMikI4NIEAA\">Godbolt link<\/a><\/span><\/p>\n<div>\n<pre><code><span>\/\/ allocate thread-local cache for results in registerfile<\/span>\n<span>float<\/span> <span>threadResults<\/span><span>[<\/span><span>TM<\/span> <span>*<\/span> <span>TN<\/span><span>]<\/span> <span>=<\/span> <span>{<\/span><span>0.0<\/span><span>};<\/span>\n<span>\/\/ register caches for As and Bs<\/span>\n<span>float<\/span> <span>regM<\/span><span>[<\/span><span>TM<\/span><span>]<\/span> <span>=<\/span> <span>{<\/span><span>0.0<\/span><span>};<\/span>\n<span>float<\/span> <span>regN<\/span><span>[<\/span><span>TN<\/span><span>]<\/span> <span>=<\/span> <span>{<\/span><span>0.0<\/span><span>};<\/span>\n\n<span>\/\/ outer-most loop over block tiles<\/span>\n<span>for<\/span> <span>(<\/span><span>uint<\/span> <span>bkIdx<\/span> <span>=<\/span> <span>0<\/span><span>;<\/span> <span>bkIdx<\/span> <span><<\/span> <span>K<\/span><span>;<\/span> <span>bkIdx<\/span> <span>+=<\/span> <span>BK<\/span><span>)<\/span> <span>{<\/span>\n  <span>\/\/ populate the SMEM caches<\/span>\n  <span>for<\/span> <span>(<\/span><span>uint<\/span> <span>loadOffset<\/span> <span>=<\/span> <span>0<\/span><span>;<\/span> <span>loadOffset<\/span> <span><<\/span> <span>BM<\/span><span>;<\/span> <span>loadOffset<\/span> <span>+=<\/span> <span>strideA<\/span><span>)<\/span> <span>{<\/span>\n    <span>As<\/span><span>[(<\/span><span>innerRowA<\/span> <span>+<\/span> <span>loadOffset<\/span><span>)<\/span> <span>*<\/span> <span>BK<\/span> <span>+<\/span> <span>innerColA<\/span><span>]<\/span> <span>=<\/span>\n        <span>A<\/span><span>[(<\/span><span>innerRowA<\/span> <span>+<\/span> <span>loadOffset<\/span><span>)<\/span> <span>*<\/span> <span>K<\/span> <span>+<\/span> <span>innerColA<\/span><span>];<\/span>\n  <span>}<\/span>\n  <span>for<\/span> <span>(<\/span><span>uint<\/span> <span>loadOffset<\/span> <span>=<\/span> <span>0<\/span><span>;<\/span> <span>loadOffset<\/span> <span><<\/span> <span>BK<\/span><span>;<\/span> <span>loadOffset<\/span> <span>+=<\/span> <span>strideB<\/span><span>)<\/span> <span>{<\/span>\n    <span>Bs<\/span><span>[(<\/span><span>innerRowB<\/span> <span>+<\/span> <span>loadOffset<\/span><span>)<\/span> <span>*<\/span> <span>BN<\/span> <span>+<\/span> <span>innerColB<\/span><span>]<\/span> <span>=<\/span>\n        <span>B<\/span><span>[(<\/span><span>innerRowB<\/span> <span>+<\/span> <span>loadOffset<\/span><span>)<\/span> <span>*<\/span> <span>N<\/span> <span>+<\/span> <span>innerColB<\/span><span>];<\/span>\n  <span>}<\/span>\n  <span>__syncthreads<\/span><span>();<\/span>\n\n  <span>\/\/ advance blocktile<\/span>\n  <span>A<\/span> <span>+=<\/span> <span>BK<\/span><span>;<\/span>     <span>\/\/ move BK columns to right<\/span>\n  <span>B<\/span> <span>+=<\/span> <span>BK<\/span> <span>*<\/span> <span>N<\/span><span>;<\/span> <span>\/\/ move BK rows down<\/span>\n\n  <span>\/\/ calculate per-thread results<\/span>\n  <span>for<\/span> <span>(<\/span><span>uint<\/span> <span>dotIdx<\/span> <span>=<\/span> <span>0<\/span><span>;<\/span> <span>dotIdx<\/span> <span><<\/span> <span>BK<\/span><span>;<\/span> <span>++<\/span><span>dotIdx<\/span><span>)<\/span> <span>{<\/span>\n    <span>\/\/ load relevant As & Bs entries into registers<\/span>\n    <span>for<\/span> <span>(<\/span><span>uint<\/span> <span>i<\/span> <span>=<\/span> <span>0<\/span><span>;<\/span> <span>i<\/span> <span><<\/span> <span>TM<\/span><span>;<\/span> <span>++<\/span><span>i<\/span><span>)<\/span> <span>{<\/span>\n      <span>regM<\/span><span>[<\/span><span>i<\/span><span>]<\/span> <span>=<\/span> <span>As<\/span><span>[(<\/span><span>threadRow<\/span> <span>*<\/span> <span>TM<\/span> <span>+<\/span> <span>i<\/span><span>)<\/span> <span>*<\/span> <span>BK<\/span> <span>+<\/span> <span>dotIdx<\/span><span>];<\/span>\n    <span>}<\/span>\n    <span>for<\/span> <span>(<\/span><span>uint<\/span> <span>i<\/span> <span>=<\/span> <span>0<\/span><span>;<\/span> <span>i<\/span> <span><<\/span> <span>TN<\/span><span>;<\/span> <span>++<\/span><span>i<\/span><span>)<\/span> <span>{<\/span>\n      <span>regN<\/span><span>[<\/span><span>i<\/span><span>]<\/span> <span>=<\/span> <span>Bs<\/span><span>[<\/span><span>dotIdx<\/span> <span>*<\/span> <span>BN<\/span> <span>+<\/span> <span>threadCol<\/span> <span>*<\/span> <span>TN<\/span> <span>+<\/span> <span>i<\/span><span>];<\/span>\n    <span>}<\/span>\n    <span>\/\/ perform outer product on register cache, accumulate<\/span>\n    <span>\/\/ into threadResults<\/span>\n    <span>for<\/span> <span>(<\/span><span>uint<\/span> <span>resIdxM<\/span> <span>=<\/span> <span>0<\/span><span>;<\/span> <span>resIdxM<\/span> <span><<\/span> <span>TM<\/span><span>;<\/span> <span>++<\/span><span>resIdxM<\/span><span>)<\/span> <span>{<\/span>\n      <span>for<\/span> <span>(<\/span><span>uint<\/span> <span>resIdxN<\/span> <span>=<\/span> <span>0<\/span><span>;<\/span> <span>resIdxN<\/span> <span><<\/span> <span>TN<\/span><span>;<\/span> <span>++<\/span><span>resIdxN<\/span><span>)<\/span> <span>{<\/span>\n        <span>threadResults<\/span><span>[<\/span><span>resIdxM<\/span> <span>*<\/span> <span>TN<\/span> <span>+<\/span> <span>resIdxN<\/span><span>]<\/span> <span>+=<\/span>\n            <span>regM<\/span><span>[<\/span><span>resIdxM<\/span><span>]<\/span> <span>*<\/span> <span>regN<\/span><span>[<\/span><span>resIdxN<\/span><span>];<\/span>\n      <span>}<\/span>\n    <span>}<\/span>\n  <span>}<\/span>\n  <span>__syncthreads<\/span><span>();<\/span>\n<span>}<\/span>\n<\/code><\/pre>\n<\/div>\n<p>In the inner loop, we can reduce the number of SMEM accesses by making <code>dotIdx<\/code> the outer loop, and explicitly loading the values we need for the two inner loops into registers.<br \/>\nBelow is a drawing of the <code>dotIdx<\/code> loop across time, to visualize which SMEM entries get loaded into thread-local registers at each step:<label for=\"42\"><\/label><span>I had to reduce some dimensions to make it easier to draw. In the kernel: <code>BK=TM=TN=8<\/code>.<\/span><\/p>\n<p><img decoding=\"async\" src=\"http:\/\/siboehm.com\/assets\/img\/CUDA-MMM\/kernel_5_reg_blocking.png\" alt><\/p>\n<p>Resulting performance: 16TFLOPs, another 2x improvement.<br \/>\nLet\u2019s repeat the memory access calculation.<br \/>\nWe\u2019re now calculating <code>TM*TN = 8*8 = 64<\/code> results per thread.<\/p>\n<ul>\n<li>GMEM: K\/8 (outer loop iters) * 2 (A+B) * 1024\/256 (sizeSMEM\/numThreads) loads<\/li>\n<li>SMEM: K\/8 (outer loop iters) * 8 (dotIdx) * 2 (A+B) * 8 loads<\/li>\n<li>Memory accesses per result: K\/64 GMEM, K\/4 SMEM<\/li>\n<\/ul>\n<p>Slowly performance is reaching acceptable levels, however, warp stalls due to memory pipeline congestion are still too frequent.<br \/>\nFor kernel 6 we\u2019ll take two measures to try to improve that: Transposing <code>As<\/code> to enable auto-vectorization of SMEM loads, and promising the compiler alignment on the GMEM accesses.<\/p>\n<h2 id=\"kernel-6-vectorize-smem-and-gmem-accesses\">Kernel 6: Vectorize SMEM and GMEM Accesses<\/h2>\n<p>The first optimization that I already hinted at earlier is to transpose <code>As<\/code>.<br \/>\nThis will allow us to load from <code>As<\/code> using vectorized SMEM loads (<code>LDS.128<\/code> in SASS).<br \/>\nBelow the same visualization of the three inner loops as for kernel 5, but now with <code>As<\/code> transposed in memory:<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/siboehm.com\/assets\/img\/CUDA-MMM\/kernel_6_As_transpose.png\" alt><\/p>\n<p>Looking at the assembly<label for=\"43\"><\/label><span><a href=\"https:\/\/godbolt.org\/#z:OYLghAFBqd5TKALEBjA9gEwKYFFMCWALugE4A0BIEAZgQDbYB2AhgLbYgDkAjF%2BTXRMiAZVQtGIHgBYBQogFUAztgAKAD24AGfgCsp5eiyahUAV0wtyKxqiIEh1ZpgDC6embZMQADnLOAGQImbAA5TwAjbFIQAE4ecgAHdCViByY3Dy9fJJS0oSCQ8LYomPjrbFt7IREiFlIiTM9vPxtsO3Ta%2BqJCsMjouISlOoam7NaRnuC%2BkoH4gEprdDNSVE4uAFIAJgBmYNQPHABqDZ2XCWAyYiQ2U9wNrQBBbb2mA4tsE7PxJRUGu4ez12%2B0On1OLlQw0I6ABTxeII%2BXwhUPoBAisKBr3ex3B5giRiUAH0AG5bAB0SAx8LeoKR5kshNIZmEBA4FKpcN2ODoISOLlwAEkAoSACICgBqEAAsuQjqF5kcoFKFdsAEKK%2BUAWh4CoA9Br5oDAYTCcB6OgIhITUdiegCJgjkpgNg2GwtiKAOr1RIAFQYwWAEGCRCOMqOwblsojAGlZTRzSwQxJEkgrICjhnM1nsznczmMExhkd4%2BhE0cAFSPWUFoslsvl1VxhMhqJ1cjpvOdrs5ush8suFUAdlVHZrIYjqqlXxFRx4Wx8pxHTwzY\/DwiOqtC09n88Xo6ERYn0e3C52S8eK4P4\/XPqnpxnp\/Pl8L15DPq396Oj\/3L6OZgjqAAEroAA7tu%2BLoKgADWAqYOoZIAJ57suRyrv%2B66oJk4HmtBsHweoyHPChaERiQdT0IB2BKGY9BEEoqo4VB9iMNuk4VhuoSERmur6heRBIKQ2AsA6BBKEcglKMkhZoixgikKhEjmEY9gmEct7lu%2BRyVC6zB0WuRz8Z8EHQcx2A\/kW6Ehkwng%2BgJQmYPRjGmduZESJR1G0Y5kFMQwnz6hAt7se%2BhpnkaKE8Uc7k0XRqjRAx3lHPqUWebFpC2YJwnTp%2B6X2UoqXxdBHYsL80REBA1lsDlwleSZvlZTsM7GVBIqsmS6gheeHYRZuuqafUnyGUcFWlEc6A0AZdnVQZ6COokxhHCwqHuM05mvhNGWuO4LmTZgeFtScWwAKyKpuiVqfKXFLb%2BpE7cBYGfvxG17eoZ0QKd%2BrBYRXX6hIOGJp8kksGsxZkBNnzmKQgnrk1znBI6HC3ChJpKKmgmYNavZHI8SgbIdqpseWG7RrjIqXcjqPYOjhLFs2G443jqrHoTm4k194X6lK6DEkZTl1SQRxRMAwRMAGo3jY8YBcGJpCgQtTAOqqktiRgWRMB2F5qp%2BQGy8zU6E8ToUoeqmsNahWHM5xhsXnyB0jqb2tgbr7Efls6qYVtFts9bEXiPQSmJqLg3BIQaxifxZb8aJ632UcIEMPQRwJiJwgzSIUq4FKq1\/hGwvRHdGum499nPWdjOXSR6656QmQFzORfCSX2zHWXVvcfqvv%2B0QVFg0N\/TyWN4mgWJA\/YxNZb9QL2Ci0nlP6YtqQmCxwzYIk30CwhC2OgGLFNVnlmOkQpD2tgte95VO01T5LH6i3T5XRZOdMCEpB3cbhc7SXN%2BW3fFfXs\/mRvzrh\/OC%2B0m4cUuhFOSAt0D8SxmJYwCsxIgU%2BCBYwIYhKoCQInUsDp%2BaAyYD3GgNEE4q2aJqOOmB%2BJxjIGvVsXd5IAHF05TgwBIKiqBRYQGKqNRIyQVC4JmvgkWqkiH0ATjLEC5D7SwIQeGehAcTBryBjLX4S1VZKENMRK82d1zDCPjgQBZ8qoOQKlfPy4CrYdjbgtMRkF\/rR2EpqP6JCgZIE%2BFAiS0UxJw0EkLZepA6CMA7JjeumBkp0VxqqQKhN3wk23BsYcWgyRaASaTVuiV9S%2BNEvQhSmDu5QNHrI1UOMUKY18ZnPGt44mfgSaqJJKTBxpLvmU7AwBOKVPaTOGpiTkmpK9tY5Y9DNRsBSCGc06BEijW5vJJqBlfIlOtlAiA%2B8IgwTgtuFJZ4BZrJeuCI4Bt1SrLwrbT8jMhx32sckRINF7GDTThnXJbiFlZiWfvJOAB5GgNAVAhk\/Js9UHyvk\/KRBuTOWzAXfOwL812n49HH0eOcqxmZsaRKrjXdibE1SKirvnW22DhKfMhUQQ0h0ukNSRTmZ4eMgxPzzqBDWrt8WYEJT8hU%2Bs8VovcFSppSK%2BkoQzK8iMELgV\/MXEyllUKQW33FUCyVJsZxwoMYi\/lmZimRJpc\/V%2BeLhVQrZRxDltLq7uBHKSrKKqcwmtVBqulIFjaMp1cS52Br\/7GtZukjMfLrYZmRghN4oSlAQA6mFL1GSFqYGJMYYGMNfJIoZXbGc0qswRRGdzImajmhhxmkfYASAiBIrtfGtNhNv6hpTZ8Rmg8QJiUwKBNWcIVU%2B0UjcruRxEjRE1KE8SVEvFIsFeuGtRBjmiq2QOodZwiZis1q7UdcFlUhusbM4MWbWnZOiM87MfbxwbLFQQEFVStlTtVAQOdeZymouqabFFeMZ0vSdli0Jd0gp3kZQQN1FzMyep7KDZZOdt1bN3Xs2JB6YWu2PScYcFLMy%2BPaUei9Cb6aqhvRil26pQnm3Ohyt9kHP0bu\/fvCSeE7ym3%2BV2pQhG91goLWqAjcFlTgffS8vDEYaPqA\/MRsVLG2MuHOpOkDqpOMns7A%2B7tnlIksb1hhrFnHqkwvJeazsZ68bibiYTaDYmqJ4U6ZdHMOGsy6fA6TFVPq\/UX0DZdT1a8QJHxbYMnunjROlKY%2BucTf71QucA5Rk5rtxOCYFU5kMnHXOkc03ukth6BP0cgy4dVD6dZqWfW5jTtG9UoYcZtBOMTUvSdNfeSDWZkypnYsJjyESlNJfUBJzSUnyudJOfJrMrZFqE2i9S2Ljt4t4p806%2B9O10NVcZdlnlKqcMWcaVwRY9BuCHX4N4LgOhyDoG4C4BQIoLzRb446ZYqwwS7D4OQIg2hxuLDccJAYgbyBQRAIdLQhhuDSH4GwEA0hpBkh8G9nYAA2L7WxBzSFiIOLQH3ZCzfm4trg\/AlAgBuwdub43yBwFgCgDAbBEi%2BQoFQCAyPUeMBiEwYkqBUA8B4IOcgOBiQEDWOKAg2AQLvLbSD\/ggT6GQ4gBEQ75AIjBHqAhbge3kccGEO8pg9Aeew9Jy6YwZp1jzcIIJDo3NIdi%2BwOodoZgu68\/4MGSo7PUQRFINztwOB2eH1ZBrxY8YWDACUFTmndPmAa7kMIMQbCpCyEEMIZQahNBi\/0FsQwku0AWCsLryHkBFgTOqIWbgmp3k7COJqEUoQRS4FVAoBh8fNTOmfv9TUGAcD3nqJg%2B8WO1fYEJD4D75A89mXJXjEvXdy%2BV6UGwRvcTNSakSEQdQxVNQR\/SDjBqmpiTx6hMXw9xOIfaQ6I4CAzgxjeASIEaYxRSgGD4fkDIy1sgJHX5H3oK%2BBhDCn5Hroowt8L4qFUTokx999wMMMbo8\/783%2BX3fnUSwVhrAMKgwXneBRMEEDN1uy4Gm3IAZwW24DlHFBcG4yJzJEHGSUVHwGIFBheB1H4Bhx0HmGO3sjOwmzuwexAB2B4DJDnB%2Bx8EOh8Ge0HCoMHB2DAPZzBwhyh320O0WAR2QBAEGWuSIEoGoCxzR1CHYHWFCGgNgJIIQPm0phQP0QMHd1EF9k4BkEd0UBUA0HZ193ICsxYESCAMmxAJm0YO4HeTVx4LFigJgNnAkMQIgDcBRzRwOh2C2HmAwLYPwK4Hu3IEe12DJFiDe1oMOh%2B0HB4B8C2C0BJ3AKYOsBYMwKOwuyuxu30J2EMLFyiNiPYMQAgBQDJwp2wD4Mx3QHsJxyEI4G4FEMsLgMkP4GkJIFkISHkOd0kGUPkM93UJ9xAD9zaGn28Fn3lif0X3llv1mBiB3zyEj36NyFSD31f2GPv2P2v0f3PzmKvxqBfyKDf2sEmAmIfwaCGNX3f0PmwCnkwCh3cNAMiO4BFGwHJ2Bht2QXknKPEPgJsOQNqMcISD5EKOx2iEcOcNcNh2wPIBOxwBiHO0u2u2AM8IuPB2iOhzcOAK2EIJ4C0Bu2hP%2BKwMWGmVSEcGkCAA\">Godbolt link<\/a><\/span> we see that loading <code>As<\/code> into the registers, which used to be a 32b <code>LDS<\/code> load, is now also a 128b <code>LDS.128<\/code> load, just like it had already been for <code>Bs<\/code>.<br \/>\nThis gives us a 500GFLOPs speedup, or ~3%.<\/p>\n<p>Next, we\u2019ll vectorize all loads and stores from\/to GMEM using <a href=\"https:\/\/developer.nvidia.com\/blog\/cuda-pro-tip-increase-performance-with-vectorized-memory-access\/\">vector datatypes<\/a>, namely <code>float4<\/code>.<\/p>\n<p>The code looks like this:<label for=\"44\"><\/label><span><a href=\"https:\/\/godbolt.org\/#z:OYLghAFBqd5TKALEBjA9gEwKYFFMCWALugE4A0BIEAZgQDbYB2AhgLbYgDkAjF%2BTXRMiAZVQtGIHgBYBQogFUAztgAKAD24AGfgCsp5eiyahUAV0wtyKxqiIEh1ZpgDC6embZMQAJlnOAGQImbAA5TwAjbFIpHgAOcgAHdCViByY3Dy9fWWTU%2ByEgkPC2KJieeOtsWwKmESIWUiJMz28\/Kpr0%2BsaiIrDI6NiEpQamluz2kZ6%2BkrKhgEprdDNSVE4uAFIfAGZg1A8cAGoN7ZcJYDJiJDYT3A2tAEEt3aZ9i2xj0\/ElFSbb%2B6eOz2Bw%2BJxcqBGhHQ\/0ez2B70%2B4Mh9AIERhgJebyOYPMESMSgA%2BgA3HwAOiQ6LhrxBiPMlgJpDMwgIHHJlNhOxwdBChxcuAAkgECQARfkANQgAFlyIdQvNDlBJfKtgAhBVygC0PHlAHp1fMAQCCQTgPR0BEJMbDkT0ARMIclMBsGw2GLsHZLgAvbAQYJEQ7Sw5%2B2Uy4MAaRlNDNLH9EkSSCshyj6BjhwAVA9yADDjnc3n8wXC7nk6m0yrI9H\/VEGhWU\/60y5lQB2FXZw4YJgjIPCQ4qyWfYWHHg%2BOInVuPHMdrvBlWhAdDkdjttT\/0zsPz0fbccPSdCac9gAq\/ZOg83293ndXh7nJ8OZ7by73\/rMwdQACV0AB3ed49CoADW\/KYOopIAJ5LhO7ZPocL49qgmQ\/maAFASB6gQU8kErjBwYkA09BvtgShmPQRBKCqSH\/vYjDzn26a9qE6E5jqeo7kQSCkNgLD2gQSiHBxSjJJ2qLUYIpDthI5hGPYJiHEeaYHnO1TOswpHdocbEfL%2BAFUdgj6XthPZMJ4B7sZxmBkRROnzrhEgEURJEWX%2BlEMB8eoQEedEKQaW6GpBzGHHZxGkao0TkU5hx6oFDkhaQJkcVxA63nFZlKDFYUAW2LA\/NERAQEZbDJVxjnaS5iXbIOWn\/sKLKkuo3nbm2\/mzjqCmHI0HwaYc%2BVlIc6A0OpplFep6AOokxhtVBWRMHp%2B7%2Bmx8WuO41mDZgKG1ccPgAKwKrOEWyXKjFQfpOErR%2B363vNZlreoe0QLtepeehjV6hISExh8AksGsSZkANHzmKQHE9pVVnBA6HA3JBxpKAmHGYFaJb%2Bg8SgbJtKq0WmvZhqjwqHdDsPYPDBJJpWvYo2jKrrpjs4409fl6pK6BEpplmlSQhxRMAwRMMEwC9f1DxgFwvGkF%2BbVMPaKpC7xGBTW2O6qre75i9T\/aY9jPmQWqivle2CHUwxms7ryG2trryvfqrdE3j4arwUtBt08b\/niPQkkxrzf3doQay8WxqZsTxA0LYcn4MPQhzRtxwgjSIkq4JKz2h9gQv0BHUcLnEETEHt2w%2BNn\/q3tIhxKRwwi8Yk0TB2ZbX%2BpxqBIA6RDYIkM3PsG3PRGdCu65dXHXbdlN7dI9VtwZV4hKQmQ94OfercB61bNtd3rnqI%2BHS7EnEe9ftIB83VV31fFfrxR\/IwNqbtRz2Ce1HhNqSwDq89RIwt0nESgRNqQmNRlVj7B\/pRafnqKQO02AZ4KnyoVcy6VnLUUxiPPalNDpYQAd2SeZ1ta9xWgPNy91DjryNheLsaDO5T3cFg2eOCF43SXjtOca9R70x%2BmJCI6A2KHHPsYSWvFPwfE\/MYOuX1G4Z3Zp9JgXsaDEQjrLVoGow6YDYpGMg79sBEGbmJAA4vHfsGAJCEVQJ7CAWVeqJDyPfMRY0mA8xklItOx9PzyLtBw7hQYNEexMEnL6osfiTVaEoA0mFoJoKASAsBlCurGRWsVOBrl6HDyYRhZ2L005\/netXLiGo3oyOER8USfFCJBV4mDDiXNX6kDoIwNsiMMmYCiqRVGKoPKYwUjjecGwWxaFJFoDpuMiEkzrAU4Aic0ZHjabeDpKouk9KbH088AzUylIYqM5Zg4JmdO6b0p2TE9TLA0RqNgKR\/RmnQIkXqzNWEUXUi5FGkF8kQDQREQCwF5w9K3BzZ5tDTiHA1mqJ5KFTa3kps2eZOzDjJESNvZuXs44J3Eg3QibYwVEFIMYASKQPg7k\/EgUqUdPbECRQsogxciBsDOSeQleYOJ%2BmiIkDiRACTfCIGCRGxc0y3AgFsAAbE8NGZDu50XXKqdB0Rp50WkDjA0aMembTmYS5GjTfTWNFe4HcCDTaHC0PKK2wr%2BVfl5Ws3uZLaqHRzAqtGSrJ5ivVcK7UdFaK6uVaQbu4yjWJDAqazh5MVSWpVfQNVBCNU%2BG1b2Y8tsRXOv1a62exrPSevNT6sh1rA3Cu2CGh14a9WfgNdZY1n5tm5h1OyyCuZMYAHVQHST5jnEx40mSizTvfE5ZyEy8LICocF0RRJsGMGsGUURxBmA7TnTA6BCJMCFsyktOZMbADMI0QR2APgSAIMAJgZd\/RH06lHW5xtS1k0aVmtUBsNVJooeKjVMrDW42nXRXMrY%2BVOswdbU9TrMjHpTeGq9nr7XeqPfam2aoz30A\/cXW10bKWY3vYep9X4P2AYje%2Bi94HZU\/upn%2B2Dn4P27UdVa89Nrw0%2BAg7eqDOYH0qn\/ZjBDwHQNBtpv0\/dKoMMYLgwB19eGQPIfDdsYje6Z15nI5R2U7G\/W0dTfR0FM6dS%2BT49S4QtL6WMqysy04rL0wcu5UxmDLGsNsdw6J8VkrGlXsSre3McmNF0rUUpkYLLKxso0z4LlgnMPwZE%2BQzjCCjPSok4S6GoFXhzyUBARJhL\/JcSJL2lmTkdLysBbrZB7y8z%2BUOczLGfivB%2BxGqA4ASAp18e1rbIFVNZRjginqVLHwh5AN4qOz801YTTs3m7KFHxK6kA1HPAp9kGnTvuWg0dRAAW3jeWqQbw3vmJcK62W243gIgspeVjmVy\/TZewGUjRu6Cz9Y7q8srBBESyUTu8xWtsCALbMzmUpIyKPRq9Y0ubN0dXhrnmdTyYagO%2BbM1sszO2ewHZG\/tw7rSTtFbOxdvjVL1vLNu7Kmi3rHt6ZeytfW%2B1T1fch8cWZi2\/uAMIihY8utRvdYJ8D4703VT8QJxDosuOSfARvETsrVOGfA8NhT22LP1Byix5Jgsr3CkOUaVztWaPhVc9WfFm9mOizXeF\/j4CIzByYyWfLpQKFVk\/tzD9zHOu8x69zP5wL0SQuHR10nT8laPh7K9vxIp1TfoPODCLvb7yXdgiO8zhX6gPu3jtZMwldOJeu7VMHj3IPQ\/e8A0XGn%2BZ\/IZxcNad0JAxKraGTxTbOP7PqTJfOCzCnrNMrs3WBz2xcCcsu9rpzLhFWvZVkdjVIuQ0Ibnqj1q4uo\/eamRjgs\/l2vdsOJohO\/YzCJEsNCkp63FukvdV8wccYEx0QFz171LuWkIYl%2BM8N1ZH6YxnyahjuZ9%2Bf1vAv3ftT6mr%2B96L9v4aw\/hp4Fvv5ajz\/H61\/vz084z9L9OoLhpaM18xc78o86M4dhUd8l9Y13880v96B4xX9f8V81dSd19G8QDxMwDt8X9ID3V81D8wVLdiBNIvp\/xFt89SArMGUi9VNs92Uy8K8Zcq8uUa8LU69LYG8O91dFdm8NVW8HYgDI8uDucu8TMKVK8cwZ9PUDdzdZkuBFh6BuBNp%2BBvAuAdByB0BuAXAFBhQdwa8wc1QlBlhVhQQdg%2BByAiBtA5DFg94uJBgQtyB\/wQBNotBDBuBpB%2BA2AQBpBpBSQ4g\/DtguVAifAmxpAABOJsLQLlWQFQtQjQrgfgJQEAFwiw1QuQ8gOAWAFADAMlFyCgKgCAbIxIXIkAJgIkVAVACoJscgHAIkAgNYMUAgbAT8AAeUrhiP4EqU22oAiEsPIGzlYFIFAm4DMOyI3WaKYHoCGNSOqOdGMFNHWDUMIA4jsAIGZkSOmOwHUHdDMGbmGP4BpQUOmJRAiFRUGLcBwF6JRRZD2MWCjBYGACUAaKaNaOYD2LkGEDEH0SkFkEEGEGUDUE0GmP0B8EMDmLQAsCsGOMSMgEWFOVqHWI1GaO2EOA1GFFCGFFwBVAUE0RRI1CdEnneg1AwBwBPEaAbhPEKJ2OwAJDiC5XIGJN0nKkaUpObhpLpKUDYDZLaQ1A1ESCIHUCyg1DhPSBRnKg1CJBRMhApNOxVB4CbASKUhWMcAgGcHGG8B4HIECGCH6FKEGA1LyDSCEDVIMANNqBmAGHKA6GTy6FGGaHcFaAMBsGtKEG6CaHNN1MtKmDGHtOyA1K9N6G1NmD1MWEMJWDWAMAEWEFaKGyYEEBuNcK4CUPIHaPUO4FlDFBcETwqFJCbG6QVHwGIF%2BmeG1H4BSJ0HmGsLMjsPkLcI8JAG2B4FJGHGCLiE2jiG8KbHbKbG2GTN6LiISKSPMMsMWAyOQBAD2UhSIEoGoEKNyNCHYHWFCAzKzMbNzLUMJkLNAUwAMF%2BNEFdk4BkHeMUBUA0F6OBPIEtxYESHjMOKTJTLiOaJ2MnP5nTMzKHFXLzIgDcByMYDEmeGDVLOHJrK4HcPIE8J2FJFCL8K7M2mCKbHiB8C0CqPvO4AHOSKAocKcJcMOO2GUL7NQqHNSJHMQAgBQBqLqOwGnIKPQB\/OiHnI4G4CXLfOzLXP4A3JTztB3PkE%2BMkEPN3P%2BNPKBN8CtKVO8BVIlmNI1K1OKAtJAC5XgqSBSENIyB9PVMUvyHSHdLmHkr9MVNqFdLtKmkdL0ptOmEDNkvkpcP9MkusFtK0sGEssWBRSXU4q1QTLvPwq4GFGwFqO%2BieL4TEiYpXJzM\/ILJTw2gbJlG\/KKN\/IioAsIvLMrNsJiHsMcOcITNApQviOsEHLLKsITJ8DrJ4C0BcKysAqIsWAuVSEcGkCAA%3D\">Godbolt link<\/a> for the full kernel<\/span><\/p>\n<div>\n<pre><code><span>float4<\/span> <span>tmp<\/span> <span>=<\/span>\n    <span>reinterpret_cast<\/span><span><<\/span><span>float4<\/span> <span>*><\/span><span>(<\/span><span>&<\/span><span>A<\/span><span>[<\/span><span>innerRowA<\/span> <span>*<\/span> <span>K<\/span> <span>+<\/span> <span>innerColA<\/span> <span>*<\/span> <span>4<\/span><span>])[<\/span><span>0<\/span><span>];<\/span>\n<span>\/\/ transpose A during the GMEM to SMEM transfer<\/span>\n<span>As<\/span><span>[(<\/span><span>innerColA<\/span> <span>*<\/span> <span>4<\/span> <span>+<\/span> <span>0<\/span><span>)<\/span> <span>*<\/span> <span>BM<\/span> <span>+<\/span> <span>innerRowA<\/span><span>]<\/span> <span>=<\/span> <span>tmp<\/span><span>.<\/span><span>x<\/span><span>;<\/span>\n<span>As<\/span><span>[(<\/span><span>innerColA<\/span> <span>*<\/span> <span>4<\/span> <span>+<\/span> <span>1<\/span><span>)<\/span> <span>*<\/span> <span>BM<\/span> <span>+<\/span> <span>innerRowA<\/span><span>]<\/span> <span>=<\/span> <span>tmp<\/span><span>.<\/span><span>y<\/span><span>;<\/span>\n<span>As<\/span><span>[(<\/span><span>innerColA<\/span> <span>*<\/span> <span>4<\/span> <span>+<\/span> <span>2<\/span><span>)<\/span> <span>*<\/span> <span>BM<\/span> <span>+<\/span> <span>innerRowA<\/span><span>]<\/span> <span>=<\/span> <span>tmp<\/span><span>.<\/span><span>z<\/span><span>;<\/span>\n<span>As<\/span><span>[(<\/span><span>innerColA<\/span> <span>*<\/span> <span>4<\/span> <span>+<\/span> <span>3<\/span><span>)<\/span> <span>*<\/span> <span>BM<\/span> <span>+<\/span> <span>innerRowA<\/span><span>]<\/span> <span>=<\/span> <span>tmp<\/span><span>.<\/span><span>w<\/span><span>;<\/span>\n\n<span>reinterpret_cast<\/span><span><<\/span><span>float4<\/span> <span>*><\/span><span>(<\/span><span>&<\/span><span>Bs<\/span><span>[<\/span><span>innerRowB<\/span> <span>*<\/span> <span>BN<\/span> <span>+<\/span> <span>innerColB<\/span> <span>*<\/span> <span>4<\/span><span>])[<\/span><span>0<\/span><span>]<\/span> <span>=<\/span>\n    <span>reinterpret_cast<\/span><span><<\/span><span>float4<\/span> <span>*><\/span><span>(<\/span><span>&<\/span><span>B<\/span><span>[<\/span><span>innerRowB<\/span> <span>*<\/span> <span>N<\/span> <span>+<\/span> <span>innerColB<\/span> <span>*<\/span> <span>4<\/span><span>])[<\/span><span>0<\/span><span>];<\/span>\n<span>__syncthreads<\/span><span>();<\/span>\n<\/code><\/pre>\n<\/div>\n<p>This leads to the 32b GMEM load instructions (<code>LDG.E<\/code> and <code>STG.E<\/code>) being replaced with 128b counterparts (<code>LDG.E.128<\/code> and <code>STG.E.128<\/code>).<br \/>\nInitially, I was confused as to why running this:<\/p>\n<div>\n<pre><code><span>reinterpret_cast<\/span><span><<\/span><span>float4<\/span> <span>*><\/span><span>(<\/span><span>&<\/span><span>Bs<\/span><span>[<\/span><span>innerRowB<\/span> <span>*<\/span> <span>BN<\/span> <span>+<\/span> <span>innerColB<\/span> <span>*<\/span> <span>4<\/span><span>])[<\/span><span>0<\/span><span>]<\/span> <span>=<\/span>\n    <span>reinterpret_cast<\/span><span><<\/span><span>float4<\/span> <span>*><\/span><span>(<\/span><span>&<\/span><span>B<\/span><span>[<\/span><span>innerRowB<\/span> <span>*<\/span> <span>N<\/span> <span>+<\/span> <span>innerColB<\/span> <span>*<\/span> <span>4<\/span><span>])[<\/span><span>0<\/span><span>];<\/span>\n<\/code><\/pre>\n<\/div>\n<p>would be any faster than just manually unrolling the access (or using <code>pragma unroll<\/code>):<\/p>\n<div>\n<pre><code><span>Bs<\/span><span>[<\/span><span>innerRowB<\/span> <span>*<\/span> <span>BN<\/span> <span>+<\/span> <span>innerColB<\/span> <span>*<\/span> <span>4<\/span> <span>+<\/span> <span>0<\/span><span>]<\/span> <span>=<\/span> <span>B<\/span><span>[<\/span><span>innerRowB<\/span> <span>*<\/span> <span>N<\/span> <span>+<\/span> <span>innerColB<\/span> <span>*<\/span> <span>4<\/span> <span>+<\/span> <span>0<\/span><span>];<\/span>\n<span>Bs<\/span><span>[<\/span><span>innerRowB<\/span> <span>*<\/span> <span>BN<\/span> <span>+<\/span> <span>innerColB<\/span> <span>*<\/span> <span>4<\/span> <span>+<\/span> <span>1<\/span><span>]<\/span> <span>=<\/span> <span>B<\/span><span>[<\/span><span>innerRowB<\/span> <span>*<\/span> <span>N<\/span> <span>+<\/span> <span>innerColB<\/span> <span>*<\/span> <span>4<\/span> <span>+<\/span> <span>1<\/span><span>];<\/span>\n<span>Bs<\/span><span>[<\/span><span>innerRowB<\/span> <span>*<\/span> <span>BN<\/span> <span>+<\/span> <span>innerColB<\/span> <span>*<\/span> <span>4<\/span> <span>+<\/span> <span>2<\/span><span>]<\/span> <span>=<\/span> <span>B<\/span><span>[<\/span><span>innerRowB<\/span> <span>*<\/span> <span>N<\/span> <span>+<\/span> <span>innerColB<\/span> <span>*<\/span> <span>4<\/span> <span>+<\/span> <span>2<\/span><span>];<\/span>\n<span>Bs<\/span><span>[<\/span><span>innerRowB<\/span> <span>*<\/span> <span>BN<\/span> <span>+<\/span> <span>innerColB<\/span> <span>*<\/span> <span>4<\/span> <span>+<\/span> <span>3<\/span><span>]<\/span> <span>=<\/span> <span>B<\/span><span>[<\/span><span>innerRowB<\/span> <span>*<\/span> <span>N<\/span> <span>+<\/span> <span>innerColB<\/span> <span>*<\/span> <span>4<\/span> <span>+<\/span> <span>3<\/span><span>];<\/span>\n<\/code><\/pre>\n<\/div>\n<p>Shouldn\u2019t the compiler just be able to coalesce the 2nd version and also generate 128b loads?<br \/>\nI think the reason is that the compiler has no way to verify that the <code>float* B<\/code> pointer that is passed to the kernel is 128b aligned, which would be a requirement for using <code>LDG.E.128<\/code>.<br \/>\nSo the <code>reinterpret_cast<\/code>\u2019s only purpose is to promise the compiler that the <code>float* B<\/code> pointer will be aligned.<label for=\"45\"><\/label><span>Compare this to SMEM loads, where the compiler automatically generates vectorized loads because that memory is not user-managed.<\/span><\/p>\n<p>Kernel 6 achieves 20TFLOPs, which is as close to the 24TFLOPs of the cuBLAS implementation as well get for now.<\/p>\n<h2 id=\"kernel-7-tbd\">Kernel 7: TBD<\/h2>\n<p>I wrote this post as my worklog while I optimized the SGEMM kernel from scratch.<br \/>\nAs such, these are the optimizations I want to experiment with next:<\/p>\n<ul>\n<li><strong>Bank conflicts<\/strong>: Kernel 6 runs into SMEM bank conflicts while loading from <code>As<\/code> &#038; <code>Bs<\/code>. I should try to find a way to avoid this.<\/li>\n<li><strong>Increasing register usage<\/strong>: Kernel 6 has higher occupancy than necessary. Each warp spends 1.5 cycles per instruction stalling while waiting to get scheduled. Therefore it should be possible to use more registers which will lower occupancy but may make it possible to use double buffering our increase arithmetic intensity.<\/li>\n<\/ul>\n<h2 id=\"conclusion\">Conclusion<\/h2>\n<p>Writing this post was a similar experience to my previous post on <a href=\"http:\/\/siboehm.com\/articles\/22\/Fast-MMM-on-CPU\">optimizing SGEMM on CPU<\/a>: Optimizing SGEMM iteratively is one of the best ways to deeply understand the performance characteristics of the hardware.<br \/>\nFor writing the CUDA programs I was surprised by how easy it was to implement the code once I had made a good visualization of how I wanted the kernel to work.<\/p>\n<p>As always, all my code is available on <a href=\"https:\/\/github.com\/siboehm\/SGEMM_CUDA\">Github<\/a>.<\/p>\n<p>Lastly, a big thanks to the creators of <a href=\"https:\/\/godbolt.org\/\">Godbolt.org<\/a> (for looking at PTX and SASS assembly) and <a href=\"https:\/\/excalidraw.com\/\">Excalidraw<\/a> (for drawing the kernels)!<br \/>\nBoth of these tools are a joy to use and have helped me learn much faster.<\/p>\n<h2 id=\"further-resources-and-references\">Further Resources and References<\/h2>\n<ul>\n<li>I started writing this post because I stumbled over <a href=\"https:\/\/github.com\/wangzyon\/NVIDIA_SGEMM_PRACTICE\">wangzyon\u2019s Github repository<\/a>, first experimenting with his kernels and then rewriting everything from scratch. Also relevant is this <a href=\"https:\/\/developer.nvidia.com\/blog\/cutlass-linear-algebra-cuda\/\">Nvidia Blogpost about the CUTLASS library<\/a>.<\/li>\n<li>Mandatory references: the official <a href=\"https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html\">CUDA Toolkit Programming Guide<\/a> and the <a href=\"https:\/\/docs.nvidia.com\/cuda\/cuda-c-best-practices-guide\">CUDA Best Practices Guide<\/a>. The <a href=\"https:\/\/docs.nvidia.com\/nsight-compute\/ProfilingGuide\/index.html\">Kernel Profiling Guide<\/a> contains even more info on low-level hardware details like caches and pipelines, and on the various metrics that can be collected.<\/li>\n<li>Onur Mutlu is a professor at ETH who uploads his lectures to Youtube. Particularly relevant for this post are <a href=\"https:\/\/www.youtube.com\/playlist?list=PL5Q2soXY2Zi-Mnk1PxjEIG32HAGILkTOF\">Computer Architecture<\/a> and <a href=\"https:\/\/www.youtube.com\/playlist?list=PL5Q2soXY2Zi_OwkTgEyA6tk3UsoPBH737\">Acceleration on Heterogeneuous Systems<\/a>.<\/li>\n<li><a href=\"https:\/\/www2.eecs.berkeley.edu\/Pubs\/TechRpts\/2016\/EECS-2016-143.pdf\">Understanding Latency Hiding on GPUs<\/a>, a Ph.D. thesis that goes in-depth on how to design workloads such that they fully utilize memory bandwidth and computation.<\/li>\n<li>Lei Mao (an engineer at Nvidia) has good CUDA content on his <a href=\"https:\/\/leimao.github.io\/tags\/CUDA\/\">blog<\/a>, including about <a href=\"https:\/\/leimao.github.io\/blog\/Proper-CUDA-Error-Checking\/\">proper CUDA error handling<\/a>.<\/li>\n<li>It seems like there aren\u2019t any good official resources for understanding SASS. There\u2019s <a href=\"https:\/\/docs.nvidia.com\/cuda\/cuda-binary-utilities\/index.html\">Nvidia\u2019s Docs on CUDA binary utilities<\/a>. More useful might be looking at Open Source SASS assemblers, like Da Yan\u2019s <a href=\"https:\/\/github.com\/daadaada\/turingas\">turingas<\/a>.<\/li>\n<\/ul>\n<p><a href=\"https:\/\/siboehm.com\/articles\/22\/CUDA-MMM\" class=\"button purchase\" rel=\"nofollow noopener\" target=\"_blank\">Read More<\/a><br \/>\n Yuri Noren<\/p>\n","protected":false},"excerpt":{"rendered":"<p>December 31, 2022 In this post, I\u2019ll iteratively optimize an implementation of matrix multiplication written in CUDA. My goal is not to build a cuBLAS replacement, but to deeply understand the most important performance characteristics of the GPUs that are used for modern deep learning. This includes coalescing global memory accesses, shared memory caching and<\/p>\n","protected":false},"author":1,"featured_media":594177,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[117651,40953,46],"tags":[],"class_list":{"0":"post-594176","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-matmul","8":"category-optimize","9":"category-technology"},"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/posts\/594176","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/comments?post=594176"}],"version-history":[{"count":0,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/posts\/594176\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/media\/594177"}],"wp:attachment":[{"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/media?parent=594176"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/categories?post=594176"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/newsycanuse.com\/index.php\/wp-json\/wp\/v2\/tags?post=594176"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}