<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Rohan Reddy / Notes</title>
<link>https://rohan-reddy.github.io/</link>
<atom:link href="https://rohan-reddy.github.io/index.xml" rel="self" type="application/rss+xml"/>
<description>A blog built with Quarto</description>
<generator>quarto-1.8.27</generator>
<lastBuildDate>Fri, 06 Feb 2026 05:00:00 GMT</lastBuildDate>
<item>
  <title>Note 001: GEMM Optimization</title>
  <link>https://rohan-reddy.github.io/posts/001-gemm-optimization/</link>
  <description><![CDATA[ 





<section id="introduction" class="level2">
<h2 class="anchored" data-anchor-id="introduction">Introduction</h2>
<p>General Matrix Multiply, or GEMM, is a linear algebra operation that comprises the majority of computing done by modern deep learning models. In this note, I will explain how we can iteratively optimize GEMM implementations in CUDA until we have almost saturated the capability of modern NVIDIA GPUs.</p>
<section id="mathematical-definition" class="level3">
<h3 class="anchored" data-anchor-id="mathematical-definition">Mathematical definition</h3>
<p>Formally, GEMM is defined as an operation on two input matrices <img src="https://latex.codecogs.com/png.latex?A"> and <img src="https://latex.codecogs.com/png.latex?B">, and an accumulation matrix <img src="https://latex.codecogs.com/png.latex?C">, scaled by scalars <img src="https://latex.codecogs.com/png.latex?%5Calpha"> and <img src="https://latex.codecogs.com/png.latex?%5Cbeta">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AC%20=%20%5Calpha%20%5Ccdot%20(A%20%5Ctimes%20B)%20+%20%5Cbeta%20%5Ccdot%20C%0A"></p>
<p>Where:</p>
<ul>
<li><img src="https://latex.codecogs.com/png.latex?A"> is an <img src="https://latex.codecogs.com/png.latex?M%20%5Ctimes%20K"> matrix.</li>
<li><img src="https://latex.codecogs.com/png.latex?B"> is a <img src="https://latex.codecogs.com/png.latex?K%20%5Ctimes%20N"> matrix.</li>
<li><img src="https://latex.codecogs.com/png.latex?C"> is an <img src="https://latex.codecogs.com/png.latex?M%20%5Ctimes%20N"> matrix.</li>
</ul>
<p><img src="https://rohan-reddy.github.io/posts/001-gemm-optimization/images/01_matrix_dims.svg" class="img-fluid" style="width:80.0%"></p>
<p>In deep learning contexts, <img src="https://latex.codecogs.com/png.latex?%5Cbeta"> is often 0 (overwriting the output) or 1 (accumulating gradients), and <img src="https://latex.codecogs.com/png.latex?%5Calpha"> is typically 1.</p>
</section>
<section id="why-gemm" class="level3">
<h3 class="anchored" data-anchor-id="why-gemm">Why GEMM?</h3>
<p>In modern Transformer architectures, GEMM operations account for the vast majority of total Floating Point Operations (FLOPs). This is due to the structure of the Attention operation: <img src="https://latex.codecogs.com/png.latex?%5Ctext%7Bsoftmax%7D(%5Cfrac%7BQ%20%5Ctimes%20K%5ET%7D%7B%5Csqrt%7Bd%7D%7D)%20%5Ctimes%20V">. Aside from the softmax operation, everything else can be represented as GEMM:</p>
<ol type="1">
<li>Calculating the scaled attention scores (<img src="https://latex.codecogs.com/png.latex?%5Cfrac%7BQ%20%5Ctimes%20K%5ET%7D%7B%5Csqrt%7Bd%7D%7D">).</li>
<li>Calculating the weighted sum of values (<img src="https://latex.codecogs.com/png.latex?%5Ctext%7Bscores%7D%20%5Ctimes%20V">).</li>
</ol>
<p>Since GEMM dominates the runtime, even a small percentage improvement in kernel efficiency can realize massive savings in training and inference costs at scale.</p>
</section>
<section id="problem-setup" class="level3">
<h3 class="anchored" data-anchor-id="problem-setup">Problem setup</h3>
<p>As I iterate on GEMM kernels, I will test them on the General Matrix Multiplication test suite and infrastructure on LeetGPU <span class="citation" data-cites="leetgpu">(<span>“LeetGPU: Competitive GPU Programming”</span> 2026)</span>. As per the problem setup there, I will only be using native capabilities of the GPUs, so no libraries like CuTe or cuBLAS. The test suite is hidden, but the known constraints are that each of the matrix dimensions <img src="https://latex.codecogs.com/png.latex?M">, <img src="https://latex.codecogs.com/png.latex?N">, and <img src="https://latex.codecogs.com/png.latex?K"> are between 16 and 4096. So the input matrices range from very small (a few hundred elements) to fairly large (16 million elements). The platform tells us the runtime of the kernel on a particular large test case that is unknown to us. The input matrices A and B are given as type half (half-precision floating point number). Lower than usual precision floats are common in AI workloads as they take up less space and allow for higher throughput. For improved accuracy, the computation of the GEMM output will be done using full-precision floats, but the final storage will also be as a half-precision float.</p>
<p>For each kernel, I will explain the algorithm, how it interacts with the GPU architecture and memory hierarchy, and show the full code in CUDA C++. Finally, I will discuss the arithmetic intensity of the kernel and benchmark its performance on the following NVIDIA GPUs: Tesla T4 (2017), Ampere A100-80GB (2020), Hopper H100 (2022), Hopper H200 (2023), and Blackwell B200 (2024).</p>
</section>
<section id="assumed-background" class="level3">
<h3 class="anchored" data-anchor-id="assumed-background">Assumed background</h3>
<p>I will assume the reader understands the basics of the CUDA programming model. If not, I recommend reading the first 6 chapters of Programming Massively Parallel Processors <span class="citation" data-cites="pmpp">(Kirk and Hwu 2022)</span>, an excellent resource and probably the canonical text on this topic.</p>
</section>
</section>
<section id="naive-matrix-multiplication" class="level2 page-columns page-full">
<h2 class="anchored" data-anchor-id="naive-matrix-multiplication">1. Naive Matrix Multiplication</h2>
<p>In a naive parallel computing model, we can have every thread be solely responsible for computing exactly one output element in the final matrix. Each thread would load the row from A and column from B that it needs for the dot product for that output element.</p>
<p>Hover over the numbered annotations for explanations of key parts.</p>
<section id="annotated-code" class="level3 page-columns page-full">
<h3 class="anchored" data-anchor-id="annotated-code">Annotated Code</h3>
<div class="column-screen-inset">
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="annotated-cell-1" style="background: #f1f3f5;"><pre class="sourceCode cpp code-annotation-code code-with-copy code-annotated"><code class="sourceCode cpp"><span id="annotated-cell-1-1"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#include </span><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">&lt;cuda_fp16.h&gt;</span></span>
<span id="annotated-cell-1-2"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#include </span><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">&lt;cuda_runtime.h&gt;</span></span>
<span id="annotated-cell-1-3"></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-1" data-target-annotation="1">1</button><span id="annotated-cell-1-4" class="code-annotation-target">__global__ <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">void</span> gemm_naive_kernel<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="annotated-cell-1-5">                                  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> </span>
<span id="annotated-cell-1-6">                                  <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span> </span>
<span id="annotated-cell-1-7">    </span>
<span id="annotated-cell-1-8">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Calculate global row and column indices for this thread</span></span>
<span id="annotated-cell-1-9">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> blockIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> blockDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> threadIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-1-10">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> blockIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>y <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> blockDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>y <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> threadIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>y<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-1-11">    </span>
<span id="annotated-cell-1-12">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Boundary check: ensure we don't access memory outside the matrix</span></span>
<span id="annotated-cell-1-13">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;&amp;</span> col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-1-14">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> val <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-1-15">        </span>
<span id="annotated-cell-1-16">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// The K-loop: Perform the dot product</span></span>
<span id="annotated-cell-1-17">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-1" data-target-annotation="2">2</button><span id="annotated-cell-1-18" class="code-annotation-target">            val <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> __half2float<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">])</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> __half2float<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> col<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]);</span></span>
<span id="annotated-cell-1-19">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-1-20">        </span>
<span id="annotated-cell-1-21">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Write result back to C</span></span>
<span id="annotated-cell-1-22">        val <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> alpha <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> val <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> beta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> __half2float<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> col<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]);</span></span>
<span id="annotated-cell-1-23">        C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> col<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> __float2half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>val<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-1-24">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-1-25"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-1-26"></span>
<span id="annotated-cell-1-27"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Wrapper function to be called from Host</span></span>
<span id="annotated-cell-1-28"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">extern</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"C"</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">void</span> solve<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span> </span>
<span id="annotated-cell-1-29"></span>
<span id="annotated-cell-1-30">    dim3 block<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-1-31">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Grid calculation: ensures we cover the entire matrix (ceiling division)</span></span>
<span id="annotated-cell-1-32">    dim3 grid<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span></span>
<span id="annotated-cell-1-33">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="annotated-cell-1-34">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span></span>
<span id="annotated-cell-1-35">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-1-36"></span>
<span id="annotated-cell-1-37">    gemm_naive_kernel<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;&lt;&lt;</span>grid<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> block<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;(</span>A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-1-38">    cudaDeviceSynchronize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span></span>
<span id="annotated-cell-1-39"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span><div class="code-annotation-gutter-bg"></div><div class="code-annotation-gutter"></div></code></pre></div></div>
<dl class="code-annotation-container-hidden code-annotation-container-grid">
<dt data-target-cell="annotated-cell-1" data-target-annotation="1">1</dt>
<dd>
<span data-code-cell="annotated-cell-1" data-code-lines="4" data-code-annotation="1"><strong>half vs.&nbsp;float</strong>: We use <code>half</code> precision (FP16) for storage but perform accumulation in <code>float</code> (FP32). This is so that we can move data faster from global memory (only 2 bytes per element rather than 4), but during the accumulation computation, we don’t lose small updates due to the smaller mantissa in FP16. (For example, imagine adding 0.01 to a running sum of 1000: if our mantissa is small enough, we may significantly alter or even omit some updates.)</span>
</dd>
<dt data-target-cell="annotated-cell-1" data-target-annotation="2">2</dt>
<dd>
<span data-code-cell="annotated-cell-1" data-code-lines="18" data-code-annotation="2"><strong>The Bottleneck</strong>: This line is the performance killer. For every single pixel in C, we are fetching the entire row of A and column of B from Global Memory (DRAM).</span>
</dd>
</dl>
</div>
</section>
<section id="arithmetic-intensity" class="level3">
<h3 class="anchored" data-anchor-id="arithmetic-intensity">Arithmetic Intensity</h3>
<p>For each output element of C, we load K elements of A and K elements of B in order to compute a dot product. For each pair of elements in the dot product, we multiply them together and then add the result to the running sum. Therefore, for every 2 halves we load from global memory (a total of 4 bytes), we perform 2 floating point operations. So our computational intensity is 2 FLOPs divided by 4 bytes, or 0.5 FLOP/B.</p>
</section>
<section id="benchmarks" class="level3">
<h3 class="anchored" data-anchor-id="benchmarks">Benchmarks</h3>
<p>Below, we can see the runtime of our kernel on the same test suite for each GPU. We can also compare the arithmetic intensity of the kernel to the ridge point of each GPU (the arithmetic intensity at which kernels switch from memory-bound to compute-bound). This kernel is highly memory-bound on every GPU. Our first course of action to improve the performance of our kernel should be to rethink our memory access pattern.</p>
<table class="table">
<caption>If our arithmetic intensity is below the Ridge Point, kernels are memory bound. Above the Ridge Point, kernels are compute bound.</caption>
<colgroup>
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">GPU Model</th>
<th style="text-align: left;">Memory Bandwidth</th>
<th style="text-align: left;">Peak FP16 Compute</th>
<th style="text-align: left;">Ridge Point (FLOP/Byte)</th>
<th style="text-align: left;">Runtime (ms)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;"><strong>NVIDIA T4</strong></td>
<td style="text-align: left;">320 GB/s</td>
<td style="text-align: left;">65 TFLOPS</td>
<td style="text-align: left;">203</td>
<td style="text-align: left;">8.49</td>
</tr>
<tr class="even">
<td style="text-align: left;"><strong>NVIDIA A100 (80GB)</strong></td>
<td style="text-align: left;">2,039 GB/s</td>
<td style="text-align: left;">312 TFLOPS</td>
<td style="text-align: left;">153</td>
<td style="text-align: left;">1.03</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><strong>NVIDIA H100 (SXM)</strong></td>
<td style="text-align: left;">3,350 GB/s</td>
<td style="text-align: left;">989 TFLOPS</td>
<td style="text-align: left;">295</td>
<td style="text-align: left;">0.54</td>
</tr>
<tr class="even">
<td style="text-align: left;"><strong>NVIDIA H200 (SXM)</strong></td>
<td style="text-align: left;">4,800 GB/s</td>
<td style="text-align: left;">989 TFLOPS</td>
<td style="text-align: left;">206</td>
<td style="text-align: left;">0.53</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><strong>NVIDIA B200</strong></td>
<td style="text-align: left;">8,000 GB/s</td>
<td style="text-align: left;">2,500 TFLOPS</td>
<td style="text-align: left;">312</td>
<td style="text-align: left;">0.50</td>
</tr>
</tbody>
</table>
</section>
</section>
<section id="tiled-matrix-multiplication" class="level2 page-columns page-full">
<h2 class="anchored" data-anchor-id="tiled-matrix-multiplication">2. Tiled Matrix Multiplication</h2>
<p>The main issue with our memory access pattern above was that we are redundantly accessing each row N times and each column M times. Why? Recall that the output C is an M x N matrix. Therefore for <img src="https://latex.codecogs.com/png.latex?C_%7B1,1%7D">, we need to compute the dot product of row 1 of A with column 1 of B; then for <img src="https://latex.codecogs.com/png.latex?C_%7B2,1%7D">, we need to compute the dot product of row 2 of A with column 1 of B again. So we retrieve column 1 of B from global memory a total of M times. Similarly, row 1 of A is retrieved from global memory a total of N times, since we access it once for each element in row 1 of the output.</p>
<div class="quarto-figure quarto-figure-center page-columns page-full">
<figure class="figure page-columns page-full">
<p><img src="https://rohan-reddy.github.io/posts/001-gemm-optimization/images/02_a100_memory.png" class="img-fluid figure-img" style="width:100.0%"></p>
<figcaption class="margin-caption">Memory hierarchy of an A100-40GB <span class="citation" data-cites="memoryhierarchy">(<span>“Memory Hierarchy of GPUs”</span> 2025)</span></figcaption>
</figure>
</div>
<p>When we execute our kernel, we pass it a grid configuration that defines a total number of blocks and how we can index them, and a total number of threads per block and how we can index them. Multiple blocks will be assigned to a single Streaming Multiprocessor (SM) of the GPU at any given time. So all threads in an individual block have access to the same Shared Memory and L1 Cache on their resident Streaming Multiprocessor during execution. We can take advantage of this local memory to reduce our global memory accesses. This pattern is known as locality.</p>
<div class="quarto-figure quarto-figure-center page-columns page-full">
<figure class="figure page-columns page-full">
<p><img src="https://rohan-reddy.github.io/posts/001-gemm-optimization/images/03_tiled_mm.png" class="img-fluid figure-img" style="width:95.0%"></p>
<figcaption class="margin-caption">Visualization of tiled matrix multiplication <span class="citation" data-cites="tiledmatmul">(Matthes et al. 2017)</span></figcaption>
</figure>
</div>
<p>In tiled matrix multiplication, we choose a tile size which will comprise the total threads in a single block. We will choose 16 x 16 as our tile size so that we have a nice total of 256 threads per block. (32 x 32 would also work, but beyond that we need to be cognizant of hardware restrictions on the maximum number of threads per block). We then loop over a wide row in A and a wide column in B, one tile at a time, as shown above. During each loop iteration, we have a single tile in A and tile in B to process. Each thread is responsible for loading in one element each from A and B to the block’s shared memory. Then in an inner loop, we compute the product of those tiles and add it to the running sum for the output tile. By the end of the outer loop, we have loaded in and processed all elements required for the final value of elements in the 16 x 16 output tile, and so we can write to global memory.</p>
<p>One additional optimization we introduce here is thread coarsening. This means that each thread is tasked with doing more work independently. The advantage of this approach is that if our grid ends up launching more total blocks than the hardware can assign to its SMs, then the blocks will inevitably be queued for assignment and execution. In that case, the blocks will be executed serially anyway, so we may as well have threads do more work in the first place and reduce some redundant data loading and synchronization overhead. However, we must be careful not to coarsen so much that we are no longer taking full advantage of the hardware. For our tiled matrix multiplication kernel, it can make sense for large matrices to have some coarsening. This is because although we have reduced redundancy in global memory accesses, we still will access the same “wide row” in A in two different blocks for two side-by-side output tiles in C. We can experiment with having a thread coarsening factor of 2, which means each block will process two output tiles in C rather than one.</p>
<section id="annotated-code-1" class="level3 page-columns page-full">
<h3 class="anchored" data-anchor-id="annotated-code-1">Annotated Code</h3>
<div class="column-screen-inset">
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="annotated-cell-2" style="background: #f1f3f5;"><pre class="sourceCode cpp code-annotation-code code-with-copy code-annotated"><code class="sourceCode cpp"><span id="annotated-cell-2-1"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#include </span><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">&lt;cuda_fp16.h&gt;</span></span>
<span id="annotated-cell-2-2"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#include </span><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">&lt;cuda_runtime.h&gt;</span></span>
<span id="annotated-cell-2-3"></span>
<span id="annotated-cell-2-4"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#define TILE_WIDTH </span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span></span>
<span id="annotated-cell-2-5"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#define COARSE_FACTOR </span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span></span>
<span id="annotated-cell-2-6"></span>
<span id="annotated-cell-2-7">__global__ <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">void</span> gemm_tiled_kernel<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-2-8">    </span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-2" data-target-annotation="1">1</button><span id="annotated-cell-2-9" class="code-annotation-target">    __shared__ <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> As<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>TILE_WIDTH<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>TILE_WIDTH<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-2-10">    __shared__ <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> Bs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>TILE_WIDTH<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>TILE_WIDTH<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-2-11"></span>
<span id="annotated-cell-2-12">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> TILE_WIDTH <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> blockIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>y <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> threadIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>y<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-2" data-target-annotation="2">2</button><span id="annotated-cell-2-13" class="code-annotation-target">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> colStart <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> COARSE_FACTOR <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> TILE_WIDTH <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> blockIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> threadIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-2-14"></span>
<span id="annotated-cell-2-15">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> sum<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>COARSE_FACTOR<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span> </span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-2" data-target-annotation="3">3</button><span id="annotated-cell-2-16" class="code-annotation-target">    <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-2-17">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> c <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> c <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> COARSE_FACTOR<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> c<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-2-18">        sum<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>c<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">f</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-2-19">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-2-20"></span>
<span id="annotated-cell-2-21">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Loop over the K-dimension (shared dimension)</span></span>
<span id="annotated-cell-2-22">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> phase <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> phase <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> TILE_WIDTH <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> TILE_WIDTH<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> phase<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-2-23">        </span>
<span id="annotated-cell-2-24">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// --- Load A ---</span></span>
<span id="annotated-cell-2-25">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// A is (M x K). </span></span>
<span id="annotated-cell-2-26">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Row comes from global 'row'. </span></span>
<span id="annotated-cell-2-27">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Col comes from 'phase' and 'threadIdx.x'.</span></span>
<span id="annotated-cell-2-28">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> a_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> phase <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> TILE_WIDTH <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> threadIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-2-29">        As<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>threadIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>y<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>threadIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> </span>
<span id="annotated-cell-2-30">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;&amp;</span> a_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">?</span> </span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-2" data-target-annotation="4">4</button><span id="annotated-cell-2-31" class="code-annotation-target">            __half2float<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> a_col<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">])</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">f</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-2-32"></span>
<span id="annotated-cell-2-33">        <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-2-34">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> c <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> c <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> COARSE_FACTOR<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> c<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-2-35">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> colStart <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> c <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> TILE_WIDTH<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-2-36"></span>
<span id="annotated-cell-2-37">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// --- Load B ---</span></span>
<span id="annotated-cell-2-38">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// B is (K x N). </span></span>
<span id="annotated-cell-2-39">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Row comes from 'phase' and 'threadIdx.y'. </span></span>
<span id="annotated-cell-2-40">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Col comes from global 'col'.</span></span>
<span id="annotated-cell-2-41">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> b_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> phase <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> TILE_WIDTH <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> threadIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>y<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-2-42">            </span>
<span id="annotated-cell-2-43">            Bs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>threadIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>y<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>threadIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> </span>
<span id="annotated-cell-2-44">                <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>b_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;&amp;</span> col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">?</span></span>
<span id="annotated-cell-2-45">                __half2float<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>b_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> col<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">])</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">f</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> </span>
<span id="annotated-cell-2-46">            </span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-2" data-target-annotation="5">5</button><span id="annotated-cell-2-47" class="code-annotation-target">            __syncthreads<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span></span>
<span id="annotated-cell-2-48"></span>
<span id="annotated-cell-2-49">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> TILE_WIDTH<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-2-50">                sum<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>c<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> As<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>threadIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>y<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> Bs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>threadIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-2-51">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-2-52">            __syncthreads<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span></span>
<span id="annotated-cell-2-53">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-2-54">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-2-55"></span>
<span id="annotated-cell-2-56">    <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-2-57">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> c <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> c <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> COARSE_FACTOR<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> c<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-2-58">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> colStart <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> c <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> TILE_WIDTH<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-2" data-target-annotation="6">6</button><span id="annotated-cell-2-59" class="code-annotation-target">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;&amp;</span> col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-2-60">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> idx <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> col<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// C is (M x N), stride is N</span></span>
<span id="annotated-cell-2-61">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> initial_val <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> __half2float<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>idx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]);</span></span>
<span id="annotated-cell-2-62">            C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>idx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> __float2half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>alpha <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> sum<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>c<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> beta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> initial_val<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-2-63">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-2-64">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-2-65"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-2-66"></span>
<span id="annotated-cell-2-67"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">extern</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"C"</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">void</span> solve<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="annotated-cell-2-68">                      <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-2-69"></span>
<span id="annotated-cell-2-70">    dim3 block<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>TILE_WIDTH<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> TILE_WIDTH<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-2-71">    </span>
<span id="annotated-cell-2-72">    dim3 grid<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-2" data-target-annotation="7">7</button><span id="annotated-cell-2-73" class="code-annotation-target">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>TILE_WIDTH <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> COARSE_FACTOR<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>TILE_WIDTH <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> COARSE_FACTOR<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">),</span></span>
<span id="annotated-cell-2-74">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> TILE_WIDTH <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> TILE_WIDTH</span>
<span id="annotated-cell-2-75">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span> </span>
<span id="annotated-cell-2-76"></span>
<span id="annotated-cell-2-77">    gemm_tiled_kernel<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;&lt;&lt;</span>grid<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> block<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;(</span>A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-2-78">    cudaDeviceSynchronize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span></span>
<span id="annotated-cell-2-79"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span><div class="code-annotation-gutter-bg"></div><div class="code-annotation-gutter"></div></code></pre></div></div>
<dl class="code-annotation-container-hidden code-annotation-container-grid">
<dt data-target-cell="annotated-cell-2" data-target-annotation="1">1</dt>
<dd>
<span data-code-cell="annotated-cell-2" data-code-lines="9" data-code-annotation="1"><strong>Shared memory</strong>: We declare our block shared memory. One can also dynamically pass the total size of block shared memory to the kernel at runtime if desired. In our case, we have a predetermined tile width. Note that we need to be cognizant of the total shared memory available on an SM. Our oldest GPU, the T4, has 64 KB of shared memory per SM. Here, we have two arrays of 16 x 16 floats each, so 512 total floats, so 4 KB. We’re well within the limits. I went ahead and converted the halves to floats at this stage since we’re so far within shared memory limits, but to save on half of the shared memory allocation, we could declare the shared memory arrays as type half and convert them at compute time.</span>
</dd>
<dt data-target-cell="annotated-cell-2" data-target-annotation="2">2</dt>
<dd>
<span data-code-cell="annotated-cell-2" data-code-lines="13" data-code-annotation="2"><strong>Coarsening</strong>: We set COARSE_FACTOR to 2, so each thread is going to load in 2 elements each from A and B, and compute 2 output elements in C. We are loading in two horizontal tiles at a time per block, so we need to apply our coarsening factor to our column computation.</span>
</dd>
<dt data-target-cell="annotated-cell-2" data-target-annotation="3">3</dt>
<dd>
<span data-code-cell="annotated-cell-2" data-code-lines="16" data-code-annotation="3"><strong>Loop unrolling</strong>: <code>#pragma unroll</code> is a directive that asks the compiler to try to unroll the loop fully, especially if the total number of iterations is known at compile time. To unroll a loop means to duplicate the code in the loop body rather than perform a condition check and a jump back to the start of the loop body. This allows us to avoid the execution speed cost of checking the loop condition, with the tradeoff of increasing code size. From here on out, we will typically unroll any loop with a constant number of iterations.</span>
</dd>
<dt data-target-cell="annotated-cell-2" data-target-annotation="4">4</dt>
<dd>
<span data-code-cell="annotated-cell-2" data-code-lines="31" data-code-annotation="4"><strong>Boundary checks</strong>: Our tiles are a fixed size. So if our matrix dimensions are not all multiples of 16, we will have some tiles that aren’t fully contained within the input matrices and try to access out-of-bound indices. We can simply set these values to 0 in shared memory so that they accumulate to 0 and don’t impact the result.</span>
</dd>
<dt data-target-cell="annotated-cell-2" data-target-annotation="5">5</dt>
<dd>
<span data-code-cell="annotated-cell-2" data-code-lines="47" data-code-annotation="5"><strong><code>__syncthreads()</code></strong>: This instruction forces each thread in the block to halt here and wait until every other thread in the block reaches this point. This first syncthreads command is known as a Read-After-Write hazard, and the one after it is known as a Write-After-Read hazard. In the first case, individual threads rely on reading shared memory that other threads in their block are writing to. In the second case, if we don’t have a barrier, then some threads risk proceeding to the next loop iteration and modifying shared memory before other threads have read it for their computation on the previous iteration.</span>
</dd>
<dt data-target-cell="annotated-cell-2" data-target-annotation="6">6</dt>
<dd>
<span data-code-cell="annotated-cell-2" data-code-lines="59" data-code-annotation="6"><strong>Another boundary check</strong>: When we write to C, we again need to check that we are within bounds, since some tiles may not be fully contained at the end of the grid.</span>
</dd>
<dt data-target-cell="annotated-cell-2" data-target-annotation="7">7</dt>
<dd>
<span data-code-cell="annotated-cell-2" data-code-lines="73" data-code-annotation="7"><strong>Grid calculation with coarsening</strong>: We adjust our grid calculation to account for the coarsening in the horizontal dimension; this impacts the total number of blocks we need horizontally.</span>
</dd>
</dl>
</div>
</section>
<section id="arithmetic-intensity-1" class="level3">
<h3 class="anchored" data-anchor-id="arithmetic-intensity-1">Arithmetic Intensity</h3>
<p>Now that we are reusing some global memory, our arithmetic intensity is higher. The coarsening factor doesn’t impact the arithmetic intensity, so let’s ignore it for the calculation. A single thread is computing a single output element in C, but it doesn’t have to load every element in the vectors of A and B that are used for that dot product. It only has to load one element of A and one element of B per tile, and then it benefits from the other 15 elements it needs from each matrix for each tile that were loaded by other threads. Therefore we reduced the number of global memory accesses by a factor of 16. But we are performing the same number of floating point operations, so our arithmetic intensity is simply 16 times higher than that of the naive kernel. Hence the arithmetic intensity of this kernel is 8 FLOPs/B.</p>
</section>
<section id="benchmarks-1" class="level3">
<h3 class="anchored" data-anchor-id="benchmarks-1">Benchmarks</h3>
<p>The runtime improved from our increase in arithmetic intensity. The kernel is still memory-bound though on every GPU. In the next section, we will address this by taking advantage of a fundamental hardware capability that happens to available in every GPU in our test set.</p>
<table class="table">
<colgroup>
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">GPU Model</th>
<th style="text-align: left;">Memory Bandwidth</th>
<th style="text-align: left;">Peak FP16 Compute</th>
<th style="text-align: left;">Ridge Point (FLOP/Byte)</th>
<th style="text-align: left;">Runtime (ms)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;"><strong>NVIDIA T4</strong></td>
<td style="text-align: left;">320 GB/s</td>
<td style="text-align: left;">65 TFLOPS</td>
<td style="text-align: left;">203</td>
<td style="text-align: left;">6.73</td>
</tr>
<tr class="even">
<td style="text-align: left;"><strong>NVIDIA A100 (80GB)</strong></td>
<td style="text-align: left;">2,039 GB/s</td>
<td style="text-align: left;">312 TFLOPS</td>
<td style="text-align: left;">153</td>
<td style="text-align: left;">0.72</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><strong>NVIDIA H100 (SXM)</strong></td>
<td style="text-align: left;">3,350 GB/s</td>
<td style="text-align: left;">989 TFLOPS</td>
<td style="text-align: left;">295</td>
<td style="text-align: left;">0.37</td>
</tr>
<tr class="even">
<td style="text-align: left;"><strong>NVIDIA H200 (SXM)</strong></td>
<td style="text-align: left;">4,800 GB/s</td>
<td style="text-align: left;">989 TFLOPS</td>
<td style="text-align: left;">206</td>
<td style="text-align: left;">0.36</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><strong>NVIDIA B200</strong></td>
<td style="text-align: left;">8,000 GB/s</td>
<td style="text-align: left;">2,500 TFLOPS</td>
<td style="text-align: left;">312</td>
<td style="text-align: left;">0.33</td>
</tr>
</tbody>
</table>
</section>
</section>
<section id="warp-matrix-multiply-accumulate" class="level2 page-columns page-full">
<h2 class="anchored" data-anchor-id="warp-matrix-multiply-accumulate">3. Warp Matrix Multiply Accumulate</h2>
<p>Every GPU in our test suite is modern enough to be equipped with Tensor Cores: programmable matrix-multiply-and-accumulate units that deliver massively higher throughput. Each SM has many of these Tensor Cores. An individual Tensor Core performs the operation <img src="https://latex.codecogs.com/png.latex?D%20=%20A%20%5Ctimes%20B%20+%20C">, where every matrix in the operation has size 4x4. We call the shape of this operation 4x4x4. Additionally, Tensor Cores natively handle mixed-precision: the input matrices A and B are expected to be half-precision (FP16), while the accumulators C and D can be either FP16 or FP32.</p>
<div class="quarto-figure quarto-figure-center page-columns page-full">
<figure class="figure page-columns page-full">
<p><img src="https://rohan-reddy.github.io/posts/001-gemm-optimization/images/04_tensor_core.png" class="img-fluid figure-img" style="width:90.0%"></p>
<figcaption class="margin-caption">Tensor Core performing a 4x4x4 matrix multiply and accumulate operation <span class="citation" data-cites="tensorcores">(2024)</span></figcaption>
</figure>
</div>
<p>This capability is exposed to us as the Warp Matrix Multiply Accumulate API (WMMA). During program execution, a full warp of execution will use multiple Tensor Cores at a time in order to process a 16x16x16 MMA operation.</p>
<p>There are several advantages of using WMMA rather than manually programming the matrix multiply and accumulate operation like we did in previous kernels.</p>
<ol type="1">
<li>Single instruction: As opposed to issuing separate multiplication and addition instructions manually, the warp scheduler issues a single instruction to the Tensor Core hardware, which proceeds to take over the rest of the operation. GPUs have a limited rate at which they can feed instructions to the execution units, so this allows us to issue memory requests much faster and get closer to saturating the memory bus.</li>
<li>Matrix loading: The <code>load_matrix_sync</code> instruction in WMMA is optimized to use 128-bit global loads. So it retrieves 16 bytes (8 halves) in a single transaction. Meanwhile, when we manually load half data, we are loading 2 bytes at a time unless we specify otherwise (discussed in a subsequent section, when we explicitly issue vectorized loads).</li>
<li>Dedicated registers: Tensor Cores have dedicated register file data paths and accumulation buffers, laid out to maximize efficiency. We don’t have to deal with register pressure (when we risk allocating too many local variables that live in registers, which can spill over to slower memory stores in we exceed the register capacity) or bank conflicts (discussed in a subsequent section). We don’t have to manage all of this ourselves as it’s already fully optimized when we use the Tensor Cores.</li>
</ol>
<p>One disadvantage of WMMA is that we are locked into the 16x16x16 operation shape. Later on, we’ll adapt our kernel to handle any arbitrary matrix sizes. For now, we’ll have our host code decide whether to use our WMMA kernel based on the input matrix sizes.</p>
<section id="annotated-code-2" class="level3 page-columns page-full">
<h3 class="anchored" data-anchor-id="annotated-code-2">Annotated Code</h3>
<div class="column-screen-inset">
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="annotated-cell-3" style="background: #f1f3f5;"><pre class="sourceCode cpp code-annotation-code code-with-copy code-annotated"><code class="sourceCode cpp"><span id="annotated-cell-3-1"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#include </span><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">&lt;cuda_runtime.h&gt;</span></span>
<span id="annotated-cell-3-2"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#include </span><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">&lt;cuda_fp16.h&gt;</span></span>
<span id="annotated-cell-3-3"></span>
<span id="annotated-cell-3-4"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#include </span><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">&lt;mma.h&gt;</span></span>
<span id="annotated-cell-3-5"></span>
<span id="annotated-cell-3-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">using</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">namespace</span> nvcuda<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-3-7"></span>
<span id="annotated-cell-3-8"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#define WARP_SIZE </span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">32</span></span>
<span id="annotated-cell-3-9"></span>
<span id="annotated-cell-3-10">__global__ <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">void</span> gemm_wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-3-11">   </span>
<span id="annotated-cell-3-12">   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Leading dimensions for Row-Major matrices</span></span>
<span id="annotated-cell-3-13">   <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> lead_dim_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// A: M x K. Stride between rows is K</span></span>
<span id="annotated-cell-3-14">   <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> lead_dim_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// B: K x N. Stride between rows is N</span></span>
<span id="annotated-cell-3-15">   <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> lead_dim_C <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// C: M x N. Stride between rows is N</span></span>
<span id="annotated-cell-3-16"></span>
<span id="annotated-cell-3-17">   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 2D grid tiling. We will have multiple warps worth of threads in the x dimension.</span></span>
<span id="annotated-cell-3-18">   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Hence warp_col is divided by warp size. </span></span>
<span id="annotated-cell-3-19">   <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> warp_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> blockDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>y <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> blockIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>y <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> threadIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>y<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-3-20">   <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> warp_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>blockDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> blockIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> threadIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> WARP_SIZE<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-3-21"></span>
<span id="annotated-cell-3-22">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Declare fragments</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-3" data-target-annotation="1">1</button><span id="annotated-cell-3-23" class="code-annotation-target">    wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>fragment<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>matrix_a<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>row_major<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> A_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-3-24">    wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>fragment<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>matrix_b<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>row_major<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> B_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-3-25">    wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>fragment<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>accumulator<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> accum_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-3-26">    wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>fragment<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>accumulator<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> C_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-3-27"></span>
<span id="annotated-cell-3-28">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Initialize the accumulator fragment for A * B with zeroes.</span></span>
<span id="annotated-cell-3-29">    wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>fill_fragment<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>accum_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">f</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-3-30"></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-3" data-target-annotation="2">2</button><span id="annotated-cell-3-31" class="code-annotation-target">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-3-32">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Get the starting row and column of our 16 x 16 tiles in both A and B.</span></span>
<span id="annotated-cell-3-33">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> row_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> warp_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-3-34">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> col_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-3-35">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> row_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-3-36">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> col_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> warp_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-3-37"></span>
<span id="annotated-cell-3-38">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Check bounds</span></span>
<span id="annotated-cell-3-39">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>row_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;&amp;</span> col_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;&amp;</span> row_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;&amp;</span> col_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-3-40"></span>
<span id="annotated-cell-3-41">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Load matrices. </span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-3" data-target-annotation="3">3</button><span id="annotated-cell-3-42" class="code-annotation-target">            wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>load_matrix_sync<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>A_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> row_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> lead_dim_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> col_A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> lead_dim_A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-3-43">            wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>load_matrix_sync<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>B_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> row_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> lead_dim_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> col_B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> lead_dim_B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-3-44"></span>
<span id="annotated-cell-3-45">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Perform MMA. </span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-3" data-target-annotation="4">4</button><span id="annotated-cell-3-46" class="code-annotation-target">            wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>mma_sync<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>accum_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> A_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> B_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> accum_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-3-47">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-3-48">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-3-49"></span>
<span id="annotated-cell-3-50">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> row_C <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> warp_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-3-51">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> col_C <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> warp_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-3-52"></span>
<span id="annotated-cell-3-53">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Complete the GEMM operation: scale and add result fragments, then write to global memory</span></span>
<span id="annotated-cell-3-54">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>row_C <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;&amp;</span> col_C <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-3-55">        wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>load_matrix_sync<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>C_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> C <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> row_C <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> lead_dim_C <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> col_C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> lead_dim_C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>mem_row_major<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-3-56"></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-3" data-target-annotation="5">5</button><span id="annotated-cell-3-57" class="code-annotation-target">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> C_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>num_elements<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-3-58">            C_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> __float2half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>alpha <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> accum_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> beta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> __half2float<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>C_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]));</span></span>
<span id="annotated-cell-3-59">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-3-60"></span>
<span id="annotated-cell-3-61">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Store the result in global memory</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-3" data-target-annotation="6">6</button><span id="annotated-cell-3-62" class="code-annotation-target">        wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>store_matrix_sync<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>C <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> row_C <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> lead_dim_C <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> col_C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> C_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> lead_dim_C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>mem_row_major<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-3-63">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-3-64"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-3-65"></span>
<span id="annotated-cell-3-66"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Same as in Tiled Matrix Multiplication</span></span>
<span id="annotated-cell-3-67">__global__ <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">void</span> gemm_tiled_kernel<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">...</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-3-68"></span>
<span id="annotated-cell-3-69"></span>
<span id="annotated-cell-3-70"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">extern</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"C"</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">void</span> solve<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-3-71"></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-3" data-target-annotation="7">7</button><span id="annotated-cell-3-72" class="code-annotation-target">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;&amp;</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;&amp;</span> K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-3-73">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> WARPS_X <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WARPS_Y <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-3" data-target-annotation="8">8</button><span id="annotated-cell-3-74" class="code-annotation-target">        dim3 blockDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>WARPS_X <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> WARP_SIZE<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WARPS_Y<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-3-75">        </span>
<span id="annotated-cell-3-76">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> num_col_tiles <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-3-77">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> num_row_tiles <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-3-78">        </span>
<span id="annotated-cell-3-79">        dim3 gridDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span></span>
<span id="annotated-cell-3-80">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>num_col_tiles <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> WARPS_X <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> WARPS_X<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="annotated-cell-3-81">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>num_row_tiles <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> WARPS_Y <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> WARPS_Y</span>
<span id="annotated-cell-3-82">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-3-83">        </span>
<span id="annotated-cell-3-84">        gemm_wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;&lt;&lt;</span>gridDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> blockDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;(</span>A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-3-85">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-3-86">        dim3 block<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>TILE_WIDTH<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> TILE_WIDTH<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-3-87">        dim3 grid<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span></span>
<span id="annotated-cell-3-88">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>TILE_WIDTH <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> COARSE_FACTOR<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>TILE_WIDTH <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> COARSE_FACTOR<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">),</span> </span>
<span id="annotated-cell-3-89">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> TILE_WIDTH <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> TILE_WIDTH</span>
<span id="annotated-cell-3-90">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-3-91"></span>
<span id="annotated-cell-3-92">        gemm_tiled_kernel<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;&lt;&lt;</span>grid<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> block<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;(</span>A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-3-93">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-3-94"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span><div class="code-annotation-gutter-bg"></div><div class="code-annotation-gutter"></div></code></pre></div></div>
<dl class="code-annotation-container-hidden code-annotation-container-grid">
<dt data-target-cell="annotated-cell-3" data-target-annotation="1">1</dt>
<dd>
<span data-code-cell="annotated-cell-3" data-code-lines="23" data-code-annotation="1"><strong>Fragments</strong>: The operand matrices must be represented in the registers of Tensor Cores before MMA is performed. Since MMA is a warp-wide operation, these registers are distributed between the threads of a warp. Each thread holds a fragment of the overall matrix. A fragment is a templated type that accepts parameters for: the matrix the fragment holds, the shape of the overall operation, the data type, and whether the data is row or column major for the operand matrices. We pass in 16 three times for the shape of the overall operation to represent that the number of rows the fragment stores, the number of columns the fragment stores, and the dot product length are all 16.</span>
</dd>
<dt data-target-cell="annotated-cell-3" data-target-annotation="2">2</dt>
<dd>
<span data-code-cell="annotated-cell-3" data-code-lines="31" data-code-annotation="2"><strong>The K-loop</strong>: Each warp computes one 16 x 16 tile of A * B. We loop over rows of A and columns of B. Each row of A and column of B has K elements. Overall, we are computing a 16 x 16 output tile in C: C (16 x 16) = A (16 x K) * B(K x 16). However, we can only store and use 16 x 16 chunks of A and B at once for the MMA operation. Therefore we need to split K into chunks of 16. On each loop iteration, we accumulate C (16 x 16) += A (16 x 16) * B (16 x 16).</span>
</dd>
<dt data-target-cell="annotated-cell-3" data-target-annotation="3">3</dt>
<dd>
<span data-code-cell="annotated-cell-3" data-code-lines="42" data-code-annotation="3"><strong>Loading into a fragment</strong>: To load data into a fragment, we need to specify the fragment to load into, the pointer to the memory we are loading from, and the leading dimension of the matrix (so that the operation knows the stride length between rows for a row-major matrix, or between columns for a column-major matrix).</span>
</dd>
<dt data-target-cell="annotated-cell-3" data-target-annotation="4">4</dt>
<dd>
<span data-code-cell="annotated-cell-3" data-code-lines="46" data-code-annotation="4"><strong>Matrix Multiply Accumulate</strong>: Computes Arg1 = Arg2 * Arg3 + Arg4.</span>
</dd>
<dt data-target-cell="annotated-cell-3" data-target-annotation="5">5</dt>
<dd>
<span data-code-cell="annotated-cell-3" data-code-lines="57" data-code-annotation="5"><strong>Modifying data within fragments</strong>: There are 16 x 16 = 256 elements in C_frag and 32 threads per warp. Each thread therefore holds 256 / 32 = 8 elements. So the loop will have 8 iterations. The fragment’s internal storage is opaque - we don’t know which thread holds each element. Luckily, this doesn’t matter for element-wise operations like scaling. What about for accum_frag and C_frag? As they are declared with identical template parameters, they are guaranteed to have the same internal layout. Hence we can be sure we are adding the correct corresponding elements.</span>
</dd>
<dt data-target-cell="annotated-cell-3" data-target-annotation="6">6</dt>
<dd>
<span data-code-cell="annotated-cell-3" data-code-lines="62" data-code-annotation="6"><strong>Storing back to global memory</strong>: Here we need to pass the pointer to memory that we are storing into, the fragment we are loading from, the leading dimension of the matrix, and whether the matrix is row or column major.</span>
</dd>
<dt data-target-cell="annotated-cell-3" data-target-annotation="7">7</dt>
<dd>
<span data-code-cell="annotated-cell-3" data-code-lines="72" data-code-annotation="7"><strong>Restrictions on WMMA</strong>: WMMA strictly handles 16x16x16 operations only, so we need to check that our matrix dimensions are multiples of 16. If not, we’ll launch our tiled GEMM kernel. In a later section, we will adjust our WMMA kernel to handle arbitrary matrix dimensions.</span>
</dd>
<dt data-target-cell="annotated-cell-3" data-target-annotation="8">8</dt>
<dd>
<span data-code-cell="annotated-cell-3" data-code-lines="74" data-code-annotation="8"><strong>Grid Dimensions</strong>: This works out to be (128, 4), so we have 512 total threads per block. Each row in our block has 128 threads, so a total of 4 warps, and then we have 4 rows, so we essentially have a 4x4 grid of warps in each block. Since each warp computes a 16x16 output tile, each warp is handling the same output as each block did in our tiled GEMM kernel. Since each block has a 4x4 grid of warps, we are then computing a 64x64 output tile of C for each block. We know that our matrix dimensions are divisible by 16, but they may not be divisible by 64. So at the blocks at the edge of our grid, we may have some warps that fall out of bounds of C. Luckily we have the necessary boundary checks in our kernel, so we just need to do our ceiling division here to ensure our blocks fully cover C, without worrying about if some of them go beyond the edges of C.</span>
</dd>
</dl>
</div>
</section>
<section id="arithmetic-intensity-2" class="level3">
<h3 class="anchored" data-anchor-id="arithmetic-intensity-2">Arithmetic Intensity</h3>
<p>To calculate the arithmetic intensity of this kernel, we will focus on the main loop where the loading from global memory and MMA operations happen. On each loop iteration, a warp collectively loads one 16x16 tile from each of A and B. So we retrieve 512 half-precision floats for a total of 1024 bytes. Then we are modifying the running sums for a 16x16 output tile in C. For each pixel in this output tile, we are taking a dot product of two 16-element vectors, so we perform 16 multiplications and 16 additions. Therefore we perform 32 FLOPs for each pixel in the 16x16 output tile, for a total of 8192 FLOPs. Therefore, our arithmetic intensity is approximately 8192 / 1024 = 8 FLOPs/B.</p>
<p>Notice that this is exactly the same as the arithmetic intensity of our previous tiled matrix multiplication kernel. In this kernel, I avoided using shared memory so that I could have a very simple and clear WMMA implementation. However, in reality, we can make use of the same collaborative shared memory loading technique from our prior kernel to improve the arithmetic intensity of our WMMA kernel even further. I will do exactly this (among other improvements) in subsequent sections. The other aspect that I observed with this kernel is that despite having the same arithmetic intensity as our tiled matrix multiplication, it is significantly faster. This is because WMMA is a hardware-native operation. In the section introduction, we discussed the anatomy of an WMMA operation and why it is so fast, but I’ll call out a few ways the arithmetic intensity here is misleading. First, although it is standard to count multiplication and addition as separate FLOPs, they are fused into a single operation on the hardware when using tensor cores. Second, we discussed that WMMA fragments live on registers instead of shared memory. This is not reflected in our arithmetic intensity (which only takes into account global memory accesses). After accessing global memory in our tiled GEMM kernel, we have just transferred it to shared memory, so we still have to pull our data again from shared memory to our compute cores. Here, we load from global memory directly to the registers of the Tensor Core.</p>
</section>
<section id="benchmarks-2" class="level3">
<h3 class="anchored" data-anchor-id="benchmarks-2">Benchmarks</h3>
<table class="table">
<colgroup>
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">GPU Model</th>
<th style="text-align: left;">Memory Bandwidth</th>
<th style="text-align: left;">Peak FP16 Compute</th>
<th style="text-align: left;">Ridge Point (FLOP/Byte)</th>
<th style="text-align: left;">Runtime (ms)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;"><strong>NVIDIA T4</strong></td>
<td style="text-align: left;">320 GB/s</td>
<td style="text-align: left;">65 TFLOPS</td>
<td style="text-align: left;">203</td>
<td style="text-align: left;">1.68</td>
</tr>
<tr class="even">
<td style="text-align: left;"><strong>NVIDIA A100 (80GB)</strong></td>
<td style="text-align: left;">2,039 GB/s</td>
<td style="text-align: left;">312 TFLOPS</td>
<td style="text-align: left;">153</td>
<td style="text-align: left;">0.17</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><strong>NVIDIA H100 (SXM)</strong></td>
<td style="text-align: left;">3,350 GB/s</td>
<td style="text-align: left;">989 TFLOPS</td>
<td style="text-align: left;">295</td>
<td style="text-align: left;">0.10</td>
</tr>
<tr class="even">
<td style="text-align: left;"><strong>NVIDIA H200 (SXM)</strong></td>
<td style="text-align: left;">4,800 GB/s</td>
<td style="text-align: left;">989 TFLOPS</td>
<td style="text-align: left;">206</td>
<td style="text-align: left;">0.10</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><strong>NVIDIA B200</strong></td>
<td style="text-align: left;">8,000 GB/s</td>
<td style="text-align: left;">2,500 TFLOPS</td>
<td style="text-align: left;">312</td>
<td style="text-align: left;">0.10</td>
</tr>
</tbody>
</table>
</section>
</section>
<section id="double-buffer" class="level2 page-columns page-full">
<h2 class="anchored" data-anchor-id="double-buffer">4. Double Buffer</h2>
<p>The next improvement we can make to our kernel is the use of a double buffer. The goal of a double buffer is to hide the latency of fetching data from global memory. In our current implementation, when threads request data from global memory, the compute cores have to pause while we wait for the data to arrive. Then we start computing, but our memory units are now sitting idle. When we’re done, we request data again and repeat the cycle. At any given time, either our compute cores or memory units are sitting idle.</p>
<p>Instead, before we compute the current tile, we can issue an asynchronous request to load data for the next tile. Then our memory bus will load data in for the next tile while we compute the current tile. There is a dedicated hardware unit in the GPU that handles this asynchronous loading, the Async Copy Engine.</p>
<p>The double buffer is so named because we declare shared memory that is double the size of what we need to compute on. That way, we can use half of the buffer to load the next tiles of A and B from global memory to shared memory, and the other half of the buffer holds the currently loaded data that we feed to our Tensor Cores. We can track which half of the buffer is ready and which is being loaded. So our process is as follows within each loop iteration:</p>
<ol type="1">
<li>Asynchronously request data for the next tile to the half of the buffer we are not about to use.</li>
<li>WMMA compute on the current tile, using the half of the buffer that is ready.</li>
<li>Barrier wait until the asynchronous request is complete. Then swap the <code>stage</code> index that tells us which half of the buffer is ready, and proceed to the next loop iteration.</li>
</ol>
<p>There are a few other optimizations related to the data loading and grid configuration that we’ll pack into this kernel that warrant some explanation ahead of time. First, we will have each block be composed of 4 warps in a 2 x 2 grid (so 128 total threads). Each warp will be responsible for computing a 32 x 32 output tile of C, so in total one block will compute a 64 x 64 output tile.</p>
<p>To accomplish this, we will still loop over the K-dimension in a wide row in A and wide column in B, just as pictured in the image from tiled matrix multiplication. However, we will specify the wide row in A to have 64 rows, and the wide column in B to have 64 columns. We still loop over K via increments of 16 at a time. So in each loop iteration over K, we will use a 64 x 16 chunk of A and a 16 x 64 chunk of B. This is the same process as tiled matrix multiplication, but we are now using a non-square tile.</p>
<p>Because we have a 2 x 2 grid of warps, each warp will use a 32 x 16 chunk of A and a 16 x 32 chunk of B, and perform 4 WMMA operations (since they only take matrices of size 16 x 16). We then add their output to our accumulator fragments (4 for each warp, since each WMMA operation accumulates to a different 16 x 16 output tile) in each loop iteration. By the time our K loop is complete, our block has fully computed the value of <img src="https://latex.codecogs.com/png.latex?A%20%5Ctimes%20B"> for a 64 x 64 tile of C.</p>
<p>The reason we do this is similar to why we loaded to shared memory in our tiled GEMM: we want to avoid redundant data loading and load as much data from global memory at once as we can usefully share across our block. By arranging our warps in 2 x 2 grid, we also are able to reuse more memory than if they were arranged in a straight line. For the collaborative data loading, we will use the thread ID in the block to determine what part of the current A (64 x 16) and B (16 x 64) chunks this thread will load. Each of these chunks can be treated as 128 8-half vectors, so each thread should load 8 elements. To reduce the number of instructions to load from global memory, we will employ vectorized loads to load 8 halves at once. Therefore, our A chunk can be viewed as 64 rows of 2 vectors, and our B chunk can be viewed as 8 rows of 8 vectors. We will use a vectorized store to global memory in the final section too, when possible.</p>
<section id="annotated-code-3" class="level3 page-columns page-full">
<h3 class="anchored" data-anchor-id="annotated-code-3">Annotated Code</h3>
<div class="column-screen-inset">
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="annotated-cell-4" style="background: #f1f3f5;"><pre class="sourceCode cpp code-annotation-code code-with-copy code-annotated"><code class="sourceCode cpp"><span id="annotated-cell-4-1"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#include </span><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">&lt;cuda_runtime.h&gt;</span></span>
<span id="annotated-cell-4-2"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#include </span><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">&lt;cuda_fp16.h&gt;</span></span>
<span id="annotated-cell-4-3"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#include </span><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">&lt;mma.h&gt;</span></span>
<span id="annotated-cell-4-4"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#include </span><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">&lt;cuda_pipeline_primitives.h&gt;</span></span>
<span id="annotated-cell-4-5"></span>
<span id="annotated-cell-4-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">using</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">namespace</span> nvcuda<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-4-7"></span>
<span id="annotated-cell-4-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------------- CONFIGURATION -------------</span></span>
<span id="annotated-cell-4-9"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">constexpr</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> BLOCK_M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">64</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-4-10"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">constexpr</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> BLOCK_N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">64</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// One block computes a 64 x 64 tile of the output matrix</span></span>
<span id="annotated-cell-4-11"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">constexpr</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> BLOCK_K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Accumulation step</span></span>
<span id="annotated-cell-4-12"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">constexpr</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> WARP_SIZE <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">32</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-4-13"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">constexpr</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> THREAD_COUNT <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">128</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-4-14"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">constexpr</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> WMMA <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-4-15"></span>
<span id="annotated-cell-4-16">__global__ <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">void</span> gemm_buffer_kernel<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-4-17"></span>
<span id="annotated-cell-4-18">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------------- INDEX CALCULATIONS -------------</span></span>
<span id="annotated-cell-4-19">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Linear view for data loading: which worker out of 128 threads am I?</span></span>
<span id="annotated-cell-4-20">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> tid <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> threadIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-4-21"></span>
<span id="annotated-cell-4-22">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Global position: what tile of the output matrix am I calculating?</span></span>
<span id="annotated-cell-4-23">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> block_row_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> blockIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>y <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> BLOCK_M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-4-24">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> block_col_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> blockIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> BLOCK_N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-4-25"></span>
<span id="annotated-cell-4-26">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// What warp am I in the 2x2 grid?</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-4" data-target-annotation="1">1</button><span id="annotated-cell-4-27" class="code-annotation-target">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> warp_id <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tid <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> WARP_SIZE<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-4-28">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> warp_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>warp_id <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">32</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-4-29">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> warp_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>warp_id <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">32</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-4-30"></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-4" data-target-annotation="2">2</button><span id="annotated-cell-4-31" class="code-annotation-target"></span>
<span id="annotated-cell-4-32">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// A tile: 64 x 16. Each row has 2 8-element vectors. </span></span>
<span id="annotated-cell-4-33">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> row_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tid <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span>       <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 0 to 63</span></span>
<span id="annotated-cell-4-34">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> col_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>tid <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 0 or 8</span></span>
<span id="annotated-cell-4-35">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// B tile: 16 x 64. Each row has 8 8-element vectors. </span></span>
<span id="annotated-cell-4-36">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> row_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tid <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span>       <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 0 to 7</span></span>
<span id="annotated-cell-4-37">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> col_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>tid <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 0, 8, 16, 24, 32, 40, 48, or 56</span></span>
<span id="annotated-cell-4-38">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ----------------------------------------------</span></span>
<span id="annotated-cell-4-39"></span>
<span id="annotated-cell-4-40"></span>
<span id="annotated-cell-4-41">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------------- MEMORY INITIALIZATION ----------</span></span>
<span id="annotated-cell-4-42">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Double Buffer: Shared Memory</span></span>
<span id="annotated-cell-4-43">    __shared__ half sA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>BLOCK_M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> BLOCK_K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 64 rows, 16 cols (K)</span></span>
<span id="annotated-cell-4-44">    __shared__ half sB<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>BLOCK_K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> BLOCK_N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 16 rows (K), 64 cols</span></span>
<span id="annotated-cell-4-45"></span>
<span id="annotated-cell-4-46">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Declare fragments and initialize accumulator. </span></span>
<span id="annotated-cell-4-47">    wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>fragment<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>matrix_a<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>row_major<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> a_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-4-48">    wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>fragment<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>matrix_b<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>row_major<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> b_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-4" data-target-annotation="3">3</button><span id="annotated-cell-4-49" class="code-annotation-target">    wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>fragment<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>accumulator<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> accum_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-4-50"></span>
<span id="annotated-cell-4-51">    <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-4-52">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-4-53">        <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-4-54">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-4-55">            wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>fill_fragment<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>accum_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">],</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">f</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-4-56">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-4-57">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-4-58"></span>
<span id="annotated-cell-4-59">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Pipeline setup</span></span>
<span id="annotated-cell-4-60">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> stage <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Alternates between 0 and 1</span></span>
<span id="annotated-cell-4-61">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ----------------------------------------------</span></span>
<span id="annotated-cell-4-62"></span>
<span id="annotated-cell-4-63"></span>
<span id="annotated-cell-4-64">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------------- PROLOGUE -------------</span></span>
<span id="annotated-cell-4-65">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Load the first tile. </span></span>
<span id="annotated-cell-4-66">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-4-67">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> src_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>block_row_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> row_A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> col_A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-4-68">        half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> dst_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span>sA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>stage<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>row_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> BLOCK_K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> col_A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-4-69"></span>
<span id="annotated-cell-4-70">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> src_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> row_B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>block_col_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> col_B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-4-71">        half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> dst_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span>sB<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>stage<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>row_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> BLOCK_N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> col_B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-4-72"></span>
<span id="annotated-cell-4-73">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">//&nbsp;Async copy. int4 is the size of 8 half elements</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-4" data-target-annotation="4">4</button><span id="annotated-cell-4-74" class="code-annotation-target">        __pipeline_memcpy_async<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>dst_A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> src_A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">sizeof</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>int4<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">));</span></span>
<span id="annotated-cell-4-75">        __pipeline_memcpy_async<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>dst_B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> src_B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">sizeof</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>int4<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">));</span> </span>
<span id="annotated-cell-4-76"></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-4" data-target-annotation="5">5</button><span id="annotated-cell-4-77" class="code-annotation-target">        __pipeline_commit<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-4" data-target-annotation="6">6</button><span id="annotated-cell-4-78" class="code-annotation-target">        __pipeline_wait_prior<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-4-79">        __syncthreads<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span></span>
<span id="annotated-cell-4-80">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-4-81">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------------------------------------</span></span>
<span id="annotated-cell-4-82"></span>
<span id="annotated-cell-4-83">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------------- MAIN LOOP -------------</span></span>
<span id="annotated-cell-4-84">    <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-4-85">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> BLOCK_K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-4-86"></span>
<span id="annotated-cell-4-87">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> k_next <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> BLOCK_K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-4-88"></span>
<span id="annotated-cell-4-89">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 1. LOAD the next tile asynchronously</span></span>
<span id="annotated-cell-4-90">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>k_next <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-4-91">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Turns 1 into 0 or 0 into 1</span></span>
<span id="annotated-cell-4-92">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> next_stage <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> stage<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-4-93"></span>
<span id="annotated-cell-4-94">            <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> src_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>block_row_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> row_A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>k_next <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> col_A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-4" data-target-annotation="7">7</button><span id="annotated-cell-4-95" class="code-annotation-target">            half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> dst_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span>sA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>next_stage<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>row_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> BLOCK_K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> col_A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-4-96">            </span>
<span id="annotated-cell-4-97">            <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> src_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>k_next <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> row_B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>block_col_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> col_B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-4-98">            half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> dst_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span>sB<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>next_stage<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>row_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> BLOCK_N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> col_B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-4-99"></span>
<span id="annotated-cell-4-100">            __pipeline_memcpy_async<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>dst_A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> src_A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">sizeof</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>int4<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">));</span> </span>
<span id="annotated-cell-4-101">            __pipeline_memcpy_async<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>dst_B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> src_B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">sizeof</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>int4<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">));</span> </span>
<span id="annotated-cell-4-102"></span>
<span id="annotated-cell-4-103">            __pipeline_commit<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span></span>
<span id="annotated-cell-4-104">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-4-105"></span>
<span id="annotated-cell-4-106">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 2. MATH: process the current tile. Recall we have a 2 x 2 grid of 16 x 16 subtiles for each warp.</span></span>
<span id="annotated-cell-4-107">        <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-4-108">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-4-109">            <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-4-110">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-4-111">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Calculate pointer into shared memory for this sub-tile</span></span>
<span id="annotated-cell-4-112">                <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> smem_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> warp_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-4-113">                <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> smem_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> warp_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-4-114"></span>
<span id="annotated-cell-4-115">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Load fragments from shared memory</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-4" data-target-annotation="8">8</button><span id="annotated-cell-4-116" class="code-annotation-target">                half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> tile_ptr_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span>sA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>stage<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>smem_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> BLOCK_K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-4-117">                half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> tile_ptr_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span>sB<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>stage<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>smem_col<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-4-118"></span>
<span id="annotated-cell-4-119">                wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>load_matrix_sync<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>a_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> tile_ptr_A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> BLOCK_K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-4-120">                wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>load_matrix_sync<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>b_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> tile_ptr_B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> BLOCK_N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-4-121"></span>
<span id="annotated-cell-4-122">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Multiply matrices and accumulate</span></span>
<span id="annotated-cell-4-123">                wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>mma_sync<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>accum_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">],</span> a_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> b_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> accum_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]);</span></span>
<span id="annotated-cell-4-124">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-4-125">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-4-126"></span>
<span id="annotated-cell-4-127">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 3. WAIT for next tile</span></span>
<span id="annotated-cell-4-128">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> BLOCK_K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-4" data-target-annotation="9">9</button><span id="annotated-cell-4-129" class="code-annotation-target">            __pipeline_wait_prior<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-4-130">            __syncthreads<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span></span>
<span id="annotated-cell-4-131">            stage <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> stage<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-4-132">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-4-133">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-4-134">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------------------------------------</span></span>
<span id="annotated-cell-4-135"></span>
<span id="annotated-cell-4-136">    __syncthreads<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Since the syncthreads above won't execute on the last iteration</span></span>
<span id="annotated-cell-4-137">   </span>
<span id="annotated-cell-4-138">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------- EPILOGUE: Store C ----------</span></span>
<span id="annotated-cell-4-139">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Size: 64 * 64 floats = 64 * 64 * 4 bytes = 16 KB. Fits easily in modern L1/Shared</span></span>
<span id="annotated-cell-4-140">    __shared__ <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> sC<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>BLOCK_M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> BLOCK_N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-4-141"></span>
<span id="annotated-cell-4-142">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Warps dump their fragments to shared memory, one 16x16 subtile at a time.</span></span>
<span id="annotated-cell-4-143">    <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-4-144">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-4-145">        <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-4-146">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-4-147">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> subtile_ptr <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> sC <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>warp_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> BLOCK_N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>warp_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-4-148">            wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>store_matrix_sync<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>subtile_ptr<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> accum_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">],</span> BLOCK_N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>mem_row_major<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-4-149">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-4-150">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-4-151"></span>
<span id="annotated-cell-4-152">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Wait for all threads to write to sC</span></span>
<span id="annotated-cell-4-153">    __syncthreads<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span></span>
<span id="annotated-cell-4-154"></span>
<span id="annotated-cell-4-155">    <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-4" data-target-annotation="10">10</button><span id="annotated-cell-4-156" class="code-annotation-target">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tid <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> BLOCK_M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> BLOCK_N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> THREAD_COUNT <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-4-157">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> BLOCK_N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-4-158">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> BLOCK_N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-4-159"></span>
<span id="annotated-cell-4-160">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> global_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> block_row_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> row<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-4-161">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> global_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> block_col_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> col<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-4-162"></span>
<span id="annotated-cell-4-163">        half buffer<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-4-164"></span>
<span id="annotated-cell-4-165">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Boundary check</span></span>
<span id="annotated-cell-4-166">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>global_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;&amp;</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>global_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-4-167">            <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-4-168">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-4-169">                <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> val <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> alpha <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> sC<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-4-170"></span>
<span id="annotated-cell-4-171">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>beta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">f</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-4-172">                    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> old_c <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> __half2float<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>global_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> global_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]);</span></span>
<span id="annotated-cell-4-173">                    val <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> beta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> old_c<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-4-174">                <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span> </span>
<span id="annotated-cell-4-175"></span>
<span id="annotated-cell-4-176">                buffer<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> __float2half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>val<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-4-177">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-4-178">            </span>
<span id="annotated-cell-4-179">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Vectorized store</span></span>
<span id="annotated-cell-4-180">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*(</span>int4<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*)&amp;</span>C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>global_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> global_col<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*(</span>int4<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*)</span>buffer<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-4-181"></span>
<span id="annotated-cell-4-182">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-4-183">            <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-4-184">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-4-185">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>global_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;&amp;</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>global_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-4-186">                    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> out_idx <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> global_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> global_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-4-187"></span>
<span id="annotated-cell-4-188">                    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> val <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> alpha <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> sC<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-4-189"></span>
<span id="annotated-cell-4-190">                    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>beta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">f</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-4-191">                        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> old_c <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> __half2float<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>out_idx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]);</span></span>
<span id="annotated-cell-4-192">                        val <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> beta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> old_c<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-4-193">                    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span> </span>
<span id="annotated-cell-4-194"></span>
<span id="annotated-cell-4-195">                    C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>out_idx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> __float2half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>val<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-4-196">                <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-4-197">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-4-198">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-4-199">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-4-200"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-4-201"></span>
<span id="annotated-cell-4-202"></span>
<span id="annotated-cell-4-203"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Same as in Tiled Matrix Multiplication</span></span>
<span id="annotated-cell-4-204">__global__ <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">void</span> gemm_tiled_kernel<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">...</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-4-205"></span>
<span id="annotated-cell-4-206"></span>
<span id="annotated-cell-4-207"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">extern</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"C"</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">void</span> solve<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-4-208">    </span>
<span id="annotated-cell-4-209">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">64</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;&amp;</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">64</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;&amp;</span> K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-4-210">        </span>
<span id="annotated-cell-4-211">        dim3 blockDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>THREAD_COUNT<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-4-212">        dim3 gridDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> BLOCK_N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> BLOCK_M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-4-213">        gemm_buffer_kernel<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;&lt;&lt;</span>gridDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> blockDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;(</span>A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-4-214"></span>
<span id="annotated-cell-4-215">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-4-216"></span>
<span id="annotated-cell-4-217">        dim3 block<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>TILE_WIDTH<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> TILE_WIDTH<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-4-218">        dim3 grid<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span></span>
<span id="annotated-cell-4-219">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>TILE_WIDTH <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> COARSE_FACTOR<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>TILE_WIDTH <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> COARSE_FACTOR<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">),</span> </span>
<span id="annotated-cell-4-220">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> TILE_WIDTH <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> TILE_WIDTH</span>
<span id="annotated-cell-4-221">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-4-222"></span>
<span id="annotated-cell-4-223">        gemm_tiled_kernel<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;&lt;&lt;</span>grid<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> block<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;(</span>A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-4-224"></span>
<span id="annotated-cell-4-225">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-4-226"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span><div class="code-annotation-gutter-bg"></div><div class="code-annotation-gutter"></div></code></pre></div></div>
<dl class="code-annotation-container-hidden code-annotation-container-grid">
<dt data-target-cell="annotated-cell-4" data-target-annotation="1">1</dt>
<dd>
<span data-code-cell="annotated-cell-4" data-code-lines="27" data-code-annotation="1"><strong>Warp Grid</strong>: As we have 128 threads per block, we have 4 warps per block, which we arrange in a 2x2 grid. Each block computes a 64 x 64 output tile, so we need to assign each warp a 32 x 32 output tile.</span>
</dd>
<dt data-target-cell="annotated-cell-4" data-target-annotation="2">2</dt>
<dd>
<span data-code-cell="annotated-cell-4" data-code-lines="31" data-code-annotation="2"><strong>Data Loading</strong>: We treat the A and B tiles, 64 x 16 and 16 x 64 respectively, as linear arrays of 128 8-element vectors. So each thread is responsible for loading 8 halves to shared memory.</span>
</dd>
<dt data-target-cell="annotated-cell-4" data-target-annotation="3">3</dt>
<dd>
<span data-code-cell="annotated-cell-4" data-code-lines="49" data-code-annotation="3"><strong>Accumulator Grid</strong>: Accumulator is a 2 x 2 grid because each warp is assigned a 32 x 32 output tile but can only compute 16 x 16 at a time.</span>
</dd>
<dt data-target-cell="annotated-cell-4" data-target-annotation="4">4</dt>
<dd>
<span data-code-cell="annotated-cell-4" data-code-lines="74" data-code-annotation="4"><code>__pipeline_memcpy_async</code>: Instructs the Async Copy Engine to copy data from global memory to shared memory. As this is an asynchronous operation, the command returns immediately and allows us to continue with other instructions while the memory loads. We issue a vectorized load for 8 halves worth of data at once (<code>sizeof(int4)</code>).</span>
</dd>
<dt data-target-cell="annotated-cell-4" data-target-annotation="5">5</dt>
<dd>
<span data-code-cell="annotated-cell-4" data-code-lines="77" data-code-annotation="5"><code>__pipeline_commit</code>: Marks the end of a batch of copy commands. Effectively, <code>memcpy_async</code> adds the copy instruction to our shopping cart, and <code>commit</code> places the order.</span>
</dd>
<dt data-target-cell="annotated-cell-4" data-target-annotation="6">6</dt>
<dd>
<span data-code-cell="annotated-cell-4" data-code-lines="78" data-code-annotation="6"><code>__pipeline_wait_prior</code>: Since we pass in 0, we are pausing thread execution until all asynchronous loads that were issued are complete (in our case, only a single load). In any case, after this line, we have to issue <code>syncthreads</code> because each thread is collaboratively loading a piece of A and B that every thread will need for compute.</span>
</dd>
<dt data-target-cell="annotated-cell-4" data-target-annotation="7">7</dt>
<dd>
<span data-code-cell="annotated-cell-4" data-code-lines="95" data-code-annotation="7"><strong>Writing to the Double Buffer</strong>: We load into the half of the double buffer that we’re not using this loop iterationn.</span>
</dd>
<dt data-target-cell="annotated-cell-4" data-target-annotation="8">8</dt>
<dd>
<span data-code-cell="annotated-cell-4" data-code-lines="116" data-code-annotation="8"><strong>Reading from the Double Buffer</strong>: We pull data for the WMMA operation from the half of the double buffer that is ready.</span>
</dd>
<dt data-target-cell="annotated-cell-4" data-target-annotation="9">9</dt>
<dd>
<span data-code-cell="annotated-cell-4" data-code-lines="129" data-code-annotation="9">Notice the location of this command in the main loop compared to in the prologue. We only had to issue it immediately after placing the copy command in the prologue because we needed to load the very first tile for compute. In the main loop, we don’t need to hold up threads on the copy completion until we have finished all compute for this iteration.</span>
</dd>
<dt data-target-cell="annotated-cell-4" data-target-annotation="10">10</dt>
<dd>
<span data-code-cell="annotated-cell-4" data-code-lines="156" data-code-annotation="10"><strong>Vectorized Store</strong>: In this loop, we complete our GEMM operation by taking our accumulated result of A x B, scaling it by alpha, adding it to beta * C, and finally storing it in global memory. We have 64 * 64 = 4096 total elements to process and store, and 128 threads to do this. So we must process 32 elements per thread. If we vectorize this into processing 8 elements per step, we need only 4 steps per thread. However, we have an else block here that covers the tail elements once we have fewer than 8 elements left and can’t do a vectorized store.</span>
</dd>
</dl>
</div>
</section>
<section id="arithmetic-intensity-3" class="level3">
<h3 class="anchored" data-anchor-id="arithmetic-intensity-3">Arithmetic Intensity</h3>
<p>We will examine a single iteration of the main loop. We load in a 64 x 16 chunk of A and a 16 x 64 chunk of B from global memory, for a total of 2,048 halves, which is 4,096 bytes. Our output tile is 64 x 64, and on each loop iteration, we accumulate a dot product of two 16-element vectors to each pixel of the output tile. This dot product consists of 16 multiplications and 16 additions, so 32 FLOPs per pixel. In total then, we perform 64 * 64 * 32 = 131,072 FLOPs per loop iteration. Dividing this out by our global memory load of 4,096 bytes, we get an arithmetic intensity of 32 FLOPs/B. This is due to our increased tile size, not due to our double buffer which mainly helps with hiding memory latency. So we should theoretically have two different improvements that speed up our runtime: reusing more data due to the larger tile size, and latency hiding due to the double buffer. Thankfully, the runtime confirms this, as we can see considerable speedup on all GPUs.</p>
</section>
<section id="benchmarks-3" class="level3">
<h3 class="anchored" data-anchor-id="benchmarks-3">Benchmarks</h3>
<table class="table">
<colgroup>
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">GPU Model</th>
<th style="text-align: left;">Memory Bandwidth</th>
<th style="text-align: left;">Peak FP16 Compute</th>
<th style="text-align: left;">Ridge Point (FLOP/Byte)</th>
<th style="text-align: left;">Runtime (ms)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;"><strong>NVIDIA T4</strong></td>
<td style="text-align: left;">320 GB/s</td>
<td style="text-align: left;">65 TFLOPS</td>
<td style="text-align: left;">203</td>
<td style="text-align: left;">1.04</td>
</tr>
<tr class="even">
<td style="text-align: left;"><strong>NVIDIA A100 (80GB)</strong></td>
<td style="text-align: left;">2,039 GB/s</td>
<td style="text-align: left;">312 TFLOPS</td>
<td style="text-align: left;">153</td>
<td style="text-align: left;">0.12</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><strong>NVIDIA H100 (SXM)</strong></td>
<td style="text-align: left;">3,350 GB/s</td>
<td style="text-align: left;">989 TFLOPS</td>
<td style="text-align: left;">295</td>
<td style="text-align: left;">0.05</td>
</tr>
<tr class="even">
<td style="text-align: left;"><strong>NVIDIA H200 (SXM)</strong></td>
<td style="text-align: left;">4,800 GB/s</td>
<td style="text-align: left;">989 TFLOPS</td>
<td style="text-align: left;">206</td>
<td style="text-align: left;">0.05</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><strong>NVIDIA B200</strong></td>
<td style="text-align: left;">8,000 GB/s</td>
<td style="text-align: left;">2,500 TFLOPS</td>
<td style="text-align: left;">312</td>
<td style="text-align: left;">0.05</td>
</tr>
</tbody>
</table>
</section>
</section>
<section id="swizzling" class="level2 page-columns page-full">
<h2 class="anchored" data-anchor-id="swizzling">5. Swizzling</h2>
<p>So far, we’ve taken pretty good advantage of NVIDIA GPU architecture. Let’s go down the checklist:</p>
<ul class="task-list">
<li><label><input type="checkbox" checked="">Compute - We’re using the Tensor Cores to perform matrix multiplication and accumulation.</label></li>
<li><label><input type="checkbox" checked="">Registers - The Tensor Cores have their own dedicated registers to store data for the compute operation, so we’re not slowed down by loading data from shared memory for the operation.</label></li>
<li><label><input type="checkbox" checked="">Shared memory - We’re loading in large tiles from global memory at once per block and reusing as much data as possible between warps.</label></li>
<li><label><input type="checkbox" checked="">Memory latency hiding - With our double buffer, we’re ensuring we’re computing as much of the time as possible while we wait on memory to load.</label></li>
</ul>
<p>I haven’t yet discussed caches in detail. Take a look at the below diagram.</p>
<div class="quarto-figure quarto-figure-center page-columns page-full">
<figure class="figure page-columns page-full">
<p><img src="https://rohan-reddy.github.io/posts/001-gemm-optimization/images/02_a100_memory.png" class="img-fluid figure-img" style="width:100.0%"></p>
<figcaption class="margin-caption">Memory hierarchy of an A100-40GB <span class="citation" data-cites="memoryhierarchy">(<span>“Memory Hierarchy of GPUs”</span> 2025)</span></figcaption>
</figure>
</div>
<p>There are two types of caches on the GPU: L1 and L2. A separate L1 cache exists on each Streaming Multiprocessor and is physically shared with Shared Memory, but not logically. We can control the split between shared memory and L1 if we so choose, but we can’t control what goes into L1 like we can for shared memory. L1 is a cache, so it’s hardware-managed and caches global memory accesses automatically. It handles some level of spatial and temporal locality automatically for us.</p>
<p>We discussed spatial locality briefly in the tiled GEMM section, but didn’t put a name to it. When we retrieve data from global memory, the GPU memory controller never fetches just a few bytes. It always fetches an aligned chunk of memory called a Cache Line, typically 128 bytes, which goes through and into the L1 cache. Ideally, all of the threads in a warp access contiguous memory addresses (i.e.&nbsp;Thread 0 reads address X, Thread 1 reads X + 4, etc). This is known as memory coalescing, and reduces the number of requests the memory controller needs to make to global memory, since we are using most or all of the full Cache Line retrieved every time, rather than just a fraction. Temporal locality means that the L1 will cache recently used data until its capacity is full and needs to evict old data. That way, in case we access the same data multiple times in a short period of time, we don’t need to retrieve it again from global memory as it is still in the cache.</p>
<p>The L2 cache functions in a similar way but is much larger and global to the whole GPU. As a tradeoff, it is also much slower to access for a thread than its local L1 cache. We already taking advantage of locality in our L1 cache in our previous kernels by ensuring threads in a warp are reading contiguous chunks of data. But we haven’t yet taken advantage of the L2 cache. The particular insight we need is that every block has access to the L2 cache. Ideally, we would figure out a way to establish some inter-block temporal locality: after one block accesses data from global memory, other blocks executing within a short time thereafter will reuse that data before it is evicted from the L2 cache.</p>
<p>Let’s think about what’s happening in our standard grid and tiling logic. Since we defined a 2D grid of blocks, and each block corresponds to a certain output tile in the matrix C, what’s happening is that we end up executing our blocks in a row-major order. Look at the first row of tiles in matrix C below.</p>
<div class="quarto-figure quarto-figure-center page-columns page-full">
<figure class="figure page-columns page-full">
<p><img src="https://rohan-reddy.github.io/posts/001-gemm-optimization/images/03_tiled_mm.png" class="img-fluid figure-img" style="width:95.0%"></p>
<figcaption class="margin-caption">Visualization of tiled matrix multiplication <span class="citation" data-cites="tiledmatmul">(Matthes et al. 2017)</span></figcaption>
</figure>
</div>
<p>Imagine that our L2 cache can only fit 16 tiles. For each tile in that first row in C, we are repeatedly using the first row of tiles of A. However, we use a different column of B each time. By the time we’re on the fourth tile of C in the first row of tiles, we’re now loading in the fourth column of tiles of B, but we already have 16 tiles in our L2 cache (one row of tiles from A and three columns of tiles from B). So we have to evict some data to make room. We’ve been continuously reusing the first row of A, so that won’t be evicted; instead, we’ll evict the first column of B. But the next output tile we will compute for C after this one is the first tile in the second row, which would have reused the first column of B. Sadly, we just evicted it, so we’ll have to pull it from global memory again.</p>
<p>Instead, what we could do, given our L2 cache size, is split C into “newspaper columns”, each having a width of 2 tiles. We will adjust our block execution order so that we traverse the first newspaper column fully before we proceed to the second one. Now what happens? For the first two tiles of C, it’s the same logic as before. Our L2 cache now has the first row of A and first two columns of B. But now we hit the edge of our newspaper column, so we go down to the first tile in the second row of C. We load in the second row of A to the L2 cache, and now we have actually reached the cache capacity of 16 tiles. However, we are going to reuse the columns of B that are already in the L2 cache for the next two output tiles. Therefore, we loaded 16 tiles a single time from global memory and computed 4 output tiles. As opposed to before, we had to reload the needed column of B every time for the second row of output tiles of C, so we needed to load 20 tiles from global memory to compute 4 output tiles.</p>
<p>One way to think of this is that this is very similar to the rationale for tiled matrix multiplication. We are just adding another layer of tiling to the traversal. This block execution order is called grid swizzling and will allow us to get the most possible out of the L2 cache.</p>
<p>There is another memory bottleneck in our previous kernels that has to do with shared memory. To understand this bottleneck, we have to discuss the physical constraints of shared memory. Shared Memory is not a monolithic block of RAM. It is divided into physical banks. For the A100, Shared Memory in each SM is divided into 32 banks, each 4 bytes wide. These banks are effectively parallel lanes that the GPU can read from. The catch is that if we have multiple data requests to shared memory and these requests live in the same bank, then we have to serialize them. If the requests are each for memory in a different bank, then we can fully parallelize them.</p>
<p>Memory addresses are mapped to shared memory banks sequentially. So for 32 banks, we will have bytes 0-3 in Bank 0, 4-7 in Bank 1, …, 124-127 in Bank 31. And then bytes 128-131 wrap around and are placed in Bank 0 again. What we have been doing is defining a 2D array of shared memory that is exactly the size we need, such as a 64 x 64 array of shared memory to hold a 64 x 64 tile of half-precision float data. Since a half is 2 bytes, one row of this array consumes 128 bytes of shared memory. Therefore, when we access a row from this array, every element in that row will be in a different bank, so the request is highly parallelizable. But when we access a column from this array, it’s disastrous: every element in a column will be in the same bank! The request must be completely serialized.</p>
<p>The solution to this is shared memory swizzling: basically storing data to shared memory in a pattern that minimizes bank conflicts. In the below implementation, I use padding to add some dummy elements at the end of each row. In the above example, if we pad each row with 8 zeroes, then the start of the second row will be Bank 8, the start of the third row will be Bank 16, and so on. So we won’t run into extreme bank conflicts with column access. The disadvantage of this approach is that it does add a slight shared memory footprint, which can be an issue if we’re already using it heavily and near capacity. This isn’t the case for our kernel, but in production libraries the additional memory footprint is undesirable, so an approach called XOR swizzling is used instead. In XOR swizzling, the XOR operator is used (since it is computationally inexpensive) to permute the bank mapping of data based on its row and column. Modern libraries handle this swizzling for us, but since we are not using them as part of the problem constraints, I will stick with padding based swizzling for readability.</p>
<p>There is one more slight optimization included below. We double BLOCK_K to 32 and double our number of fragments for A and B. Doubling BLOCK_K means we load twice the data at the start of our K-loop and then have a new loop wrapping our warp math that executes exactly twice. The benefit is that we’re loading more data at once and have fewer total iterations in our K-loop as it increments by 32 rather than 16, so we have to issue the syncthreads and pipeline wait commands fewer times. Doubling the number of fragments means in our warp math loop, when a warp computes WMMA 4 times for its 2 x 2 grid of subtiles, we can load the necessary data into fragments all at once and then perform the MMA. Previously, we were loading into the same fragments 4 separate times.</p>
<section id="annotated-code-4" class="level3 page-columns page-full">
<h3 class="anchored" data-anchor-id="annotated-code-4">Annotated Code</h3>
<div class="column-screen-inset">
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="annotated-cell-5" style="background: #f1f3f5;"><pre class="sourceCode cpp code-annotation-code code-with-copy code-annotated"><code class="sourceCode cpp"><span id="annotated-cell-5-1"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#include </span><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">&lt;cuda_runtime.h&gt;</span></span>
<span id="annotated-cell-5-2"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#include </span><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">&lt;cuda_fp16.h&gt;</span></span>
<span id="annotated-cell-5-3"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#include </span><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">&lt;mma.h&gt;</span></span>
<span id="annotated-cell-5-4"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#include </span><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">&lt;cuda_pipeline_primitives.h&gt;</span></span>
<span id="annotated-cell-5-5"></span>
<span id="annotated-cell-5-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">using</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">namespace</span> nvcuda<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-7"></span>
<span id="annotated-cell-5-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------------- CONFIGURATION -------------</span></span>
<span id="annotated-cell-5-9"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">constexpr</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> BLOCK_M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">64</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-10"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">constexpr</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> BLOCK_N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">64</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// One block computes a 64 x 64 tile of the output matrix</span></span>
<span id="annotated-cell-5-11"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">constexpr</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> BLOCK_K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">32</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Accumulation step will be in terms of 16 but we load 32 at once to hide latency</span></span>
<span id="annotated-cell-5-12"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">constexpr</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> WARP_SIZE <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">32</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-13"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">constexpr</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> THREAD_COUNT <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">128</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-14"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">constexpr</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> WMMA <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-15"></span>
<span id="annotated-cell-5-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Pad the row stride to avoid bank conflicts in shared memory.</span></span>
<span id="annotated-cell-5-17"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">constexpr</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> SMEM_PAD <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-18"></span>
<span id="annotated-cell-5-19">__global__ <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">void</span> gemm_swizzled_kernel<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-20"></span>
<span id="annotated-cell-5-21">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------------- GRID SWIZZLING (L2 Cache Optimization) -------------</span></span>
<span id="annotated-cell-5-22">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Remap the linear block index to a "Swizzled" 2D grid.</span></span>
<span id="annotated-cell-5-23"></span>
<span id="annotated-cell-5-24">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Usually 2, 4, or 8</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-5" data-target-annotation="1">1</button><span id="annotated-cell-5-25" class="code-annotation-target">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> swizzle_factor <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-26"></span>
<span id="annotated-cell-5-27">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Calculate linear block ID and grid dimensions</span></span>
<span id="annotated-cell-5-28">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> idx_linear <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> blockIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>y <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> gridDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> blockIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-29">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> grid_m_blocks <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> gridDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>y<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-30">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> grid_n_blocks <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> gridDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-31"></span>
<span id="annotated-cell-5-32">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Swizzle logic: Map linear ID to (block_row, block_col) in a localized pattern.</span></span>
<span id="annotated-cell-5-33">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// This traverses the grid in 'thick columns' of width 'swizzle_factor'</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-5" data-target-annotation="2">2</button><span id="annotated-cell-5-34" class="code-annotation-target">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> panel_number <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> idx_linear <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>swizzle_factor <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> grid_m_blocks<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-35">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> block_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>idx_linear <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> swizzle_factor<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> grid_m_blocks<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-36">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> block_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>idx_linear <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> swizzle_factor<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> panel_number <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> swizzle_factor<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-37">    </span>
<span id="annotated-cell-5-38">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Safety check for irregular grids (if grid is not perfectly divisible)</span></span>
<span id="annotated-cell-5-39">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>block_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;=</span> grid_m_blocks <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">||</span> block_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;=</span> grid_n_blocks<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-40"></span>
<span id="annotated-cell-5-41">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Calculate offsets based on swizzled coordinates</span></span>
<span id="annotated-cell-5-42">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> block_row_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> block_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> BLOCK_M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-43">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> block_col_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> block_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> BLOCK_N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-44">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// --------------------------------------------------------------------</span></span>
<span id="annotated-cell-5-45"></span>
<span id="annotated-cell-5-46">    </span>
<span id="annotated-cell-5-47">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------------- INDEX CALCULATIONS -------------</span></span>
<span id="annotated-cell-5-48">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Linear view for data loading: which worker out of 128 threads am I?</span></span>
<span id="annotated-cell-5-49">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> tid <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> threadIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-50"></span>
<span id="annotated-cell-5-51">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// As we have 128 threads per block, we have 4 warps per block, which we arrange in a 2x2 grid.</span></span>
<span id="annotated-cell-5-52">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// As each block computes a 64 x 64 output tile, we need to assign each warp a 32 x 32 output tile.</span></span>
<span id="annotated-cell-5-53">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> warp_id <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tid <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> WARP_SIZE<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-54">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> warp_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>warp_id <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">32</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-55">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> warp_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>warp_id <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">32</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-56">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ----------------------------------------------</span></span>
<span id="annotated-cell-5-57"></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-5" data-target-annotation="3">3</button><span id="annotated-cell-5-58" class="code-annotation-target"></span>
<span id="annotated-cell-5-59">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------------- MEMORY INITIALIZATION ----------</span></span>
<span id="annotated-cell-5-60">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Double Buffer: Shared Memory. Padded to remove bank conflicts </span></span>
<span id="annotated-cell-5-61">    __shared__ half sA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>BLOCK_M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>BLOCK_K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> SMEM_PAD<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)];</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 64 rows, 40 cols (K + pad)</span></span>
<span id="annotated-cell-5-62">    __shared__ half sB<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>BLOCK_K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>BLOCK_N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> SMEM_PAD<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)];</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 40 rows (K + pad), 64 cols</span></span>
<span id="annotated-cell-5-63"></span>
<span id="annotated-cell-5-64">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Declare fragments and initialize accumulator</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-5" data-target-annotation="4">4</button><span id="annotated-cell-5-65" class="code-annotation-target">    wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>fragment<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>matrix_a<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>row_major<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> a_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-5-66">    wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>fragment<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>matrix_b<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>row_major<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> b_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-5-67">    wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>fragment<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>accumulator<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> accum_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-5-68"></span>
<span id="annotated-cell-5-69">    <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-5-70">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-71">        <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-5-72">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-73">            wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>fill_fragment<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>accum_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">],</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">f</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-74">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-75">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-76"></span>
<span id="annotated-cell-5-77">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Pipeline setup</span></span>
<span id="annotated-cell-5-78">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> stage <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Alternates between 0 and 1</span></span>
<span id="annotated-cell-5-79">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ----------------------------------------------</span></span>
<span id="annotated-cell-5-80"></span>
<span id="annotated-cell-5-81"></span>
<span id="annotated-cell-5-82">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------------- PROLOGUE -------------</span></span>
<span id="annotated-cell-5-83">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Load the first tile (k=0). A: 64x32. B: 32x64.</span></span>
<span id="annotated-cell-5-84">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// We have 128 threads. We need to load 64*32 = 2048 halves per matrix.</span></span>
<span id="annotated-cell-5-85">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// So each thread must load 16 halves (int4 size) from each matrix.</span></span>
<span id="annotated-cell-5-86"></span>
<span id="annotated-cell-5-87">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> src_A_base <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> block_row_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-88">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> src_B_base <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> block_col_start<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-89"></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-5" data-target-annotation="5">5</button><span id="annotated-cell-5-90" class="code-annotation-target">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">auto</span> load_tile_async <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[&amp;](</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> stage_idx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> k_step<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-91">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> A_ptr <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> src_A_base <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> k_step<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Adding row * K is handled in loop</span></span>
<span id="annotated-cell-5-92">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> B_ptr <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> src_B_base <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> k_step <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-93"></span>
<span id="annotated-cell-5-94">        half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> sA_ptr <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> sA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>stage_idx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-5-95">        half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> sB_ptr <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> sB<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>stage_idx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-5-96"></span>
<span id="annotated-cell-5-97">        <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-5-98">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-99">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Calculate which vector of 8 halves this thread is moving</span></span>
<span id="annotated-cell-5-100">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> tid_offset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tid <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> THREAD_COUNT<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 0..127, then 128..255</span></span>
<span id="annotated-cell-5-101"></span>
<span id="annotated-cell-5-102">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Map linear ID to (row, col) for A (64x32)</span></span>
<span id="annotated-cell-5-103">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Width is 32 (4 vectors of 8 halves).</span></span>
<span id="annotated-cell-5-104">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> vec_row_a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tid_offset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-105">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> vec_col_a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>tid_offset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-106"></span>
<span id="annotated-cell-5-107">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>vec_row_a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> BLOCK_M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-108">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Async Copy</span></span>
<span id="annotated-cell-5-109">                __pipeline_memcpy_async<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span></span>
<span id="annotated-cell-5-110">                    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span>sA_ptr<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>vec_row_a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>BLOCK_K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> SMEM_PAD<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> vec_col_a<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">],</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Swizzled shared ptr</span></span>
<span id="annotated-cell-5-111">                    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span>A_ptr<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>vec_row_a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> vec_col_a<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">],</span>                     <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Global ptr</span></span>
<span id="annotated-cell-5-112">                    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">sizeof</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>int4<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span></span>
<span id="annotated-cell-5-113">                <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-114">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-115"></span>
<span id="annotated-cell-5-116">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Map linear ID to (row, col) for B (32x64)</span></span>
<span id="annotated-cell-5-117">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Width is 64 (8 vectors of 8 halves).</span></span>
<span id="annotated-cell-5-118">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> vec_row_b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tid_offset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-119">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> vec_col_b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>tid_offset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-120"></span>
<span id="annotated-cell-5-121">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>vec_row_b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> BLOCK_K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-122">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Async Copy</span></span>
<span id="annotated-cell-5-123">                __pipeline_memcpy_async<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span></span>
<span id="annotated-cell-5-124">                    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span>sB_ptr<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>vec_row_b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>BLOCK_N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> SMEM_PAD<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> vec_col_b<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">],</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Swizzled shared ptr</span></span>
<span id="annotated-cell-5-125">                    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span>B_ptr<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>vec_row_b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> vec_col_b<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">],</span>                     <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Global ptr</span></span>
<span id="annotated-cell-5-126">                    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">sizeof</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>int4<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span></span>
<span id="annotated-cell-5-127">                <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-128">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-129">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-130">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">};</span></span>
<span id="annotated-cell-5-131">    </span>
<span id="annotated-cell-5-132">    load_tile_async<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>stage<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-133">    __pipeline_commit<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span></span>
<span id="annotated-cell-5-134">    __pipeline_wait_prior<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-135">    __syncthreads<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span></span>
<span id="annotated-cell-5-136">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------------------------------------</span></span>
<span id="annotated-cell-5-137"></span>
<span id="annotated-cell-5-138">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------------- MAIN LOOP -------------</span></span>
<span id="annotated-cell-5-139">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> BLOCK_K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-140"></span>
<span id="annotated-cell-5-141">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> k_next <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> BLOCK_K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-142"></span>
<span id="annotated-cell-5-143">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 1. LOAD the next tile asynchronously</span></span>
<span id="annotated-cell-5-144">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>k_next <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-145">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Turns 1 into 0 or 0 into 1</span></span>
<span id="annotated-cell-5-146">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> next_stage <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> stage<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-147">            load_tile_async<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>next_stage<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> k_next<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-148">            __pipeline_commit<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span></span>
<span id="annotated-cell-5-149">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-150"></span>
<span id="annotated-cell-5-151">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 2. MATH: process the current tile. Recall we have a 2 x 2 grid of 16 x 16 subtiles for each warp.</span></span>
<span id="annotated-cell-5-152">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// BLOCK_K = 32, and WMMA accumulates 16x16x16 at a time, so we need to loop k_step 0..1.</span></span>
<span id="annotated-cell-5-153">        <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-5" data-target-annotation="6">6</button><span id="annotated-cell-5-154" class="code-annotation-target">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> k_step <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> k_step <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> BLOCK_K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> k_step <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-155">            </span>
<span id="annotated-cell-5-156">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// --- STEP A: Load Fragments into Registers (Pre-Load) ---</span></span>
<span id="annotated-cell-5-157">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// A Warp computes a 32 x 32 output tile.</span></span>
<span id="annotated-cell-5-158">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// This requires 32 rows of A (2 fragments) and 32 cols of B (2 fragments).</span></span>
<span id="annotated-cell-5-159">            </span>
<span id="annotated-cell-5-160">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Load the 2 fragments of Matrix A needed for this warp</span></span>
<span id="annotated-cell-5-161">            <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-5-162">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-163">                <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> smem_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> warp_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-164">                half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> tile_ptr_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span>sA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>stage<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>smem_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>BLOCK_K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> SMEM_PAD<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> k_step<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-5-165">                </span>
<span id="annotated-cell-5-166">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Load into specific index [i]</span></span>
<span id="annotated-cell-5-167">                wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>load_matrix_sync<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>a_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">],</span> tile_ptr_A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> BLOCK_K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> SMEM_PAD<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-168">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-169"></span>
<span id="annotated-cell-5-170">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Load the 2 fragments of Matrix B needed for this warp</span></span>
<span id="annotated-cell-5-171">            <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-5-172">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-173">                <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> smem_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> warp_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-174">                half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> tile_ptr_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span>sB<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>stage<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>k_step <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>BLOCK_N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> SMEM_PAD<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> smem_col<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-5-175">                </span>
<span id="annotated-cell-5-176">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Load into specific index [j]</span></span>
<span id="annotated-cell-5-177">                wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>load_matrix_sync<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>b_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">],</span> tile_ptr_B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> BLOCK_N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> SMEM_PAD<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-178">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-179"></span>
<span id="annotated-cell-5-180">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// --- STEP B: Compute (Reuse Registers) ---</span></span>
<span id="annotated-cell-5-181">            <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-5" data-target-annotation="7">7</button><span id="annotated-cell-5-182" class="code-annotation-target">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-183">                <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-5-184">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-185">                    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Reuse a_frag[i] and b_frag[j] multiple times</span></span>
<span id="annotated-cell-5-186">                    wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>mma_sync<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>accum_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">],</span> a_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">],</span> b_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">],</span> accum_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]);</span></span>
<span id="annotated-cell-5-187">                <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-188">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-189">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-190">       </span>
<span id="annotated-cell-5-191"></span>
<span id="annotated-cell-5-192">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 3. WAIT for next tile</span></span>
<span id="annotated-cell-5-193">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> BLOCK_K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-194">            __pipeline_wait_prior<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-195">            __syncthreads<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span></span>
<span id="annotated-cell-5-196">            stage <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> stage<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-197">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-198">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-199">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------------------------------------</span></span>
<span id="annotated-cell-5-200"></span>
<span id="annotated-cell-5-201">    __syncthreads<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Since the syncthreads above won't execute on the last iteration</span></span>
<span id="annotated-cell-5-202">   </span>
<span id="annotated-cell-5-203">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------- EPILOGUE: Store C ----------</span></span>
<span id="annotated-cell-5-204">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// We need a Shared Memory buffer for the floats from the Accumulators.</span></span>
<span id="annotated-cell-5-205">    __shared__ <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> sC<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>BLOCK_M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> BLOCK_N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-5-206"></span>
<span id="annotated-cell-5-207">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 1. Store Accumulators (Registers) -&gt; Shared Memory (Float)</span></span>
<span id="annotated-cell-5-208">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Each warp holds a 32x32 tile distributed across 2x2 fragments (16x16 each).</span></span>
<span id="annotated-cell-5-209">    <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-5-210">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-211">        <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-5-212">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-213">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Calculate where this 16x16 fragment belongs in the 64x64 block</span></span>
<span id="annotated-cell-5-214">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> row_offset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> warp_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-215">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> col_offset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> warp_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-216">            </span>
<span id="annotated-cell-5-217">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> smem_ptr <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> sC <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> row_offset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> BLOCK_N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> col_offset<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-218"></span>
<span id="annotated-cell-5-219">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Store fragment to shared memory (Stride is BLOCK_N)</span></span>
<span id="annotated-cell-5-220">            wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>store_matrix_sync<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>smem_ptr<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> accum_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">],</span> BLOCK_N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>mem_row_major<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-221">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-222">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-223"></span>
<span id="annotated-cell-5-224">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Wait for all warps to finish writing to sC</span></span>
<span id="annotated-cell-5-225">    __syncthreads<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span></span>
<span id="annotated-cell-5-226"></span>
<span id="annotated-cell-5-227">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 2. Write Shared Memory (Float) -&gt; Global Memory (Half)</span></span>
<span id="annotated-cell-5-228">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Total Elements: 64 * 64 = 4096.</span></span>
<span id="annotated-cell-5-229">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Threads: 128.</span></span>
<span id="annotated-cell-5-230">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Elements per thread: 32.</span></span>
<span id="annotated-cell-5-231">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Vectors per thread: 32 / 8 = 4 vectors (int4).</span></span>
<span id="annotated-cell-5-232"></span>
<span id="annotated-cell-5-233">    <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-5-234">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> v <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> v <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> v<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-235">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Calculate the linear index for this vector of 8 elements</span></span>
<span id="annotated-cell-5-236">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Stride by THREAD_COUNT to ensure coalescing (Thread 0 takes 0..7, Thread 1 takes 8..15)</span></span>
<span id="annotated-cell-5-237">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> vec_idx <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tid <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> v <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> THREAD_COUNT<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> </span>
<span id="annotated-cell-5-238">        </span>
<span id="annotated-cell-5-239">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> base_idx <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> vec_idx <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// The starting element index</span></span>
<span id="annotated-cell-5-240">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> base_idx <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> BLOCK_N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-241">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> base_idx <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> BLOCK_N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-242"></span>
<span id="annotated-cell-5-243">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> global_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> block_row_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> row<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-244">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> global_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> block_col_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> col<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-245"></span>
<span id="annotated-cell-5-246">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Boundary Check (Safe for arbitrary M/N)</span></span>
<span id="annotated-cell-5-247">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// We check if the whole vector of 8 fits</span></span>
<span id="annotated-cell-5-248">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>global_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;&amp;</span> global_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-249">            </span>
<span id="annotated-cell-5-250">            half out_buffer<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Register buffer for formatting</span></span>
<span id="annotated-cell-5-251"></span>
<span id="annotated-cell-5-252">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// OPTIONAL: Beta Handling (Load old C)</span></span>
<span id="annotated-cell-5-253">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// If beta is non-zero, we must load the existing values from Global Memory first</span></span>
<span id="annotated-cell-5-254">            half old_c<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span> </span>
<span id="annotated-cell-5-255">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">bool</span> use_beta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>beta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">f</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-256"></span>
<span id="annotated-cell-5-257">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>use_beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-258">                 <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Vectorized Load of old C</span></span>
<span id="annotated-cell-5-259">                <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*(</span>int4<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*)</span>old_c <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*(</span>int4<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*)&amp;</span>C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>global_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> global_col<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-5-260">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-261"></span>
<span id="annotated-cell-5-262">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Compute scaling and conversion</span></span>
<span id="annotated-cell-5-263">            <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-5-264">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-265">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Read float from Shared</span></span>
<span id="annotated-cell-5-266">                <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> val <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> sC<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>base_idx <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span> </span>
<span id="annotated-cell-5-267">                </span>
<span id="annotated-cell-5-268">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Apply Alpha</span></span>
<span id="annotated-cell-5-269">                val <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*=</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-270"></span>
<span id="annotated-cell-5-271">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Apply Beta</span></span>
<span id="annotated-cell-5-272">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>use_beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-273">                    val <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> beta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> __half2float<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>old_c<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]);</span></span>
<span id="annotated-cell-5-274">                <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-275"></span>
<span id="annotated-cell-5-276">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Convert to Half</span></span>
<span id="annotated-cell-5-277">                out_buffer<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> __float2half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>val<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-278">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-279"></span>
<span id="annotated-cell-5-280">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Vectorized Store to Global Memory</span></span>
<span id="annotated-cell-5-281">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*(</span>int4<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*)&amp;</span>C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>global_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> global_col<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*(</span>int4<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*)</span>out_buffer<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-5-282"></span>
<span id="annotated-cell-5-283">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>global_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-284">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Edge Case: Partial vector write (at the edge of the matrix)</span></span>
<span id="annotated-cell-5-285">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-286">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>global_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-287">                    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> val <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> alpha <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> sC<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>base_idx <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-5-288">                    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>beta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">f</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-289">                        val <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> beta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> __half2float<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>global_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> global_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]);</span></span>
<span id="annotated-cell-5-290">                    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-291">                    C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>global_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> global_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> __float2half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>val<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-292">                <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-293">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-294">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-295">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span> </span>
<span id="annotated-cell-5-296"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-297"></span>
<span id="annotated-cell-5-298"></span>
<span id="annotated-cell-5-299"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Same as before</span></span>
<span id="annotated-cell-5-300">__global__ <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">void</span> gemm_tiled_kernel<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">...</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-301"></span>
<span id="annotated-cell-5-302"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">extern</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"C"</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">void</span> solve<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-5" data-target-annotation="8">8</button><span id="annotated-cell-5-303" class="code-annotation-target">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">64</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;&amp;</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">64</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;&amp;</span> K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">32</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-304">        dim3 blockDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>THREAD_COUNT<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-305">        dim3 gridDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> BLOCK_N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> BLOCK_M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-306">        gemm_swizzled_kernel<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;&lt;&lt;</span>gridDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> blockDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;(</span>A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-307">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-5-308">        dim3 blockDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>TILE_WIDTH<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> TILE_WIDTH<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-309">        dim3 gridDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span></span>
<span id="annotated-cell-5-310">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> TILE_WIDTH <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> TILE_WIDTH<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span></span>
<span id="annotated-cell-5-311">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> TILE_WIDTH <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> TILE_WIDTH</span>
<span id="annotated-cell-5-312">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-313"></span>
<span id="annotated-cell-5-314">        gemm_tiled_kernel<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;&lt;&lt;</span>gridDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> blockDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;(</span>A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-5-315">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-5-316"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span><div class="code-annotation-gutter-bg"></div><div class="code-annotation-gutter"></div></code></pre></div></div>
<dl class="code-annotation-container-hidden code-annotation-container-grid">
<dt data-target-cell="annotated-cell-5" data-target-annotation="1">1</dt>
<dd>
<span data-code-cell="annotated-cell-5" data-code-lines="25" data-code-annotation="1"><strong>Newspaper Panels</strong>: So our “newspaper panels” have a width of 4 blocks. This is a standard balanced choice: if we have too narrow of a panel, we are effectively traversing column-major, and if we have too wide of a panel, we might as well just traverse row-major.</span>
</dd>
<dt data-target-cell="annotated-cell-5" data-target-annotation="2">2</dt>
<dd>
<span data-code-cell="annotated-cell-5" data-code-lines="34" data-code-annotation="2"><strong>Grid Swizzling</strong>: Computing the panel index beforehand tells us what column the left side of the current panel starts at. At the end of these few lines, we have our new block row and column index in terms of our output matrix, that traverses our newspaper columns first instead of going row-major.</span>
</dd>
<dt data-target-cell="annotated-cell-5" data-target-annotation="3">3</dt>
<dd>
<span data-code-cell="annotated-cell-5" data-code-lines="58" data-code-annotation="3"><strong>Shared Memory Swizzling</strong>: This is where we swizzle the shared memory, by adding padding to our shared memory declaration.</span>
</dd>
<dt data-target-cell="annotated-cell-5" data-target-annotation="4">4</dt>
<dd>
<span data-code-cell="annotated-cell-5" data-code-lines="65" data-code-annotation="4"><strong>Doubled Fragments</strong>: We double the number of fragments so we can load all the data at once the warp is using for its math loop into separate fragments, and then do all the math.</span>
</dd>
<dt data-target-cell="annotated-cell-5" data-target-annotation="5">5</dt>
<dd>
<span data-code-cell="annotated-cell-5" data-code-lines="90" data-code-annotation="5"><strong>Async Loading Lambda</strong>: We moved the async loading to a lambda for readability.</span>
</dd>
<dt data-target-cell="annotated-cell-5" data-target-annotation="6">6</dt>
<dd>
<span data-code-cell="annotated-cell-5" data-code-lines="154" data-code-annotation="6"><strong>Doubled Data Loading</strong>: Since we’re loading twice the data now per K-loop iteration, we need a new k-step loop that performs the warp math twice.</span>
</dd>
<dt data-target-cell="annotated-cell-5" data-target-annotation="7">7</dt>
<dd>
<span data-code-cell="annotated-cell-5" data-code-lines="182" data-code-annotation="7"><strong>Warp Math</strong>: We load the data all at once before this into our 4 fragments, and then can just loop 4 times calling <code>mma_sync</code> to perform the math.</span>
</dd>
<dt data-target-cell="annotated-cell-5" data-target-annotation="8">8</dt>
<dd>
<span data-code-cell="annotated-cell-5" data-code-lines="303" data-code-annotation="8"><strong>Dimension Check</strong>: Since we doubled BLOCK_K we need to change this dimension check too. It may seem disappointing that we are now handling even fewer matrices with our optimized kernel. Don’t worry, we’ll fix this in the next kernel!</span>
</dd>
</dl>
</div>
</section>
<section id="arithmetic-intensity-4" class="level3">
<h3 class="anchored" data-anchor-id="arithmetic-intensity-4">Arithmetic Intensity</h3>
<p>The swizzling didn’t impact our actual FLOP count or memory volume. We did double BLOCK_K, but that effectively just doubled the number of FLOPs in our K-loop while also doubling the global memory load. So the arithmetic intensity is unchanged from our prior kernel: we’re still sitting at 32 FLOPs/B. Still, we witness considerable speedup from our optimizations.</p>
</section>
<section id="benchmarks-4" class="level3">
<h3 class="anchored" data-anchor-id="benchmarks-4">Benchmarks</h3>
<table class="table">
<colgroup>
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">GPU Model</th>
<th style="text-align: left;">Memory Bandwidth</th>
<th style="text-align: left;">Peak FP16 Compute</th>
<th style="text-align: left;">Ridge Point (FLOP/Byte)</th>
<th style="text-align: left;">Runtime (ms)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;"><strong>NVIDIA T4</strong></td>
<td style="text-align: left;">320 GB/s</td>
<td style="text-align: left;">65 TFLOPS</td>
<td style="text-align: left;">203</td>
<td style="text-align: left;">0.49</td>
</tr>
<tr class="even">
<td style="text-align: left;"><strong>NVIDIA A100 (80GB)</strong></td>
<td style="text-align: left;">2,039 GB/s</td>
<td style="text-align: left;">312 TFLOPS</td>
<td style="text-align: left;">153</td>
<td style="text-align: left;">0.07</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><strong>NVIDIA H100 (SXM)</strong></td>
<td style="text-align: left;">3,350 GB/s</td>
<td style="text-align: left;">989 TFLOPS</td>
<td style="text-align: left;">295</td>
<td style="text-align: left;">0.04</td>
</tr>
<tr class="even">
<td style="text-align: left;"><strong>NVIDIA H200 (SXM)</strong></td>
<td style="text-align: left;">4,800 GB/s</td>
<td style="text-align: left;">989 TFLOPS</td>
<td style="text-align: left;">206</td>
<td style="text-align: left;">0.04</td>
</tr>
<tr class="odd">
<td style="text-align: left;"><strong>NVIDIA B200</strong></td>
<td style="text-align: left;">8,000 GB/s</td>
<td style="text-align: left;">2,500 TFLOPS</td>
<td style="text-align: left;">312</td>
<td style="text-align: left;">0.03</td>
</tr>
</tbody>
</table>
</section>
</section>
<section id="arbitrary-matrix-dimensions" class="level2 page-columns page-full">
<h2 class="anchored" data-anchor-id="arbitrary-matrix-dimensions">6. Arbitrary Matrix Dimensions</h2>
<p>It is unfortunate that we have made it this far without being able to fully remove our tiled GEMM kernel. This next kernel is a modification of the prior swizzled kernel that allows us to handle arbitrary dimensions in our input matrices. With some smart boundary checks and padding of shared memory with zeroes, we can ensure that we can use WMMA 16x16x16 operations across the entire matrix. We end up having some harmless padded zeroes as part of the operation that don’t impact the final result.</p>
<section id="annotated-code-5" class="level3 page-columns page-full">
<h3 class="anchored" data-anchor-id="annotated-code-5">Annotated Code</h3>
<div class="column-screen-inset">
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="annotated-cell-6" style="background: #f1f3f5;"><pre class="sourceCode cpp code-annotation-code code-with-copy code-annotated"><code class="sourceCode cpp"><span id="annotated-cell-6-1"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#include </span><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">&lt;cuda_runtime.h&gt;</span></span>
<span id="annotated-cell-6-2"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#include </span><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">&lt;cuda_fp16.h&gt;</span></span>
<span id="annotated-cell-6-3"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#include </span><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">&lt;mma.h&gt;</span></span>
<span id="annotated-cell-6-4"><span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#include </span><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">&lt;cuda_pipeline_primitives.h&gt;</span></span>
<span id="annotated-cell-6-5"></span>
<span id="annotated-cell-6-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">using</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">namespace</span> nvcuda<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-7"></span>
<span id="annotated-cell-6-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------------- CONFIGURATION -------------</span></span>
<span id="annotated-cell-6-9"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">constexpr</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> BLOCK_M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">64</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-10"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">constexpr</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> BLOCK_N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">64</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// One block computes a 64 x 64 tile of the output matrix</span></span>
<span id="annotated-cell-6-11"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">constexpr</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> BLOCK_K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">32</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Accumulation step will be in terms of 16 but we load 32 at once to hide latency</span></span>
<span id="annotated-cell-6-12"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">constexpr</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> WARP_SIZE <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">32</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-13"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">constexpr</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> THREAD_COUNT <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">128</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-14"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">constexpr</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> WMMA <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-15"></span>
<span id="annotated-cell-6-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Pad to avoid bank conflicts in shared memory.</span></span>
<span id="annotated-cell-6-17"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">constexpr</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> SMEM_PAD <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-18"></span>
<span id="annotated-cell-6-19">__global__ <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">void</span> gemm_swizzled_all<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-20"></span>
<span id="annotated-cell-6-21">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------------- GRID SWIZZLING (L2 Cache Optimization) -------------</span></span>
<span id="annotated-cell-6-22">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Remap the linear block index to a "Swizzled" 2D grid.</span></span>
<span id="annotated-cell-6-23"></span>
<span id="annotated-cell-6-24">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Usually 2, 4, or 8</span></span>
<span id="annotated-cell-6-25">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> swizzle_factor <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-26"></span>
<span id="annotated-cell-6-27">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Calculate linear block ID and grid dimensions</span></span>
<span id="annotated-cell-6-28">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> idx_linear <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> blockIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>y <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> gridDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> blockIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-29">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> grid_m_blocks <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> gridDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>y<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-30">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> grid_n_blocks <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> gridDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-31"></span>
<span id="annotated-cell-6-32">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Swizzle logic: Map linear ID to (block_row, block_col) in a localized pattern.</span></span>
<span id="annotated-cell-6-33">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// This traverses the grid in 'thick columns' of width 'swizzle_factor'</span></span>
<span id="annotated-cell-6-34">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> panel_number <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> idx_linear <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>swizzle_factor <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> grid_m_blocks<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-6-35">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> block_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>idx_linear <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> swizzle_factor<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> grid_m_blocks<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-36">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> block_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>idx_linear <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> swizzle_factor<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> panel_number <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> swizzle_factor<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-37">    </span>
<span id="annotated-cell-6-38">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Safety check for irregular grids (if grid is not perfectly divisible)</span></span>
<span id="annotated-cell-6-39">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>block_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;=</span> grid_m_blocks <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">||</span> block_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;=</span> grid_n_blocks<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-40"></span>
<span id="annotated-cell-6-41">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Calculate offsets based on swizzled coordinates</span></span>
<span id="annotated-cell-6-42">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> block_row_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> block_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> BLOCK_M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-43">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> block_col_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> block_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> BLOCK_N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-44">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// --------------------------------------------------------------------</span></span>
<span id="annotated-cell-6-45"></span>
<span id="annotated-cell-6-46">    </span>
<span id="annotated-cell-6-47">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------------- INDEX CALCULATIONS -------------</span></span>
<span id="annotated-cell-6-48">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Linear view for data loading: which worker out of 128 threads am I?</span></span>
<span id="annotated-cell-6-49">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> tid <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> threadIdx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-50"></span>
<span id="annotated-cell-6-51">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// As we have 128 threads per block, we have 4 warps per block, which we arrange in a 2x2 grid.</span></span>
<span id="annotated-cell-6-52">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// As each block computes a 64 x 64 output tile, we need to assign each warp a 32 x 32 output tile.</span></span>
<span id="annotated-cell-6-53">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> warp_id <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tid <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> WARP_SIZE<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-54">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> warp_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>warp_id <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">32</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-55">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> warp_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>warp_id <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">32</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-56">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ----------------------------------------------</span></span>
<span id="annotated-cell-6-57"></span>
<span id="annotated-cell-6-58"></span>
<span id="annotated-cell-6-59">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------------- MEMORY INITIALIZATION ----------</span></span>
<span id="annotated-cell-6-60">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Double Buffer: Shared Memory. Padded to remove Bank Conflicts</span></span>
<span id="annotated-cell-6-61">    __shared__ half sA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>BLOCK_M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>BLOCK_K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> SMEM_PAD<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)];</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 64 rows, 40 cols (K + pad)</span></span>
<span id="annotated-cell-6-62">    __shared__ half sB<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>BLOCK_K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>BLOCK_N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> SMEM_PAD<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)];</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 40 rows (K + pad), 64 cols</span></span>
<span id="annotated-cell-6-63"></span>
<span id="annotated-cell-6-64">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Declare fragments and initialize accumulator</span></span>
<span id="annotated-cell-6-65">    wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>fragment<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>matrix_a<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>row_major<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> a_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// x2 for K=32</span></span>
<span id="annotated-cell-6-66">    wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>fragment<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>matrix_b<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>row_major<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> b_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-6-67">    wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>fragment<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>accumulator<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> accum_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-6-68"></span>
<span id="annotated-cell-6-69">    <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-6-70">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-71">        <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-6-72">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-73">            wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>fill_fragment<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>accum_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">],</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">f</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-6-74">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-75">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-76"></span>
<span id="annotated-cell-6-77">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Pipeline setup</span></span>
<span id="annotated-cell-6-78">    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> stage <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Alternates between 0 and 1</span></span>
<span id="annotated-cell-6-79">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ----------------------------------------------</span></span>
<span id="annotated-cell-6-80"></span>
<span id="annotated-cell-6-81"></span>
<span id="annotated-cell-6-82">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------------- PROLOGUE -------------</span></span>
<span id="annotated-cell-6-83">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Load the first tile (k=0). A: 64x32. B: 32x64.</span></span>
<span id="annotated-cell-6-84">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// We have 128 threads. We need to load 64*32 = 2048 halves per matrix.</span></span>
<span id="annotated-cell-6-85">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// So each thread must load 16 halves (int4 size) from each matrix.</span></span>
<span id="annotated-cell-6-86"></span>
<span id="annotated-cell-6-87">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> src_A_base <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> block_row_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-88">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> src_B_base <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> block_col_start<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-89"></span>
<span id="annotated-cell-6-90">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">auto</span> load_tile_async <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[&amp;](</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> stage_idx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> k_step<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-91">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> A_ptr <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> src_A_base <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> k_step<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Base pointer for this tile</span></span>
<span id="annotated-cell-6-92">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> B_ptr <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> src_B_base <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> k_step <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> </span>
<span id="annotated-cell-6-93"></span>
<span id="annotated-cell-6-94">        half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> sA_ptr <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> sA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>stage_idx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-6-95">        half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> sB_ptr <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> sB<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>stage_idx<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-6-96"></span>
<span id="annotated-cell-6-97">        <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-6-98">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-99">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> tid_offset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tid <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> THREAD_COUNT<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-100"></span>
<span id="annotated-cell-6-101">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// --- LOAD MATRIX A (Row-Major: [M x K]) ---</span></span>
<span id="annotated-cell-6-102">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> vec_row_a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tid_offset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span>        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Local Row (0..63)</span></span>
<span id="annotated-cell-6-103">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> vec_col_a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>tid_offset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Local Col (0, 8, 16, 24)</span></span>
<span id="annotated-cell-6-104">            </span>
<span id="annotated-cell-6-105">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> global_row_a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> block_row_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> vec_row_a<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-106">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> global_col_a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> k_step <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> vec_col_a<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-107"></span>
<span id="annotated-cell-6-108">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Address of the shared memory destination</span></span>
<span id="annotated-cell-6-109">            half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> dst_a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span>sA_ptr<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>vec_row_a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>BLOCK_K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> SMEM_PAD<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> vec_col_a<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-6-110"></span>
<span id="annotated-cell-6-111">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 1. Check strict bounds (Is this whole vector inside the matrix?)</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-6" data-target-annotation="1">1</button><span id="annotated-cell-6-112" class="code-annotation-target">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">bool</span> a_fully_valid <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>global_row_a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;&amp;</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>global_col_a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-6-113"></span>
<span id="annotated-cell-6-114">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>a_fully_valid<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-115">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Fast path: Async Copy</span></span>
<span id="annotated-cell-6-116">                 __pipeline_memcpy_async<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>dst_a<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span>A_ptr<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>vec_row_a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> vec_col_a<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">],</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">sizeof</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>int4<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">));</span></span>
<span id="annotated-cell-6-117">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span> </span>
<span id="annotated-cell-6-118">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-119">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Slow / Edge path: Manual loading or Zeroing</span></span>
<span id="annotated-cell-6-120">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// We must ensure Shared Memory has 0s where the matrix has nothing</span></span>
<span id="annotated-cell-6-121">                <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-6-122">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> v<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> v<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> v<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-123">                    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>global_row_a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;&amp;</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>global_col_a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> v<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-124">                        dst_a<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>v<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> A_ptr<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>vec_row_a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>vec_col_a <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> v<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)];</span></span>
<span id="annotated-cell-6-125">                    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-126">                        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Pad with zeroes</span></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-6" data-target-annotation="2">2</button><span id="annotated-cell-6-127" class="code-annotation-target">                        dst_a<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>v<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> __float2half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">f</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-6-128">                    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-129">                <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-130">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-131"></span>
<span id="annotated-cell-6-132">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// --- LOAD MATRIX B (Row-Major: [K x N]) ---</span></span>
<span id="annotated-cell-6-133">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> vec_row_b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tid_offset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span>        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Local Row (0..31)</span></span>
<span id="annotated-cell-6-134">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> vec_col_b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>tid_offset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Local Col (0..56)</span></span>
<span id="annotated-cell-6-135"></span>
<span id="annotated-cell-6-136">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> global_row_b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> k_step <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> vec_row_b<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-137">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> global_col_b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> block_col_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> vec_col_b<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-138"></span>
<span id="annotated-cell-6-139">            half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> dst_b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span>sB_ptr<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>vec_row_b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>BLOCK_N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> SMEM_PAD<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> vec_col_b<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-6-140"></span>
<span id="annotated-cell-6-141">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">bool</span> b_fully_valid <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>global_row_b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;&amp;</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>global_col_b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-6-142"></span>
<span id="annotated-cell-6-143">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>b_fully_valid<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-144">                 __pipeline_memcpy_async<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>dst_b<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span>B_ptr<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>vec_row_b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> vec_col_b<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">],</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">sizeof</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>int4<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">));</span></span>
<span id="annotated-cell-6-145">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-146">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Edge path</span></span>
<span id="annotated-cell-6-147">                <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-6-148">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> v<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> v<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> v<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-149">                    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>global_row_b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;&amp;</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>global_col_b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> v<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-150">                        dst_b<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>v<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> B_ptr<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>vec_row_b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>vec_col_b <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> v<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)];</span></span>
<span id="annotated-cell-6-151">                    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-152">                        dst_b<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>v<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> __float2half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">f</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// PAD WITH ZERO</span></span>
<span id="annotated-cell-6-153">                    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-154">                <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-155">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-156">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-157">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">};</span></span>
<span id="annotated-cell-6-158">    </span>
<span id="annotated-cell-6-159">    load_tile_async<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>stage<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-6-160">    __pipeline_commit<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span></span>
<span id="annotated-cell-6-161">    __pipeline_wait_prior<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-6-162">    __syncthreads<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span></span>
<span id="annotated-cell-6-163">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------------------------------------</span></span>
<span id="annotated-cell-6-164"></span>
<span id="annotated-cell-6-165">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------------- MAIN LOOP -------------</span></span>
<span id="annotated-cell-6-166">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> BLOCK_K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-167"></span>
<span id="annotated-cell-6-168">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> k_next <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> BLOCK_K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-169"></span>
<span id="annotated-cell-6-170">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 1. LOAD the next tile asynchronously</span></span>
<span id="annotated-cell-6-171">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>k_next <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-172">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Turns 1 into 0 or 0 into 1</span></span>
<span id="annotated-cell-6-173">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> next_stage <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> stage<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-174">            load_tile_async<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>next_stage<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> k_next<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-6-175">            __pipeline_commit<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span></span>
<span id="annotated-cell-6-176">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-177"></span>
<span id="annotated-cell-6-178">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 2. MATH: process the current tile. Recall we have a 2 x 2 grid of 16 x 16 subtiles for each warp.</span></span>
<span id="annotated-cell-6-179">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// BLOCK_K = 32, and WMMA accumulates 16x16x16 at a time, so we need to loop k_step 0..1.</span></span>
<span id="annotated-cell-6-180">        <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-6-181">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> k_step <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> k_step <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> BLOCK_K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> k_step <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> WMMA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-182">            </span>
<span id="annotated-cell-6-183">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// --- STEP A: Load Fragments into Registers (Pre-Load) ---</span></span>
<span id="annotated-cell-6-184">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// A Warp computes a 32x32 output tile.</span></span>
<span id="annotated-cell-6-185">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// This requires 32 rows of A (2 fragments) and 32 cols of B (2 fragments).</span></span>
<span id="annotated-cell-6-186">            </span>
<span id="annotated-cell-6-187">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Load the 2 fragments of Matrix A needed for this warp</span></span>
<span id="annotated-cell-6-188">            <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-6-189">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-190">                <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> smem_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> warp_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-6-191">                half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> tile_ptr_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span>sA<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>stage<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>smem_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>BLOCK_K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> SMEM_PAD<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> k_step<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-6-192">                </span>
<span id="annotated-cell-6-193">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Load into specific index [i]</span></span>
<span id="annotated-cell-6-194">                wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>load_matrix_sync<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>a_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">],</span> tile_ptr_A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> BLOCK_K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> SMEM_PAD<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-6-195">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-196"></span>
<span id="annotated-cell-6-197">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Load the 2 fragments of Matrix B needed for this warp</span></span>
<span id="annotated-cell-6-198">            <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-6-199">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-200">                <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> smem_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> warp_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-6-201">                half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> tile_ptr_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span>sB<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>stage<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>k_step <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>BLOCK_N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> SMEM_PAD<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> smem_col<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-6-202">                </span>
<span id="annotated-cell-6-203">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Load into specific index [j]</span></span>
<span id="annotated-cell-6-204">                wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>load_matrix_sync<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>b_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">],</span> tile_ptr_B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> BLOCK_N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> SMEM_PAD<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-6-205">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-206"></span>
<span id="annotated-cell-6-207">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// --- STEP B: Compute (Reuse Registers) ---</span></span>
<span id="annotated-cell-6-208">            <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-6-209">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-210">                <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-6-211">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-212">                    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Reuse a_frag[i] and b_frag[j] multiple times</span></span>
<span id="annotated-cell-6-213">                    wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>mma_sync<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>accum_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">],</span> a_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">],</span> b_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">],</span> accum_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]);</span></span>
<span id="annotated-cell-6-214">                <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-215">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-216">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-217">       </span>
<span id="annotated-cell-6-218"></span>
<span id="annotated-cell-6-219">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 3. WAIT for next tile</span></span>
<span id="annotated-cell-6-220">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> BLOCK_K <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-221">            __pipeline_wait_prior<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-6-222">            __syncthreads<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span></span>
<span id="annotated-cell-6-223">            stage <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> stage<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-224">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-225">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-226">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------------------------------------</span></span>
<span id="annotated-cell-6-227"></span>
<span id="annotated-cell-6-228">    __syncthreads<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Since the syncthreads above won't execute on the last iteration</span></span>
<span id="annotated-cell-6-229">   </span>
<span id="annotated-cell-6-230">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// ------- EPILOGUE: Store C ----------</span></span>
<span id="annotated-cell-6-231">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// We need a Shared Memory buffer for the floats from the Accumulators.</span></span>
<span id="annotated-cell-6-232">    __shared__ <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> sC<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>BLOCK_M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> BLOCK_N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-6-233"></span>
<span id="annotated-cell-6-234">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 1. Store Accumulators (Registers) -&gt; Shared Memory (Float)</span></span>
<span id="annotated-cell-6-235">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Each warp holds a 32x32 tile distributed across 2x2 fragments (16x16 each).</span></span>
<span id="annotated-cell-6-236">    <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-6-237">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-238">        <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-6-239">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-240">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Calculate where this 16x16 fragment belongs in the 64x64 block</span></span>
<span id="annotated-cell-6-241">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> row_offset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> warp_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>i <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-6-242">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> col_offset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> warp_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>j <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-6-243">            </span>
<span id="annotated-cell-6-244">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> smem_ptr <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> sC <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> row_offset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> BLOCK_N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> col_offset<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-245"></span>
<span id="annotated-cell-6-246">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Store fragment to shared memory (Stride is BLOCK_N)</span></span>
<span id="annotated-cell-6-247">            wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>store_matrix_sync<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>smem_ptr<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> accum_frag<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">][</span>j<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">],</span> BLOCK_N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> wmma<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>mem_row_major<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-6-248">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-249">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-250"></span>
<span id="annotated-cell-6-251">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Wait for all warps to finish writing to sC</span></span>
<span id="annotated-cell-6-252">    __syncthreads<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">();</span></span>
<span id="annotated-cell-6-253"></span>
<span id="annotated-cell-6-254">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// 2. Write Shared Memory (Float) -&gt; Global Memory (Half)</span></span>
<span id="annotated-cell-6-255">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Total Elements: 64 * 64 = 4096.</span></span>
<span id="annotated-cell-6-256">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Threads: 128.</span></span>
<span id="annotated-cell-6-257">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Elements per thread: 32.</span></span>
<span id="annotated-cell-6-258">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Vectors per thread: 32 / 8 = 4 vectors (int4).</span></span>
<span id="annotated-cell-6-259"></span>
<span id="annotated-cell-6-260">    <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-6-261">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> v <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> v <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> v<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-262">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Calculate the linear index for this vector of 8 elements</span></span>
<span id="annotated-cell-6-263">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Stride by THREAD_COUNT to ensure coalescing (Thread 0 takes 0..7, Thread 1 takes 8..15)</span></span>
<span id="annotated-cell-6-264">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> vec_idx <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tid <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> v <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> THREAD_COUNT<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> </span>
<span id="annotated-cell-6-265">        </span>
<span id="annotated-cell-6-266">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> base_idx <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> vec_idx <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// The starting element index</span></span>
<span id="annotated-cell-6-267">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> base_idx <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> BLOCK_N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-268">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> base_idx <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> BLOCK_N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-269"></span>
<span id="annotated-cell-6-270">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> global_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> block_row_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> row<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-271">        <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> global_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> block_col_start <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> col<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-272"></span>
<span id="annotated-cell-6-273">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Boundary Check (Safe for arbitrary M/N)</span></span>
<span id="annotated-cell-6-274">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// We check if the whole vector of 8 fits</span></span>
<span id="annotated-cell-6-275">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>global_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;&amp;</span> global_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-276">            </span>
<span id="annotated-cell-6-277">            half out_buffer<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Register buffer for formatting</span></span>
<span id="annotated-cell-6-278"></span>
<span id="annotated-cell-6-279">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// OPTIONAL: Beta Handling (Load old C)</span></span>
<span id="annotated-cell-6-280">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// If beta is non-zero, we must load the existing values from Global Memory first</span></span>
<span id="annotated-cell-6-281">            half old_c<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span> </span>
<span id="annotated-cell-6-282">            <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">bool</span> use_beta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>beta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">f</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-6-283"></span>
<span id="annotated-cell-6-284">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>use_beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-285">                 <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Vectorized Load of old C</span></span>
<span id="annotated-cell-6-286">                <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*(</span>int4<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*)</span>old_c <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*(</span>int4<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*)&amp;</span>C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>global_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> global_col<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-6-287">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-288"></span>
<span id="annotated-cell-6-289">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Compute scaling and conversion</span></span>
<span id="annotated-cell-6-290">            <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">#pragma unroll</span></span>
<span id="annotated-cell-6-291">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-292">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Read float from Shared</span></span>
<span id="annotated-cell-6-293">                <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> val <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> sC<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>base_idx <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span> </span>
<span id="annotated-cell-6-294">                </span>
<span id="annotated-cell-6-295">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Apply Alpha</span></span>
<span id="annotated-cell-6-296">                val <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*=</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-297"></span>
<span id="annotated-cell-6-298">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Apply Beta</span></span>
<span id="annotated-cell-6-299">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>use_beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-300">                    val <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> beta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> __half2float<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>old_c<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]);</span></span>
<span id="annotated-cell-6-301">                <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-302"></span>
<span id="annotated-cell-6-303">                <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Convert to Half</span></span>
<span id="annotated-cell-6-304">                out_buffer<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> __float2half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>val<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-6-305">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-306"></span>
<span id="annotated-cell-6-307">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Vectorized Store to Global Memory</span></span>
<span id="annotated-cell-6-308">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*(</span>int4<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*)&amp;</span>C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>global_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> global_col<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*(</span>int4<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*)</span>out_buffer<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span></span>
<span id="annotated-cell-6-309"></span>
<button class="code-annotation-anchor" data-target-cell="annotated-cell-6" data-target-annotation="3">3</button><span id="annotated-cell-6-310" class="code-annotation-target">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>global_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-311">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">// Edge Case: Partial vector write (at the edge of the matrix)</span></span>
<span id="annotated-cell-6-312">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">++)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-313">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>global_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-314">                    <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> val <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> alpha <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> sC<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>base_idx <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">];</span></span>
<span id="annotated-cell-6-315">                    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>beta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">f</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-316">                        val <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> beta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> __half2float<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>global_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> global_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]);</span></span>
<span id="annotated-cell-6-317">                    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-318">                    C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span>global_row <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> global_col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> __float2half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>val<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-6-319">                <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-320">            <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-321">        <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-322">    <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span> </span>
<span id="annotated-cell-6-323"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span>
<span id="annotated-cell-6-324"></span>
<span id="annotated-cell-6-325"></span>
<span id="annotated-cell-6-326"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">extern</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"C"</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">void</span> solve<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">const</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> half<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">int</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">float</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span></span>
<span id="annotated-cell-6-327">    dim3 blockDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>THREAD_COUNT<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-6-328">    dim3 gridDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">((</span>N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> BLOCK_N <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> BLOCK_N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">(</span>M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> BLOCK_M <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">)</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> BLOCK_M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-6-329">    gemm_swizzled_all<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;&lt;&lt;</span>gridDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> blockDim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;&gt;&gt;(</span>A<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> B<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> C<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> M<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> N<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> K<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">,</span> beta<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">);</span></span>
<span id="annotated-cell-6-330"><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span></span><div class="code-annotation-gutter-bg"></div><div class="code-annotation-gutter"></div></code></pre></div></div>
<dl class="code-annotation-container-hidden code-annotation-container-grid">
<dt data-target-cell="annotated-cell-6" data-target-annotation="1">1</dt>
<dd>
<span data-code-cell="annotated-cell-6" data-code-lines="112" data-code-annotation="1"><strong>Valid flag</strong>: We check if we are far enough within bounds to still do a vectorized load, or if we would go beyond the edges of the input matrices. If we’re far enough within bounds, we can issue our <code>pipeline_memcpy_async</code> command as before.</span>
</dd>
<dt data-target-cell="annotated-cell-6" data-target-annotation="2">2</dt>
<dd>
<span data-code-cell="annotated-cell-6" data-code-lines="127" data-code-annotation="2"><strong>Zero padding</strong>: If we’re too close to the edge of the matrix, we loop element by element, loading from global memory where we’re still in bounds and padding with zeroes wherever we’re not.</span>
</dd>
<dt data-target-cell="annotated-cell-6" data-target-annotation="3">3</dt>
<dd>
<span data-code-cell="annotated-cell-6" data-code-lines="310" data-code-annotation="3"><strong>Boundary check for writing</strong>: We have the same boundary checks as usual to not write out zeroes or junk to global memory in the epilogue.</span>
</dd>
</dl>
</div>
</section>
<section id="arithmetic-intensity-5" class="level3">
<h3 class="anchored" data-anchor-id="arithmetic-intensity-5">Arithmetic Intensity</h3>
<p>We’re not performing more FLOPs or global memory access than the previous kernel. However, we are avoiding the use of the tiled GEMM kernel entirely, which means that in reality, our overall arithmetic intensity for all test cases will be closer to the optimized kernel’s 32 FLOPs/B. We are not funneling any straggler test cases with awkward dimensions to the tiled kernel which only had 8 FLOPs/B.</p>
</section>
<section id="benchmarks-5" class="level3">
<h3 class="anchored" data-anchor-id="benchmarks-5">Benchmarks</h3>
<p>I omit the benchmark table here as the runtimes were the same as the prior kernel on the LeetGPU test suite, plus or minus some run to run variation. This checks out with my understanding that the runtime is given for a particular test case, that was probably already compatible with our WMMA dimension checks in the previous kernel.</p>
</section>
</section>
<section id="final-performance-analysis" class="level2 page-columns page-full">
<h2 class="anchored" data-anchor-id="final-performance-analysis">Final Performance Analysis</h2>
<p>The final graph of kernel versus runtime on each GPU is below.</p>
<div id="cell-fig-gpu-optimization" class="cell" data-message="false" data-execution_count="1">
<div class="cell-output cell-output-display">
<div id="fig-gpu-optimization" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-align="center" data-cap-location="bottom">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-gpu-optimization-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://rohan-reddy.github.io/posts/001-gemm-optimization/index_files/figure-html/fig-gpu-optimization-output-1.png" class="quarto-figure quarto-figure-center figure-img" width="655" height="562">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-gpu-optimization-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;1: GPU Runtime by Kernel Optimization Step
</figcaption>
</figure>
</div>
</div>
</div>
<p>LeetGPU has a leaderboard for each GPU for the GEMM problem, as well as a list of public solutions ordered by runtime. The leaderboard considers both private and public solutions (it is a user preference whether your solutions are public or not - I left mine as public as I greatly benefited from reading others solutions to understand their approaches). At the time of writing, on most of the GPUs, I am not in the top 3 on the leaderboard, but on all of them my solution is in the top 5. In particular, for the Blackwell B200, my solution sits at 1st place by a whopping 0.1 microsecond over the next best solution. Not bad!</p>
<div class="quarto-figure quarto-figure-center page-columns page-full">
<figure class="figure page-columns page-full">
<p><img src="https://rohan-reddy.github.io/posts/001-gemm-optimization/images/05_leaderboard.png" class="img-fluid figure-img" style="width:90.0%"></p>
<figcaption class="margin-caption">“ShaderShinobi” is my pseudonym. I debated whether to omit the other leaderboard usernames for anonymity, but those usernames already look pretty pseudonymous. Also, I’m hoping if either of those authors see this post, they’ll contact me to nerd out about GPUs.</figcaption>
</figure>
</div>
</section>
<section id="further-optimizations" class="level2">
<h2 class="anchored" data-anchor-id="further-optimizations">Further Optimizations</h2>
<p>I can almost certainly ascertain that the author of the next best solution had a generally superior kernel though, as their solutions are public. In particular, they used Warp Group MMA, a capability introduced in the Hopper generation that is much more efficient than standard WMMA. The cleanest way to use Warp Group MMA is with an external library, which is prohibited by the problem constraints so I considered it out of scope for this problem. Admirably, this author went ahead and called it directly with PTX code. While I assumed this would be very messy, their solution was surprisingly still quite nice to read. The architecture of Hopper and Blackwell GPUs is quite different and more optimized than previous generations for GEMM operations. In a future post, I will explore Warp Group MMA, the CuTe library, and various optimizations only available on the current generation of GPUs.</p>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0">
<div id="ref-tensorcores" class="csl-entry">
2024. <em>NVIDIA Technical Blog</em>. NVIDIA. <a href="https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/">https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/</a>.
</div>
<div id="ref-pmpp" class="csl-entry">
Kirk, David B, and Wen-mei W Hwu. 2022. <em>Programming Massively Parallel Processors: A Hands-on Approach</em>. 4th ed. Morgan Kaufmann.
</div>
<div id="ref-leetgpu" class="csl-entry">
<span>“LeetGPU: Competitive GPU Programming.”</span> 2026. <a href="https://leetgpu.com" class="uri">https://leetgpu.com</a>.
</div>
<div id="ref-tiledmatmul" class="csl-entry">
Matthes, Alexander, Rene Widera, Erik Zenker, Benjamin Worpitz, Axel Huebl, and Michael Bussmann. 2017. <span>“Tuning and Optimization for a Variety of Many-Core Architectures Without Changing a Single Line of Implementation Code Using the Alpaka Library.”</span> In, 496–514. <a href="https://doi.org/10.1007/978-3-319-67630-2_36">https://doi.org/10.1007/978-3-319-67630-2_36</a>.
</div>
<div id="ref-memoryhierarchy" class="csl-entry">
<span>“Memory Hierarchy of GPUs.”</span> 2025. Arc Compute. <a href="https://www.arccompute.io/arc-blog/gpu-101-memory-hierarchy">https://www.arccompute.io/arc-blog/gpu-101-memory-hierarchy</a>.
</div>
</div>


</section>

 ]]></description>
  <category>CUDA</category>
  <category>GEMM</category>
  <category>Linear Algebra</category>
  <guid>https://rohan-reddy.github.io/posts/001-gemm-optimization/</guid>
  <pubDate>Fri, 06 Feb 2026 05:00:00 GMT</pubDate>
  <media:content url="https://rohan-reddy.github.io/posts/001-gemm-optimization/images/fig-gpu-optimization-output-1.png" medium="image" type="image/png" height="123" width="144"/>
</item>
</channel>
</rss>
