Here we present a thorough experience on tuning double-precision
matrix-matrix multiplication(DGEMM) on the NVIDIA Fermi GPU architecture. Our optimizations
include software pipelining, use of vector memory operations, and instruction scheduling.
The optimization strategy is further guided by a performance model based on
micro-architecture benchmarks. Our best CUDA algorithm achieves comparable performance
with the latest vendor supplied library: CUBLAS 3.2. We further improve upon this with
an implementation in the native machine language, leading to a 20% increase in performance
over CUBLAS. That is, the achieved peak performance (efficiency) is improved from 302Gflop/s
(58%) to 362Gflop/s (70%).

**A blocking algorithm of DGEMM as a baseline**The BLAS specification defines DGEMM as C := alpha *A * B + beta * C, where A, B and C are m*k, k*n, m*n matrices, respectively. A straightforward implementation of DGEMM is three nested loops, yet a blocking algorithm often has higher performance on a processor with a memory hierarchy because blocking matrix-matrix multiplication exploits more data reuse and achieves higher effective memory bandwidth.

For a blocking DGEMM on the GPU, the three matrices are partitioned into blocks of bm*bk, bk*bn,bm*bn and these blocks are laid out as grids of M*K,K*N,M*N, where M=m/bm,N=n/bn,K=k/bk.

The size of thread block: vlx*vly

Register: accum[rx*ry], //rx*ry is a factor of register blocking

rA[rx],rB[ry]

Shared memory: smA[bk][bm],smB[bk][bn]

////////////////////////////////////////////////////////////////////////

accum[0...rx][0...ry]=0

load one bm*bk block of A into smA[bk][bm]

load one bk*bn block of B into smB[bk][bn]

**synch**

**while**(--K>0) {

**for**(ki=0;ki<bk;ki++) {

load one column of A in smA into rA[0...rx]

load one row of B in smB into rB[0...ry]

accum[0...rx][0...ry]+=rA[0...rx]*rB[0...ry]

} //end for

load one bm*bk block of A into smA[bk][bm]

load one bk*bn block of B into smB[bk][bn]

**synch**

} //end while

Merge accum[0...rx][0...ry] with bm*bn block of C.

**Algorithm 1.**The basic framework of DGEMM routines.

**Software prefetching in registers**The cost (latency) of global memory operations is about 100 times more than that of shared memory operations, so that the 8 global memory operations take more time to finish. Therefore, the top priority is latency hiding of global memory operations. We may make use of software prefetching to hide the long latency. One way is to load the next block into register files just before current block is calculated. Algorithm 2 describes the software prefetching strategy. As shown by the green line in Figure 1, this optimization improves the efficiency to 57%, which is very close to the efficiency of CUBLAS3.2.

The size of thread block: vlx*vly

Register: accum[rx*ry], //rx*ry is a factor of register blocking

rA[rx],rB[ry],nrA[rx],nrB[ry]

Shared memory: smA[bk][bm],smB[bk][bn]

///////////////////////////////////////////////////////////////////////

accum[0...rx][0...ry]=0

load one bm*bk block of A into smA[bk][bm]

load one bk*bn block of B into smB[bk][bn]

**synch**

**while**(--K>0) {

prefetch one bm*bk block of A into nrA[rx]

prefetch one bk*bn block of B into nrB[ry]

**for**(ki=0;ki<bk;ki++) {

load one column of A in smA into rA[0...rx]

load one row of B in smB into rB[0...ry]

accum[0...rx][0...ry]+=rA[0...rx]*rB[0...ry]

} //end for

store nrA[rx] into smA[bk][bm]

store nrB[ry] into smB[bk][bn]

**synch**

} //end while

Merge accum[0...rx][0...ry] with bm*bn block of C.

**Algorithm 2.**The algorithm with software prefetching in registers.

The red texts highlight the changes (the same way is used in Algorithm 3).

**Figure 1.**The performance of our initial DGEMM routines.

However, we note that a disadvantage of Algorithm 2 is the use of extra registers, i.e. additional 8 registers are temporarily used to store the next block of matrices A/B. The requirement of more registers leads to register spilling to local memory.

**Data thread mapping & double buffering**CUDA3.2 on Fermi supports 128-bits load/store operations. Obviously, the use of 128-bits load/store instructions will increase the ratio of floating-point operations to 256/(64+4+4+256)=78%, which means that we may achieve 78% efficiency of peak performance.

The use of 128-bits memory operations leads to different data-thread mapping. If the 128-bit load instructions are used, we only need 32 threads (one warp), with each thread loading two doubles( 128-bits). Thus we change the thread block of 64*4 to 32*8.

**Figure 2.**Data-thread mapping for data transfer between global memory

and shared memory. This picture is split into two parts by the dashed

line. The left part illustrates the mapping in Algorithm 2 using 64-bit

memory operation, the right one illustrates the mapping in Algorithm 3

using 128-bit memory operation.

It is proved that software pipelining using double-buffers in low latency memory is an efficient method to overlap computation with communication through the memory hierarchy. The double-buffering algorithm is outlined in Algorithm 3. In the pseudo-code we map*smA/B[0...bk/2-1][bm]*to buffer 0 in Figure 2 and*smA/B[bk/2-1...bk-1][bm]*to buffer 1. Since they operate on different shared memory buffers, memory operations can proceed in parallel with computing.

The size of thread block: vlx*vly

Register: accum[rx*ry], //rx*ry is a factor of register blocking

rA[rx],rB[ry]

Shared memory: smA[bk][bm],smB[bk][bn]

/////////////////////////////////////////////////////////////////////////

1. accum[0...rx][0...ry]=0

2. load one bm*bk/2 block of A into smA[0...bk/2-1][bm]

3. load one bk/2*bn block of B into smB[0...bk/2-1][bn]

4.**synch**

5.**while**(--K>0) {

6. load one bm*bk/2 block of A into smA[bk/2...bk-1][bm]

7. load one bk/2*bn block of B into smB[bk/2...bk-1][bm]

8.**for**(ki=0;ki<bk/2;ki++) {

9. load one column of A in smA[0...bk/2-1][bm] into rA[0...rx]

10. load one row of B in smB[0...bk/2-1][bn] into rB[0...ry]

11. accum[0...rx][0...ry]+=rA[0...rx]*rB[0...ry]

12. } //end for

13.**synch**

14. load one bm*bk/2 block of A into smA[0...bk/2-1][bm]

15. load one bk/2*bn block of B into smB[0...bk/2-1][bn]

16.**for**(ki=bk/2-1;ki<bk;ki++) {

17. load one column of A in smA[bk/2...bk-1][bm] into rA[0...rx]

18. load one row of B in smB[bk/2...bk-1][bn] into rB[0...ry]

19. accum[0...rx][0...ry]+=rA[0...rx]*rB[0...ry]

20. } //end for

21.**synch**

22.} //end while

23.Merge accum[0...rx][0...ry] with bm*bn block of C.

**Algorithm 3.**The algorithm with double-buffering strategy.

The Algorithm 3 with CUDA C only achieves an efficiency of 55%, which is also far away from the theoretical efficiency of 78%.

**Figure 3.**The performance comparison of Algorithm 2 and 3.

**Instruction scheduling**The use of 128-bits memory operations has longer latency. Besides, in order to make sure that data for the next block is totally loaded into shared memory before computation, the double-buffering forces us to use one more synchronization instruction in the while-loop. Since the major penalty is extra latency, we therefore believe that there exists room for optimizing instruction scheduling to hide these latencies.

You can find more detail about the rearrange process on instruction sequence in our paper.

**Figure 4.**The performance of our final version of DGEMM.

*version 1*: Based on Algorithm 3, we modify it to only use 128-bits memory operations and do not implement double-buffering. The implementation eliminates one synchronization instruction in while-loop.

*version 2*: Algorithm 3 is directly translated into assemble code without any instruction scheduling optimization.

*version 3*: Based on version 2, the instructions in all inner for-loops are reordered by instruction scheduling optimization. That is, we only optimized latency hiding for shared memory accesses.

*version 4*: Based on version 3, we further optimized the latency hiding for global memory access using instruction scheduling optimization. This is our final version.

**Figure 5.**The incremental improvement by the optimization strategies.

They are also compared to CUBLAS3.2.

In the experimental evaluation, there are four versions written in assemble code:

- Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, Ninghui Sun. Fast Implementation of DGEMM on Fermi GPU. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC'11, Seatle, WA, USA, 2011.

Download pdf

**Test description**C-bench inst-bench measure latency and throughput of arithmetic pipeline gmem-bench evalue effective global memory bandwidth in different access patterns sass-bench inst_latency get accurate latency time of relevant instructions smem_latency get accurate latency time of shared memory load in different width dual_issue_test verify the dual issue mechanism in float operation

**Source code**### Microbenchmarks for Fermi GPU