S6349 - XMP LIBRARY INTERNALS Niall Emmart University of Massachusetts
Follow on to S6151 – XMP: An NVIDIA CUDA – Accelerated Big Integer Library
Niall Emmart University of Massachusetts Follow on to S6151 XMP: An - - PowerPoint PPT Presentation
S6349 - XMP LIBRARY INTERNALS Niall Emmart University of Massachusetts Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library High Performance Modular Exponentiation A^K mod P Where A, K and P are hundreds to thousands
Follow on to S6151 – XMP: An NVIDIA CUDA – Accelerated Big Integer Library
centers
Centric Networking
imad.x.{lo,hi}.cc d, a, b, c;
L(A0B0) L(A1B0) L(A2B0) L(A3B0)
A3 A2 A1 A0 B3 B2 B1 B0 H(A0B0) H(A3B1) H(A2B1) H(A1B1) H(A0B1) ADD L(A3B1) L(A2B1) L(A1B1) L(A0B1) H(A3B2) H(A2B2) H(A1B2) H(A0B2) ADD L(A3B2) L(A2B2) L(A1B2) L(A0B2) H(A3B3) H(A2B3) H(A1B3) H(A0B3) ADD L(A3B3) L(A2B3) L(A1B3) L(A0B3) H(A1B0) H(A2B0) H(A3B0) Use madc.lo.cc and madc.hi.cc
The Maxwell multiplier is a 16 bit by 16 bit multiplier. The basic form is: xmad.x.cc d, a.{h0|h1}, b.{h0|h1}, c; It select a half word from a and a half word from b, computes the 32 bit full product and adds c, with carry in and carry out. Consider the simple case of computing A*B where A and B are each 32 bits, to generate a 64 bit result: It requires a lot of work to integrate the half word aligned products.
AL * BL AH * BH AL * BH AH * BL These two products are half word aligned
A0L * B0L A1L * B0L A2L * B0L A3L * B0L A0H * B0L A1H * B0L A2H * B0L A3H * B0L A0L * B0H A1L * B0H A2L * B0H A3L * B0H A0H * B0H A1H * B0H A2H * B0H A3H * B0H
A3
A2
A1 A0 B1 B0
A0L * B1L A1L * B1L A2L * B1L A3L * B1L A0H * B1L A1H * B1L A2H * B1L A3H * B1L A0L * B1H A1L * B1H A2L * B1H A3L * B1H A0H * B1L A1H * B1L A2H * B1L A3H * B1L
B0 Terms B1 Terms Green terms are full word aligned Red terms are half word aligned
A0L * B0L A1L * B0L A2L * B0L A3L * B0L A0H * B0L A1H * B0L A2H * B0L A3H * B0L A0L * B0H A1L * B0H A2L * B0H A3L * B0H A0H * B0H A1H * B0H A2H * B0H A3H * B0H
A3
A2
A1 A0 B1 B0
A0L * B1L A1L * B1L A2L * B1L A3L * B1L A0H * B1L A1H * B1L A2H * B1L A3H * B1L A0L * B1H A1L * B1H A2L * B1H A3L * B1H A0H * B1L A1H * B1L A2H * B1L A3H * B1L
SUM THE RED TERMS AND SHIFT LEFT 16 BITS USING PRMT ADD IN THE GREEN TERMS Roughly 4N^2 instructions
L(A0A0) L(A1A0) L(A2A0) L(A3A0)
A3 A2 A1 A0 A3 A2 A1 A0 H(A0A0) H(A3A1) H(A2A1) H(A1A1) H(A0A1) L(A3A1) L(A2A1) L(A1A1) L(A0A1) H(A3A2) H(A2A2) H(A1A2) H(A0A2) L(A3A2) L(A2A2) L(A1A2) L(A0A2) H(A3A3) H(A2A3) H(A1A3) H(A0A3) L(A3A3) L(A2A3) L(A1A3) L(A0A3) H(A1A0) H(A2A0) H(A3A0) Compute the Red values, double it and add in the grey diagonal values
A0 A1 A2 … An-1
B0 B1 B2 … Bn-1
Product (2N registers, overwrites B) A Value (N registers) B Value (N registers) Step 1 Multiply Step 2 Reduce A0 A1 A2 … An-1 AB0 AB1 AB2 … AB2n-1 Montgomery Reduction
L(A0B0) L(A1B0) L(A2B0) L(A3B0) A3 A2 A1 A0 B3 B2 B1 B0 H(A0B0) H(A3B1) H(A2B1) H(A1B1) H(A0B1) L(A3B1) L(A2B1) L(A1B1) L(A0B1) H(A3B2) H(A2B2) H(A1B2) H(A0B2) L(A3B2) L(A2B2) L(A1B2) L(A0B2) H(A3B3) H(A2B3) H(A1B3) H(A0B3) L(A3B3) L(A2B3) L(A1B3) L(A0B3) H(A1B0) H(A2B0) H(A3B0) Row Oriented: N accesses across all of A Column Oriented: N accesses across all of A and all of B
Digit 0 Digit 1 Digit 2 Digit 3 A=1024 Bits Digit 0 Digit 1 Digit 2 Digit 3 B=1024 Bits
5 10 15 20 25 256 512 1024 2048 4096
Speedup Precision
XMP Speedup
E5-2698 v3 M2090 K10 K40m K80 M6000 M60
The E5-2698v3 is a 16 core Xeon at 2.3 GHz
Unfortunately, there is no single PTX instruction that corresponds to a Maxwell XMAD. However, we can define a sequence of instructions:
__device__ __forceinline__
void XMADLL(uint32& d, uint32& a, uint32& b, uint32& c) { asm volatile ("{\n\t" ".reg .u16 %al, %ah, %bl, %bh;\n\t" "mov.b32 {%al,%ah},%1;\n\t" "mov.b32 {%bl,%bh},%2;\n\t" "mul.wide.u16 %0, %al, %bl;\n\t" "add.u32 %0, %0, %3;\n\t" "}" : "=r"(d) : "r"(a), "r"(b), "r"(c)); } }
As of CUDA 8, the compiler will recognize this (and related) sequences and convert all these instructions into a single XMAD on Maxwell.