Brian Hickmann, Dennis Bradford Motivation AI is driving - PowerPoint PPT Presentation

Brian Hickmann, Dennis Bradford

Motivation AI is driving development of several new matrix-multiplication accelerators • However, IEEE 754 standard gives significant implementation-specific • flexibility in its definition of the dot product operation Rounding points Summation order • • Internal format width Exception reporting • • Accelerator microarchitecture details are typically not well documented • This work details a series of experiments that can be used to better • understand the design of these accelerators Exploit above flexibility to gain insight into design • Applied this method to Tensor Core within NVIDIA V100 GPUs • 2

Methodology Wanted to investigate several properties of the design: • Internal precision width? NaN/Exception Behavior? • • Order of operations? Rounding modes / locations? • • Interconnection of design units? How is the accumulator integrated? • • First explored available documentation to understand: • What is the SW interface? What is the smallest design unit? • • Next we designed several rounds of experiments to try to answer each question • Test vectors always permuted values across all inputs to understand any ordering • dependencies 3

Test Vector Examples Question Test 2 30 + -2 30 + 2 -14 or Big + -Big + Small Order of operations? Depending on operation order, expect 0.0 or 2 -14 as result 2 30 + 2 N where N in {30,-28} Internal precision? Expect 2 N to disappear at edge of datapath width Rounding points and Selected products to create various “L”, “R”, and “S” bits with both positive and negative results. modes? Accumulator ordering Repeated above testing, introducing C accumulator value to understand order of summation and rounding. and rounding? 4

Volta Tensor Cores Each Tensor core performs matrix multiply-accumulate or dot- • product operation Input data size is FP16, accumulator is FP16 or FP32 • Exposed through CUDA “ wmma ” instruction • 16x16, 4x32, and 32x4 matrices supported • Wrote test software using 16x16 matrix size • Initial testing done on smallest 4-input dot product element: • D 0 = a 3 *b 3 + a 2 *b 2 + a 1 *b 1 + a 0 *b 0 + C 0 • *Images from Volta Whitepaper 5

Test Vector Examples Question Test 2 30 + -2 30 + 2 -14 or Big + -Big + Small Order of operations? Depending on operation order, expect 0.0 or 2 -14 as result 2 30 + -2 30 + 2 N where N in {30,-28} Internal precision? Expect 2 N to disappear at edge of datapath width Rounding points and Selected products to create various “L”, “R”, and “S” bits with both positive and negative results. modes? Accumulator ordering Repeated above testing, introducing C accumulator value to understand order of summation and rounding. and handling? 6

Possible Micro-Architectures – Chain of FMAs Test 1: 2 30 + -2 30 + 2 -14 + 0 = 2 -14 Test 2: 2 -14 + 0 + 2 30 + -2 30 = 0 b 0 a 0 b 1 a 1 b 2 a 2 b 3 a 3 FMA FMA FMA FMA D 0 No Rounding Output Rounded 7

Possible Micro-Architectures – Tree of FP Adds Test 1: 2 30 + -2 30 + 2 -14 + 0 = 2 -14 Test 2: 2 -14 + 2 30 + 0 + -2 30 = 0 b0 a0 b1 a1 b2 a2 b3 a3 MULT MULT MULT MULT FP FP ADD ADD FP ADD No Rounding D 0 Output Rounded 8

Possible Micro-Architectures – Tree of INT Adds Test 1: 2 30 + -2 30 + 2 -14 + 0 = 0 Test 2: 2 -14 + 2 30 + 0 + -2 30 = 0 b 0 a 0 b 1 a 1 b 2 a 2 b 3 a 3 MULT MULT MULT MULT ALIGN and Round/Truncate 4:2 Compress Int No Rounding ADD Output Rounded D 0 9

Test Vector Examples Question Test 2 30 + -2 30 + 2 -14 or Big + -Big + Small Order of operations? Depending on operation order, expect 0.0 or 2 -14 as result 2 30 + -2 30 + 2 N where N in {30,-28} Internal precision? Expect 2 N to disappear at edge of datapath width Rounding points and Selected products to create various “L”, “R”, and “S” bits with both positive and negative results. modes? Accumulator ordering Repeated above testing, introducing C accumulator with critical values to understand order of summation and and handling? rounding. 10

Internal Datapath Width Experimental Results Test (N=7): 2 30 + -2 30 + 2 7 + 0 = 2 7 Test (N=6): 2 30 + -2 30 + 2 6 + 0 = 0 b 0 a 0 b 1 a 1 b 2 a 2 b 3 a 3 MULT MULT MULT MULT ALIGN and Round/Truncate to 24b 4:2 Compress (24b) Int No Rounding ADD Output Rounded D 0 11

Test Vector Examples Question Test 2 30 + -2 30 + 2 -14 or Big + -Big + Small Order of operations? Depending on operation order, expect 0.0 or 2 -14 as result 2 30 + -2 30 + 2 N where N in {30,-28} Internal precision? Expect 2 N to disappear at edge of datapath width Rounding points and Selected products to create various “L”, “R”, and “S” bits with both positive and negative results. modes? Accumulator ordering Repeated above testing, introducing C accumulator value to understand order of summation and rounding. and handling? 12

Best Estimate of Tensor Core Microarchitecture FP32 Round Test: +/- (1.0 + 2 -23 + 2 -24 ) • b 4 a 4 b 5 a 5 b 6 a 6 b 7 a 7 b 0 a 0 b 1 a 1 b 2 a 2 b 3 a 3 No round up, so result truncated • Integer overflow test: +/- (1.0 +1.0 + 2 -23 + 2 -24 ) MULT MULT MULT MULT • ALIGN and Truncate (24b) Result is normalized and truncated to 24b • 2 nd rounding point for FP32 4:2 Compress • Int FP16 Round Test: +/- (1.0 + 2 -10 + 2 -11 ) • ADD Indicates FP16 results use RNE • Normalize / Truncate (24b) FP16 Datapath width: +/- (1.0 + 2 -10 + 2 N ) • Round FP16 2 N in {-12, -30}, sticky bit for rounding • Sticky bit truncated off at 24b. • D 0 13

Test Vector Examples Question Test 2 30 + -2 30 + 2 -14 or Big + -Big + Small Order of operations? Depending on operation order, expect 0.0 or 2 -14 as result 2 30 + -2 30 + 2 N where N in {30,-28} Internal precision? Expect 2 N to disappear at edge of datapath width Rounding points and Selected products to create various “L”, “R”, and “S” bits with both positive and negative results. modes? Accumulator ordering Repeated above testing, introducing C accumulator value to understand order of summation and rounding. and handling? Also expanded testing to full 16x16 matrix 14

Best Estimate of Tensor Core Microarchitecture b 4 a 4 b 5 a 5 b 6 a 6 b 7 a 7 b 12 a 12 b 13 a 13 b 14 a 14 b 15 a 15 b 0 a 0 b 1 a 1 b 2 a 2 b 3 a 3 b 8 a 8 b 9 a 9 b 10 a 10 b 11 a 11 MULT MULT MULT MULT MULT MULT MULT MULT ALIGN and Truncate (24b) ALIGN and Truncate (24b) 5:2 Compress 5:2 Compress Int Int ADD ADD Normalize / Truncate (24b) Normalize / Truncate (24b) Round Round FP16 FP16 FP FP C 0 D 0 ADD ADD 15

Conclusions / Future Work Described a testing methodology that uses software visible inputs/outputs to • explore a matrix multiplication unit microarchitecture Iterative testing exploits rounding modes and order of operations to gain • insight into design Applied the methodology to Tensor Core units in NVIDIA V100 GPU • By analyzing many rounds of testing, we were able to synthesize a detailed • estimate of the design microarchitecture. In future work we would like to this same methods to other designs, such as • Google’s TPU 16

Results – Internal Architecture FP16 results are rounding using Round to Nearest Even • FP16 subnormals correctly handled • FP32 results are rounded using truncation (Round to Zero) • FP32 subnormals NOT correctly handled, flushed to zero • Internal Architecture is NOT chain of FMAs or tree of FP adders • FP32 Test vector (products): 2 30 + -2 30 + 2 -14 or Big + -Big + small • Expect result of 0.0 or 2 -14 depending on order of summation and rounding. • Tensor Core results were always 0.0, which implies no internal rounding. • Internal datapath width is truncated to 24 bits, even if integer overflow • By varying the exponent difference between largest and smallest product, we found that all bits • after the 24 th bit were truncated (not rounded) away. 18

Results – Top-level Architecture Testing for interconnection between dot-product units • Expanded testing to all 16 elements of A=[a 0 ..a 15 ] and B=[b 0 ..b 15 ] inputs • Call each dot-product result T 0 , T 1 , T 2 , and T 3 (T 0 = [a 0 , a 1 , a 2 , a 3 ] * [b 0 , b 1 , b 2 , b 3 ]) • FP16: Tensor results always rounded using RNE • FP32: Tensor results rounded with RNE or with truncation • Division found when inputs permuted between groups (T 0 , T 1 ) and (T 2 , T 3 ) • Implies that intermediate summation results are added with products directly • C Matrix always added to result using RNE • Summation order is: (C 0 + (T 0 + T 1 )) + (T 2 + T 3 ) • 19

Brian Hickmann, Dennis Bradford Motivation AI is driving - PowerPoint PPT Presentation

Brian Hickmann, Dennis Bradford Motivation AI is driving development of several new matrix-multiplication accelerators However, IEEE 754 standard gives significant implementation-specific flexibility in its definition of the dot

Bradford Council Cycling Development Programme Key Partners Bradford Council British

Welcome to the Better Start Bradford Family Would you like to be involved with our CAMPAIGN

Latent TB Infection Screening Programme - Bradford Bradford Our model General Practice

CITY PLAN Bradford 2013 - 2023 CITY & DISTRICT HOW WE EARN OUR LIVING The Producer City

Practice in a Bradford Nursery School Research-Informed Practice in a Bradford Nursery

Memory services: for what, for whom? Murna Downs Trinity College Dublin, May 7 2010 Bradford

Economics 113 Slides J. Bradford Delong http://bradford-delong.com brad.delong@gmail.com

Economics 113 Slides J. Bradford Delong http://bradford-delong.com brad.delong@gmail.com

Economics 113 Slides J. Bradford Delong http://bradford-delong.com brad.delong@gmail.com

Economics 113 Slides J. Bradford Delong http://bradford-delong.com brad.delong@gmail.com

Plain language, grammar and tie-pose MAGC workshop Hennepin County Brian Lieb Hennepin County

Urban Legend Propagation Ian Dennis Miller 2018-11-08 Ian Dennis Miller Urban Legend

Networks of Computational Social Science Ian Dennis Miller 2018-11-22 Ian Dennis Miller

Comparative Advantage: The Advantage of the Comparatively Powerful? J. Bradford DeLong

Proposed Ballfields at Bradford North Community Information Meeting March 9, 2017 7:00PM @

The Sigmoid Beverton-Holt Model William Bradford, Suzette Lake, Terrence Tappin, Simeon Weatherby

EE456 Digital Communications Professor Ha Nguyen September 2015 EE456 Digital

Changelog Changes made in this version not seen in fjrst lecture: 6 September: fjx stray @s on

Managing NVM in The Machine Rocky Craig, Master Linux Technologist Linux Foundation Vault 2016

Path to effective electroweak operators

EE 457 Unit 2a Unsigned 2s Complement Sign and Zero Extension Fixed Point Systems and

VFS, Continued Don Porter CSE 506 Logical Diagram Binary Memory Threads Formats Allocators

Getting precise about precision Kiran S. Kedlaya ( kedlaya@mit.edu ) Department of Mathematics,

Projected Spatial Rich Features on a GPU Andrew Ker adk @ cs.ox.ac.uk Department of Computer

Brian Hickmann, Dennis Bradford Motivation AI is driving - PowerPoint PPT Presentation

Brian Hickmann, Dennis Bradford Motivation AI is driving development of several new matrix-multiplication accelerators However, IEEE 754 standard gives significant implementation-specific flexibility in its definition of the dot

Bradford Council Cycling Development Programme Key Partners Bradford Council British

Welcome to the Better Start Bradford Family Would you like to be involved with our CAMPAIGN

Latent TB Infection Screening Programme - Bradford Bradford Our model General Practice

CITY PLAN Bradford 2013 - 2023 CITY &amp; DISTRICT HOW WE EARN OUR LIVING The Producer City

Practice in a Bradford Nursery School Research-Informed Practice in a Bradford Nursery

Memory services: for what, for whom? Murna Downs Trinity College Dublin, May 7 2010 Bradford

Economics 113 Slides J. Bradford Delong http://bradford-delong.com brad.delong@gmail.com

Economics 113 Slides J. Bradford Delong http://bradford-delong.com brad.delong@gmail.com

Economics 113 Slides J. Bradford Delong http://bradford-delong.com brad.delong@gmail.com

Economics 113 Slides J. Bradford Delong http://bradford-delong.com brad.delong@gmail.com

Plain language, grammar and tie-pose MAGC workshop Hennepin County Brian Lieb Hennepin County

Urban Legend Propagation Ian Dennis Miller 2018-11-08 Ian Dennis Miller Urban Legend

Networks of Computational Social Science Ian Dennis Miller 2018-11-22 Ian Dennis Miller

Comparative Advantage: The Advantage of the Comparatively Powerful? J. Bradford DeLong

Proposed Ballfields at Bradford North Community Information Meeting March 9, 2017 7:00PM @

The Sigmoid Beverton-Holt Model William Bradford, Suzette Lake, Terrence Tappin, Simeon Weatherby

EE456 Digital Communications Professor Ha Nguyen September 2015 EE456 Digital

Changelog Changes made in this version not seen in fjrst lecture: 6 September: fjx stray @s on

Managing NVM in The Machine Rocky Craig, Master Linux Technologist Linux Foundation Vault 2016

Path to effective electroweak operators

EE 457 Unit 2a Unsigned 2s Complement Sign and Zero Extension Fixed Point Systems and

VFS, Continued Don Porter CSE 506 Logical Diagram Binary Memory Threads Formats Allocators

Getting precise about precision Kiran S. Kedlaya ( kedlaya@mit.edu ) Department of Mathematics,

Projected Spatial Rich Features on a GPU Andrew Ker adk @ cs.ox.ac.uk Department of Computer

CITY PLAN Bradford 2013 - 2023 CITY & DISTRICT HOW WE EARN OUR LIVING The Producer City