ACACES 2018 Summer School GPU Architectures: Basic to Advanced - PowerPoint PPT Presentation

ACACES 2018 Summer School GPU Architectures: Basic to Advanced Concepts Adwait Jog, Assistant Professor College of William & Mary (http://adwaitjog.github.io/)

Course Outline q Lectures 1 and 2: Basics Concepts ● Basics of GPU Programming ● Basics of GPU Architecture q Lecture 3: GPU Performance Bottlenecks ● Memory Bottlenecks ● Compute Bottlenecks ● Possible Software and Hardware Solutions q Lecture 4: GPU Security Concerns ● Timing channels ● Possible Software and Hardware Solutions

Era of Heterogeneous Architectures Intel Coffee Lake and AMD Raven Ridge Kaby Lake

Discrete GPUs

Discrete GPUs + Intel Processors

Security Concerns q GPUs may be accelerating applications that are using user-sensitive data (e.g., genomics, financial) q GPUs may be accelerating cryptographic applications (e.g., AES, RSA etc.) and authentication algorithms on-behalf of CPUs q Given the popularity of GPUs, it is imperative to keep GPUs secure against a variety of side-channel attacks and other security vulnerabilities.

Security Attacks q User’s web activity on GPU can be tracked by the malicious attacker who is co-located on the same card [Oakland’14] q AES private keys can be recovered by correlation timing attacks [HPCA’16] q Accelerating attacks via GPUs [Oakland’18] ● Glitch: Accelerating row hammer attacks

Correlation Timing Attacks Server@GPU Plaintexts Ciphertexts Time duration time 1 Plaintext # 1 Ciphertext # 1 time 2 Plaintext # 2 Ciphertext # 2 time 3 Plaintext # 3 Ciphertext # 3 … … … Correct Key Correct Key?? K 1 , K 2 , … ,K , … i Key guesses time start - time stop = time 1 Outside Attacker

Memory Access Coalescing in GPUs Computing Unit Wavefront pool Wavefront . . . Thread # 1 Thread # 32 Scheduler LD/ST Unit Coalescing Unit Global Memory

Memory Access Coalescing in GPUs Wavefront tid = thread id tid = 0 tid = 1 tid = 2 tid = 3 0x00 0x04 0x07 0x09 0x00 0x01 0x02 0x03 Block Address # 0 Block Address # 1 0x04 0x05 0x06 0x07 Block Address # 1 0x04 0x05 0x06 0x07 Block Address # 2 0x08 0x09 0x0A 0x0B

Memory Access Coalescing in GPUs Wavefront tid = thread id tid = 0 tid = 1 tid = 2 tid = 3 0x00 0x04 0x07 0x09 Coalescing Unit Block Address # 0 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 Block Address # 1 Block Address # 2 0x08 0x09 0x0A 0x0B

AES implementation on GPU q Symmetric Encryption with 128-bit key and 10 rounds. q S-box implementation involves table lookups. q [Jiang/Fei/Kaeli, HPCA’16] demonstrated that the last round is vulnerable.

Last Round of AES on GPU #$% = 𝑈 #$% ] ⊕ 𝑙 " 𝑑 ) [𝑢 $ "

Last Round of AES on GPU #$% = 𝑈 #$% ] ⊕ 𝑙 " 𝑑 ) [𝑢 $ " Replies # 1 Request # 1 Thread # 1 t i1 T 4 [t i1 ] ⊕ k j LINE # 1 c j1 Replies # 2 Request # 2 Thread # 2 ⊕ k j LINE # 2 t i2 T 4 [t i2 ] c j2 Coalescing … … … … Unit . . . . . . . . . . . . . . . . . . . . . . . . Replies # 32 Request # 32 Thread # 32 LINE # 32 t i32 T 4 [t i32 ] ⊕ k j c j32 Input text Ciphertext to Last Round

Correlation Timing Attack on GPU q Goal of the attack: Recover the AES Key (byte-by-byte) q Last Round of AES is vulnerable #$% = 𝑈 #$% ] ⊕ 𝑙 " 𝑑 ) [𝑢 $ " q Last Round is invertible #$% = 𝑈 #$% ⊕ 𝑙 " ] Memory access /0 [𝑑 𝑢 $ ) " of thread tid How an attacker can calculate the number of coalesced accesses?

Attacker calculates the # of coalesced accesses #$% = 𝑈 #$% ⊕ 𝑙 " ] /0 [𝑑 𝑢 $ ) " Guessed Table Lookup Indices T 4-1 [c j1 ⊕ k jm ] ⊕ k jm c j1 t i1,m Coalesced Accesses ⊕ k jm c j2 T 4-1 [c j2 ⊕ k jm ] t i2,m Correct value of key byte? ( A jm,n ) . . … . … . . . . . . . . . . . . . . . . . . T 4- c j32 ⊕ k jm t i32,m 1 [c j32 ⊕ k jm ] Ciphertext

Coalesced Accesses and Execution Time Associate the number of coalesced accesses with execution time

Finding the Correct Key Value q Attacker encrypts ‘N’ number of plaintexts over server ● Records Ciphertext and Execution time Recorded # of Coalesced Accesses Execution Time Correlations A j0,1 , A j0,2 , . . . . , A j0,N E 1 ,E 2 ,...,E N Corr j0 Key Guess 0 A j1,1 , A j1,2 , . . . . , A j1,N Corr j1 Key Guess 1 . . . . . . Maximum Correct Key A jα,1 , A jα,2 , . . . . , A jα,N Corr jα Correlation Key Byte Guess α . . . . . . Key Corr j255 A j255,1 , A j255,2 , . . . . ,A j255,N Guess 255

Simulating Timing Attack on our Set-up How to mitigate Correlation Timing Why is Correlation Timing Attack Correct guess Attacks on GPU? possible? • The baseline attack leverages the deterministic nature of Incorrect guesses Answer: By making it harder for the the coalescing mechanism • AES key value affects the coalesced accesses attacker to correctly calculate the number • # coalesced accesses affects the execution time of coalesced accesses

Naïve Solution RCoal to mitigate the correlation timing q Disable coalescing altogether? attacks ● Correlation drops to ~ 0 Correct guess ● Correct key byte is indistinguishable • Targets the deterministic nature of the coalescing mechanism • Fixed number of subwarps (or subwavefronts) • Fixed sizes of subwarp (or subwavefronts) • Deterministic mapping of the thread elements to subwarps (or subwavefronts) q Up to 178% performance degradation ● Degradation increases with plaintext size Naïve solution is Good for Security, Bad for Performance Offers no tradeoff

RCoal: Fixed Sized Subwarp (FSS) DEFAULT: number of subwarps = 1 FSS: number of subwarps = 2 sid = 1 sid = 0 sid = 0 tid = 0 tid = 1 tid = 2 tid = 3 tid = 0 tid = 1 tid = 2 tid = 3 0x00 0x04 0x07 0x09 0x00 0x04 0x07 0x09 Coalescing Unit Coalescing Unit 0x00 0x01 0x02 0x03 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x04 0x05 0x06 0x07 0x04 0x05 0x06 0x07 0x08 0x09 0x0A 0x0B 0x08 0x09 0x0A 0x0B

FSS Security against Baseline Attack • Correlation between the number of coalesced accesses and the execution time drops • Correct key byte is harder to find • Improved security

FSS Performance • Memory accesses increase with number of subwarps • Execution time increases with number of subwarps • Performance degrades as number of subwarp increase Can attacker still recover the AES key?

FSS against FSS attack q Attacker can figure out the number of subwarps

FSS against FSS attack q Attacker can figure out the number of subwarps q Attacker can calculate per subwarp accesses Correct guess

FSS against FSS attack q Attack possible when the attacker can RCoal to mitigate the correlation timing figure out number of subwarps! ● Coalescing still deterministic attacks • Targets the deterministic nature of the coalescing mechanism • Fixed number of subwarps • Fixed sizes of subwarp • Deterministic mapping of the thread elements to subwarps

RCoal: Random Sized Subwarp (RSS) q Size distribution We select RSS with Skewed Distribution RCoal to mitigate the correlation timing RCoal to mitigate the correlation timing attacks attacks • Targets the deterministic nature of the coalescing mechanism û • Fixed number of subwarps ü Skewed Distribution Normal Distribution • Fixed sizes of subwarp • Deterministic mapping of the thread elements to subwarps • Mean of the distribution is different than FSS • Mean of the distribution is same as FSS • Large subwarp offers better coalescing • Security and performance similar to FSS • Improved security compared to FSS • Improved performance compared to FSS

RCoal: Random-Threaded Subwarp (RTS) FSS: number of subwarps = 2 FSS+RTS: number of subwarps = 2 tid = 0 tid = 1 tid = 2 tid = 3 tid = 0 tid = 1 tid = 2 tid = 3 0x00 0x01 0x06 0x07 0x00 0x01 0x06 0x07 sid = 0 sid = 0 sid = 1 sid = 1 sid = 0 sid = 0 sid = 1 sid = 1 Coalescing Unit Coalescing Unit 0x00 0x01 0x02 0x03 0x00 0x01 0x02 0x03 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x04 0x05 0x06 0x07 0x04 0x05 0x06 0x07

RCoal: Random-Threaded Subwarp (RTS) RSS: number of subwarps = 2 RSS+RTS: number of subwarps = 2 sid = 1 sid = 1 sid = 0 sid = 0 tid = 2 tid = 0 tid = 1 tid = 3 tid = 0 tid = 1 tid = 2 tid = 3 0x06 0x00 0x01 0x08 0x00 0x01 0x06 0x08 Coalescing Unit Coalescing Unit 0x00 0x01 0x02 0x03 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x0A 0x0B 0x08 0x09 0x0A 0x0B

Evaluation Set-up q AES-128 q Plaintext with 32 lines q GPGPU-SIM ● 15 SMs, 32 threads/warp, one subwarp per coalescing unit (base case) ● GDDR5 Memory with 6 MCs, 16 DRAM-banks, 4 bank-groups/MC q Enhanced Attack Algorithms ● Corresponding Attacks

Performance/Security Trade-off 2 Correlation 1 Security (Lower the better) 0 1 2 4 8 16 32 Number of Subwarps FSS FSS+RTP RSS RSS+RTP Offers Security/Performance Trade-off 1.5 Execution Time Execution Time (Lower the better) 1 0.5 0 1 2 4 8 16 32 Number of Subwarps FSS FSS+RTS RSS RSS+RTS

ACACES 2018 Summer School GPU Architectures: Basic to Advanced - PowerPoint PPT Presentation

ACACES 2018 Summer School GPU Architectures: Basic to Advanced Concepts Adwait Jog, Assistant Professor College of William & Mary (http://adwaitjog.github.io/) Course Outline q Lectures 1 and 2: Basics Concepts Basics of GPU

ACACES 2018 Summer School GPU Architectures: Basic to Advanced Concepts Adwait Jog, Assistant

ACACES 2018 Summer School GPU Architectures: Basic to Advanced Concepts Adwait Jog, Assistant

Do we still care about single thread performance? ACACES

Architectures Architectural styles Software architectures Architectures versus middleware

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

FILM RESTORATION SUMMER SCHOOL / FIAF SUMMER SCHOOL 2009 FILM RESTORATION SUMMER SCHOOL / FIAF

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

Live Green Eric Frew and Sriram Sankaranarayanan University of Colorado, Boulder FMCAD18

Metaphor (and Metonymy) March 6, 2017 Next Assignments FST regression test suite FST

Time-to-Live TLV for LSP-Ping draft-ietf-mpls-lsp-ping-ttl-tlv-01 Sami Boutros

CS 528 Mobile and Ubiquitous Computing Lecture 6: Maps, Sensors, Widget Catalog and Presentations

Skiplist Timing Attack Vulnerability Eyal Nussbaum PhD Student, Communication Systems Engineering

Exclusive Exponent Blinding May Not Suffice Attacks on RSA to Prevent Timing Attacks on RSA

Cache-Timing Attacks Matteo BOCCHI System Research & Applications STMicroelectronics S.r.l.

RSA: More about attacks Need to take care with the implementation, e.g.: - Do not take p or q

ACACES 2018 Summer School GPU Architectures: Basic to Advanced - PowerPoint PPT Presentation

ACACES 2018 Summer School GPU Architectures: Basic to Advanced Concepts Adwait Jog, Assistant Professor College of William & Mary (http://adwaitjog.github.io/) Course Outline q Lectures 1 and 2: Basics Concepts Basics of GPU

ACACES 2018 Summer School GPU Architectures: Basic to Advanced Concepts Adwait Jog, Assistant

ACACES 2018 Summer School GPU Architectures: Basic to Advanced Concepts Adwait Jog, Assistant

Do we still care about single thread performance? ACACES

Architectures Architectural styles Software architectures Architectures versus middleware

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

FILM RESTORATION SUMMER SCHOOL / FIAF SUMMER SCHOOL 2009 FILM RESTORATION SUMMER SCHOOL / FIAF

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

Live Green Eric Frew and Sriram Sankaranarayanan University of Colorado, Boulder FMCAD18

Metaphor (and Metonymy) March 6, 2017 Next Assignments FST regression test suite FST

Time-to-Live TLV for LSP-Ping draft-ietf-mpls-lsp-ping-ttl-tlv-01 Sami Boutros

CS 528 Mobile and Ubiquitous Computing Lecture 6: Maps, Sensors, Widget Catalog and Presentations

Skiplist Timing Attack Vulnerability Eyal Nussbaum PhD Student, Communication Systems Engineering

Exclusive Exponent Blinding May Not Suffice Attacks on RSA to Prevent Timing Attacks on RSA

Cache-Timing Attacks Matteo BOCCHI System Research &amp; Applications STMicroelectronics S.r.l.

RSA: More about attacks Need to take care with the implementation, e.g.: - Do not take p or q

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Cache-Timing Attacks Matteo BOCCHI System Research & Applications STMicroelectronics S.r.l.