Common Subexpression Convergence (CSC) Sana Damani and Vivek Sarkar - - PowerPoint PPT Presentation

common subexpression convergence csc
SMART_READER_LITE
LIVE PREVIEW

Common Subexpression Convergence (CSC) Sana Damani and Vivek Sarkar - - PowerPoint PPT Presentation

Common Subexpression Convergence (CSC) Sana Damani and Vivek Sarkar Habanero Extreme Scale Software Research Lab Georgia Institute of Technology Short paper at LCPC 19, Atlanta, GA Agenda Motivation Common Subexpression Convergence


slide-1
SLIDE 1

Sana Damani and Vivek Sarkar Habanero Extreme Scale Software Research Lab Georgia Institute of Technology

Common Subexpression Convergence (CSC)

Short paper at LCPC ’19, Atlanta, GA

slide-2
SLIDE 2
  • Motivation
  • Common Subexpression Convergence Transformations
  • Approach
  • Preliminary Results and Discussion

Agenda

2

slide-3
SLIDE 3

3

slide-4
SLIDE 4
  • SIMT (Single Instruction Multiple Threads)
  • All threads in a warp execute the same instruction in parallel
  • Divergence
  • A conditional branch dependent on thread-local values
  • Threads in the warp execute different paths
  • Serialized execution of a warp

Divergence in SIMT processors

Image credits: https://devblogs.nvidia.com/inside-volta

4

threadIdx.x 1 2 3 4 5 6 7 (1) (2) (3) (4) (5)

slide-5
SLIDE 5

Problem: Serialization of common code

  • Divergent Code
  • Warp Execution

5

slide-6
SLIDE 6

6

slide-7
SLIDE 7

Hoist

  • Move to convergent common ancestor

7

slide-8
SLIDE 8

Sink

  • Move to convergent common successor

8

slide-9
SLIDE 9

Split

  • Move to new convergent join point
  • Duplicate conditional branch
  • Alternative solution: hoist defs/sink uses

9

slide-10
SLIDE 10

Operand Renaming

  • Insert copy instructions then sink/split

10

slide-11
SLIDE 11

Branches

  • Flatten branch, then sink/split

11

slide-12
SLIDE 12

Recursive CSC

entry b = ... c = ...

12

tid%2 tid%3 a=b*c a=b*c a=b*c F T F T Bottom-Up Traversal Through CDG

slide-13
SLIDE 13

Common Loops

  • Loop distribution
  • Index set splitting

13

slide-14
SLIDE 14

14

slide-15
SLIDE 15

Given a GPU program, identify and move divergent common code to a convergent region using Hoist/Sink/Split such that dependences are preserved, and the benefit of code motion is maximized.

15

Problem Statement

slide-16
SLIDE 16

Algorithm

16

slide-17
SLIDE 17

Identifying common code: Dynamic Programming

17

slide-18
SLIDE 18
  • Benefit:
  • Function Call > Memory Instructions > Math Instructions > Copy Instructions
  • Loop nest depth
  • Cost:
  • Copy Instructions for Operand Renaming
  • Increase in register live range and/or stalls with hoist/sink
  • Increase in branches, smaller blocks, more barriers with Split

Profitability Heuristics

18

slide-19
SLIDE 19

19

slide-20
SLIDE 20

CUDA NVPTX/LLVM Nvidia Volta V100

20

Experimental Setup

slide-21
SLIDE 21

Note: nvprof shows major gains due to reduction in global reads of up to 27% with CSC (common address reads/coalesced accesses)

Preliminary Results: Microbenchmarks

0% 20% 40% 60% 80% 100% 120% Hoist Sink Split Function Nested Switch

SIMT efficiency

SIMT efficiency Before SIMT efficiency After 2 4 6 8 10 12 Hoist Sink Split Switch

Speedup

Speedup

21

slide-22
SLIDE 22

Preliminary Results: Bitonic Sort

0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 120.00% min eff max eff avg eff

SIMT Efficiency for Bitonic Sort

before after 50 100 150 200 250 300 350 400 450 500 min run time max run time avg run time

Run Time for Bitonic Sort

before after

22

slide-23
SLIDE 23
  • Legality
  • CSE and PRE
  • Interprocedural analysis
  • Opportunity in automatically parallelized programs
  • Profile information for divergence, cost, bottlenecks

Discussion and Future Work

23