A. Baniasadi N. J. Dimopoulos University of Victoria Presenter: S. - - PowerPoint PPT Presentation

a baniasadi
SMART_READER_LITE
LIVE PREVIEW

A. Baniasadi N. J. Dimopoulos University of Victoria Presenter: S. - - PowerPoint PPT Presentation

A. Jooya A. Baniasadi N. J. Dimopoulos University of Victoria Presenter: S. Agathos University of Ioannina The Goal Introduce a fast, low-cost and effective approach to optimize resource allocation in GPUs. Produces the optimum


slide-1
SLIDE 1
  • A. Jooya
  • A. Baniasadi
  • N. J. Dimopoulos

University of Victoria Presenter: S. Agathos University of Ioannina

slide-2
SLIDE 2

The Goal

 Introduce a fast, low-cost and effective approach to

  • ptimize resource allocation in GPUs.

 Produces the optimum configuration in 84% of the cases.  Produces the second optimum configuration for the rest of

the cases (less than 3.5% error).

 Reduces the number of explorations by as much as 78%.

slide-3
SLIDE 3

Outline

 GPU Architecture  Plackett and Burman Design Method  Knapsack Optimization Technique  Proposed Method  Results  Conclusion

slide-4
SLIDE 4

GPU Architecture

User program

Parallel section Serial section

CPU GPU

slide-5
SLIDE 5

GPU Architecture

Off Chip DRAM Memory

SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM

Register File

Shared Mem

L1 Cache Const Cache Text Cache

Streaming Multiprocessor Warp Pool Warps SIMD Pipeline Memory Hierarchy Warp Scheduler

slide-6
SLIDE 6

Design Parameters

 Number of SMs  SIMD pipeline width  Warp size  Texture cache size  L1 cache size  Constant cache size  Number of memory controllers  Register file size  ….

slide-7
SLIDE 7

Parameters Under Study

 Number of SMs  SIMD pipeline width  Warp size  Texture cache size  L1 cache size  Constant cache size  Number of memory controllers  Register file size  ….

slide-8
SLIDE 8

Proposed method

 Suggests the best application-specific configuration for

different available chip budgets.

 Plackett & Burman design

 measures the effect of each parameter on performance.

 Knapsack problem

 determines the configuration of parameters based on their

effect on performance such that:

 leads to the optimum performance  meets the budget

slide-9
SLIDE 9

Plackett & Burman Design (PB)

 N parameters; each one takes L values

 PB considers only the min and max values  X experiments (X is the next multiple of 4 strictly greater than N)  PB with fold-over captures the effect of two interactive parameters

(doubles the number of experiments)

 4 parameters, 16 experiments

slide-10
SLIDE 10

PB Design Table

exp A B C D Perf 1 + +

  • 2

+ + + 3

  • +

+ 4

  • +

5 +

  • 6
  • +
  • 7

+

  • +

8

  • exp

A B C D Perf 9

  • +

10

  • 11

+

  • 12

+ +

  • 13
  • +

+ 14 +

  • +

15

  • +
  • 16

+ + + SB SC SD + + + +

  • +

+ + + T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 SA =

slide-11
SLIDE 11

Knapsack Problem

 A constraint optimization problem

vj = value of an item of type j wj = weight of an item of type j bj = uper bound on the availability of items of type j C = capacity of the knapsack Select a number xj of items of each type so as to

n i j jx

v z

1 n j j j

C x w

1

subject to

i j

b x 1

and integer ,

. ,.., 1 n N j

maximize ; determined by PB design (Sx) ; transistor count of unit j ; maximum number of transistors

slide-12
SLIDE 12

GPU configuration

Parameter Value Number of shader 30 Shader clock frequency 1.3 GHZ Max thread per shader 1024 SIMD pipeline width 32 Warp width 32 Scheduling PDOM Max CTA/shader 8

slide-13
SLIDE 13

Transistor costs

parameter Size/number Cost (million transistor) Memory controller 1 0.3 DL1 cache 32 KB 87 Constant cache 32 KB 52 Register file 32 KB 170

slide-14
SLIDE 14

Benchmarks and PB design values

benchmark MCB DL1 cache Constant cache Register file AES 1-3 1 KB, 2KB 512 B- 8 KB 4 KB, 8 KB Montcarlo 1-5 32 KB, 64 KB 1 KB, 2 KB 8 KB, 16 KB LIB 3-5 32 KB- 128 KB 64 KB- 256 KB 2 KB,4 KB Ray 1-3 1 KB, 2 KB 1 KB- 32 KB 8 KB, 16 KB NN 1-4 8 KB- 32 KB 512 B, 1 KB 4 KB, 8 KB Scan 1-3 512 B- 2 KB N/A 2 KB- 8 KB Srad 1-6 2 KB- 256 KB N/A 4 KB- 16 KB Blachschole 1-4 2 KB- 8 KB N/A 4 KB, 8 KB Hotspot 1,2 1 KB, 2 KB N/A 8 KB, 16 KB Matrix 1-5 1 KB, 2 KB N/A 4 KB, 8 KB Backprop 1-3 1 KB, 2 KB N/A 4 KB, 8 KB FWT 1-4 4 KB- 32 KB N/A 4 KB, 8 KB LPS 1-3 1 KB, 2 KB N/A 2 KB, 4 KB

slide-15
SLIDE 15

PB results

Benchmark MCB R DL1 cache R Const cache R Register file R

AES 57100 1 17742 4 35500 2 18922 3 Montcarlo 445073 1 293537 2 46469 3 26229 4 LIB 1482938 3 2284182 2 1051960 4 6037368 1 Ray 34441615 2 38417897 1 1405 4 4684461 3 NN 1054553 2 1508787 1 5307 3 535 4 Scan 12372 1 9708 2 1748 3 Srad 139159 1 2125 2 117 3 Blachschole 3110257 1 142613 2 29369 3 Hotspot 433926 1 133392 2 4584 3 Matrix 13693 1 7931 2 1775 3 Backprop 14864 1 12020 2 476 3 FWT 127799 1 73903 2 459 3 LPS 340704 1 136082 2 16736 3

slide-16
SLIDE 16

Region2: 56 – 67 million transistor

 MCB 4 units  DL1 cache 2 unit (16 KB)  Const cache 1 unit (512 B)  Register file 1 unit (4 KB)

Example:

Knapsack result for NN benchmark

44– 135 million transistor

MCB 1-4 units DL1 cache 8-32 KB Const cache 512-1 KB Register file 4-8 KB

transistor count number of units

slide-17
SLIDE 17

Execution Platform

 GPGPU-Sim 2.1  Computing resource: Hermes cluster (Westgrid)

 88 node, dual socket X5550 (@2.66GHz)

 Exhaustive simulation time: 11 day, 8 hours, 40 minutes and

56 seconds (on a single node)

 Proposed method time: 1 day, 7 hours, 48 minutes and 21

seconds (on a single node)

slide-18
SLIDE 18

ILP Result Validation

Region2: 56 – 66 million transistor

 MCB 4 units  DL1 cache 2 unit (16 KB)  Const cache 1 unit (512 B)  Register file 1 unit (4 KB)

ILP suggested 44– 135 million transistor

MCB 1-4 units DL1 cache 8-32 KB Const cache 512-1 KB Register file 4-8 KB 30 35 40 45 50 55 60 65

65,4 65,9 66,2 66,7 65,7 66,2 66,5 66,0 66,5 66,8 66,3 66,8 66,8 performance (IPC) configurations (labeled by the transistor counts)

  • ptimum

ILP-suggested

slide-19
SLIDE 19

ILP Result Validation

200 300 400 500 600 700 800 AES BlackSchole RAY 400 500 600 Srad 400 500 600 LIB Matrix 250 350 450 550 FWT LPS Scan BackProp 50 55 60 65 NN 750 800 850 900 Montcarlo HotSpot

best configuration ILP-suggested configuration

IPC IPC IPC IPC IPC IPC

slide-20
SLIDE 20

Miss-Match Regions Details

200 300 400 500 600 700 800 AES BlackSchole RAY 400 500 600 Srad

best configuration ILP-suggested configuration

500 600 700

  • ther configuration

50 250 450

Srad - Region5

IPC IPC IPC IPC

Error: 1.2% Error: 0.66%

AES – Region1

slide-21
SLIDE 21

Conclusion

 delivers the optimum performing configuration in 48

  • ut of 57 cases

 In other nine cases, performance lagged the optimum

  • ne by less than 3.5%.

 Reduces the number of explorations by as much as

78%.

slide-22
SLIDE 22
slide-23
SLIDE 23

Questions?