Using Todays Fastest Chips to Design the Chips of Tomorrow Mauro - - PowerPoint PPT Presentation

using today s fastest chips to design the chips of
SMART_READER_LITE
LIVE PREVIEW

Using Todays Fastest Chips to Design the Chips of Tomorrow Mauro - - PowerPoint PPT Presentation

Using Todays Fastest Chips to Design the Chips of Tomorrow Mauro Calderara, Sascha Brck, Mathieu Luisier | | Overview What we want to do How we do it | | Mauro Calderara Apr 08 2016 2 Overview What we want to do


slide-1
SLIDE 1

| |

Mauro Calderara, Sascha Brück, Mathieu Luisier

Using Today’s Fastest Chips to Design the Chips of Tomorrow

slide-2
SLIDE 2

| |

  • What we want to do
  • How we do it

Apr 08 2016 Mauro Calderara 2

Overview

slide-3
SLIDE 3

| |

  • What we want to do → Quantum Transport: electrons and structures
  • How we do it → How GPUs saved the day

Apr 08 2016 Mauro Calderara 3

Overview

slide-4
SLIDE 4

| | Apr 08 2016 Mauro Calderara 4

Probably you’re familiar with this

slide-5
SLIDE 5

| | Apr 08 2016 Mauro Calderara 5

Zooming in

slide-6
SLIDE 6

| | Apr 08 2016 Mauro Calderara 6

The future?

(link to video: http://iis.ee.ethz.ch/~mauro/movie_SC15.avi)

slide-7
SLIDE 7

| | Apr 08 2016 Mauro Calderara 7

From a somewhat more abstract POV

Device

slide-8
SLIDE 8

| | Apr 08 2016 Mauro Calderara 7

From a somewhat more abstract POV

Device

?

e

slide-9
SLIDE 9

| | Apr 08 2016 Mauro Calderara 7

From a somewhat more abstract POV

Device

?

e e

slide-10
SLIDE 10

| | Apr 08 2016 Mauro Calderara 7

From a somewhat more abstract POV

Device

e

?

e e

slide-11
SLIDE 11

| | Apr 08 2016 Mauro Calderara 7

From a somewhat more abstract POV

Device

e e e e

?

e e

slide-12
SLIDE 12

| |

  • How do electrons behave w.r.t the

device?

Apr 08 2016 Mauro Calderara 8

This is what we’re ultimately interested in!

Device

slide-13
SLIDE 13

| |

  • How do electrons behave w.r.t the

device?

  • Change in parameters → change in

behavior?

Apr 08 2016 Mauro Calderara 8

This is what we’re ultimately interested in!

Device

slide-14
SLIDE 14

| |

  • How do electrons behave w.r.t the

device?

  • Change in parameters → change in

behavior?

Apr 08 2016 Mauro Calderara 8

This is what we’re ultimately interested in!

Device

e e e e e e

slide-15
SLIDE 15

| |

  • How do electrons behave w.r.t the

device?

  • Change in parameters → change in

behavior?

Apr 08 2016 Mauro Calderara 8

This is what we’re ultimately interested in!

Device

e e e e e e Gate voltage

slide-16
SLIDE 16

| |

  • How do electrons behave w.r.t the

device?

  • Change in parameters → change in

behavior?

Apr 08 2016 Mauro Calderara 8

This is what we’re ultimately interested in!

Device

e e e e e e Gate voltage Dimensions Material properties

slide-17
SLIDE 17

| |

  • How do electrons behave w.r.t the

device?

  • Change in parameters → change in

behavior?

  • Applies not just to transistors
  • Batteries
  • Storage devices
  • ...

Apr 08 2016 Mauro Calderara 8

This is what we’re ultimately interested in!

Device

e e e e e e Gate voltage Dimensions Material properties

slide-18
SLIDE 18

| | Apr 08 2016 Mauro Calderara 9

How would we do that? The ‘‘easy’’ case:

slide-19
SLIDE 19

| | Apr 08 2016 Mauro Calderara 9

How would we do that? The ‘‘easy’’ case:

→ device behaves like bulk material

slide-20
SLIDE 20

| | Apr 08 2016 Mauro Calderara 10

How would we do that? The ‘‘difficult’’ case:

slide-21
SLIDE 21

| | Apr 08 2016 Mauro Calderara 10

How would we do that? The ‘‘difficult’’ case:

→ device behaves like atomic structure

slide-22
SLIDE 22

| | Apr 08 2016 Mauro Calderara 11

The cost of going small

Why is this ‘‘easy’’ ... ... and this ‘‘difficult’’?

slide-23
SLIDE 23

| | Apr 08 2016 Mauro Calderara 12

The cost of going small

Can assume is ‘‘infinite’’ and use semi empirical model. Very finite! Need to do it from first principles.

slide-24
SLIDE 24

| | Apr 08 2016 Mauro Calderara 12

The cost of going small

Can assume is ‘‘infinite’’ and use semi empirical model. Very finite! Need to do it from first principles.

slide-25
SLIDE 25

| | Apr 08 2016 Mauro Calderara 12

The cost of going small

Can assume is ‘‘infinite’’ and use semi empirical model. Very finite! Need to do it from first principles.

slide-26
SLIDE 26

| | Apr 08 2016 Mauro Calderara 12

The cost of going small

Can assume is ‘‘infinite’’ and use semi empirical model. Very finite! Need to do it from first principles.

slide-27
SLIDE 27

| | Apr 08 2016 Mauro Calderara 12

The cost of going small

Can assume is ‘‘infinite’’ and use semi empirical model. Very finite! Need to do it from first principles.

slide-28
SLIDE 28

| | Apr 08 2016 Mauro Calderara 12

The cost of going small

Can assume is ‘‘infinite’’ and use semi empirical model. Very finite! Need to do it from first principles.

slide-29
SLIDE 29

| | Apr 08 2016 Mauro Calderara 12

The cost of going small

Can assume is ‘‘infinite’’ and use semi empirical model. Very finite! Need to do it from first principles.

runtime runtime

slide-30
SLIDE 30

| |

runtime runtime

Apr 08 2016 Mauro Calderara 13

The cost of going small

Semi-empirical → O(Hours) First principles → O(Months)

slide-31
SLIDE 31

| |

runtime runtime

Apr 08 2016 Mauro Calderara 13

The cost of going small

Semi-empirical → O(Hours) First principles → O(Months)

slide-32
SLIDE 32

| |

runtime runtime

Apr 08 2016 Mauro Calderara 13

The cost of going small

Semi-empirical → O(Hours) First principles → O(Months)

slide-33
SLIDE 33

| |

  • What we want to do → Quantum Transport: electrons and structures
  • How we do it → How GPUs saved the day

Apr 08 2016 Mauro Calderara 14

Overview

slide-34
SLIDE 34

| | Apr 08 2016 Mauro Calderara 15

Where does all that time go?

runtime

~ 40x

slide-35
SLIDE 35

| | Apr 08 2016 Mauro Calderara 15

Where does all that time go?

runtime

~ 40x

Solve an eigenvalue problem (not discussed here).

slide-36
SLIDE 36

| | Apr 08 2016 Mauro Calderara 15

Where does all that time go?

runtime

~ 40x

Invert the matrix from before (selectively!) using a recursive algorithm. Solve an eigenvalue problem (not discussed here).

slide-37
SLIDE 37

| |

  • Instead of trying to invert selectively,

solve system using generic sparse solver package

Apr 08 2016 Mauro Calderara 16

Avoiding the inversion, use a sparse solver instead

runtime

~ 40x

slide-38
SLIDE 38

| |

  • Instead of trying to invert selectively,

solve system using generic sparse solver package

  • Gain: speed, parallelism, capacity for

somewhat larger systems

Apr 08 2016 Mauro Calderara 16

Avoiding the inversion, use a sparse solver instead

runtime

~ 40x

slide-39
SLIDE 39

| |

  • Instead of trying to invert selectively,

solve system using generic sparse solver package

  • Gain: speed, parallelism, capacity for

somewhat larger systems

  • Cost: code now mem-bw bound

And: not such a good fit for GPUs ... 

Apr 08 2016 Mauro Calderara 16

Avoiding the inversion, use a sparse solver instead

runtime

~ 40x

slide-40
SLIDE 40

| |

  • Instead of trying to invert selectively,

solve system using generic sparse solver package

  • Gain: speed, parallelism, capacity for

somewhat larger systems

  • Cost: code now mem-bw bound

And: not such a good fit for GPUs ... 

Apr 08 2016 Mauro Calderara 16

Avoiding the inversion, use a sparse solver instead

runtime

~ 40x

slide-41
SLIDE 41

| |

runtime

  • We’ve been able to solve that one 

Apr 08 2016 Mauro Calderara 17

Tackling the eigenvalue problem

runtime

~ 200x

slide-42
SLIDE 42

| |

  • Good speedup so far

(now: O(Days), still not quite there...)

Apr 08 2016 Mauro Calderara 18

Now what?

runtime

~ 70x overall

slide-43
SLIDE 43

| |

  • Good speedup so far

(now: O(Days), still not quite there...)

  • But

Apr 08 2016 Mauro Calderara 18

Now what?

runtime

~ 70x overall

slide-44
SLIDE 44

| |

  • Good speedup so far

(now: O(Days), still not quite there...)

  • But

Apr 08 2016 Mauro Calderara 18

Now what?

runtime

~ 70x overall

Mem-BW bound by sparse solver

?

slide-45
SLIDE 45

| |

  • Good speedup so far

(now: O(Days), still not quite there...)

  • But

Apr 08 2016 Mauro Calderara 18

Now what?

runtime

~ 70x overall

Mem-BW bound by sparse solver

?

slide-46
SLIDE 46

| |

  • Good speedup so far

(now: O(Days), still not quite there...)

  • But

Apr 08 2016 Mauro Calderara 18

Now what?

runtime

~ 70x overall

Mem-BW bound by sparse solver

slide-47
SLIDE 47

| |

  • Good speedup so far

(now: O(Days), still not quite there...)

  • But

Apr 08 2016 Mauro Calderara 18

Now what?

runtime

~ 70x overall

Mem-BW bound by sparse solver

?

Advisor PhD student

slide-48
SLIDE 48

| |

  • Inverting sparse system not feasible

Apr 08 2016 Mauro Calderara 19

A Sparse Solver for Transport Problems running on GPUs

  • 1

=

slide-49
SLIDE 49

| |

  • Inverting sparse system not feasible
  • In our case: also not neccessary

Apr 08 2016 Mauro Calderara 19

A Sparse Solver for Transport Problems running on GPUs

  • 1

=

slide-50
SLIDE 50

| |

  • Inverting sparse system not feasible
  • In our case: also not neccessary
  • Need first and last block rows only

Apr 08 2016 Mauro Calderara 19

A Sparse Solver for Transport Problems running on GPUs

  • 1

=

slide-51
SLIDE 51

| |

  • Inverting sparse system not feasible
  • In our case: also not neccessary
  • Need first and last block rows only
  • If we can compute this fast, we can
  • interleave the solving step with the BC

computation

  • obtain the full solution very efficiently

Apr 08 2016 Mauro Calderara 19

A Sparse Solver for Transport Problems running on GPUs

  • 1

=

slide-52
SLIDE 52

| |

  • Recursive algorithm based on the

Schwinger-Dyson equation

Apr 08 2016 Mauro Calderara 20

Obtaining the first and last block columns of the inverse

for i = N:1 𝑌𝑗 ← (𝐵𝑗,𝑗 − 𝐵𝑗,𝑗+1𝑌𝑗+1) \ 𝐵𝑗,𝑗−1 for i = 2:N 𝑅𝑗 ← −𝑌𝑗 𝑅𝑗−1

slide-53
SLIDE 53

| |

  • Recursive algorithm based on the

Schwinger-Dyson equation

Apr 08 2016 Mauro Calderara 20

Obtaining the first and last block columns of the inverse

for i = N:1 𝑌𝑗 ← (𝐵𝑗,𝑗 − 𝐵𝑗,𝑗+1𝑌𝑗+1) \ 𝐵𝑗,𝑗−1 for i = 2:N 𝑅𝑗 ← −𝑌𝑗 𝑅𝑗−1

slide-54
SLIDE 54

| |

  • Recursive algorithm based on the

Schwinger-Dyson equation

Apr 08 2016 Mauro Calderara 20

Obtaining the first and last block columns of the inverse

for i = N:1 𝑌𝑗 ← (𝐵𝑗,𝑗 − 𝐵𝑗,𝑗+1𝑌𝑗+1) \ 𝐵𝑗,𝑗−1 for i = 2:N 𝑅𝑗 ← −𝑌𝑗 𝑅𝑗−1

N N+1 N-1 N-2 𝑌 𝐵

slide-55
SLIDE 55

| |

  • Recursive algorithm based on the

Schwinger-Dyson equation

Apr 08 2016 Mauro Calderara 20

Obtaining the first and last block columns of the inverse

for i = N:1 𝑌𝑗 ← (𝐵𝑗,𝑗 − 𝐵𝑗,𝑗+1𝑌𝑗+1) \ 𝐵𝑗,𝑗−1 for i = 2:N 𝑅𝑗 ← −𝑌𝑗 𝑅𝑗−1

N N+1 N-1 N-2 𝑌 𝐵

slide-56
SLIDE 56

| |

  • Recursive algorithm based on the

Schwinger-Dyson equation

  • xGEMM + xGESV + xGEMM

Apr 08 2016 Mauro Calderara 20

Obtaining the first and last block columns of the inverse

for i = N:1 𝑌𝑗 ← (𝐵𝑗,𝑗 − 𝐵𝑗,𝑗+1𝑌𝑗+1) \ 𝐵𝑗,𝑗−1 for i = 2:N 𝑅𝑗 ← −𝑌𝑗 𝑅𝑗−1

N N+1 N-1 N-2 𝑌 𝐵

slide-57
SLIDE 57

| |

  • Recursive algorithm based on the

Schwinger-Dyson equation

  • xGEMM + xGESV + xGEMM
  • Very fast on accelerators

Apr 08 2016 Mauro Calderara 20

Obtaining the first and last block columns of the inverse

for i = N:1 𝑌𝑗 ← (𝐵𝑗,𝑗 − 𝐵𝑗,𝑗+1𝑌𝑗+1) \ 𝐵𝑗,𝑗−1 for i = 2:N 𝑅𝑗 ← −𝑌𝑗 𝑅𝑗−1

N N+1 N-1 N-2 𝑌 𝐵

slide-58
SLIDE 58

| |

  • Recursive algorithm based on the

Schwinger-Dyson equation

  • xGEMM + xGESV + xGEMM
  • Very fast on accelerators
  • Parallelizable

Apr 08 2016 Mauro Calderara 20

Obtaining the first and last block columns of the inverse

for i = N:1 𝑌𝑗 ← (𝐵𝑗,𝑗 − 𝐵𝑗,𝑗+1𝑌𝑗+1) \ 𝐵𝑗,𝑗−1 for i = 2:N 𝑅𝑗 ← −𝑌𝑗 𝑅𝑗−1

N N+1 N-1 N-2 𝑌 𝐵

slide-59
SLIDE 59

| |

 Runs on GPUs, compute bound

Apr 08 2016 Mauro Calderara 21

A Sparse Solver for Transport Problems running on GPUs

slide-60
SLIDE 60

| |

 Runs on GPUs, compute bound

Apr 08 2016 Mauro Calderara 21

A Sparse Solver for Transport Problems running on GPUs

Arithmetic Intensity [log(FLOPS/Byte)] Performance [log(FLOPS)]

slide-61
SLIDE 61

| |

 Runs on GPUs, compute bound  Interleaves with EV computation

Apr 08 2016 Mauro Calderara 21

A Sparse Solver for Transport Problems running on GPUs

Arithmetic Intensity [log(FLOPS/Byte)] Performance [log(FLOPS)]

slide-62
SLIDE 62

| |

 Runs on GPUs, compute bound  Interleaves with EV computation  Memory efficient

Apr 08 2016 Mauro Calderara 21

A Sparse Solver for Transport Problems running on GPUs

Arithmetic Intensity [log(FLOPS/Byte)] Performance [log(FLOPS)]

slide-63
SLIDE 63

| |

 Runs on GPUs, compute bound  Interleaves with EV computation  Memory efficient  Much faster than sparse solvers

Apr 08 2016 Mauro Calderara 21

A Sparse Solver for Transport Problems running on GPUs

Arithmetic Intensity [log(FLOPS/Byte)] Performance [log(FLOPS)]

slide-64
SLIDE 64

| |

 Runs on GPUs, compute bound  Interleaves with EV computation  Memory efficient  Much faster than sparse solvers  Whole simulation: O(Hours)

Apr 08 2016 Mauro Calderara 21

A Sparse Solver for Transport Problems running on GPUs

Arithmetic Intensity [log(FLOPS/Byte)] Performance [log(FLOPS)]

slide-65
SLIDE 65

| |

 Runs on GPUs, compute bound  Interleaves with EV computation  Memory efficient  Much faster than sparse solvers  Whole simulation: O(Hours)

Apr 08 2016 Mauro Calderara 21

A Sparse Solver for Transport Problems running on GPUs

Arithmetic Intensity [log(FLOPS/Byte)] Performance [log(FLOPS)]

~ 10x / 80x

slide-66
SLIDE 66

| | Apr 08 2016 Mauro Calderara 22

Summary

slide-67
SLIDE 67

| |

  • Transforming a sparse problem to a dense one can be a good thing

Apr 08 2016 Mauro Calderara 22

Summary

slide-68
SLIDE 68

| |

  • Transforming a sparse problem to a dense one can be a good thing
  • Large speedup over state of the art (15x - 150x)

Apr 08 2016 Mauro Calderara 22

Summary

slide-69
SLIDE 69

| |

  • Transforming a sparse problem to a dense one can be a good thing
  • Large speedup over state of the art (15x - 150x)
  • Significant increase in capacity (100’000 atoms → 10x - 100x)

Apr 08 2016 Mauro Calderara 22

Summary

slide-70
SLIDE 70

| |

  • Transforming a sparse problem to a dense one can be a good thing
  • Large speedup over state of the art (15x - 150x)
  • Significant increase in capacity (100’000 atoms → 10x - 100x)
  • Uses hybrid ressources very efficiently (15 PF sustained)

Apr 08 2016 Mauro Calderara 22

Summary

slide-71
SLIDE 71

| |

  • Transforming a sparse problem to a dense one can be a good thing
  • Large speedup over state of the art (15x - 150x)
  • Significant increase in capacity (100’000 atoms → 10x - 100x)
  • Uses hybrid ressources very efficiently (15 PF sustained)
  • Made ballistic ab-initio QT simulations for realistic structures a reality

Apr 08 2016 Mauro Calderara 22

Summary

slide-72
SLIDE 72

| | Apr 08 2016 Mauro Calderara 23

(link to video: http://iis.ee.ethz.ch/~mauro/movie_Ag_Switch.avi)

slide-73
SLIDE 73

| | Apr 08 2016 Mauro Calderara 24

Questions?