| |
Using Todays Fastest Chips to Design the Chips of Tomorrow Mauro - - PowerPoint PPT Presentation
Using Todays Fastest Chips to Design the Chips of Tomorrow Mauro - - PowerPoint PPT Presentation
Using Todays Fastest Chips to Design the Chips of Tomorrow Mauro Calderara, Sascha Brck, Mathieu Luisier | | Overview What we want to do How we do it | | Mauro Calderara Apr 08 2016 2 Overview What we want to do
| |
- What we want to do
- How we do it
Apr 08 2016 Mauro Calderara 2
Overview
| |
- What we want to do → Quantum Transport: electrons and structures
- How we do it → How GPUs saved the day
Apr 08 2016 Mauro Calderara 3
Overview
| | Apr 08 2016 Mauro Calderara 4
Probably you’re familiar with this
| | Apr 08 2016 Mauro Calderara 5
Zooming in
| | Apr 08 2016 Mauro Calderara 6
The future?
(link to video: http://iis.ee.ethz.ch/~mauro/movie_SC15.avi)
| | Apr 08 2016 Mauro Calderara 7
From a somewhat more abstract POV
Device
| | Apr 08 2016 Mauro Calderara 7
From a somewhat more abstract POV
Device
?
e
| | Apr 08 2016 Mauro Calderara 7
From a somewhat more abstract POV
Device
?
e e
| | Apr 08 2016 Mauro Calderara 7
From a somewhat more abstract POV
Device
e
?
e e
| | Apr 08 2016 Mauro Calderara 7
From a somewhat more abstract POV
Device
e e e e
?
e e
| |
- How do electrons behave w.r.t the
device?
Apr 08 2016 Mauro Calderara 8
This is what we’re ultimately interested in!
Device
| |
- How do electrons behave w.r.t the
device?
- Change in parameters → change in
behavior?
Apr 08 2016 Mauro Calderara 8
This is what we’re ultimately interested in!
Device
| |
- How do electrons behave w.r.t the
device?
- Change in parameters → change in
behavior?
Apr 08 2016 Mauro Calderara 8
This is what we’re ultimately interested in!
Device
e e e e e e
| |
- How do electrons behave w.r.t the
device?
- Change in parameters → change in
behavior?
Apr 08 2016 Mauro Calderara 8
This is what we’re ultimately interested in!
Device
e e e e e e Gate voltage
| |
- How do electrons behave w.r.t the
device?
- Change in parameters → change in
behavior?
Apr 08 2016 Mauro Calderara 8
This is what we’re ultimately interested in!
Device
e e e e e e Gate voltage Dimensions Material properties
| |
- How do electrons behave w.r.t the
device?
- Change in parameters → change in
behavior?
- Applies not just to transistors
- Batteries
- Storage devices
- ...
Apr 08 2016 Mauro Calderara 8
This is what we’re ultimately interested in!
Device
e e e e e e Gate voltage Dimensions Material properties
| | Apr 08 2016 Mauro Calderara 9
How would we do that? The ‘‘easy’’ case:
| | Apr 08 2016 Mauro Calderara 9
How would we do that? The ‘‘easy’’ case:
→ device behaves like bulk material
| | Apr 08 2016 Mauro Calderara 10
How would we do that? The ‘‘difficult’’ case:
| | Apr 08 2016 Mauro Calderara 10
How would we do that? The ‘‘difficult’’ case:
→ device behaves like atomic structure
| | Apr 08 2016 Mauro Calderara 11
The cost of going small
Why is this ‘‘easy’’ ... ... and this ‘‘difficult’’?
| | Apr 08 2016 Mauro Calderara 12
The cost of going small
Can assume is ‘‘infinite’’ and use semi empirical model. Very finite! Need to do it from first principles.
| | Apr 08 2016 Mauro Calderara 12
The cost of going small
Can assume is ‘‘infinite’’ and use semi empirical model. Very finite! Need to do it from first principles.
| | Apr 08 2016 Mauro Calderara 12
The cost of going small
Can assume is ‘‘infinite’’ and use semi empirical model. Very finite! Need to do it from first principles.
| | Apr 08 2016 Mauro Calderara 12
The cost of going small
Can assume is ‘‘infinite’’ and use semi empirical model. Very finite! Need to do it from first principles.
| | Apr 08 2016 Mauro Calderara 12
The cost of going small
Can assume is ‘‘infinite’’ and use semi empirical model. Very finite! Need to do it from first principles.
| | Apr 08 2016 Mauro Calderara 12
The cost of going small
Can assume is ‘‘infinite’’ and use semi empirical model. Very finite! Need to do it from first principles.
| | Apr 08 2016 Mauro Calderara 12
The cost of going small
Can assume is ‘‘infinite’’ and use semi empirical model. Very finite! Need to do it from first principles.
runtime runtime
| |
runtime runtime
Apr 08 2016 Mauro Calderara 13
The cost of going small
Semi-empirical → O(Hours) First principles → O(Months)
| |
runtime runtime
Apr 08 2016 Mauro Calderara 13
The cost of going small
Semi-empirical → O(Hours) First principles → O(Months)
| |
runtime runtime
Apr 08 2016 Mauro Calderara 13
The cost of going small
Semi-empirical → O(Hours) First principles → O(Months)
| |
- What we want to do → Quantum Transport: electrons and structures
- How we do it → How GPUs saved the day
Apr 08 2016 Mauro Calderara 14
Overview
| | Apr 08 2016 Mauro Calderara 15
Where does all that time go?
runtime
~ 40x
| | Apr 08 2016 Mauro Calderara 15
Where does all that time go?
runtime
~ 40x
Solve an eigenvalue problem (not discussed here).
| | Apr 08 2016 Mauro Calderara 15
Where does all that time go?
runtime
~ 40x
Invert the matrix from before (selectively!) using a recursive algorithm. Solve an eigenvalue problem (not discussed here).
| |
- Instead of trying to invert selectively,
solve system using generic sparse solver package
Apr 08 2016 Mauro Calderara 16
Avoiding the inversion, use a sparse solver instead
runtime
~ 40x
| |
- Instead of trying to invert selectively,
solve system using generic sparse solver package
- Gain: speed, parallelism, capacity for
somewhat larger systems
Apr 08 2016 Mauro Calderara 16
Avoiding the inversion, use a sparse solver instead
runtime
~ 40x
| |
- Instead of trying to invert selectively,
solve system using generic sparse solver package
- Gain: speed, parallelism, capacity for
somewhat larger systems
- Cost: code now mem-bw bound
And: not such a good fit for GPUs ...
Apr 08 2016 Mauro Calderara 16
Avoiding the inversion, use a sparse solver instead
runtime
~ 40x
| |
- Instead of trying to invert selectively,
solve system using generic sparse solver package
- Gain: speed, parallelism, capacity for
somewhat larger systems
- Cost: code now mem-bw bound
And: not such a good fit for GPUs ...
Apr 08 2016 Mauro Calderara 16
Avoiding the inversion, use a sparse solver instead
runtime
~ 40x
| |
runtime
- We’ve been able to solve that one
Apr 08 2016 Mauro Calderara 17
Tackling the eigenvalue problem
runtime
~ 200x
| |
- Good speedup so far
(now: O(Days), still not quite there...)
Apr 08 2016 Mauro Calderara 18
Now what?
runtime
~ 70x overall
| |
- Good speedup so far
(now: O(Days), still not quite there...)
- But
Apr 08 2016 Mauro Calderara 18
Now what?
runtime
~ 70x overall
| |
- Good speedup so far
(now: O(Days), still not quite there...)
- But
Apr 08 2016 Mauro Calderara 18
Now what?
runtime
~ 70x overall
Mem-BW bound by sparse solver
?
| |
- Good speedup so far
(now: O(Days), still not quite there...)
- But
Apr 08 2016 Mauro Calderara 18
Now what?
runtime
~ 70x overall
Mem-BW bound by sparse solver
?
| |
- Good speedup so far
(now: O(Days), still not quite there...)
- But
Apr 08 2016 Mauro Calderara 18
Now what?
runtime
~ 70x overall
Mem-BW bound by sparse solver
| |
- Good speedup so far
(now: O(Days), still not quite there...)
- But
Apr 08 2016 Mauro Calderara 18
Now what?
runtime
~ 70x overall
Mem-BW bound by sparse solver
?
Advisor PhD student
| |
- Inverting sparse system not feasible
Apr 08 2016 Mauro Calderara 19
A Sparse Solver for Transport Problems running on GPUs
- 1
=
| |
- Inverting sparse system not feasible
- In our case: also not neccessary
Apr 08 2016 Mauro Calderara 19
A Sparse Solver for Transport Problems running on GPUs
- 1
=
| |
- Inverting sparse system not feasible
- In our case: also not neccessary
- Need first and last block rows only
Apr 08 2016 Mauro Calderara 19
A Sparse Solver for Transport Problems running on GPUs
- 1
=
| |
- Inverting sparse system not feasible
- In our case: also not neccessary
- Need first and last block rows only
- If we can compute this fast, we can
- interleave the solving step with the BC
computation
- obtain the full solution very efficiently
Apr 08 2016 Mauro Calderara 19
A Sparse Solver for Transport Problems running on GPUs
- 1
=
| |
- Recursive algorithm based on the
Schwinger-Dyson equation
Apr 08 2016 Mauro Calderara 20
Obtaining the first and last block columns of the inverse
for i = N:1 𝑌𝑗 ← (𝐵𝑗,𝑗 − 𝐵𝑗,𝑗+1𝑌𝑗+1) \ 𝐵𝑗,𝑗−1 for i = 2:N 𝑅𝑗 ← −𝑌𝑗 𝑅𝑗−1
| |
- Recursive algorithm based on the
Schwinger-Dyson equation
Apr 08 2016 Mauro Calderara 20
Obtaining the first and last block columns of the inverse
for i = N:1 𝑌𝑗 ← (𝐵𝑗,𝑗 − 𝐵𝑗,𝑗+1𝑌𝑗+1) \ 𝐵𝑗,𝑗−1 for i = 2:N 𝑅𝑗 ← −𝑌𝑗 𝑅𝑗−1
| |
- Recursive algorithm based on the
Schwinger-Dyson equation
Apr 08 2016 Mauro Calderara 20
Obtaining the first and last block columns of the inverse
for i = N:1 𝑌𝑗 ← (𝐵𝑗,𝑗 − 𝐵𝑗,𝑗+1𝑌𝑗+1) \ 𝐵𝑗,𝑗−1 for i = 2:N 𝑅𝑗 ← −𝑌𝑗 𝑅𝑗−1
N N+1 N-1 N-2 𝑌 𝐵
| |
- Recursive algorithm based on the
Schwinger-Dyson equation
Apr 08 2016 Mauro Calderara 20
Obtaining the first and last block columns of the inverse
for i = N:1 𝑌𝑗 ← (𝐵𝑗,𝑗 − 𝐵𝑗,𝑗+1𝑌𝑗+1) \ 𝐵𝑗,𝑗−1 for i = 2:N 𝑅𝑗 ← −𝑌𝑗 𝑅𝑗−1
N N+1 N-1 N-2 𝑌 𝐵
| |
- Recursive algorithm based on the
Schwinger-Dyson equation
- xGEMM + xGESV + xGEMM
Apr 08 2016 Mauro Calderara 20
Obtaining the first and last block columns of the inverse
for i = N:1 𝑌𝑗 ← (𝐵𝑗,𝑗 − 𝐵𝑗,𝑗+1𝑌𝑗+1) \ 𝐵𝑗,𝑗−1 for i = 2:N 𝑅𝑗 ← −𝑌𝑗 𝑅𝑗−1
N N+1 N-1 N-2 𝑌 𝐵
| |
- Recursive algorithm based on the
Schwinger-Dyson equation
- xGEMM + xGESV + xGEMM
- Very fast on accelerators
Apr 08 2016 Mauro Calderara 20
Obtaining the first and last block columns of the inverse
for i = N:1 𝑌𝑗 ← (𝐵𝑗,𝑗 − 𝐵𝑗,𝑗+1𝑌𝑗+1) \ 𝐵𝑗,𝑗−1 for i = 2:N 𝑅𝑗 ← −𝑌𝑗 𝑅𝑗−1
N N+1 N-1 N-2 𝑌 𝐵
| |
- Recursive algorithm based on the
Schwinger-Dyson equation
- xGEMM + xGESV + xGEMM
- Very fast on accelerators
- Parallelizable
Apr 08 2016 Mauro Calderara 20
Obtaining the first and last block columns of the inverse
for i = N:1 𝑌𝑗 ← (𝐵𝑗,𝑗 − 𝐵𝑗,𝑗+1𝑌𝑗+1) \ 𝐵𝑗,𝑗−1 for i = 2:N 𝑅𝑗 ← −𝑌𝑗 𝑅𝑗−1
N N+1 N-1 N-2 𝑌 𝐵
| |
Runs on GPUs, compute bound
Apr 08 2016 Mauro Calderara 21
A Sparse Solver for Transport Problems running on GPUs
| |
Runs on GPUs, compute bound
Apr 08 2016 Mauro Calderara 21
A Sparse Solver for Transport Problems running on GPUs
Arithmetic Intensity [log(FLOPS/Byte)] Performance [log(FLOPS)]
| |
Runs on GPUs, compute bound Interleaves with EV computation
Apr 08 2016 Mauro Calderara 21
A Sparse Solver for Transport Problems running on GPUs
Arithmetic Intensity [log(FLOPS/Byte)] Performance [log(FLOPS)]
| |
Runs on GPUs, compute bound Interleaves with EV computation Memory efficient
Apr 08 2016 Mauro Calderara 21
A Sparse Solver for Transport Problems running on GPUs
Arithmetic Intensity [log(FLOPS/Byte)] Performance [log(FLOPS)]
| |
Runs on GPUs, compute bound Interleaves with EV computation Memory efficient Much faster than sparse solvers
Apr 08 2016 Mauro Calderara 21
A Sparse Solver for Transport Problems running on GPUs
Arithmetic Intensity [log(FLOPS/Byte)] Performance [log(FLOPS)]
| |
Runs on GPUs, compute bound Interleaves with EV computation Memory efficient Much faster than sparse solvers Whole simulation: O(Hours)
Apr 08 2016 Mauro Calderara 21
A Sparse Solver for Transport Problems running on GPUs
Arithmetic Intensity [log(FLOPS/Byte)] Performance [log(FLOPS)]
| |
Runs on GPUs, compute bound Interleaves with EV computation Memory efficient Much faster than sparse solvers Whole simulation: O(Hours)
Apr 08 2016 Mauro Calderara 21
A Sparse Solver for Transport Problems running on GPUs
Arithmetic Intensity [log(FLOPS/Byte)] Performance [log(FLOPS)]
~ 10x / 80x
| | Apr 08 2016 Mauro Calderara 22
Summary
| |
- Transforming a sparse problem to a dense one can be a good thing
Apr 08 2016 Mauro Calderara 22
Summary
| |
- Transforming a sparse problem to a dense one can be a good thing
- Large speedup over state of the art (15x - 150x)
Apr 08 2016 Mauro Calderara 22
Summary
| |
- Transforming a sparse problem to a dense one can be a good thing
- Large speedup over state of the art (15x - 150x)
- Significant increase in capacity (100’000 atoms → 10x - 100x)
Apr 08 2016 Mauro Calderara 22
Summary
| |
- Transforming a sparse problem to a dense one can be a good thing
- Large speedup over state of the art (15x - 150x)
- Significant increase in capacity (100’000 atoms → 10x - 100x)
- Uses hybrid ressources very efficiently (15 PF sustained)
Apr 08 2016 Mauro Calderara 22
Summary
| |
- Transforming a sparse problem to a dense one can be a good thing
- Large speedup over state of the art (15x - 150x)
- Significant increase in capacity (100’000 atoms → 10x - 100x)
- Uses hybrid ressources very efficiently (15 PF sustained)
- Made ballistic ab-initio QT simulations for realistic structures a reality
Apr 08 2016 Mauro Calderara 22
Summary
| | Apr 08 2016 Mauro Calderara 23
(link to video: http://iis.ee.ethz.ch/~mauro/movie_Ag_Switch.avi)
| | Apr 08 2016 Mauro Calderara 24