Case Study: Gauss-Seidel on Multicores
Erik Hagersten Uppsala University Sweden
Thanks: Dan Wallin(arch), Henrik Löf (sci comp) and Sverker Holmgren (sci comp) From Wallin et al, ICS 2006
Case Study: Gauss-Seidel on Multicores Erik Hagersten Uppsala - - PowerPoint PPT Presentation
Case Study: Gauss-Seidel on Multicores Erik Hagersten Uppsala University Sweden Thanks: Dan Wallin(arch), Henrik Lf (sci comp) and Sverker Holmgren (sci comp) From Wallin et al, ICS 2006 Criteria for HPC Algorithms Past: Minimize
Thanks: Dan Wallin(arch), Henrik Löf (sci comp) and Sverker Holmgren (sci comp) From Wallin et al, ICS 2006
Institutionen för informationsteknologi | www.it.uu.se
Past:
Minimize communication Maximize scalability (1000s of CPUs)
Multicores today:
Communication is “for free”
[on some multicores]
Scalability is limited to 32 threads The caches are tiny Memory bandwidth is scarse
Data locality is key!
Institutionen för informationsteknologi | www.it.uu.se
= previous = current = sweep path = data dependence
1 1 1 1,2,3,4 = iteration number
= cacheline layout
1 1
Institutionen för informationsteknologi | www.it.uu.se
1 1 1 1 1 1 1 1 1
IF (convergence_test) <done> else <iterate again>
= previous = current = sweep path = data dependence
1,2,3,4 = iteration number
= cacheline layout
Institutionen för informationsteknologi | www.it.uu.se
2 2 2 2 2
= previous = current = sweep path = data dependence
1,2,3,4 = iteration number
= cacheline layout
Institutionen för informationsteknologi | www.it.uu.se
0,5 0,5 0,5 0,5 0,5
= previous = current = sweep path = data dependence
1,2,3,4 = iteration number
= cacheline layout
Institutionen för informationsteknologi | www.it.uu.se
0,5 0,5 0,5 0,5 0,5
= previous = current = sweep path = data dependence
1,2,3,4 = iteration number
= cacheline layout
Institutionen för informationsteknologi | www.it.uu.se
1
1 1 1 1 1 1
Update all blacks <barrier> Update all reds <barrier>
= previous = current = sweep path = data dependence
1,2,3,4 = iteration number
= cacheline layout
Institutionen för informationsteknologi | www.it.uu.se
1
1 1 1 1 1 1
IN PARALLELL { Update all blacks <barrier> Update all reds <barrier> }
thread 0 thread 1 thread 2 thread 3 thread 4
= previous = current = sweep path = data dependence
1,2,3,4 = iteration number
= cacheline layout
Institutionen för informationsteknologi | www.it.uu.se
Poor Cache Locality of Red-Black:
Each element will be brought into the cache
Natural Order:
Each element will be brought into the cache
You can do even better…
Natural Order with Temporal Blocking ☺
Institutionen för informationsteknologi | www.it.uu.se
2 2 4 4 1 3
= active region
3 4
= previous = current = sweep path = data dependence
1,2,3,4 = iteration number
= cacheline layout
4
Institutionen för informationsteknologi | www.it.uu.se
3 2 4 4 1 4 1 2
= active region = previous = current = sweep path = data dependence
1,2,3,4 = iteration number
= cacheline layout
Institutionen för informationsteknologi | www.it.uu.se
Institutionen för informationsteknologi | www.it.uu.se
Institutionen för informationsteknologi | www.it.uu.se
2 4 6 8 10 12 14
512 KB 1 MB 2 MB 4 MB 8 MB 16 MB 32 MB
Cache size Cache miss ratio (percent)
Institutionen för informationsteknologi | www.it.uu.se
3 2 4 4 1 4
thread 0 thread 1 thread 2 thread 3
1 2 3 Synchronization flags 1 = active region = previous = current = sweep path = data dependence
1,2,3,4 = iteration number
= cacheline layout 1 = sync flag iteration no
Institutionen för informationsteknologi | www.it.uu.se
3 2 4 4 1 4
thread 0 thread 1 thread 2 thread 3
1 2 3 1 1 Synchronization flags
2 = active region = previous = current = sweep path = data dependence
1,2,3,4 = iteration number
= cacheline layout 1 = sync flag iteration no
Institutionen för informationsteknologi | www.it.uu.se
Institutionen för informationsteknologi | www.it.uu.se
Flags
Institutionen för informationsteknologi | www.it.uu.se
1
Institutionen för informationsteknologi | www.it.uu.se
1
Institutionen för informationsteknologi | www.it.uu.se
Stratup cost = (#threads-1)/(Nσ)
1 1 1 cacheline layout (size B bytes)
Communication:
2
Institutionen för informationsteknologi | www.it.uu.se
0,5 1 1,5 2 2,5 0,5 1 2 4 8 16 RBGS TBGS Execution time per step (sec)
RedBlack
Natural Order
Institutionen för informationsteknologi | www.it.uu.se
(Sun E15 K, US IIIcu = SMP!!)
1,0 1,5 2,0 2,5 3,0 3,5 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Threads
Performance ratio TBGS/RBGS
σ = 1 σ = 2 σ = 4 σ = 8 σ = 16
Institutionen för informationsteknologi | www.it.uu.se
0,0 0,5 1,0 1,5 2,0 2,5 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Threads
Performance ratio TBGS/RBGS
Sun 15k Simulated Multicore
Institutionen för informationsteknologi | www.it.uu.se
G-S Important part of many real apps! EX: G-S as a Smoother in “Multigrid”
Iterative algorithm More efficient smother cuts #iterations
Institutionen för informationsteknologi | www.it.uu.se
Today’s algorithms assume expensive
The communication cost of [some] multicores is
Locality is becoming key to performance [again] Redesign HPC algorithms to face this fact!
(For both Capacity and Capability computing)
Institutionen för informationsteknologi | www.it.uu.se
…8 ...
CPU L1I L1D (wt) L2
Memory ctrl
CPU L1I L1D (wt) Xbar = 134 GB/s … 8 … L2 L2 L2
Memory ctrl Memory ctrl Memory ctrl
4 x DDR-2 = 25GB/s (!) Shared L2! 3MB Shared caches: Good or bad?