case study gauss seidel on multicores
play

Case Study: Gauss-Seidel on Multicores Erik Hagersten Uppsala - PowerPoint PPT Presentation

Case Study: Gauss-Seidel on Multicores Erik Hagersten Uppsala University Sweden Thanks: Dan Wallin(arch), Henrik Lf (sci comp) and Sverker Holmgren (sci comp) From Wallin et al, ICS 2006 Criteria for HPC Algorithms Past: Minimize


  1. Case Study: Gauss-Seidel on Multicores Erik Hagersten Uppsala University Sweden Thanks: Dan Wallin(arch), Henrik Löf (sci comp) and Sverker Holmgren (sci comp) From Wallin et al, ICS 2006

  2. Criteria for HPC Algorithms � Past: � Minimize communication � Maximize scalability (1000s of CPUs) Uppsala University � Multicores today: � Communication is “for free” [on some multicores] � Scalability is limited to 32 threads � The caches are tiny � Memory bandwidth is scarse � Data locality is key! (Both for Capacity and Capability Computing!) Institutionen för informationsteknologi | www.it.uu.se

  3. Natural Order Gauss-Seidel = sweep path 1 = previous Uppsala University 1 1 = current 1 = data dependence 1 1,2,3,4 = iteration number = cacheline layout Institutionen för informationsteknologi | www.it.uu.se

  4. Natural Order Gauss-Seidel = sweep path 1 = previous Uppsala University 1 1 = current 1 = data dependence 1 1,2,3,4 = iteration number 1 = cacheline layout 1 1 1 IF (convergence_test) <done> else <iterate again> Institutionen för informationsteknologi | www.it.uu.se

  5. Natural Order Gauss-Seidel = sweep path 2 = previous Uppsala University 2 2 = current 2 = data dependence 2 1,2,3,4 = iteration number = cacheline layout Data dependence � Poor Parallelism � Institutionen för informationsteknologi | www.it.uu.se

  6. Red-Black Gauss-Seidel = sweep path 0,5 = previous Uppsala University 0,5 0,5 = current 0,5 = data dependence 0,5 1,2,3,4 = iteration number = cacheline layout Institutionen för informationsteknologi | www.it.uu.se

  7. Red-Black Gauss-Seidel step 0,5: update the blacks = sweep path 0,5 = previous Uppsala University 0,5 0,5 = current 0,5 = data dependence 0,5 1,2,3,4 = iteration number = cacheline layout Institutionen för informationsteknologi | www.it.uu.se

  8. Red-Black Gauss-Seidel step 1,0 update all reds = sweep path = previous Uppsala University 1 = current 1 = data dependence 1 1,2,3,4 = iteration number 1 = cacheline layout 1 1 1 Update all blacks <barrier> Update all reds � great parallelism!!! <barrier> Institutionen för informationsteknologi | www.it.uu.se

  9. Red-Black Gauss-Seidel Parallel version = sweep path = previous Uppsala University 1 thread 0 = current 1 = data dependence thread 1 1 1,2,3,4 = iteration number 1 = cacheline layout thread 2 1 thread 3 1 1 thread 4 IN PARALLELL { Update all blacks <barrier> Update all reds <barrier> } Institutionen för informationsteknologi | www.it.uu.se

  10. Any Drawbacks of the Red-Black? � Poor Cache Locality of Red-Black: Uppsala University � Each element will be brought into the cache twice per iteration � � Natural Order: � Each element will be brought into the cache once per iteration � � You can do even better… � Natural Order with Temporal Blocking ☺ Institutionen för informationsteknologi | www.it.uu.se

  11. G-S, temporal blocking: several iterations per sweep = sweep path 4 = previous 4 Uppsala University 4 = current 4 3 2 3 = data dependence 2 1 1,2,3,4 = iteration number = cacheline layout = active region Institutionen för informationsteknologi | www.it.uu.se

  12. G-S, temporal blocking: several iterations per sweep = sweep path = previous 4 Uppsala University 4 = current 4 3 = data dependence 2 2 1 1,2,3,4 = iteration number 1 = cacheline layout = active region In this case: 4 iterations per “sweep”. ( σ = 4) σ = 1,0 for natural order G-S σ = 0,5 for red-black G-S Institutionen för informationsteknologi | www.it.uu.se

  13. G-S 3D, σ =2 Uppsala University Institutionen för informationsteknologi | www.it.uu.se

  14. G-S 3D, σ =2 Uppsala University Institutionen för informationsteknologi | www.it.uu.se

  15. Acumem Graph, 3D N=129 Uppsala University 14 σ = 0,5 σ = 1 σ = 2 Cache miss ratio (percent) 12 σ = 4 σ = 8 σ = 16 10 8 6 4 2 0 512 KB 1 MB 2 MB 4 MB 8 MB 16 MB 32 MB Cache size Miss ratio ~ Memory bandwidth Institutionen för informationsteknologi | www.it.uu.se

  16. Parallel G-S, temporal blocked thread 0 thread 1 thread 2 thread 3 = sweep path = previous 4 Uppsala University 4 = current 4 3 3 = data dependence 2 2 1 1,2,3,4 = iteration number 1 1 0 0 = cacheline layout = active region 1 = sync flag iteration no Synchronization flags Institutionen för informationsteknologi | www.it.uu.se

  17. Parallel G-S, temporal blocked thread 0 thread 1 thread 2 thread 3 = sweep path = previous 4 Uppsala University 4 = current 4 3 3 = data dependence 2 2 1 1,2,3,4 = iteration number 2 1 1 0 1 = cacheline layout = active region 1 = sync flag iteration no Wait until ”lefty” is done: Synchronization Lots of communication flags • Producer/Consumer flag • Sharing of data values Institutionen för informationsteknologi | www.it.uu.se

  18. Parallel G-S 3D Uppsala University t0 t1 t2 t3 Institutionen för informationsteknologi | www.it.uu.se

  19. Parallel G-S 3D Uppsala University t0 0 0 0 0 0 t1 0 0 0 0 0 t2 0 0 0 0 0 t3 0 0 0 0 0 Flags Institutionen för informationsteknologi | www.it.uu.se

  20. Parallel G-S 3D Uppsala University t0 1 0 0 0 0 0 t1 0 0 0 0 0 t2 0 0 0 0 0 t3 0 0 0 0 0 Institutionen för informationsteknologi | www.it.uu.se

  21. Parallel G-S 3D Uppsala University t0 1 0 0 0 0 t1 0 0 0 0 0 t2 0 0 0 0 0 t3 0 0 0 0 0 Institutionen för informationsteknologi | www.it.uu.se

  22. Parallel G-S 3D cacheline layout (size B bytes) Uppsala University t0 1 2 1 0 0 0 0 t1 1 0 0 0 0 0 t2 0 0 0 0 0 t3 0 0 0 0 0 Stratup cost = (#threads-1)/(N σ ) Communication: • one flag synchronization per N 2 /#threads ops • one communication miss per B*N/#threads bytes Institutionen för informationsteknologi | www.it.uu.se

  23. Parallel Executiontime 2,5 threads=1 threads=2 Uppsala University Execution time per step (sec) 2 threads=4 threads=8 1,5 1 0,5 0 σ = 0,5 1 2 4 8 16 Temp.Blocked GS Natural RedBlack RBGS TBGS Order Institutionen för informationsteknologi | www.it.uu.se

  24. Performance comparison with Red-Black σ = 0,5 N = 257, 32 threads (Sun E15 K, US IIIcu = SMP!!) σ = 1 3,5 σ = 2 Uppsala University σ = 4 3,0 σ = 8 Performance ratio TBGS/RBGS σ = 16 2,5 2,0 1,5 1,0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Threads Institutionen för informationsteknologi | www.it.uu.se

  25. Multicore Simulation σ = 1 Uppsala University 2,5 Sun 15k Simulated Multicore 2,0 Performance ratio TBGS/RBGS 1,5 1,0 0,5 0,0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Threads Institutionen för informationsteknologi | www.it.uu.se

  26. Using Gauss-Seidel Smoother in a Multigrid � G-S Important part of many real apps! Uppsala University � EX: G-S as a Smoother in “Multigrid” � Iterative algorithm � More efficient smother cuts #iterations Institutionen för informationsteknologi | www.it.uu.se

  27. One slide summary � Today’s algorithms assume expensive communication � The communication cost of [some] multicores is Uppsala University close to zero � Locality is becoming key to performance [again] � Redesign HPC algorithms to face this fact! (For both Capacity and Capability computing) We show: * 3x performance gain * ~30x less bandwidth Is it time to revisit more algorithms? Institutionen för informationsteknologi | www.it.uu.se

  28. Niagara 4 x DDR-2 = 25GB/s (!) Uppsala University Memory Memory Memory Memory ctrl ctrl ctrl ctrl Shared L2! 3MB L2 L2 L2 L2 Xbar = 134 GB/s …8 ... L1I L1D L1I L1D (wt) (wt) … 8 … CPU CPU Shared caches: Good or bad? Institutionen för informationsteknologi | www.it.uu.se

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend