replicating the performance evaluation of an n body
play

Replicating the Performance Evaluation of an N-Body Application on a - PowerPoint PPT Presentation

Replicating the Performance Evaluation of an N-Body Application on a Manycore Accelerator Vincius Garcia Pinto Vinicius Alves Herbstrith Lucas Mello Schnorr October 18, 2015 1 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on


  1. Replicating the Performance Evaluation of an N-Body Application on a Manycore Accelerator Vinícius Garcia Pinto Vinicius Alves Herbstrith Lucas Mello Schnorr October 18, 2015 1 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

  2. Outline 1 Introduction 2 Background 3 Related Work 4 N-Body Performance Evaluation on XeonPhi 5 Conclusions 6 References 2 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

  3. Introduction Reproducibility • discoveries are replicated and reproduced by independent scientists • in computer science • lack of documentation in the experiments and its methodology • → obstacles to repeat/check third party results • HPC scenario • Few works about platforms with accelerators Source: Nature Education 2015 [ Stodden et al. 2014; TOP500 2015] 3 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

  4. This work • Replication of a performance evaluation reported in the book High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches • N-Body OpenMP parallel application on a XeonPhi accelerator. • Our Goals • check if their results are valid for a similar but not identical hardware • improve the reproducibility of the original experiments • no raw data and description with few details. • Despite this: we believe that with source code and a high-level description of the hardware it’s possible to replicate and extend this performance analysis. 4 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

  5. N-Body Simulation based on Newton’s Gravitation Law • Algorithm based on the Newton’s laws of motion • interaction between bodies and the forces acting over them. − → x j − − → ∀ i ∈ { 1 , ..., N } d − → x i v i � = G m j (1) d 3 d t ij i � = j • N-Body OpenMP Parallel Application for the XeonPhi • This kind of application simulates the interaction of particles in the space • presented in the Chapter 11 of the book High Performance Parallelism Pearls • open source code that implements this Equation as a parallel OpenMP application. [Reinders et al. ’14] 5 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

  6. N-Body OpenMP Parallel Application • Four versions • v0 - able to run natively either on the host or on the device • v1 - starts the execution on the host and offloads specific computations to the device • v2 - simultaneous computations in both host and device (host remains executing while the device is running) • v2.1 - overlapping in data transfers between host and device • v3 - adds support for an arbitrary number of accelerator devices. tag platform memory transfers FP Prec. v0 v0s/v0d host or device single/double v1 v1s both single v2 v2s both one-sided single v2.1 v2.1s both bi single v3 v3s both bi single 6 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

  7. Related Work • XeonPhi supports standard parallel programming tools (e.g OpenMP) • OpenMP 4.0: new directives for accelerators and coprocessors • Related Works: • [Cramer et al. 2012] Evaluation of OpenMP kernel-type benchmarks and CG solver: XeonPhi vs 128core SMP machine; overhead of synchronization OpenMP constructs is smaller in the XeonPhi scientific applications can run efficiently on the device. • [Schmidl et al. 2013] Some applications don’t perform well on the XeonPhi because of its relatively slow serial performance. • [Tian et al. 2015] N-Body to evaluate SIMD vectorization in XeonPhi: SIMD instructions accelerates the execution by 10.52 times. • [Borovska et al. 2014] Porting of an (MPI/OpenMP) astrophysics simulation application on XeonPhi: up to 38% of gain changing # of MPI process by device and # of threads by processes, no comparison with standard processors. . [Borovska et al. 2014; Cramer et al. 2012; Schmidl et al. 2013; Tian et al. 2015] 7 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

  8. Related Work • Our work • Similar to first two related works, we used a OpenMP application. However, it uses new OpenMP 4.0 directives like pragma simd and pragma target. • We also use a N-body like application (similar to other two works) but including more two comparisons: one between the XeonPhi and the host and other between GPU and XeonPhi. 8 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

  9. N-Body Performance Evaluation on XeonPhi • Platforms Setup and Experimental Methodology • Experiments conducted on two machines: • orion node at INF/UFRGS with two accelerators (Intel XeonPhi and Nvidia K20) • Bree desktop with one accelerator (Nvidia GTX760) Orion Bree Processor Xeon E5-2630 i7-4770 N of procs. (NUMA) 2 (two) 1 (one) Cores per proc. 6 (12 Hyper. T.) 4 (8 Hyper. T.) Max. Core Freq. 2.30GHz 3.40Ghz Main memory 32GBytes 8GBytes Accelerator #1 XeonPhi 3120A GTX760 Accelerator #2 Nvidia K20 OS CentOS Linux7 Ubuntu 14.04 Kernel 3.10.0 (x86 64) 3.13.0 MPSS / CUDA 3.4.1 / 6.5 NA / 5.5 Phi 3120A K20m GTX760 Processor in-order x86 cuda cores cuda cores Cores 57(228 HW T.) 2496 1152 Max. Core Freq. 1.10GHz 706MHz 980MHz L2 Cache 512KBytes 1.3MBytes 768KBytes Main memory 6GBytes 5GBytes 2GBytes Mem. Bandwidth 240 GB/s 208 GB/s 192 GB/s TDP 300 W 225 W 170 W 9 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

  10. N-Body Performance Evaluation on XeonPhi • Application Input • number of particles (50,000) • the time step for each iteration (0.01) • the number of iterations (100). • Experimental Methodology • average of at least 31 runs • std error as 3 times the std deviation divided by the square root of the number of observations • we also adopted the Speedup-Test to declare that the observed speedup is (or isn’t) statistically significant • Open Source R-based tool that uses Student’s t-test and Wilcoxon-Mann-Whitney test. [Touati et al. 2013] 10 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

  11. N-Body Performance Evaluation on XeonPhi • Independent executions: v0s/v0d (on host) Free Pin 1000 ● ● Time (secs) • No much difference between 750 500 free and pinned ● ● ● ● 250 ● ● • Performance gains seems to ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 be much higher for double 2 4 6 8 10 12 2 4 6 8 10 12 prec. Number of Threads • However single has a good Version v0d v0s ● ● seq. time → similar gains Free Pin 12 ● ● ● • Acceleration is very close no ● 10 ● ● ● Speedup ● matter which version used 8 ● ● ● ● 6 ● ● ● ● • Pinned: after 8 cores, 4 ● ● ● ● speedup gets distant from 2 ● ● ● ● 2 4 6 8 10 12 2 4 6 8 10 12 the ideal. Number of Threads Version v0d v0s ● ● 11 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

  12. N-Body Performance Evaluation on XeonPhi • Independent executions: v0s/v0d (on device) • Native execution model • Vertical lines represent: • # of physical cores (57) 150 • # of hw threads (228) Speedup • Speedup distances from the 100 ideal much before the first 50 vert. line • inadequate load/inability 0 to schedule this # of 0 50 100 150 200 250 Number of Threads cores Version v0d v0s • irregular acceleration • maximimum speedup obtained using 224 threads 12 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

  13. N-Body Performance Evaluation on XeonPhi • Offloading overhead to the XeonPhi– v0s vs v1s • v1s offloads the computation to the XeonPhi. Host remains idle during the execution in the accelerator (similar to CUDA). • We evaluate here the difference between offloading the computation (v1s) against the previous version (v0s). ● 8 • Only exections with 228 ● Time (secs) 6 threads (max) 4 • v1s is 11.19% faster 2 • scheduling decisions ● taken on the host ? 0 v0s v1s v0s−v1s • Note: best v0s was with Version 224 threads. ● v0s ● v1s ● v0s−v1s Version 13 / 21 PINTO V.G., HERBSTRITH V. A., SCHNORR L.M. 6th Workshop on Applications for Multi-Core Architectures - WAMCA

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend