On the Development and Optimization of Hybrid Parallel Codes for - PowerPoint PPT Presentation

On the Development and Optimization of Hybrid Parallel Codes for Integral Equation Formulations 7 th European Conference on Antennas and Propagation Swedish Exhibition & Congress Centre Gothenburg, Sweden Alejandro Álvarez-Melcón, Fernando D. Quesada, Domingo Giménez, Carlos Pérez-Alcaraz, Tomás Ramírez, and José Ginés Picón alejandro.alvarez@upct.es; domingo@um.es Universidad Politécnica de Cartagena/ Universidad de Murcia ETSI. Telecomunicación/ Facultad de Informatica Dpto. Tecnologías de la Información y las Comunicaciones/ Dpto. de Informática y Sistemas Signal Theory and Communications 08-12 April 2013 Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 1 / 18

Outline Introduction and motivation 1 Computation of Green’s functions on hybrid systems 2 Parallelization in CC-NUMA at MoM level of a VIE technique 3 Autotuning parallel codes 4 Conclusions 5 Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 2 / 18

Introduction and motivation Motivation of the work High interest in the development of full-wave techniques based on 1 Integral Equation formulations for the analysis of microwave components and antennas. Need of efficient software tools that allow optimization of complex 2 devices in real time. Complexity of devices increases computational time as the cube 3 of the problem size. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 3 / 18

Introduction and motivation Motivation of the work High interest in the development of full-wave techniques based on 1 Integral Equation formulations for the analysis of microwave components and antennas. Need of efficient software tools that allow optimization of complex 2 devices in real time. Complexity of devices increases computational time as the cube 3 of the problem size. Identification of bottle-necks Two important elements in integral equation formulations: Calculation of Green’s functions inside waveguides maybe slow 1 due to low convergence rate of series (images, modes). In Volume Integral Equation formulations, size of the MoM 2 matrices increases as N 3 . Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 3 / 18

Introduction and motivation Objectives of the work Increase efficiency using parallel computing. 1 Application of several hybrid-heterogeneous parallelism strategies 2 is proposed in this context. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 4 / 18

Introduction and motivation Objectives of the work Increase efficiency using parallel computing. 1 Application of several hybrid-heterogeneous parallelism strategies 2 is proposed in this context. Strategies explored At a low level, application of hybrid parallelism 1 (MPI+OpenMP+CUDA) for the computation of Green’s functions in rectangular waveguides. At a higher level, combination of two level parallelism (OpenMP 2 and MKL multithread routines) in cc-NUMA systems applied to accelerate MoM solutions in VIE formulation. Possibilities to use autotuning strategies. 3 Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 4 / 18

Computation of Green’s functions on hybrid systems Hybrid parallelism MPI+OpenMP , OpenMP+CUDA and MPI+OpenMP+CUDA 1 routines are developed to accelerate the calculation of 2D waveguide Green’s functions. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 5 / 18

Computation of Green’s functions on hybrid systems Hybrid parallelism MPI+OpenMP , OpenMP+CUDA and MPI+OpenMP+CUDA 1 routines are developed to accelerate the calculation of 2D waveguide Green’s functions. As seen, ( p ) MPI processes For each MPI process P k , 0 ≤ k < p : are started. omp_set_num_threads( h + g ) In addition, ( h + g ) threads for i = k m p to ( k + 1 ) m p − 1 do run inside each process. node = omp_get_thread_num() if node < h then Threads ( 0 ) to ( h − 1 ) works Compute with OpenMP thread on the CPU (OpenMP , OMP). else Call to CUDA kernel Remaining threads from ( h ) end if to ( h + g − 1 ) works in GPU end for calling CUDA kernels. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 5 / 18

Computation of Green’s functions on hybrid systems Routines developed 1 + 0 h + 0 0 + g p \ h + g h + g 1 SEQ OMP CUDA OMP+CUDA MPI MPI+OMP MPI+CUDA MPI+OMP+CUDA p Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 6 / 18

Computation of Green’s functions on hybrid systems Routines developed 1 + 0 h + 0 0 + g p \ h + g h + g 1 SEQ OMP CUDA OMP+CUDA MPI MPI+OMP MPI+CUDA MPI+OMP+CUDA p Computational systems tested Saturno is a NUMA system with 24 cores, Intel Xeon, 1.87 GHz, 32 GB of shared-memory. Plus NVIDIA Tesla C2050, CUDA with total of 448 CUDA cores, 2.8 Gb and 1.15 GHz. Marte and Mercurio are AMD Phenom II X6 1075T (hexa-core), 3 GHz, 15 GB (Marte) and 8 GB (Mercurio). Plus NVIDIA GeForce GTX 590 with two devices, with 512 CUDA cores; machines are connected in a homogeneous cluster. Luna is an Intel Core 2 Quad Q6600, 2.4 GHz, 4 GB. With NVIDIA GeForce 9800 GT, CUDA with a total of 112 CUDA cores. All them connected in a heterogeneous cluster. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 6 / 18

Computation of Green’s functions on hybrid systems Comparison between use of CPU versus use of GPU Test on computational speed, when CPUs or GPUs are used. CPU version uses number of threads equal to number of cores. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 7 / 18

Computation of Green’s functions on hybrid systems Comparison between use of CPU versus use of GPU Test on computational speed, when CPUs or GPUs are used. CPU version uses number of threads equal to number of cores. Plot is presented as a function of the problem size (#images, #points). S=T(#threads=#cores)/ T(#kernels=3). S > 1 means GPU is preferred over CPU. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 7 / 18

Computation of Green’s functions on hybrid systems Comparison between GPU and optimum parameters The selection of the optimum values for p , h and g produces lower execution times that blind GPU use. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 8 / 18

Computation of Green’s functions on hybrid systems Comparison between GPU and optimum parameters The selection of the optimum values for p , h and g produces lower execution times that blind GPU use. Plot is presented as a function of the problem size (#images, #points). S=T(#kernels=3)/ T(lowest). S > 1 means GPU is worse than lowest. Speed-up of two is obtained for large problems using optimum. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 8 / 18

Computation of Green’s functions on hybrid systems Comparison homogeneous - heterogeneous cluster Combination of nodes at different computational speed, different number of cores and GPU produces additional reduction of the execution time. Different values of p , h and g for different nodes. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 9 / 18

Computation of Green’s functions on hybrid systems Comparison homogeneous - heterogeneous cluster Combination of nodes at different computational speed, different number of cores and GPU produces additional reduction of the execution time. Different values of p , h and g for different nodes. Plot is presented as a function of the problem size (#images, #points). S=T(#kernels=3*#nodes)/ T(lowest). Important reduction of the execution time with the hetereogeneous cluster. Execution time closer to the lowest experimental. Álvarez/Giménez, et. al. (UPCT/UMU) EuCAP 2013/ 8-12 April 2013 EuCAP 2013 9 / 18

On the Development and Optimization of Hybrid Parallel Codes for - PowerPoint PPT Presentation

On the Development and Optimization of Hybrid Parallel Codes for Integral Equation Formulations 7 th European Conference on Antennas and Propagation Swedish Exhibition & Congress Centre Gothenburg, Sweden Alejandro lvarez-Melcn, Fernando

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model for t he Dist

Hybrid Automobiles Hybrid Automobiles It switches easily between fuel, batteries, or both It

Parallel Hybrid Solution with PHT Parallel Hybrid Solution Pourquoi envisager une vritable

Hybrid NLP Hybrid NLP O UTLINE O UTLINE Problems of Deep and Shallow Processing

EXPO REAL Hybrid Summit Your virtual exhibition EXPO REAL Hybrid Summit The Hybrid Conference

Model Predictive Control Model Predictive Control of Hybrid Systems of Hybrid Systems Model

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

A Hybrid, Dynamic Logic for Hybrid-Dynamic Information Flow Brandon Bohrer and Andr e Platzer

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

Tools for Success: Developing an Academic Advising Syllabus NACADA Region 1 P3: March 8, 2017:

Student Involvement at EOU: Results of the Student

SOLACTIVE EUROPEAN DEEP VALUE SELECT 50 INDEX Marketing C ommunication For professional clients

technologies in support of Flexible Learning at The University of the South Pacific. Evan

Discovering Frequent Topological Structures from Graph Datasets R. Jin C. Wang D. Polshakov S.

Beta Presentation GameChang3rs Learning Management System The Capstone Experience Team Michael

Hybrid Clustering of multi-view data via MLSVD Xinhai Liu, Lieven De Lathauwer, Wolfgang Gl

Office of Data and Accountability W HY D ATA A CCURACY ? Accurate student data is important