Large-scale Ultrasound Simulations Using the Hybrid OpenMP/MPI - - PowerPoint PPT Presentation

large scale ultrasound simulations using
SMART_READER_LITE
LIVE PREVIEW

Large-scale Ultrasound Simulations Using the Hybrid OpenMP/MPI - - PowerPoint PPT Presentation

Large-scale Ultrasound Simulations Using the Hybrid OpenMP/MPI Decomposition Jiri Jaros*, Vojtech Nikl*, Bradley E. Treeby * Department of Compute Systems, Brno University of Technology Department of Medical Physics, University College


slide-1
SLIDE 1

Large-scale Ultrasound Simulations Using the Hybrid OpenMP/MPI Decomposition

Jiri Jaros*, Vojtech Nikl*, Bradley E. Treeby†

*Department of Compute Systems, Brno University of Technology

† Department of Medical Physics, University College London

slide-2
SLIDE 2

2

Outline

  • Ultrasound simulations in soft tissues

– What is ultrasound – Why do we need ultrasound simulations – Factors for the ultrasound simulations – What is the challenge

  • k-Wave toolbox

– Acoustic model – Spectral methods

  • Large-scale ultrasound simulations

– 1D domain decomposition with pure-MPI – 2D hybrid decomposition with OpenMP/MPI

  • Achieved results

– FFTW scaling – Strong scaling

  • Conclusions and open questions

Jiri Jaros: Large-scale Ultrasound Simulations…

slide-3
SLIDE 3

Longitudinal (compressional) acoustic waves … … with a frequency above 20 kHz

3

What Is Ultrasound?

Jiri Jaros: Large-scale Ultrasound Simulations…

slide-4
SLIDE 4

4

Ultrasound Simulation

  • Photoacoustic imaging
  • Aberration correction
  • Training ultrasonographers
  • Ultrasound transducer design
  • Treatment planning (HIFU)

Jiri Jaros: Large-scale Ultrasound Simulations…

slide-5
SLIDE 5

5

HIFU Treatment Planning

Jiri Jaros: Large-scale Ultrasound Simulations…

Scan

  • CT or MR scan of a patient

Parameter setting

  • Scan segmentation (bones, fat, skin, …)
  • Medium parameters (density, sound

speed) Simulation

  • Ultrasound propagation simulation
  • Dosage, focus position, aberration

correction Operation

  • Application of the ultrasound treatment
slide-6
SLIDE 6

6

Factors for Ultrasound Simulation

  • Nonlinear wave propagation

– Production of harmonics – Energy dependent

  • Heterogeneous medium

– Dispersion – Reflection

  • Absorbing medium

– Frequency dependent – Medium dependent

Jiri Jaros: Large-scale Ultrasound Simulations…

slide-7
SLIDE 7

7

How Big Simulations Do We Need?

Speed of sound in water ≈ 1500 m/s At 1 MHz, 20 cm ≈ 133 λ At 10 MHz, 20 cm ≈ 1333 λ At 15 grid points per wavelength, each matrix is 30 TB!

Jiri Jaros: Large-scale Ultrasound Simulations…

Modeling Scenario Source Freq [MHz] Source Type Nonlinear Harmonics Max Freq [MHz] Domain Size [mm] Domain Size [Wavelengths] X Y Z X Y Z Diagnostic Ultrasound: Abdominal Curvilinear Transducer 3 Tone Burst 5 18 150 80 25 1800 960 300 Diagnostic Ultrasound: Linear Transducer 10 Tone Burst 5 60 50 80 30 2000 3200 1200 Transrectal Prostate HIFU Minimal Cavitation 4 CW 15 64 80 60 20 3413 2560 853 MR-Guided HIFU Minimal Cavitation 1.5 CW 10 15 250 250 150 2500 2500 1500 Histotripsy Intense Cavitation 1 CW 50 50 250 250 150 8333 8333 5000

slide-8
SLIDE 8

8

  • k-Wave Toolbox (http://www.k-wave.org)

– 3,385 registered users

  • Full-wave 3D acoustic model

– including nonlinearity – heterogeneities – power law absorption

  • Solves coupled first-order equations

Acoustic Model for Soft Tissues

Jiri Jaros: Large-scale Ultrasound Simulations… momentum conservation mass conservation pressure-density relation absorption term

slide-9
SLIDE 9

9

  • Technique

– Medium properties generated by Matlab scripts from a medical scan. – Input signal is injected by a transducer. – Sensor data is collected in the form of raw time series or aggregated acoustics values. – Post processing and visualization handled by Matlab.

  • Operations executed in every time step

– 6 forward 3D FFTs – 8 inverse 3D FFTs – 3+3 forward and inverse 1D FFTs in the case of non-staggered velocity – About 100 element wise matrix operations (multiplication, addition,…)

  • Global data set

– 14 +3 (scratch) + 3 (unstaggering) real 3D matrices – 3+3 complex 3D matrices – 6 real 1D vectors – 6 complex 1D vectors – Sensor mask, source mask, source input – <0 , 20> real buffers for aggregated quantities (max, min, rms, max_all, min_all)

k-space Pseudospectral Method in C++

Jiri Jaros: Large-scale Ultrasound Simulations…

slide-10
SLIDE 10

10

  • Implementation language

– C/C++ and MPI parallelization – MPI-FFTW library– efficient way to calculate distributed 3D FFTs – HDF5 library – hierarchical data format for parallel I/O

  • Data decomposition

– Data decomposed along the Z dimension – Data distributed when read using parallel I/O – Frequency domain operations work on transposed data to reduce the number of global communications (3D transpositions).

K-Wave++ Toolbox Distributed 1D decomposition

Jiri Jaros: Large-scale Ultrasound Simulations…

slide-11
SLIDE 11

11

K-Wave++ Toolbox Strong Scaling (1D decomposition)

Jiri Jaros: Large-scale Ultrasound Simulations…

0,01 0,1 1 10 100 SEQ 8 cores (1 node) 16 cores (2 nodes) 32 cores (4 nodes) 64 cores (8 nodes) 128 cores (16 nodes) 256 cores (32 nodes) 512 cores (64 nodes) 1024 cores (128 nodes)

Time per timestep [s]

Strong Scaling of Ultrasound Simulations

Problem size remains constant as the number of cores is increased

128x128x128 256x128x128 256x256x128 256x256x256 512x256x256 512x512x256 512x512x512 1024x512x512 1024x1024x512 1024x1024x1024 2048x1024x1024 2048x2048x1024 2048x2048x2048 4096x2048x2048

slide-12
SLIDE 12
  • 1D decomposition

– The number of cores is limited by the largest dimension – It makes some simulation run for too long – It makes some simulation not fit into memory + It requires less communication

  • 2D decomposition

+ The number of codes is limited by a product of two largest dimensions + It is enough for anything we could think of running – It requires more communication and smaller messages

  • Example

– 40963 matrix -> ~256GB in single precision – Say 32 matrices -> 8TB RAM in total – Max 4096 cores -> 2GB per core (Anselm – 2GB, Fermi – 1GB)

Jiri Jaros: Large-scale Ultrasound Simulations… 12

Scalability Problem

slide-13
SLIDE 13

Jiri Jaros: Large-scale Ultrasound Simulations… 13

+ high core count limit + only 1 MPI global transposition + lower number of larger MPI messages

2D Hybrid Domain Decompostion

slide-14
SLIDE 14

Jiri Jaros: Large-scale Ultrasound Simulations… 14

FFT libraries strong scaling Anselm supercomputer (10243)

slide-15
SLIDE 15

Jiri Jaros: Large-scale Ultrasound Simulations… 15

FFT libraries strong scaling Fermi supercomputer (10243)

slide-16
SLIDE 16

Jiri Jaros: Large-scale Ultrasound Simulations… 16

Time distribution of hybrid FFT

slide-17
SLIDE 17

Jiri Jaros: Large-scale Ultrasound Simulations… 17

Simulation scaling (Anselm)

4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 16 32 64 128 256 512 1024 2048 Time per timestep [ms] Cores 128 - Pure 128 - Socket 128 - Node 256 - Pure 256 - Socket 256 - Node 512 - Pure 512 - Socket 512 - Node 1024 - Pure 1024 - Socket 1024 - Node

slide-18
SLIDE 18

Jiri Jaros: Large-scale Ultrasound Simulations… 18

Simulation scaling (SuperMUC)

4 8 16 32 64 128 256 512 1024 2048 4096 8192 128 256 512 1024 2048 4096 8192 Time per timestep [ms] Cores 128 - Pure 128 - Socket 128 - Node 256 - Pure 256 - Socket 256 - Node 512 - Pure 512 - Socket 512 - Node 1024 - Pure 1024 - Socket 1024 - Node

slide-19
SLIDE 19

Jiri Jaros: Large-scale Ultrasound Simulations… 19

Memory scaling (SuperMUC)

1 2 4 8 16 32 64 128 256 512 1024 128 256 512 1024 2048 4096 8192 Memory per core [MB] Cores 128 - Pure 128 - Socket 128 - Node 256 - Pure 256 - Socket 256 - Node 512 - Pure 512 - Socket 512 - Node 1024 - Pure 1024 - Socket 1024 - Node

slide-20
SLIDE 20

20

Conclusions

  • Clinical relevant results

– To get clinically relevant simulation we need grid sizes of 40963 to 81923 at least for 50k simulation timesteps

  • Two different

Implementations

– 1D domain decomposition gives better results for small core counts – 2D domain decomposition works well on Anselm, however there is a room for improvement on SuperMUC – Memory scaling enables us to run much bigger simulations

  • Future work

– Communication and synchronization reduction via

  • verlapping

Jiri Jaros: Large-scale Ultrasound Simulations…

slide-21
SLIDE 21

21

Our work has been supported by following institutions and grants

Questions and Comments

Jiri Jaros: Large-scale Ultrasound Simulations…

The project is financed from the SoMoPro II programme. The research leading to this invention has acquired a financial grant from the People Programme (Marie Curie action) of the Seventh Framework Programme of EU according to the REA Grant Agreement No. 291782. The research is further co-financed by the South- Moravian Region. This work reflects only the author’s view and the European Union is not liable for any use that may be made of the information contained therein. This work was also supported by the research project "Architecture of parallel and embedded computer systems", Brno University of Technology, FIT-S-14-2297, 2014-2016. This work was supported by the IT4Innovations Centre of Excellence project (CZ.1.05/1.1.00/02.0070), funded by the European Regional Development Fund and the national budget of the Czech Republic via the Research and Development for Innovations Operational Programme, as well as Czech Ministry of Education, Youth and Sports via the project Large Research, Development and Innovations Infrastructures (LM2011033). We acknowledge CINECA and PRACE Summer of HPC project for the availability of high performance computing resources. The authors gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) for funding this project by providing computing time on the GCS Supercomputer SuperMUC at Leibniz Supercomputing Centre (LRZ, www.lrz.de).