HD GP-GPU Systems for HPC Applications: Engines | SAR | RF Amps - - PowerPoint PPT Presentation

hd gp gpu systems for hpc applications
SMART_READER_LITE
LIVE PREVIEW

HD GP-GPU Systems for HPC Applications: Engines | SAR | RF Amps - - PowerPoint PPT Presentation

Introduction Implementation HD GP-GPU Systems for HPC Applications: Engines | SAR | RF Amps Sergio Tafur , & Christopher Kung Center for Computational Science | Section Head (Acting) Code 5594 Productivity Enhancement,


slide-1
SLIDE 1

Introduction Implementation

HD GP-GPU Systems for HPC Applications:

Engines | SAR | RF Amps Sergio Tafur†, & Christopher Kung‡

†Center for Computational Science | Section Head (Acting) Code 5594 ‡Productivity Enhancement, Technology Transfer and Training On-Site at NRL DISTRIBUTION A . Approved for public release: distribution unlimited.

GPU Technology Conference | April 2016

April 1, 2016

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-2
SLIDE 2

Introduction Implementation

1

Introduction Applications: SAR | RF Amps | RDEs y Objective Distributed Architecture (Y.O.D.A.)

2

Implementation Benchmarked Performance Challenges

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-3
SLIDE 3

Introduction Implementation SAR | RF Amps | RDEs y Objecitve Distributed Architecture

1

Introduction Applications: SAR | RF Amps | RDEs y Objective Distributed Architecture (Y.O.D.A.)

2

Implementation Benchmarked Performance Challenges

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-4
SLIDE 4

Introduction Implementation SAR | RF Amps | RDEs y Objecitve Distributed Architecture

Applications

Synthetic Amperture Radar | RF Amps | RDEs

Synthetic Aperture Radar

Investigating the Use of GPU-Accelerated Nodes for SAR Image Formation, IEEE Int. Conf. on Cluster Computing and Workshops, 1-8, 31, 2009.

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-5
SLIDE 5

Introduction Implementation SAR | RF Amps | RDEs y Objecitve Distributed Architecture

Applications

SAR | Radio Frequency Amplifiers | RDEs

RF Amplifiers

Simulation of Klystrons With Slow and Reflected Electrons Using Large-Signal Code TESLA IEEE Transactions on Electron Devices, 54(6), 1555-1561, 2007.

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-6
SLIDE 6

Introduction Implementation SAR | RF Amps | RDEs y Objecitve Distributed Architecture

Applications

SAR | RF Amps | Rotating Detonation Engines

Rotating Detonation Engines

Thermodynamic Modeling of a Rotating Detonation Engine AIAA Paper 2011-803, 49 th AIAA Aerospace Sciences Meeting, 2011.

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-7
SLIDE 7

Introduction Implementation SAR | RF Amps | RDEs y Objecitve Distributed Architecture

1

Introduction Applications: SAR | RF Amps | RDEs y Objective Distributed Architecture (Y.O.D.A.)

2

Implementation Benchmarked Performance Challenges

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-8
SLIDE 8

Introduction Implementation SAR | RF Amps | RDEs y Objecitve Distributed Architecture

Y Objective Distributed Architecture

Configuration | FDR Infiniband Fat-Tree(ish)

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-9
SLIDE 9

Introduction Implementation SAR | RF Amps | RDEs y Objecitve Distributed Architecture

Y Objective Distributed Architecture

Configuration | FDR Infiniband Fat-Tree(ish)

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-10
SLIDE 10

Introduction Implementation SAR | RF Amps | RDEs y Objecitve Distributed Architecture

Y Objective Distributed Architecture

Hardware | Exxact Quantum TXR410-768R

Mother board: 2x Intel E5-2600 v3 4x PLX PEX 8747 switch Configuration: 8x Titan Black 128 GB DDR4 Memory

http://www.tyan.com/datasheets/DataSheet_FT77A-B7059.pdf http://tyan.com/manuals/FT77C-B7079_QIG.pdf

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-11
SLIDE 11

Introduction Implementation SAR | RF Amps | RDEs y Objecitve Distributed Architecture

Y Objective Distributed Architecture

Hardware | Exxact Quantum TXR410-768R

Mother board: 2x Intel E5-2600 v3 4x PLX PEX 8747 switch Configuration: 8x Titan Black 128 GB DDR4 Memory

http://www.tyan.com/datasheets/DataSheet_FT77A-B7059.pdf http://tyan.com/manuals/FT77C-B7079_QIG.pdf

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-12
SLIDE 12

Introduction Implementation SAR | RF Amps | RDEs y Objecitve Distributed Architecture

Y Objective Distributed Architecture

Hardware | NVIDIA GK110

GTX TITAN Black GPU Engine Specs: 2880 CUDA Cores/960 DP Units 889 Base Clock (MHz) 980 Boost Clock (MHz) GTX TITAN Black Memory Specs: 7.0 Gbps Memory Clock 6144 MB Standard Memory Config 336 Memory Bandwidth (GB/sec)

https://forums.geforce.com/default/topic/531846/geforce-gtx-titan-is-here-/ http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-titan-black/specifications

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-13
SLIDE 13

Introduction Implementation SAR | RF Amps | RDEs y Objecitve Distributed Architecture

Y Objective Distributed Architecture

Hardware | NVIDIA GK110

NVIDIA | GK 110 Die 15 Streaming Multiprocessor (SMX) Architecture Units six 64-bit Memory controllers 3 SP cores / 1 DP Unit

http://international.download.nvidia.com/pdf/kepler/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-14
SLIDE 14

Introduction Implementation SAR | RF Amps | RDEs y Objecitve Distributed Architecture

Y Objective Distributed Architecture

Hardware | NVIDIA GK110

Nvidia Graphics Processing Unit GK 110 w/ 15 SMX @ 250W eight w/ 2880 sp cores @ 0.98/1.12 GHz SP Perf: 8 x 2.8 TFlops SP Eff: 11.3 GFlops/W eight w/ 960 dp cores @ 0.98/1.12 GHz DP Perf: 8 x 1.11 TFlops DP Eff: 4.44 GFlops/W

http://www.nvidia.com/content/pdf/kepler/nvidia-kepler-gk110-architecture-whitepaper.pdf http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-titan-black/specifications

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-15
SLIDE 15

Introduction Implementation SAR | RF Amps | RDEs y Objecitve Distributed Architecture

Y Objective Distributed Architecture

Hardware | NVIDIA GK110

NVIDIA | GK 110 Die 15 Streaming Multiprocessor (SMX) Architecture Units 1536 kB L2 Cache six 64-bit Memory controllers 3 SP cores / 1 DP Unit

http://international.download.nvidia.com/pdf/kepler/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-16
SLIDE 16

Introduction Implementation SAR | RF Amps | RDEs y Objecitve Distributed Architecture

Y Objective Distributed Architecture

Hardware | NVIDIA GK110

NVIDIA | GK110 SMX Unit 192 SP Cores 64 DP Units 64 kB on chip memory 48kB shared / 16kB L1 16kB shared / 48kB L1 32kB shared / 32kB L1

http://international.download.nvidia.com/pdf/kepler/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-17
SLIDE 17

Introduction Implementation SAR | RF Amps | RDEs y Objecitve Distributed Architecture

Y Objective Distributed Architecture

Expected Performance

GTX Titan Black Single Precision: 2.8 TFlops Double Precision: 922 TFlops Server Single Precision: 22.4 TFlops Double Precision: 7.4 TFlops Y.O.D.A. Single Precision: 1.4 PFlops Double Precision: 477 TFlops

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-18
SLIDE 18

Introduction Implementation Benchmarks Challenges

1

Introduction Applications: SAR | RF Amps | RDEs y Objective Distributed Architecture (Y.O.D.A.)

2

Implementation Benchmarked Performance Challenges

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-19
SLIDE 19

Introduction Implementation Benchmarks Challenges

Y Objective Distributed Architecture

Benchmarked Performance: 8 GPUs CuBLAS - S | D | C | Z - GEMM

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-20
SLIDE 20

Introduction Implementation Benchmarks Challenges

Y Objective Distributed Architecture

Benchmarked Performance: 8 GPUs MAGMA - S | D | C | Z - GEMM

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-21
SLIDE 21

Introduction Implementation Benchmarks Challenges

Y Objective Distributed Architecture

Benchmarked Performance: 8 GPUs MAGMA - S | D | C | Z - GESV

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-22
SLIDE 22

Introduction Implementation Benchmarks Challenges

Y Objective Distributed Architecture

Benchmarked Performance: 2 GPUs MAGMA - S | D | C | Z - GESV

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-23
SLIDE 23

Introduction Implementation Benchmarks Challenges

Y Objective Distributed Architecture

Benchmarked Performance: 4 GPUs MAGMA - S | D | C | Z - GESV

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-24
SLIDE 24

Introduction Implementation Benchmarks Challenges

Y Objective Distributed Architecture

Benchmarked Performance: 8 GPUs MAGMA - S | D | C | Z - GESV

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-25
SLIDE 25

Introduction Implementation Benchmarks Challenges

Y Objective Distributed Architecture

Benchmarked Performance: 2/4/8 GPUs MAGMA - Z - GESV

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-26
SLIDE 26

Introduction Implementation Benchmarks Challenges

Y Objective Distributed Architecture

Benchmarked Single GPU Performance: HPL Top 500 Run

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-27
SLIDE 27

Introduction Implementation Benchmarks Challenges

Y Objective Distributed Architecture

Benchmarked Aggregate Performance: HPL Top 500 Run

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-28
SLIDE 28

Introduction Implementation Benchmarks Challenges

1

Introduction Applications: SAR | RF Amps | RDEs y Objective Distributed Architecture (Y.O.D.A.)

2

Implementation Benchmarked Performance Challenges

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-29
SLIDE 29

Introduction Implementation Benchmarks Challenges

Y Objective Distributed Architecture

Power Distribution

PDUs (~30kW/Rack) Two 50 Amp (208V) 3-Phase Overloading Breakers (2+1) 2.4kW Power Supplies Load Balancing Power Supply 1 TXR410-768R supply per phase Max Power Supplies Five Per 15Amp breaker

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-30
SLIDE 30

Introduction Implementation Benchmarks Challenges

Y Objective Distributed Architecture

HPL Benchmark Matrix Decomposition

CPU Affinity | nvidia-smi topo -m | numactl –physcpubind=<ids>

GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx4_0 CPU Affinity GPU0 X PIX PHB PHB SOC SOC SOC SOC SOC 0-5,12-17 GPU1 PIX X PHB PHB SOC SOC SOC SOC SOC 0-5,12-17 GPU2 PHB PHB X PIX SOC SOC SOC SOC SOC 0-5,12-17 GPU3 PHB PHB PIX X SOC SOC SOC SOC SOC 0-5,12-17 GPU4 SOC SOC SOC SOC X PIX PHB PHB PHB 6-11,18-23 GPU5 SOC SOC SOC SOC PIX X PHB PHB PHB 6-11,18-23 GPU6 SOC SOC SOC SOC PHB PHB X PIX PHB 6-11,18-23 GPU7 SOC SOC SOC SOC PHB PHB PIX X PHB 6-11,18-23 mlx4_0 SOC SOC SOC SOC PHB PHB PHB PHB X

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-31
SLIDE 31

Introduction Implementation Benchmarks Challenges

Y Objective Distributed Architecture

HPL Benchmark Matrix Decomposition

Performance Problem Size Block Size Rows Columns Bad 439,547 768 15 32 Good 491,520 384 24 20 Best 549,504 128 24 20

Matrix Distribution Close to Perfect Square | PxQ : 8 GPUs * 60 Motherboards Large Problem Size | N :

  • 1

8 GPUMem∗P×Q : 621,729

Small Block Sizes | Factor of N

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-32
SLIDE 32

Introduction Implementation Benchmarks Challenges

FFTs, CPUs, & GPUs

Examples

For CPU Operations: import time import numpy as np dtype=np.complex64 #np.complex128 axes = (0, 1) shape=(1024,1024) #2048, 4096, 8192 data = numpy.random.normal(size=shape).astype(dtype) t_start = time.time() np.fft.fftn(data, axes=axes) t_cpu_fft = time.time() - t_start t_start = time.time() np.fft.fftshift(data, axes=axes) t_cpu_shift = time.time() - t_start t_start = time.time() data_ref = numpy.fft.fftn(data, axes=axes) data_ref = numpy.fft.fftshift(data_ref, axes=axes) t_cpu_all = time.time() - t_start http://reikna.publicfields.net/en/latest/api/computations.html#

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-33
SLIDE 33

Introduction Implementation Benchmarks Challenges

FFTs, CPUs, & GPUs

For CPU Operations @ 1024x1024: import time import numpy as np (... data prep ...) t_start = time.time() np.fft.fftn(data, axes=axes) t_cpu_fft = time.time() - t_start t_start = time.time() np.fft.fftshift(data, axes=axes) t_cpu_shift = time.time() - t_start t_start = time.time() data_ref = numpy.fft.fftn(data, axes=axes) data_ref = numpy.fft.fftshift(data_ref, axes=axes) t_cpu_all = time.time() - t_start t_cpu_fft: 0.0365 | t_cpu_shift: 0.0044 t_cpu_all: 0.0504 http://reikna.publicfields.net/en/latest/api/computations.html#

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-34
SLIDE 34

Introduction Implementation Benchmarks Challenges

FFTs, CPUs, & GPUs

For GPU Operations: import time import numpy as np from reikna.cluda import any_api, cuda_api, find_devices from reikna.fft import FFT, FFTShift import reikna.cluda.dtypes as dtypes from reikna.core import Transformation, Parameter, Annotation, Type (... data prep ...) fft = FFT(data, axes=axes) fftc = fft.compile(thr) shift = FFTShift(data, axes=axes) shiftc = shift.compile(thr) data_dev = thr.to_device(data) (... calculate and get data ...) http://reikna.publicfields.net/en/latest/api/computations.html#

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-35
SLIDE 35

Introduction Implementation Benchmarks Challenges

FFTs, CPUs, & GPUs

For GPU Operations @ 1024x1024: (... time & np ...) from reikna.cluda import any_api, cuda_api, find_devices from reikna.fft import FFT, FFTShift import reikna.cluda.dtypes as dtypes from reikna.core import Transformation, Parameter from reikna.core import Annotation, Type (... data prep ...) (... compile GPU funcs & send data_dev ...) t_start = time.time() fftc(data_dev, data_dev) thr.synchronize() t_gpu_fft = time.time() - t_start t_start = time.time() shiftc(data_dev, data_dev) thr.synchronize() t_gpu_shift = time.time() - t_start t_start = time.time() fftc(data_dev, data_dev) shiftc(data_dev, data_dev) thr.synchronize() t_gpu_separate = time.time() - t_start data_gpu = data_dev.get() t_gpu_fft: 0.0012 | t_gpu_shift: 0.0002 t_gpu_separate: 0.0013 http://reikna.publicfields.net/en/latest/api/computations.html#

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-36
SLIDE 36

Introduction Implementation Benchmarks Challenges

FFTs, CPUs, & GPUs

For GPU Operations @ 1024x1024: (... time & np ...) from reikna.cluda import any_api, cuda_api, find_devices from reikna.fft import FFT, FFTShift import reikna.cluda.dtypes as dtypes from reikna.core import Transformation, Parameter from reikna.core import Annotation, Type (... data prep ...) (... compile GPU funcs & send data_dev ...) res_dev = thr.empty_like(data_dev) shift_tr = fftshift(data, axes=axes) fft2 = fft.parameter.output.connect(shift_tr, shift_tr.input, new_output=shift_tr.output) fft2c = fft2.compile(thr) t_start = time.time() fft2c(res_dev, data_dev) thr.synchronize() t_gpu_combined = time.time() - t_start data_gpu2 = data_dev.get() t_gpu_fft: 0.0012 | t_gpu_shift: 0.0002 t_gpu_combined: 0.0010 https://github.com/fjarri/reikna/tree/develop/examples

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-37
SLIDE 37

Introduction Implementation Benchmarks Challenges

FFTs, CPUs, & GPUs

For GPU Operations @ 4096x4096: (... time & np ...) from reikna.cluda import any_api, cuda_api, find_devices from reikna.fft import FFT, FFTShift import reikna.cluda.dtypes as dtypes from reikna.core import Transformation, Parameter from reikna.core import Annotation, Type (... data prep ...) (... compile GPU funcs & send data_dev ...) res_dev = thr.empty_like(data_dev) shift_tr = fftshift(data, axes=axes) fft2 = fft.parameter.output.connect(shift_tr, shift_tr.input, new_output=shift_tr.output) fft2c = fft2.compile(thr) t_start = time.time() fft2c(res_dev, data_dev) thr.synchronize() t_gpu_combined = time.time() - t_start data_gpu2 = data_dev.get() t_cpu_fft: 0.7720 | t_cpu_shift: 0.0645 t_gpu_fft: 0.0151 | t_gpu_shift: 0.0032 t_gpu_combined: 0.01481 https://github.com/fjarri/reikna/tree/develop/examples

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-38
SLIDE 38

Introduction Implementation Benchmarks Challenges

FFTs, CPUs, & GPUs

For GPU Operations @ 4096x4096: (... time & np ...) from reikna.cluda import any_api, cuda_api, find_devices from reikna.fft import FFT, FFTShift import reikna.cluda.dtypes as dtypes from reikna.core import Transformation, Parameter from reikna.core import Annotation, Type (... data prep ...) (... compile GPU funcs & send data_dev ...) res_dev = thr.empty_like(data_dev) shift_tr = fftshift(data, axes=axes) fft2 = fft.parameter.output.connect(shift_tr, shift_tr.input, new_output=shift_tr.output) fft2c = fft2.compile(thr) t_start = time.time() fft2c(res_dev, data_dev) thr.synchronize() t_gpu_combined = time.time() - t_start data_gpu2 = data_dev.get() speedup_fft: 51.2420 | t_gpu_shift: 19.7710 sparate_speedup: 61.5952 combined_speedup: 74.9029 https://github.com/fjarri/reikna/tree/develop/examples

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-39
SLIDE 39

Introduction Implementation Benchmarks Challenges

FFTs, CPUs, & GPUs

For GPU Operations @ 4096x4096 on yoda: (... time & np ...) from reikna.cluda import any_api, cuda_api, find_devices from reikna.fft import FFT, FFTShift import reikna.cluda.dtypes as dtypes from reikna.core import Transformation, Parameter from reikna.core import Annotation, Type (... data prep ...) (... compile GPU funcs & send data_dev ...) res_dev = thr.empty_like(data_dev) shift_tr = fftshift(data, axes=axes) fft2 = fft.parameter.output.connect(shift_tr, shift_tr.input, new_output=shift_tr.output) fft2c = fft2.compile(thr) t_start = time.time() fft2c(res_dev, data_dev) thr.synchronize() t_gpu_combined = time.time() - t_start data_gpu2 = data_dev.get() t_cpu_fft: 1.3697 | t_cpu_shift: 0.1343 t_gpu_fft: 0.01269 | t_gpu_shift: 0.0015 t_gpu_combined: 0.0128 https://github.com/fjarri/reikna/tree/develop/examples

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-40
SLIDE 40

Introduction Implementation Benchmarks Challenges

FFTs, CPUs, & GPUs

For GPU Operations @ 4096x4096 on yoda: (... time & np ...) from reikna.cluda import any_api, cuda_api, find_devices from reikna.fft import FFT, FFTShift import reikna.cluda.dtypes as dtypes from reikna.core import Transformation, Parameter from reikna.core import Annotation, Type (... data prep ...) (... compile GPU funcs & send data_dev ...) res_dev = thr.empty_like(data_dev) shift_tr = fftshift(data, axes=axes) fft2 = fft.parameter.output.connect(shift_tr, shift_tr.input, new_output=shift_tr.output) fft2c = fft2.compile(thr) t_start = time.time() fft2c(res_dev, data_dev) thr.synchronize() t_gpu_combined = time.time() - t_start data_gpu2 = data_dev.get() speedup_fft: 107.9372 | t_gpu_shift: 86.0241 sparate_speedup: 140.9891 combined_speedup: 153.7552 https://github.com/fjarri/reikna/tree/develop/examples

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-41
SLIDE 41

Introduction Implementation Benchmarks Challenges

FFTs, CPUs, & GPUs

For GPU Operations @ 8192x8192 on yoda: (... time & np ...) from reikna.cluda import any_api, cuda_api, find_devices from reikna.fft import FFT, FFTShift import reikna.cluda.dtypes as dtypes from reikna.core import Transformation, Parameter from reikna.core import Annotation, Type (... data prep ...) (... compile GPU funcs & send data_dev ...) res_dev = thr.empty_like(data_dev) shift_tr = fftshift(data, axes=axes) fft2 = fft.parameter.output.connect(shift_tr, shift_tr.input, new_output=shift_tr.output) fft2c = fft2.compile(thr) t_start = time.time() fft2c(res_dev, data_dev) thr.synchronize() t_gpu_combined = time.time() - t_start data_gpu2 = data_dev.get() speedup_fft: 118.3925 | t_gpu_shift: 94.6785 sparate_speedup: 140.9891 combined_speedup: 166.1680 https://github.com/fjarri/reikna/tree/develop/examples

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-42
SLIDE 42

Introduction Implementation Benchmarks Challenges

Split-Step Schrödinger Eqn.

Position i¯ h∂tΨ(x,t) =

  • − ¯

h2 2m∂xx +V (x,t)

  • Ψ(x,t)

⇒ Ψ(x,t +δt) = Ψ(x,t)e− i

¯ hV(x)δt

Momentum i¯ h∂t ˜ Ψ(k,t) = ¯ h2 2mk2 +V (i∂k)

  • ˜

Ψ(k,t) ⇒ ˜ Ψ(k,t +δt) = ˜ Ψ(k,t)e− i¯

h 2m k2δt https://jakevdp.github.io/blog/2012/09/05/quantum-python/

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-43
SLIDE 43

Introduction Implementation Benchmarks Challenges

Split-Step Schrödinger Eqn.

Position i¯ h∂tΨ(x,t) =

  • − ¯

h2 2m∂xx +V (x,t)

  • Ψ(x,t)

⇒ Ψ(x,t +δt) = Ψ(x,t)e− i

¯ hV(x)δt

Momentum i¯ h∂t ˜ Ψ(k,t) = ¯ h2 2mk2 +V (i∂k)

  • ˜

Ψ(k,t) ⇒ ˜ Ψ(k,t +δt) = ˜ Ψ(k,t)e− i¯

h 2m k2δt https://jakevdp.github.io/blog/2012/09/05/quantum-python/

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-44
SLIDE 44

Introduction Implementation Benchmarks Challenges

Split-Step Schrödinger Eqn.

https://jakevdp.github.io/blog/2012/09/05/quantum-python/ Advance Position by Half Step Ψ

  • xn,t + δt

2

  • = Ψ(xn,t)e− i

¯ h V(xn) δt 2

Fourier Transform ˜ Ψ

  • km,t + δt

2

  • =

δx √ 2π

N−1

n=0

Ψ

  • xn,t + δt

2

  • e−ikmxn

Advance Momentum by Full Step ˜ Ψ

  • km,t + 3δt

2

  • = ˜

Ψ

  • km,t + δt

2

  • e− i¯

h 2m k2 mδt

Inverse-Fourier Transform Ψ

  • km,t + 3δt

2

  • = δk

δx √ 2π N

N−1

m=0

˜ Ψ

  • km,t + 3δt

2

  • eikmxn

Advance Position by Half Step Ψ(xn,t +2δt) = Ψ

  • xn,t + 3δt

2

  • e− i

¯ h V(xn) δt 2

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-45
SLIDE 45

Introduction Implementation Benchmarks Challenges

Split-Step Schrödinger Eqn.

https://jakevdp.github.io/blog/2012/09/05/quantum-python/ Advance Position by Half Step Ψ

  • xn,t + δt

2

  • = Ψ(xn,t)e− i

¯ h V(xn) δt 2

Fourier Transform ˜ Ψ

  • km,t + δt

2

  • =

δx √ 2π

N−1

n=0

Ψ

  • xn,t + δt

2

  • e−ikmxn

Advance Momentum by Full Step ˜ Ψ

  • km,t + 3δt

2

  • = ˜

Ψ

  • km,t + δt

2

  • e− i¯

h 2m k2 mδt

Inverse-Fourier Transform Ψ

  • km,t + 3δt

2

  • = δk

δx √ 2π N

N−1

m=0

˜ Ψ

  • km,t + 3δt

2

  • eikmxn

Advance Position by Half Step Ψ(xn,t +2δt) = Ψ

  • xn,t + 3δt

2

  • e− i

¯ h V(xn) δt 2

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-46
SLIDE 46

Introduction Implementation Benchmarks Challenges

Split-Step Schrödinger Eqn.

https://jakevdp.github.io/blog/2012/09/05/quantum-python/ Advance Position by Half Step Ψ

  • xn,t + δt

2

  • = Ψ(xn,t)e− i

¯ h V(xn) δt 2

Fourier Transform ˜ Ψ

  • km,t + δt

2

  • =

δx √ 2π

N−1

n=0

Ψ

  • xn,t + δt

2

  • e−ikmxn

Advance Momentum by Full Step ˜ Ψ

  • km,t + 3δt

2

  • = ˜

Ψ

  • km,t + δt

2

  • e− i¯

h 2m k2 mδt

Inverse-Fourier Transform Ψ

  • km,t + 3δt

2

  • = δk

δx √ 2π N

N−1

m=0

˜ Ψ

  • km,t + 3δt

2

  • eikmxn

Advance Position by Half Step Ψ(xn,t +2δt) = Ψ

  • xn,t + 3δt

2

  • e− i

¯ h V(xn) δt 2

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-47
SLIDE 47

Introduction Implementation Benchmarks Challenges

Split-Step Schrödinger Eqn.

https://jakevdp.github.io/blog/2012/09/05/quantum-python/ Advance Position by Half Step Ψ

  • xn,t + δt

2

  • = Ψ(xn,t)e− i

¯ h V(xn) δt 2

Fourier Transform ˜ Ψ

  • km,t + δt

2

  • =

δx √ 2π

N−1

n=0

Ψ

  • xn,t + δt

2

  • e−ikmxn

Advance Momentum by Full Step ˜ Ψ

  • km,t + 3δt

2

  • = ˜

Ψ

  • km,t + δt

2

  • e− i¯

h 2m k2 mδt

Inverse-Fourier Transform Ψ

  • km,t + 3δt

2

  • = δk

δx √ 2π N

N−1

m=0

˜ Ψ

  • km,t + 3δt

2

  • eikmxn

Advance Position by Half Step Ψ(xn,t +2δt) = Ψ

  • xn,t + 3δt

2

  • e− i

¯ h V(xn) δt 2

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-48
SLIDE 48

Introduction Implementation Benchmarks Challenges

Split-Step Schrödinger Eqn.

https://jakevdp.github.io/blog/2012/09/05/quantum-python/ Advance Position by Half Step Ψ

  • xn,t + δt

2

  • = Ψ(xn,t)e− i

¯ h V(xn) δt 2

Fourier Transform ˜ Ψ

  • km,t + δt

2

  • =

δx √ 2π

N−1

n=0

Ψ

  • xn,t + δt

2

  • e−ikmxn

Advance Momentum by Full Step ˜ Ψ

  • km,t + 3δt

2

  • = ˜

Ψ

  • km,t + δt

2

  • e− i¯

h 2m k2 mδt

Inverse-Fourier Transform Ψ

  • km,t + 3δt

2

  • = δk

δx √ 2π N

N−1

m=0

˜ Ψ

  • km,t + 3δt

2

  • eikmxn

Advance Position by Half Step Ψ(xn,t +2δt) = Ψ

  • xn,t + 3δt

2

  • e− i

¯ h V(xn) δt 2

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-49
SLIDE 49

Introduction Implementation Benchmarks Challenges

Split-Step Schrödinger Eqn.

https://jakevdp.github.io/blog/2012/09/05/quantum-python/ Advance Position by Half Step Ψ

  • xn,t + δt

2

  • = Ψ(xn,t)e− i

¯ h V(xn) δt 2

Fourier Transform ˜ Ψ

  • km,t + δt

2

  • =

δx √ 2π

N−1

n=0

Ψ

  • xn,t + δt

2

  • e−ikmxn

Advance Momentum by Full Step ˜ Ψ

  • km,t + 3δt

2

  • = ˜

Ψ

  • km,t + δt

2

  • e− i¯

h 2m k2 mδt

Inverse-Fourier Transform Ψ

  • km,t + 3δt

2

  • = δk

δx √ 2π N

N−1

m=0

˜ Ψ

  • km,t + 3δt

2

  • eikmxn

Advance Position by Half Step Ψ(xn,t +2δt) = Ψ

  • xn,t + 3δt

2

  • e− i

¯ h V(xn) δt 2

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-50
SLIDE 50

Introduction Implementation Benchmarks Challenges

Split-Step Schrödinger Eqn.

https://jakevdp.github.io/blog/2012/09/05/quantum-python/ Advance Position by Half Step Ψ

  • xn,t + δt

2

  • = Ψ(xn,t)e− i

¯ h V(xn) δt 2

Fourier Transform ˜ Ψ

  • km,t + δt

2

  • =

δx √ 2π

N−1

n=0

Ψ

  • xn,t + δt

2

  • e−ikmxn

Advance Momentum by Full Step ˜ Ψ

  • km,t + 3δt

2

  • = ˜

Ψ

  • km,t + δt

2

  • e− i¯

h 2m k2 mδt

Inverse-Fourier Transform Ψ

  • km,t + 3δt

2

  • = δk

δx √ 2π N

N−1

m=0

˜ Ψ

  • km,t + 3δt

2

  • eikmxn

Advance Position by Half Step Ψ(xn,t +2δt) = Ψ

  • xn,t + 3δt

2

  • e− i

¯ h V(xn) δt 2

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-51
SLIDE 51

Introduction Implementation Benchmarks Challenges

FFTs, CPUs, & GPUs

For CPU Operations @ N = 2 ** 11: (numpy, matplotllib, scipy) (initialize function Ψ(xn,t), ˜ Ψ(km,t) , & V (xn)) (start time stepping iterated over 50 times per frame) self.dt = dt if Nsteps > 0: self.psi_mod_x *= self.x_evolve_half for i in xrange(Nsteps - 1): self.compute_k_from_x() ⇐ FFT self.psi_mod_k *= self.k_evolve self.compute_x_from_k() ⇐ IFFT self.psi_mod_x *= self.x_evolve self.compute_k_from_x() ⇐ FFT self.psi_mod_k *= self.k_evolve self.compute_x_from_k() ⇐ IFFT self.psi_mod_x *= self.x_evolve_half self.compute_k_from_x() ⇐ FFT self.t += dt * Nsteps cpu_cpu_iteration: 0.0126 cpu_gpu_iteration: 0.3569 gpu_gpu_iteration: 0.6801 https://github.com/fjarri/reikna/tree/develop/examples

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-52
SLIDE 52

Introduction Implementation Benchmarks Challenges

FFTs, CPUs, & GPUs

For CPU Operations @ N = 2 ** 22: (numpy, matplotllib, scipy) (initialize function Ψ(xn,t), ˜ Ψ(km,t) , & V (xn)) (start time stepping iterated over 50 times per frame) self.dt = dt if Nsteps > 0: self.psi_mod_x *= self.x_evolve_half for i in xrange(Nsteps - 1): self.compute_k_from_x() ⇐ FFT self.psi_mod_k *= self.k_evolve self.compute_x_from_k() ⇐ IFFT self.psi_mod_x *= self.x_evolve self.compute_k_from_x() ⇐ FFT self.psi_mod_k *= self.k_evolve self.compute_x_from_k() ⇐ IFFT self.psi_mod_x *= self.x_evolve_half self.compute_k_from_x() ⇐ FFT self.t += dt * Nsteps cpu_cpu_iteration: 22.1887 gpu_gpu_iteration: 9.0297 gpu_gpu_speedup: 2.4573 https://github.com/fjarri/reikna/tree/develop/examples

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-53
SLIDE 53

Introduction Implementation Benchmarks Challenges

FFTs, CPUs, & GPUs

For CPU Operations @ N = 2 ** 25: (numpy, matplotllib, scipy) (initialize function Ψ(xn,t), ˜ Ψ(km,t) , & V (xn)) (start time stepping iterated over 50 times per frame) self.dt = dt if Nsteps > 0: self.psi_mod_x *= self.x_evolve_half for i in xrange(Nsteps - 1): self.compute_k_from_x() ⇐ FFT self.psi_mod_k *= self.k_evolve self.compute_x_from_k() ⇐ IFFT self.psi_mod_x *= self.x_evolve self.compute_k_from_x() ⇐ FFT self.psi_mod_k *= self.k_evolve self.compute_x_from_k() ⇐ IFFT self.psi_mod_x *= self.x_evolve_half self.compute_k_from_x() ⇐ FFT self.t += dt * Nsteps cpu_cpu_iteration: 212.7478 gpu_gpu_iteration: 58.6416 gpu_gpu_speedup: 3.6279 https://github.com/fjarri/reikna/tree/develop/examples

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-54
SLIDE 54

Introduction Implementation Benchmarks Challenges

Examples

Spectrogram

from pylab import * dt = 0.0005 t = arange(0.0, 20.0, dt) s1 = sin(2*pi*100*t) #Signal 1 s2 = 2*sin(2*pi*400*t) #Signal 2 mask = where(logical_and(t>10, t<12), 1.0, 0.0) s2 = s2 * mask #masked chirp nse = 0.01*randn(len(t)) #Noise x = s1 + s2 + nse #Total Signal NFFT = 1024 #Window Fs = int(1.0/dt) #Freq Sample ax1 = subplot(211) plot(t, x) subplot(212, sharex=ax1) # Pxx: segments x freqs array, freqs: freq vector Pxx, freqs, bins, im = specgram(x, NFFT=NFFT, Fs=Fs, noverlap=900, cmap=cm.gist_heat) show() http://matplotlib.org/examples/pylab_examples/specgram_demo.html

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-55
SLIDE 55

Introduction Implementation Benchmarks Challenges

Examples

Spectrogram

from pylab import * import happy as hp dt = 0.0005 t = arange(0.0, 20.0, dt) s1 = sin(2*pi*100*t) #Signal 1 s2 = 2*sin(2*pi*400*t) #Signal 2 mask = where(logical_and(t>10, t<12), 1.0, 0.0) s2 = s2 * mask #masked chirp nse = 0.01*randn(len(t)) #Noise x = s1 + s2 + hp.smile(t) + nse #Total Signal NFFT = 1024 #Window Fs = int(1.0/dt) #Freq Sample ax1 = subplot(211) plot(t, x) subplot(212, sharex=ax1) # Pxx: segments x freqs array, freqs: freq vector Pxx, freqs, bins, im = specgram(x, NFFT=NFFT, Fs=Fs, noverlap=900, cmap=cm.gist_heat) show() http://matplotlib.org/examples/pylab_examples/specgram_demo.html https://github.com/fjarri/reikna/tree/develop/examples

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-56
SLIDE 56

Introduction Implementation Benchmarks Challenges

Examples

Spectrogram

from pylab import * import happy as hp dt = 0.0005 t = arange(0.0, 20.0, dt) s1 = sin(2*pi*100*t) #Signal 1 s2 = 2*sin(2*pi*400*t) #Signal 2 mask = where(logical_and(t>10, t<12), 1.0, 0.0) s2 = s2 * mask #masked chirp nse = 0.01*randn(len(t)) #Noise x = s1 + s2 + hp.smile(t) + nse #Total Signal NFFT = 1024 #Window Fs = int(1.0/dt) #Freq Sample ax1 = subplot(211) plot(t, x) subplot(212, sharex=ax1) # Pxx: segments x freqs array, freqs: freq vector Pxx, freqs, bins, im = specgram(x, NFFT=NFFT, Fs=Fs, noverlap=900, cmap=cm.gist_heat) show() from reikna.cluda import any_api from reikna.cluda import dtypes, functions from reikna.fft import FFT from reikna.core import Computation, Transformation, Param- eter, Annotation, Type from reikna.algorithms import Transpose import reikna.transformations as transformations http://matplotlib.org/examples/pylab_examples/specgram_demo.html https://github.com/fjarri/reikna/tree/develop/examples

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-57
SLIDE 57

Introduction Implementation Benchmarks Challenges

Examples

Spectrogram

(import ...) dt = 0.0005 t = arange(0.0, 20.0, dt) s1 = sin(2*pi*100*t) #Signal 1 s2 = 2*sin(2*pi*400*t) #Signal 2 mask = where(logical_and(t>10, t<12), 1.0, 0.0) s2 = s2 * mask #masked chirp nse = 0.01*randn(len(t)) #Noise x = s1 + s2 + hp.smile(t) + nse #Total Signal NFFT = 1024 #Window Fs = int(1.0/dt) #Freq Sample ax1 = subplot(211) plot(t, x) subplot(212, sharex=ax1) # Pxx: segments x freqs array, freqs: freq vector Pxx, freqs, bins, im = specgram(x, NFFT=NFFT, Fs=Fs, noverlap=900, cmap=cm.gist_heat) show() http://matplotlib.org/examples/pylab_examples/specgram_demo.html https://github.com/fjarri/reikna/tree/develop/examples

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-58
SLIDE 58

Introduction Implementation Benchmarks Challenges

Examples

Spectrogram

The Transformation API allows to to connect “transformations” to a core computation (Spectrogram) rolling_frame_trf = rolling_frame(x, NFFT, noverlap, pad_to) complex_dtype = dtypes.complex_for(x.dtype) fft_arr = Type(complex_dtype, rolling_frame_trf.output.shape) real_fft_arr = Type(x.dtype, rolling_frame_trf.output.shape) window_trf = window(real_fft_arr, NFFT) broadcast_zero_trf = transformations.broadcast_const(real_fft_arr, 0) to_complex_trf = transformations.combine_complex(fft_arr) amplitude_trf = transformations.norm_const(fft_arr, 1) crop_trf = crop_frequencies(amplitude_trf.output) fft = FFT(fft_arr, axes=(1,)) http://matplotlib.org/examples/pylab_examples/specgram_demo.html https://github.com/fjarri/reikna/tree/develop/examples

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-59
SLIDE 59

Introduction Implementation Benchmarks Challenges

Examples

Spectrogram

__init__(self, x, NFFT,..., window=hanning_window): rolling_frame_trf= rolling_frame(x, NFFT, noverlap, pad_to) (...) real_fft_arr= Type(x.dtype, rolling_frame_trf.output.shape) window_trf= window(real_fft_arr, NFFT) (...) fft = FFT(fft_arr, axes=(1,)) def hanning_window(arr, NFFT): """Applies the von Hann window to the rows of a 2D array. To account for zero padding (which we do not want to window), NFFT is provided separately. """ if dtypes.is_complex(arr.dtype): coeff_dtype = dtypes.real_for(arr.dtype) else: coeff_dtype = arr.dtype return Transformation(...) http://matplotlib.org/examples/pylab_examples/specgram_demo.html https://github.com/fjarri/reikna/tree/develop/examples

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-60
SLIDE 60

Introduction Implementation Benchmarks Challenges

Examples

Spectrogram

def hanning_window(arr, NFFT): """Applies the von Hann window to the rows of a 2D array. To account for zero padding (which we do not want to window), NFFT is provided separately. """ if dtypes.is_complex(arr.dtype): coeff_dtype = dtypes.real_for(arr.dtype) else: coeff_dtype = arr.dtype return Transformation(...) [Parameter(’output’, Annotation(arr, ’o’)), Parameter(’input’, Annotation(arr, ’i’)),], """ ${dtypes.ctype(coeff_dtype)} coeff; %if NFFT != output.shape[0]: if (${idxs[1]} >= ${NFFT}) { coeff = 1; } else %endif { coeff=0.5*(1-cos(2*${numpy.pi}*${idxs[-1]}/(${NFFT}-1)));} ${output.store_same}(${mul}(${input.load_same}, coeff)); """, render_kwds=dict( coeff_dtype=coeff_dtype, NFFT=NFFT, mul=functions.mul(arr.dtype, coeff_dtype)) http://matplotlib.org/examples/pylab_examples/specgram_demo.html https://github.com/fjarri/reikna/tree/develop/examples

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-61
SLIDE 61

Introduction Implementation Benchmarks Challenges

Examples

Spectrogram

[Parameter(’output’, Annotation(arr, ’o’)), Parameter(’input’, Annotation(arr, ’i’)),], """ ${dtypes.ctype(coeff_dtype)} coeff; %if NFFT != output.shape[0]: if (${idxs[1]} >= ${NFFT}) {coeff = 1; } else %endif { coeff=0.5*(1-cos(2*${numpy.pi}*${idxs[- 1]}/(${NFFT}-1)));} ${output.store_same}(${mul}(${input.load_same}, coeff)); """, render_kwds=dict( coeff_dtype=coeff_dtype, NFFT=NFFT, mul=functions.mul(arr.dtype, coeff_dtype)) Hanning Based α = 0.5, β ≡ 1−α = 0.5

w(n) =

  • 1

, n ≥ N α −β cos 2πn

N−1

  • , n ≤ N −1

http://matplotlib.org/examples/pylab_examples/specgram_demo.html https://github.com/fjarri/reikna/tree/develop/examples

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-62
SLIDE 62

Introduction Implementation Benchmarks Challenges

Examples

Spectrogram

[Parameter(’output’, Annotation(arr, ’o’)), Parameter(’input’, Annotation(arr, ’i’)),], """ ${dtypes.ctype(coeff_dtype)} coeff; %if NFFT != output.shape[0]: if (${idxs[1]} >= ${NFFT}) { coeff = 1; } else %endif { coeff=<YOUR FUNCTION HERE>;} ${output.store_same}(${mul}(${input.load_same}, coeff)); """, render_kwds=dict( coeff_dtype=coeff_dtype, NFFT=NFFT, mul=functions.mul(arr.dtype, coeff_dtype)) Hamming Based α = 0.54, β ≡ 1−α = 0.46 w(n) =

  • 1

, n ≥ N α −β cos

  • 2πn

N−1

  • , n ≤ N −1

Blackman-Harris α0 = 0.358, α1 = 0.488, α2 = 0.141, α3 = 0.012 w(n) =

  • 1

, n ≥ N α0 −α1 cos

  • 2πn

N−1

  • +α2 cos
  • 4πn

N−1

  • −α3 cos
  • 6πn

N−1

  • , n ≤ N −1

Flat top α0 = 1, α1 = 1.93, α2 = 1.29, α3 = 0.388, α4 = 0.028 w(n) =        1 , n ≥ N α0 −α1 cos

  • 2πn

N−1

  • +α2 cos
  • 4πn

N−1

  • −α3 cos
  • 6πn

N−1

  • +α4 cos
  • 8πn

N−1

  • , n ≤ N −1

http://matplotlib.org/examples/pylab_examples/specgram_demo.html https://github.com/fjarri/reikna/tree/develop/examples

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-63
SLIDE 63

Introduction Implementation Benchmarks Challenges

Examples

Spectrogram

Grid points: 32,768 CPU Runtime: 0.0199 GPU reikna Runtime: 0.0654 GPU to CPU Speedup: 0.3041 Grid points: 524,288 CPU Runtime: 0.3239 GPU reikna Runtime: 0.1226 GPU to CPU Speedup: 2.6416 Grid points: 8,388,608 CPU Runtime: 4.8352 GPU reikna Runtime: 0.8795 GPU to CPU Speedup: 5.4976 http://matplotlib.org/examples/pylab_examples/specgram_demo.html https://github.com/fjarri/reikna/tree/develop/examples

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-64
SLIDE 64

Introduction Implementation Benchmarks Challenges

Examples

Spectrogram

Grid points: 32,768 | CPU Runtime: 0.0199 | GPU reikna Runtime: 0.0654 | GPU to CPU Speedup: 0.3041 http://matplotlib.org/examples/pylab_examples/specgram_demo.html https://github.com/fjarri/reikna/tree/develop/examples

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-65
SLIDE 65

Introduction Implementation Benchmarks Challenges

Examples

Spectrogram

Grid points: 8,388,608 | CPU Runtime: 4.8352 | GPU reikna Runtime: 0.8795 | GPU to CPU Speedup: 5.4976 http://matplotlib.org/examples/pylab_examples/specgram_demo.html https://github.com/fjarri/reikna/tree/develop/examples

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-66
SLIDE 66

Appendix References | Further Reading

SAR Image Formation S1. Frey, O., Meir, E.H. and Nuesch, D.R., “Processing SAR Data of Rugged Terrain by Time-Domain Back- Projection”, Proceedings of the SPIE, Vol 5980, 71-79, 2005.

  • S2. Park, J.,Tang, P.T.P., Smelyanskiy, M., Kim, D., and Benson,

T., “Efficient Backprojection-Based Synthetic Aperture Radar Computation with Many-Core Processors,” Int. Conf. for High Performance Computing, Networking, Storage, and Analysis (SC), 2012.

  • S3. Ryland, Robert. "Synthetic aperture radar (SAR) imaging

system." U.S.Patent 8,344,934. <http://patft.uspto.gov/netacgi/nph- Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-67
SLIDE 67

Appendix References | Further Reading

Fnetahtml%2FPTO%2Fsearch- bool.html&r=9&f=G&l=50&co1=AND&d=PTXT&s1=ryland.IN NM.&OS=IN/ryland&RS=IN/ryland>

  • S4. Hartley, T.D.R., Fasih, A. R., Berdanier, C.A., Ozguner, F.,

and Catalyurek, U.V.,’ Investigating the Use of GPU-Accelerated Nodes for SAR Image Formation”, IEEE Int. Conf. on Cluster Computing and Workshops, 1-8, 31, 2009. S5 Capozzoli, A., Curcio, C., and Liesno, A., “Fast GPU-Based Interpolation for SAR Back-Projection”, Progress in Electromagnetics Research, V 133, 249-283, 2013.

  • S6. Capozzoli, A., Curcio, C., Liesno, A. and Testa, P.V., “

NUFFT-based SAR Backprojection on Multiple GPUs”, Proc. Of the Tyrrhenian Workshop on Advances in Radar and Remote Sensing, Napoli, Italy, 2012. RF Design

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-68
SLIDE 68

Appendix References | Further Reading

  • R1. S. J. Cooke, I. A. Chernyavskiy, G. M. Stanchev, B. Levush,
  • T. M. Antonsen “GPU-Accelerated Large-Signal Device

Simulation Using the 3D Particle-in-Cell Code ’Neptune’” International Conference on Plasma Science, Edinburgh, UK, July 2012.

  • R2. Antonsen, T. M., Mondelli, A., Levush, B., Verboncoeur, J.

P., & Birdsall, C. K. “Advances in modeling and simulation of vacuum electronic devices.” Proceedings of the IEEE, 87(5), 804-839, 1999.

  • R3. J. Petillo, et al., “The MICHELLE 3D Electron Gun And

Collector Modeling Tool: Theory and Design,” IEEE Trans. Plasma Science, vol. 30, no. 3, pp. 1238–1264, June 2002.

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-69
SLIDE 69

Appendix References | Further Reading

  • R4. Cooke, S. J., Shtokhamer, R., Mondelli, A. A., & Levush, B.

“A finite integration method for conformal, structured-grid, electromagnetic simulation.” Journal of Computational Physics, 215(1), 321-347, 2006.

  • R5. Chernyavskiy, I. A., Vlasov, A. N., Antonsen Jr., T. M.,

Cooke, S. J., Levush, B., & Nguyen, K. T. “Simulation of Klystrons With Slow and Reflected Electrons Using Large-Signal Code TESLA.” IEEE Transactions on Electron Devices, 54(6), 1555-1561, 2007.

  • R6. Chernyavskiy, I. A., Cooke, S. J., Vlasov, A. N., Antonsen, T.

M., Abe, D. K., Levush, B., & Nguyen, K. T. “Parallel Simulation of Independent Beam-Tunnels in Multiple-Beam Klystrons Using TESLA.” IEEE Transactions on Plasma Science, 36(3), 670-681, 2008.

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-70
SLIDE 70

Appendix References | Further Reading

  • R7. Alexander N. Vlasov, Thomas M. Antonsen, Igor A.

Chernyavskiy, David P. Chernin, Baruch Levush “A Computationally Efficient Two-Dimensional Model of the Beam–Wave Interaction in a Coupled-Cavity TWT,” IEEE Transactions on Plasma Science 40(6):1575-1589, 2012.

  • R8. Adams, B.M., Bohnhoff, W.J., Dalbey, K.R., Eddy, J.P.,

Eldred, M.S., Gay, D.M., Haskell, K., Hough, P.D., and Swiler, L.P., “DAKOTA, A Multilevel Parallel Object-Oriented Framework for Design Optimization, Parameter Estimation, Uncertainty Quantification, and Sensitivity Analysis: Version 5.0 Reference Manual,” Sandia Technical Report SAND2010-2184. Rotating Detonating Engines

  • D1. Heiser, W.H. and Pratt, D.T., “Thermodynamic Cycle

Analysis of Pulse Detonation Engines,” JPP, Vol. 18, No. 1, p 68, 2002.

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-71
SLIDE 71

Appendix References | Further Reading

  • D2. Braun, E. M., Lu, F. K., Wilson, D. R., and Camberos, J.,

“Detonation Engine Performance Comparison Using First and Second Law Analyses,” AIAA Paper 2010–7040, 46 th Joint Propulsion Conference, 2010.

  • D3. Bykovskii, F.A., Zhdan, S.A., and Vedernikov, E.F.

“Continuous spin detonations,” J Propulsion Power, Vol. 22, p 1204, 2006.

  • D4. Zhdan, S.A., Bykovskii, F.A., and Vedernikov, E.F.

“Mathematical Modeling of a Rotating Detonation Wave in a Hydrogen-Oxygen Mixture,” Combustion, Explosion, and Shock Waves Vol. 43, No. 4, p 449, 2007.

  • D5. Hishida, M., Fujiwara, T., and Wolanski, P. “Fundamentals of

rotating detonation engines,” Shock Waves, Vol. 19, No. 1, pp. 1-10, 2009.

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-72
SLIDE 72

Appendix References | Further Reading

  • D6. Nordeen, C.A., Schwer, D.A., Schauer, F., Hoke, J., Cetegen,
  • B. and Barber, T. “Thermodynamic Modeling of a Rotating

Detonation Engine,” AIAA Paper 2011-803, 49 th AIAA Aerospace Sciences Meeting, 2011.

  • D7. Kindracki, J., Wolanski, P., and Gut, Z. “Experimental

research on the rotating detonation in gaseous fuels- oxygen mixtures,” Shock Waves, Vol. 21, pp 75-84, 2011.

  • D8. Schwer, D.A. and Kailasanath, K., “Fluid Dynamics of

Rotating Detonation Engines with Hydrogen and Hydrocarbon Fuels,” Proc Combust Inst, Vol. 34, pp. 1991-1998, 2013.

  • D9. Lu, F.K., Braun, E.M., Massa, L., and Wilson, D.R.,

“Rotating Detonation Wave Propulsion: Experimental Challenges, Modeling, and Engine Concepts (Invited),” AIAA Paper 2011-6043, 47 th Joint Propulsion Conference, 2011.

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-73
SLIDE 73

Appendix References | Further Reading

  • D10. Kailasanath, K., Patnaik, G., and Li, C. “On factors

controlling the performance of pulsed detonation engines,” in High- speed deflagration and detonation: Fundamentals and

  • control. Eds: G. Roy, S. Frolov, D. Netzer, and A. Borisov,

Moscow, Russia: Enas Publ. 193-206, 2001.

  • D11. Jenkins, T.P., Sanders, S.T., Kailasanath, K., Li, C., and

Hanson, R.K., “Diode Laser-Based Measurements for Model Validation in Pulse Detonation Flows,” Proceedings of the JANNAF Combustion, Airbreathing Propulsion, Modeling and Simulation Joint Meeting, Monterey, CA, Nov 12-17, 2000 (published by CPIA, JSC CD-05).

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-74
SLIDE 74

Appendix References | Further Reading

  • D12. Brophy, C.M., Sinibaldi, J.O., Netzer, D.W., and

Kailasanath, K., “Initiator Diffraction Limits in Pulse Detonation Engines,” in Confined Detonations and Pulse Detonation

  • Engines. Eds: G.D. Roy, S.M. Frolov, R.J. Santoro, and S.A.

Tsyganov, Moscow, Russia: TORUS PRESS Publ., 2003.

  • D13. Li, C., Kailasanath, K., and Patnaik, G. “A numerical study
  • f flow field evolution in a pulsed detonation engine,” AIAA

Paper 2000-0314, 38th Aerospace Sciences Meeting, Reno, NV 2000.

  • D14. Schwer, D.A. and Kailasanath, K. “Numerical Investigation
  • f Rotating Detonation Engines”, AIAA Paper 2010-6880, 46th

Joint Propulsion Conference, 2010.

  • D15. Schwer, D.A., and Kailasanath, K. “Numerical Study of

Engine Size Effects on Rotating Detonation Engines,” AIAA Paper 2011-581, 49th AIAA Aerospace Sciences Meeting, 2011.

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-75
SLIDE 75

Appendix References | Further Reading

  • D16. Schwer, D.A., and Kailasanath, K. “Effect of Inlet on Fill

Region and Performance of Rotating Detonation Engines,” AIAA Paper 2011-6044, 47th Joint Propulsion Conference, 2011.

  • D17. Schwer, D.A., and Kailasanath, K. “Feedback into Mixture

Plenums in Rotating Detonation Engines,” AIAA Paper 2012-0617, 50th AIAA Aerospace Sciences Meeting, 2012.

  • D18. Nordeen, C.A., Schwer, D.A., Schauer, F., Hoke, J., Barber,

T., and Cetegen, B., “Energy Transfer in a Rotating Detonation Engine,” AIAA Paper 2011-6045, 47th Joint Propulsion Conference, 2011.

  • D19. Nordeen, C.A., Schwer, D.A., Schauer, F., Hoke, J., Barber,

T., and Cetegen, B., “Divergence and Mixing in a Rotating Detonation Engine,” AIAA Paper 2013-1175, 2013.

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-76
SLIDE 76

Appendix References | Further Reading

  • D20. Schwer, D.A., and Kailasanath, K. “On Reducing Feedback

Pressure in Rotating Detonation Engines,” AIAA Paper 2013-1178, 51 st AIAA Aerospace Sciences Meeting, 2012.

  • D21. Russo, R.M., King, P.I., Schauer, F.R., and Thomas, L.M.,

“Characterization of Pressure Rise Across a Continuous Detonation Engine,” AIAA Paper 2011-6046, 47th Joint Propulsion Conference, 2011.

  • D22. Dyer, R., Naples, A., Kaemming, T., Hoke, J., and Schauer,
  • F. “Parametric Testing of a Unique Rotating Detonation Engine

Design,” AIAA Paper 2012-0121, 50th Aerospace Sciences Meeting, 2012.

  • D23. Shank, J.C., King, P.I., Karnesky, J., Schauer, F.R., and

Hoke, J.L., “Development and Testing of a Modular Rotating Detonation Engine,” AIAA Paper 2012-0120, 50th Aerospace Sciences Meeting, 2012.

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC

slide-77
SLIDE 77

Appendix References | Further Reading

  • D24. Zalesak, S. T., “Fully multidimensional flux-corrected

transport algorithms for fluids," Journal of Computational Physics, Vol. 31, No. 3, pp. 335-362, 1979.

  • D25. The MPI Forum, “MPI: A Message-Passing Interface

Standard,” July 2011, retrieved from http://www.mpi- forum.org/docs/mpi-2.2/mpi22-report.pdf.

  • D26. Karypis, G. and Schloegel, K., “ParMETIS: Parallel Graph

Partitioning and Sparse Matrix Ordering Library, Version 4.0,” retrieved from http://glaros.dtc.umn.edu/gkhome/metis/ parmetis/overview, 2011.

  • D27. Hoberock, J. and Bell, N., “Thrust: A Parallel Template

Library,” Version 1.4.0, 2011. D28. Bell, N. and Hoberock, J., “Thrust: A Productivity-Oriented Library for CUDA,” GPU Computing Gems: Jade Edition, Morgan Kaufmann, pp. 359-372, 2011.

  • S. Tafur, & C. Kung | DoDI 5230.24: Distribution Statement A.

HD GP-GPU HPC