a case study By Chris Laidler Optimization cycle Assess - - PowerPoint PPT Presentation

a case study
SMART_READER_LITE
LIVE PREVIEW

a case study By Chris Laidler Optimization cycle Assess - - PowerPoint PPT Presentation

Accelerating the acceleration search a case study By Chris Laidler Optimization cycle Assess Parallelise Test Optimise Profile le Identify the function or functions in which the application is spending most of its execution time.


slide-1
SLIDE 1

Accelerating the acceleration search a case study

By Chris Laidler

slide-2
SLIDE 2

Optimization cycle

Assess Parallelise Optimise Test

slide-3
SLIDE 3

Profile le

  • Identify the function or functions in which the

application is spending most of its execution time.

  • CPU code:

–gprof –valgrind –oprofile

  • Identifying hotspots
slide-4
SLIDE 4

Parallelize

  • Use existing libraries
  • Code to expose parallelism
slide-5
SLIDE 5

Optimizing CUDA code

  • Using CPU Timers

–CudaDeviceSynchronize() –cudaEventSynchronize()

  • Using CUDA GPU Timers

–cudaEventCreate(&start) –CudaEventElapsedTime()

  • Bandwidth

–How, when

slide-6
SLIDE 6

Data Transfer Between Host and Devic ice

  • Minimize data transfer between the host and the device.

Even if that means running kernels on the GPU that do not demonstrate any speedup compared with running them on the host CPU.

  • Keep it in device memory
  • Batch many small transfers into one larger transfer
  • Use page-locked (or pinned) memory

–CudaHostAlloc()

slide-7
SLIDE 7

Asynchronous and Overla lappin ing memory ry Transfers with ith Co Computatio ion

  • A stream is simply a sequence of operations that are

performed in order on the device. Operations in different streams can be interleaved and in some cases overlapped - a property that can be used to hide data transfers between the host and the device.

– cudaStreamCreate(&stream1); – Default stream - no explicit synchronization is

needed always sequential.

  • Some devices are capable of concurrent copy and compute

– cudaMemcpy()

is blocking

– cudaMemcpyAsync() is a non-blocking

  • kernel<<<grid, block, 0, stream2>>>(data...);
slide-8
SLIDE 8

Concurrent copy and execute

  • cudaStreamCreate(&stream1);
  • cudaStreamCreate(&stream2);
  • cudaMemcpyAsync(a_d, a_h, size,

cudaMemcpyHostToDevice, stream1);

  • kernel<<<grid, block, 0, stream2>>>(otherData_d);
slide-9
SLIDE 9

Staged concurrent copy and execute

  • Sequential
  • Concurrent

memcpy compute memcpy compute memcpy memcpy memcpy compute compute compute

slide-10
SLIDE 10

Devic ice Memory ry Spaces

  • Coalesced Access to Global Memory
  • Global memory loads and stores by threads of a warp

are coalesced by the device into as few as one transaction when certain access requirements are met.

  • the concurrent accesses of the threads of a warp will

coalesce into a number of transactions equal to the number of cache lines necessary to service all of the threads of the warp.

  • By default, all accesses are cached through L1, which as

128-byte lines.

slide-11
SLIDE 11

Global memory ry accesses

  • 2.x cached through L1, which has 128-byte lines.
  • 3.x is only cached in L2.

–L1 is reserved for local memory accesses.

slide-12
SLIDE 12

A Simple Access Pattern

  • A Simple Access Pattern

128 256 1 30 31 128 256 1 30 31 256 1 30 31

slide-13
SLIDE 13

Memory ry Hierarchy

  • Shared Memory

–Minimize Bank conflicts

  • Texture Memory
  • Constant Memory
  • Registers
slide-14
SLIDE 14

Occupancy

  • Occupancy: number of warps running concurrently on a multiprocessor

divided by maximum number of warps that can run concurrently

  • Limited by resource usage:

– Registers – Shared memory

  • Higher occupancy does not necessarily lead to higher performance

– Low occupancy kernels cannot hide memory latency

slide-15
SLIDE 15

Case Study

Finding pulsars

slide-16
SLIDE 16

Pulsars

Neutron stars

  • Mass

∼1.4 M

  • Radius:

10 – 80 km

  • Density:

1014 grams/cm3

  • Rapidly rotating

Up to 716 Hz

  • Highly Magnetized

108 - 1015 Gauss

slide-17
SLIDE 17

Pulsars

Neutron stars

  • Mass

∼1.4 M

  • Radius:

10 – 80 km

  • Density:

1014 grams/cm3

  • Rapidly rotating

Up to 716 Hz

  • Highly Magnetized

108 - 1015 Gauss The rotating magnetic field induces an electric field which accelerates charged particles that are then beamed from the poles of the star. If one of these beams pass over us we can detect them as a broadband periodic signal.

slide-18
SLIDE 18

So how do we fi find new pulsars?

  • Take a long observe with radio telescope
  • High sampling rate ∼12 kHz
  • Remove what RIF we can
  • Perform barycentric corrections
  • De-disperce the observation – for a number of trial DM’s

And then to find a periodic signal….

The good old Fourier Transform!

slide-19
SLIDE 19

Frequency Search – Power Spectra

Lets examine a 7.3 hour observation of Terzan 5 taken on the 05/05/05 with the GBT.

Power spectra of a 7.3 h observation of Ter 5

slide-20
SLIDE 20

Frequency Search – Power Spectra

Lets examine a 7.3 hour observation of Terzan 5 taken on the 05/05/05 with the GBT.

J1748-2446A and its harmonics J1748-2446C

slide-21
SLIDE 21

Frequency Search – Power Spectra

Lets examine a 7.3 hour observation of Terzan 5 taken on the 05/05/05 with the GBT.

Power spectra of a 7.3 h observation of Ter 5 J1748-2446A Fundamental harmonic

slide-22
SLIDE 22

Frequency Search – Power Spectra

Power of J1748-2446A, a very strong binary pulsar. Ter A completes ∼4 orbits during the observation. The

  • rbital motion Doppler shifts observed spin frequency and

smears the power across a number of Fourier bins.

slide-23
SLIDE 23

Frequency Search – Power Spectra

Lets examine a 7.3 hour observation of Terzan 5 taken on the 05/05/05 with the GBT.

J1748-2446A and its harmonics J1748-2446C

slide-24
SLIDE 24

Frequency Search – Power Spectra

Lets examine a 7.3 hour observation of Terzan 5 taken on the 05/05/05 with the GBT.

Power spectra of a 7.3 h observation of Ter 5 J1748-2446ae Fundamental harmonic

slide-25
SLIDE 25

Frequency Search – Power Spectra

If we look for J1748-2446ae at ~273.33 Hz Ter AE is fairly weak and we can see there is no significant detection in the power spectra.

slide-26
SLIDE 26

Finding a new binary ry pulsars?

Acceleration search

  • Assumes the orbital period is significantly longer that the
  • bservation. The acceleration can be assumed to be

close to constant during this observation.

  • This constant acceleration can be compensated for and

most of the power regained.

  • This is essentially a 2D parameter search. ( 𝑔 and

𝑔)

slide-27
SLIDE 27

Acceleration search - 𝑔 and 𝑔

slide-28
SLIDE 28

Searching for J1 J1748-2446ae

  • Ter AE has short orbital period ( 4 hours )
  • Thus completes ∼1.8 orbits during the 7.38 hour
  • bservation.
  • It is this not detected with a acceleration search.

So what is next?

slide-29
SLIDE 29

Create a f-dot plain

  • Prepare kernels (make 2d array)
  • Read fft

–Prepare (1D data)

  • Create f-dot plain

–Multiply kernels with data –FFT –Powers

  • Search (optional)
slide-30
SLIDE 30

Preparer the kennels

  • This is only don once!
  • Calculate kernel columns – only dependent on width and

height (Fresnel integrals)

  • Place data ( half and split )
  • Fourier transform (y columns)
slide-31
SLIDE 31

Prepare the input Data

  • Read raw powers ~8K ( float2 )
  • Calculate powers
  • Calculate median
  • Normalize raw powers (Using median of powers)
  • Spread
  • FFT
slide-32
SLIDE 32

Create f-fdot

  • Multiply Input (vector) by kernel column by column
  • FFT data
  • Chop ends
  • Calculate powers
  • Copy to f-fdot plain
slide-33
SLIDE 33

Search f-fdot plain

  • Find values above a threshold
  • Compare to neibours (block 16 x16)
  • If local maxima add to list
slide-34
SLIDE 34

Add plains

  • Scale x and y, sum “up” to highest harmonic.
slide-35
SLIDE 35

8 Harmonics

  • Create fundamental

–Search fundamental –For stages ( Powers of 2, ½, ¼, 1/8, ...)

  • For sub harmonics

–Create –Sum with fundamental

  • Search
slide-36
SLIDE 36

n Harmonics v1

  • Make n input data sets

1 kernel

  • Create n f-fdot plains

n kernels

  • For stages

add all subs s kernels search 1 kernel

slide-37
SLIDE 37

n Harmonics v2

  • Make n input data sets

1 kernel

  • Create f-fdot
  • Multiply

n kernels –FFT's

? kernels

  • Sum and search

1 kernel

–For stages

  • Create powers
  • Sum to shard memory
  • Search section of f-fdot plain