a case study By Chris Laidler Optimization cycle Assess - PowerPoint PPT Presentation

Accelerating the acceleration search a case study By Chris Laidler

Optimization cycle Assess Parallelise Test Optimise

Profile le ● Identify the function or functions in which the application is spending most of its execution time. ● CPU code: – gprof – valgrind – oprofile ● Identifying hotspots

Parallelize ● Use existing libraries ● Code to expose parallelism

Optimizing CUDA code ● Using CPU Timers – CudaDeviceSynchronize() – cudaEventSynchronize() ● Using CUDA GPU Timers – cudaEventCreate(&start) – CudaEventElapsedTime() ● Bandwidth – How, when

Data Transfer Between Host and Devic ice ● Minimize data transfer between the host and the device. Even if that means running kernels on the GPU that do not demonstrate any speedup compared with running them on the host CPU. ● Keep it in device memory ● Batch many small transfers into one larger transfer ● Use page-locked (or pinned) memory – CudaHostAlloc()

Asynchronous and Overla lappin ing memory ry Transfers with ith Co Computatio ion ● A stream is simply a sequence of operations that are performed in order on the device. Operations in different streams can be interleaved and in some cases overlapped - a property that can be used to hide data transfers between the host and the device. – cudaStreamCreate(&stream1); – Default stream - no explicit synchronization is needed always sequential. ● Some devices are capable of concurrent copy and compute – cudaMemcpy() is blocking – cudaMemcpyAsync() is a non-blocking ● kernel<<<grid, block, 0, stream2>>>(data...);

Concurrent copy and execute • cudaStreamCreate(&stream1); • cudaStreamCreate(&stream2); • cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, stream1); • kernel<<<grid, block, 0, stream2>>>(otherData_d);

Staged concurrent copy and execute • Sequential memcpy compute • Concurrent memcpy memcpy memcpy memcpy compute compute compute compute

Devic ice Memory ry Spaces ● Coalesced Access to Global Memory ● Global memory loads and stores by threads of a warp are coalesced by the device into as few as one transaction when certain access requirements are met. ● the concurrent accesses of the threads of a warp will coalesce into a number of transactions equal to the number of cache lines necessary to service all of the threads of the warp. ● By default, all accesses are cached through L1, which as 128-byte lines.

Global memory ry accesses ● 2.x cached through L1, which has 128-byte lines. ● 3.x is only cached in L2. – L1 is reserved for local memory accesses.

A Simple Access Pattern • A Simple Access Pattern 30 31 30 31 0 0 1 1 256 256 128 30 31 0 1 128 256

Memory ry Hierarchy ● Shared Memory – Minimize Bank conflicts ● Texture Memory ● Constant Memory ● Registers

Occupancy ● Occupancy: number of warps running concurrently on a multiprocessor divided by maximum number of warps that can run concurrently ● Limited by resource usage: – Registers – Shared memory ● Higher occupancy does not necessarily lead to higher performance – Low occupancy kernels cannot hide memory latency

Case Study Finding pulsars

Pulsars Neutron stars ∼ 1.4 M • Mass • Radius: 10 – 80 km 10 14 grams/cm 3 • Density: • Rapidly rotating Up to 716 Hz • Highly Magnetized 10 8 - 10 15 Gauss

Pulsars Neutron stars ∼ 1.4 M • Mass • Radius: 10 – 80 km 10 14 grams/cm 3 • Density: • Rapidly rotating Up to 716 Hz • Highly Magnetized 10 8 - 10 15 Gauss The rotating magnetic field induces an electric field which accelerates charged particles that are then beamed from the poles of the star. If one of these beams pass over us we can detect them as a broadband periodic signal.

So how do we fi find new pulsars? • Take a long observe with radio telescope • High sampling rate ∼ 12 kHz • Remove what RIF we can • Perform barycentric corrections • De-disperce the observation – for a number of trial DM’s And then to find a periodic signal…. The good old Fourier Transform!

Frequency Search – Power Spectra Lets examine a 7.3 hour observation of Terzan 5 taken on the 05/05/05 with the GBT. Power spectra of a 7.3 h observation of Ter 5

Frequency Search – Power Spectra Lets examine a 7.3 hour observation of Terzan 5 taken on the 05/05/05 with the GBT. J1748-2446A and its harmonics J1748-2446C

Frequency Search – Power Spectra Lets examine a 7.3 hour observation of Terzan 5 taken on the 05/05/05 with the GBT. Power spectra of a 7.3 h observation of Ter 5 J1748-2446A Fundamental harmonic

Frequency Search – Power Spectra Power of J1748-2446A, a very strong binary pulsar. Ter A completes ∼ 4 orbits during the observation. The orbital motion Doppler shifts observed spin frequency and smears the power across a number of Fourier bins.

Frequency Search – Power Spectra Lets examine a 7.3 hour observation of Terzan 5 taken on the 05/05/05 with the GBT. J1748-2446A and its harmonics J1748-2446C

Frequency Search – Power Spectra Lets examine a 7.3 hour observation of Terzan 5 taken on the 05/05/05 with the GBT. Power spectra of a 7.3 h observation of Ter 5 J1748-2446ae Fundamental harmonic

Frequency Search – Power Spectra If we look for J1748-2446ae at ~273.33 Hz Ter AE is fairly weak and we can see there is no significant detection in the power spectra.

Finding a new binary ry pulsars? Acceleration search • Assumes the orbital period is significantly longer that the observation. The acceleration can be assumed to be close to constant during this observation. • This constant acceleration can be compensated for and most of the power regained. • This is essentially a 2D parameter search. ( 𝑔 and 𝑔 )

Acceleration search - 𝑔 and 𝑔

Searching for J1 J1748-2446ae • Ter AE has short orbital period ( 4 hours ) • Thus completes ∼ 1.8 orbits during the 7.38 hour observation. • It is this not detected with a acceleration search. So what is next?

Create a f-dot plain ● Prepare kernels (make 2d array) ● Read fft – Prepare (1D data) ● Create f-dot plain – Multiply kernels with data – FFT – Powers ● Search (optional)

Preparer the kennels • This is only don once! ● Calculate kernel columns – only dependent on width and height (Fresnel integrals) ● Place data ( half and split ) ● Fourier transform (y columns)

Prepare the input Data ● Read raw powers ~8K ( float2 ) ● Calculate powers ● Calculate median ● Normalize raw powers (Using median of powers) ● Spread ● FFT

Create f-fdot ● Multiply Input (vector) by kernel column by column ● FFT data ● Chop ends ● Calculate powers ● Copy to f-fdot plain

Search f-fdot plain ● Find values above a threshold ● Compare to neibours (block 16 x16) ● If local maxima add to list

Add plains ● Scale x and y, sum “up” to highest harmonic.

8 Harmonics ● Create fundamental – Search fundamental – For stages ( Powers of 2, ½, ¼, 1/8, ...) ● For sub harmonics – Create – Sum with fundamental ● Search

n Harmonics v1 ● Make n input data sets 1 kernel ● Create n f-fdot plains n kernels ● For stages add all subs s kernels search 1 kernel

n Harmonics v2 ● Make n input data sets 1 kernel ● Create f-fdot ● Multiply n kernels – FFT's ? kernels ● Sum and search 1 kernel – For stages ● Create powers ● Sum to shard memory ● Search section of f-fdot plain

a case study By Chris Laidler Optimization cycle Assess - PowerPoint PPT Presentation

Accelerating the acceleration search a case study By Chris Laidler Optimization cycle Assess Parallelise Test Optimise Profile le Identify the function or functions in which the application is spending most of its execution time.

Case study 2 Case study 2 Case study 2 Case study 2 Former Industrial Site, London: How has

How Expert Knowledge Can Three Case Studies Help Measurements: First Case Study Second Case

Case Study A Case fo r Use in Addic tio n Re se arc h De re k Quig le y Unive rsity o f Auc

Analysis Analysis of Analysis Analysis of of a Real Case Study : of a Real Case Study : a

Offload Mode Case Study James Briggs 1 COSMOS DiRAC April 28, 2015 Case Study: Modal2d

Indirect Left Turns Study Indirect Left Turns Study Indirect Left Turns Study Indirect Left

Case Comparisons Department of Government London School of Economics and Political Science Uses

Case Study Presentation Criteria: information for the NetP Registered Nurse Case Study

Case Study Presentation to Utility Districts April 8-9, 2013 Cypress Creek Greenway Case Study

Ricardo Semler Case Study Briefing Advance notice: Leading Change Case Study Please read the

2.7 Case Study: Floorplan Optimization Our second case study is an example of a highly irregular,

Week 1, Video 5 Case Study San Pedro Case Study of Classification With educational data

Use Case Study: Journey - Effects and Terrain Presented by Madis Janno Use Case Study: Journey -

Case Management Best P r actice Case Management regulations. Case Study. Community

ECE444: Software Engineering Case Study Shurui Zhou Learning goal Review a specific case

CASE PRESENTATION CASE PRESENTATION Prepared by: Dr. Lina Raffa Case Report p 14 year old

Data Mining II Data Preprocessing Heiko Paulheim Introduction Give me six hours to

MA/CSSE 474 Theory of Computation Languages, prefixes, sets, cardinality, functions Your

Using seccomp to limit the kernel attack surface Michael Kerrisk, man7.org c 2015 man7.org

Number Systems III MA1S1 Tristan McLoughlin December 4, 2013

Semaphores Semaphores 1 Semaphores Semaphores A semaphore is an object that consists of a

Week 1: Gettin Starte wit R Gettin Starte wit R Week 1: EMSE 4574: Intro to

Moving punctures that are neither moving, nor punctures Mark Hannam Friedrich-Schiller

Computer Graphics Si Lu Fall 2017 http://web.cecs.pdx.edu/~lusi/CS447/CS447_547_Comp