Statistics of the Universe: Exa-calculations and Cosmology's Data - - PowerPoint PPT Presentation

statistics of the universe exa calculations and cosmology
SMART_READER_LITE
LIVE PREVIEW

Statistics of the Universe: Exa-calculations and Cosmology's Data - - PowerPoint PPT Presentation

Statistics of the Universe: Exa-calculations and Cosmology's Data Deluge Matt Bellis Debbie Bard Cosmology: the study of the nature and history of the Universe History of Universe driven by competing forces: gravitational attraction


slide-1
SLIDE 1

Statistics of the Universe: Exa-calculations and Cosmology's Data Deluge

Debbie Bard Matt Bellis

slide-2
SLIDE 2
  • History of Universe driven

by competing forces: ○ gravitational attraction ○ repulsive dark energy

Cosmology: the study of the nature and history of the Universe

slide-3
SLIDE 3

How we study cosmology

image credit: Sources: Nina McCurdy/University of California, Santa Cruz; Ralf Kaehler and Risa Wechsler/Stanford University; Sloan Digital Sky Survey; Michael Busha/University of Zurich

  • Use computer simulations of the Universe to compare

theoretical models to data.

  • Comparison of dark-

matter simulation (Bolshoi), to galaxy locations from Sloan Digital Sky Survey (SDSS).

slide-4
SLIDE 4
  • Two point function:

counting galaxy pairs as a function of distance.

1 10 100 distance between galaxy pairs

two-point function data simulation

slide-5
SLIDE 5

Cosmology

  • Three point function:

counting galaxy triplets.

three-point function

0.2 0.4 0.6 0.8 1.0

  • pening angle of triangle

data simulation

slide-6
SLIDE 6

Cosmology

  • Three point function:

counting galaxy triplets.

slide-7
SLIDE 7

Three-point function

  • New information about

topology of Universe becomes accessible in three-point function.

  • Can use it to distinguish

between different theoretical models of cosmology.

two-point three-point

data simulation

1

10 100 0.2 0.4 0.6 0.8 1.0 distance between galaxy pairs opening angle of triangle

Kulkarni et al., MNRAS 378 3 (2007)

slide-8
SLIDE 8

How we calculate these functions

  • Histogram according to

○ Distance between galaxies (two-point) → 1D histogram ○ triangle configuration (three-point) → 3D histogram!

1

10 100 0.2 0.4 0.6 0.8 1.0 distance between galaxy pairs opening angle of triangle

  • Count pairs and triplets of galaxies

○ two-point function: O(N2) [Bard, Bellis et al, AsCom 1 17 (2013)] ○ three-point function: O(N3) !

  • Previously rely on approximation code…

○ Insufficient for precision cosmology

slide-9
SLIDE 9

Computational challenges growing...can GPUs help? 2015: # of galaxies = 100,000 O(N3) = 1015 calculations (1 quadrillion) 2025: # of galaxies = 1,000,000 O(N3) = 1018 calculations (1 quintillion! Exa-scale!)

slide-10
SLIDE 10

Histogramming

Volume of calculations. Each point represents the 3 numbers that describe the triangle formed by the galaxies indexed along each axis.

slide-11
SLIDE 11

Histogramming

One slice of the histogram calculations represents all the triangles that use one common galaxy.

slide-12
SLIDE 12

Histogramming

Each thread does calculations for one ``pivot” galaxy.

slide-13
SLIDE 13

Histogramming

Each thread does calculations for one ``pivot” galaxy.

slide-14
SLIDE 14

Histogramming

Each thread does calculations for one ``pivot” galaxy.

slide-15
SLIDE 15

Histogramming

Each thread does calculations for one ``pivot” galaxy.

slide-16
SLIDE 16

Histogramming

Each thread does calculations for one ``pivot” galaxy.

slide-17
SLIDE 17

Histogramming

We can choose to break up the volume of calculations into subvolumes. These subvolumes can be farmed out to different CPUs/GPUs, and the results combined.

slide-18
SLIDE 18

Histogramming

Challenges arise if multiple threads are trying to increment the same bin.

slide-19
SLIDE 19

Histogramming issues Binning matters!

  • Finer bins good!

○ Discern structure → ○ Less thread block!

  • Finer bins bad!

○ Limited shared memory

Kulkarni et al., MNRAS 378 3 (2007)

Shared memory is capped at 48k 50 x 16 x 32 x (4 bytes) = 102k! (previous measurements)

slide-20
SLIDE 20

Histogramming

Large number of bins to fill if everything is kept. Do we need to keep everything?

slide-21
SLIDE 21

Histogramming

Only record part of the calculations. These samples of the full calculation are enough to test different Cosmologies.

slide-22
SLIDE 22

CPU vs GPU

CPU (desktop): Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz GPU: NVIDIA Tesla K40

# galaxies CPU times (minutes) GPU times (minutes) 1000 3.2 0.15 5000 500 (8.25 hours) 19 10000 2,790 (46.5 hours) 120 50000 480,000 (8000 hours) (333 days) 20,400 (340 hours) (14 days)

  • Speedup of ~20x

compared to CPU.

  • 50k sample run on

○ SLAC, 7000 CPUs ○ XSEDE/Stampede 128 GPUs ○ Turnaround time for researcher is 3-4 days.

slide-23
SLIDE 23

Comparison to approximation code

  • GPU is faster than KD-

tree approximation method

○ And, it’s precise!

# galaxies CPU time (minutes) KD-tree (minutes) GPU time (minutes) 1000 3.4 0.9 0.2 5000 300 22 14

  • Can improve precision in KD-

tree by using smaller leafs, but runs much slower (~x10)

KD-tree approximates to a level of 0.05 in each triangle parameter.

slide-24
SLIDE 24

Summary

  • Cosmology is entering the Big Data era
  • Cosmological calculations do not scale well to Big Data!

○ 3-point correlation function: O(N3)

  • GPUs enable precise calculations in a reasonable time-

frame

○ 20x faster than CPU ○ faster than approximation code!

  • Interesting issues with histogramming.
  • Easily scales up to multi-GPU clusters

○ Exa-scale calculations feasible!

https://github.com/djbard/ccogs

slide-25
SLIDE 25

References

  • Fosalba, P et al. "The MICE Grand Challenge Lightcone Simulation I:

Dark matter clustering." arXiv preprint arXiv:1312.1707 (2013).

  • Kulkarni, Gauri V et al. "The three-point correlation function of

luminous red galaxies in the Sloan Digital Sky Survey." Monthly Notices of the Royal Astronomical Society 378.3 (2007): 1196-1206.

  • Kayo, Issha et al. "Three-Point Correlation Functions of SDSS Galaxies

in Redshift Space: Morphology, Color, and Luminosity Dependence." Publications of the Astronomical Society of Japan 56.3 (2004): 415-423.

  • Podlozhnyuk, Victor. "Histogram calculation in CUDA." NVIDIA

Corporation, White Paper (2007).

  • Bard, Deborah et al. "Cosmological calculations on the GPU."

Astronomy and Computing 1 (2013): 17-22.

slide-26
SLIDE 26

Extra Slides

slide-27
SLIDE 27

Cosmology: the study of the nature and history

  • f the Universe
  • The nature of the

Dark Universe is the biggest puzzle facing scientists today.

slide-28
SLIDE 28

Dark Energy and the growth of structure

  • Dark energy

affects the growth of structure over time.

These simulations were carried out by the Virgo Supercomputing Consortium using computers based at Computing Centre of the Max-Planck Society in Garching and at the Edinburgh Parallel Computing Centre. The data are publicly available at www.mpa-garching.mpg.de/galform/virgo/int_sims

slide-29
SLIDE 29

Examples of reduced 3- point function in different triangle parameterisation binning.

slide-30
SLIDE 30

How we calculate these functions

  • Use estimators:
  • Count pairs and triplets of galaxies

○ two-point function: O(N2) ■ [Bard, Bellis et al, AsCom 1 (2013)] ○ three-point function: O(N3) !

  • Histogram according to

○ Distance between galaxies (two-point) → 1D histogram ○ triangle configuration (three-point) → 3D histogram! ξ = DD-2DR+RR , ζ = DDD - 3DDR + 3DRR - RRR RR RRR

Landy & Szalay (1993), Szapudi & Szalay (1998) 1

10 100 0.2 0.4 0.6 0.8 1.0 distance between galaxy pairs opening angle of triangle

slide-31
SLIDE 31

Binning Matters Histogramming is non-trivial!

  • Podlozhnyuk, Victor. "Histogram calculation in CUDA." NVIDIA Corporation,

White Paper (2007).

We take naive, but maintainable/implementable approach.

  • Use shared memory for a histogram for each block.
  • Collect entries at the end of the kernel launch.
  • Sum each block’s histogram on the CPU.

We use atomicAdd to avoid losing information if threads become serialized.

slide-32
SLIDE 32

Challenges of histogramming Multiple threads want to increment the same bin SOLUTION: Use atomics and increase granularity of bin. but…. increasing the granularity for 3ptCF goes as granularity3! Shared memory is capped at 48k 24 x 24 x 24 x (4 bytes) = 55k! Yikes!

slide-33
SLIDE 33

Unique issues with histogramming. We’ve tried:

  • global memory

○ can have very fine bins (avoids thread block) but data transfer is slow.

  • shared memory

○ limited # bins so thread block is an issue ○ nevertheless, faster than using global memory!

  • __shfl

○ can share data between threads. Only one thread per warp writes to histo - avoids atomicadd thread-lock within warp. ○ actually takes longer to sum across warp for all histogram bins!

  • randomising data was vital!

Histogramming bottlenecks

slide-34
SLIDE 34

Within the kernel...

// On each block, create a histogram that is visible to all the // threads in that block __shared__ int shared_hist[NUMBER_OF_BINS]; // Run over all the calculations // Increment the appropriate bin atomicAdd(&shared_hist[i2],1); __syncthreads(); // Copy each block’s shared histogram to sections of dev_hist on // (global memory). The summation will take place on the CPU if(threadIdx.x==0) { for(int i=0;i<tot_hist_size;i++) { dev_hist[i+(blockIdx.x*NUMBER_OF_BINS)]=shared_hist[i]; } }

slide-35
SLIDE 35

__shfl

  • __shfl command allows register values to be shared between threads
  • Potential value in histogramming: accumulate bin count in only one thread

○ atomic-add only once instead of once per thread. ○ 1 instruction for each shfl, compared to 3 for atomicadd

  • Usual way: in a 32-thread ‘lane’, each thread could be incrementing any
  • ne of 640 histogram bins.

○ aadd once on each thread -> 1920 instructions.

  • With shfl: in a 32-thread ‘lane’, shuffle each hist bin value to lane 0 thread

○ still have to aadd once for each thread histogram bin -> 1952 instructions.

slide-36
SLIDE 36

MICE Grand Challenge

http://arxiv.org/abs/1312.1707

slide-37
SLIDE 37

XSEDE, TACC, Stampede

Stampede cluster, within TACC (Texas Advanced Computing Center). Part of XSEDE. 6400 compute nodes. 128 of the nodes have a NVIDIA K20 GPU. https://portal.tacc.utexas.edu/user-guides/stampede