Big Bang, Big Iron: CMB Data Analysis at the Petascale and Beyond - - PowerPoint PPT Presentation

big bang big iron
SMART_READER_LITE
LIVE PREVIEW

Big Bang, Big Iron: CMB Data Analysis at the Petascale and Beyond - - PowerPoint PPT Presentation

Big Bang, Big Iron: CMB Data Analysis at the Petascale and Beyond Julian Borrill Computational Cosmology Center, LBL & Space Sciences Laboratory, UCB with Christopher Cantalupo, Theodore Kisner, Radek Stompor et al and the BOOMERanG,


slide-1
SLIDE 1

3DAPAS – June 8th, 2011

Big Bang, Big Iron:

CMB Data Analysis at the Petascale and Beyond

Julian Borrill Computational Cosmology Center, LBL & Space Sciences Laboratory, UCB with Christopher Cantalupo, Theodore Kisner, Radek Stompor et al and the BOOMERanG, EBEX, MAXIMA, Planck & PolarBear teams

slide-2
SLIDE 2

3DAPAS – June 8th, 2011

The Cosmic Microwave Background

About 400,000 years after the Big Bang, the expanding Universe cools through the ionization temperature of hydrogen: p+ + e- => H. Without free electrons to scatter off, the photons free-stream to us today.

  • COSMIC - filling all of space.
  • MICROWAVE - redshifted by

the expansion of the Universe from 3000K to 3K.

  • BACKGROUND - primordial

photons coming from “behind” all astrophysical sources.

slide-3
SLIDE 3

3DAPAS – June 8th, 2011

CMB Science

  • Primordial photons give the earliest possible image of the Universe.
  • The existence of the CMB supports a Big Bang over a Steady State

cosmology (NP1).

  • Tiny fluctuations in the CMB temperature (NP2) and polarization encode the

fundamentals of – Cosmology

  • geometry, topology, composition, history, …

– Highest energy physics

  • grand unified theories, the dark sector, inflation, …
  • Current goals:

– definitive T measurement provides complementary constraints for all dark energy experiments. – detection of cosmological B-mode gives energy scale of inflation from primordial gravity waves. (NP3)

slide-4
SLIDE 4

3DAPAS – June 8th, 2011

The Concordance Cosmology

High-z Supernova & Supernova Cosmology (1998): Cosmic Dynamics (ΩΛ - Ωm) BOOMERanG & MAXIMA (2000): Cosmic Geometry (ΩΛ + Ωm) 70% Dark Energy + 25% Dark Matter + 5% Baryons 95% Ignorance What (and why) is the Dark Universe ?

slide-5
SLIDE 5

3DAPAS – June 8th, 2011

Observing the CMB

  • With very sensitive, very cold,

detectors – coloured noise.

  • Scanning all of the sky from space,
  • r some of the sky from the

stratosphere or high dry ground.

slide-6
SLIDE 6

3DAPAS – June 8th, 2011

Analysing The CMB

slide-7
SLIDE 7

3DAPAS – June 8th, 2011

CMB Satellite Evolution

Evolving science goals require (i) higher resolution & (ii) polarization sensitivity.

slide-8
SLIDE 8

3DAPAS – June 8th, 2011

CMB Data Analysis

  • In principle very simple

– Assume Gaussianity and maximize the likelihood 1.

  • f maps given the observations and their noise statistics (analytic).

2.

  • f power spectra given maps and their noise statistics (iterative).
  • In practice very complex

– Foregrounds, glitches, asymmetric beams, non-Gaussian noise, etc. – Algorithm & implementation scaling with evolution of

  • CMB data-set size
  • HPC architecture
slide-9
SLIDE 9

3DAPAS – June 8th, 2011

The CMB Data Challenge

  • Extracting fainter signals (polarization, high resolution) from the data requires:

– larger data volumes to provide higher signal-to-noise. – more complex analyses to control fainter systematic effects.

Experiment Start Date Goals Nt Np COBE 1989 All-sky, low res, T 109 104 BOOMERanG 1997 Cut-sky, high-res, T 109 106 WMAP 2001 All-sky, mid-res, T+E 1010 107 Planck 2009 All-sky, high-res, T+E(+B) 1012 109 PolarBear 2012 Cut-sky, high-res, T+E+B 1013 107 QUIET-II 2015 Cut-sky, high-res, T+E+B 1014 107 CMBpol 2020+ All-sky, high-res, T+E+B 1015 1010

  • 1000x increase in data volume over past & future 15 years

– need linear analysis algorithms to scale through 10 + 10 M-foldings !

slide-10
SLIDE 10

3DAPAS – June 8th, 2011

CMB Data Analysis Evolution

Data volume & computational capability dictate analysis approach.

Date Data System Map Power Spectrum 1997 - 2000 B98 Cray T3E x 700 Explicit Maximum Likelihood (Matrix Invert - Np

3)

Explicit Maximum Likelihood (Matrix Cholesky + Tri-solve - Np

3)

Algorithms 2000 - 2003 B2K2 IBM SP3 x 3,000 Explicit Maximum Likelihood (Matrix Invert - Np

3)

Explicit Maximum Likelihood (Matrix Invert + Multiply - Np

3)

2003 - 2007 Planck SF IBM SP3 x 6,000 PCG Maximum Likelihood (band-limited FFT – few Nt) Monte Carlo (Sim + Map - many Nt) 2007 - 2010 Planck AF EBEX Cray XT4 x 40,000 PCG Maximum Likelihood (band-limited FFT – few Nt) Monte Carlo (SimMap - many Nt) Implementations 2010 - 2014 Planck MC PolarBear Cray XE6 x 150,000 PCG Maximum Likelihood (band-limited FFT – few Nt) Monte Carlo (Hybrid SimMap - many Nt)

slide-11
SLIDE 11

3DAPAS – June 8th, 2011

Scaling In Practice

  • 2000: BOOMERanG T-map

– 108 samples => 105 pixels – 128 Cray T3E processors;

  • 2005: Planck T-map

– 1010 samples => 108 pixels – 6000 IBM SP3 processors;

  • 2008: EBEX T/P-maps

– 1011 samples, 106 pixels – 15360 Cray XT4 cores.

  • 2010: Planck Monte Carlo 1000 noise T-maps

– 1014 samples => 1011 pixels – 32000 Cray XT4 cores.

slide-12
SLIDE 12

3DAPAS – June 8th, 2011

Simulation & Mapping: Calculations

Given the instrument noise statistics & beams, a scanning strategy, and a sky: 1) SIMULATION: dt = nt + st= nt + Ptp sp – A realization of the piecewise stationary noise time-stream:

  • Pseudo-random number generation & FFT

– A signal time-stream scanned & beam-smoothed from the sky map:

  • SHT

2) MAPPING: (PT N-1 P) dp = PT N-1 dt (A x = b) – Build the RHS

  • FFT & sparse matrix-vector multiply

– Solve for the map

  • PCG over FFT & sparse matrix-vector multiply
slide-13
SLIDE 13

3DAPAS – June 8th, 2011

Simulation & Mapping: Scaling

  • In theory such analyses should scale

– Perfectly to arbitrary numbers of cores (strong – within an experiment). – Linearly with number of observations (weak – between experiments).

  • In practice this does not happen because of

– IO (reading pointing, writing time-streams; reading time-streams, writing maps) – Communication (gathering maps from all processes) – Calculation inefficiency (linear operations only, minimal data re-use)

  • Ultimately all comes down to delivering data to cores fast enough.
  • Code development has been an ongoing history of addressing this challenge

anew with each new CMB data volume and HPC system/scale.

slide-14
SLIDE 14

3DAPAS – June 8th, 2011

IO - Before

For each MC realization For each detector Read detector pointing Sim Write detector timestream For all detectors Read detector timestream & pointing Map Write map ⇒ Read: 56 x Realizations x Detectors x Observations bytes Write: 8 x Realizations x (Detectors x Observations + Pixels) bytes E.g. for Planck, read 500PB & write 70PB.

slide-15
SLIDE 15

3DAPAS – June 8th, 2011

IO - Optimizations

  • Read sparse telescope pointing instead of dense detector pointing

– Calculate individual detector pointing on the fly.

  • Remove redundant write/read of time-streams between simulation &

mapping – Generate simulations on the fly only when map-maker requests data.

  • Put MC loop inside map-maker

– Amortize common data reads over all realizations.

slide-16
SLIDE 16

3DAPAS – June 8th, 2011

IO – After

Read telescope pointing For each detector Calculate detector pointing For each MC realization SimMap For all detectors Simulate time-stream Write map ⇒ Read: 24 x Sparse Observations bytes Write: 8 x Realizations x Pixels bytes E.g. for Planck, read 2GB & write 70TB (108 read & 103 write compression).

slide-17
SLIDE 17

3DAPAS – June 8th, 2011

Communication Details

  • The time-ordered data from all the detectors are distributed over the

processes subject to: – Load-balance – Common telescope pointing

  • Each process therefore holds

– some of the observations – for some of the pixels.

  • In each PCG iteration, each process solves with its observations.
  • At the end of each iteration, each process needs to gather the total result for

all of the pixels it observed.

slide-18
SLIDE 18

3DAPAS – June 8th, 2011

Communication Implementation

  • Initialize a process & MPI task on every core
  • Distribute time-stream data & hence pixels
  • After each PCG iteration

– Each process creates a full map vector by zero-padding – Call MPI_Allreduce(map, world) – Each process extracts the pixels of interest to it & discards the rest

slide-19
SLIDE 19

3DAPAS – June 8th, 2011

Communication Challenge

slide-20
SLIDE 20

3DAPAS – June 8th, 2011

Communication Optimizations

  • Reduce the number of MPI tasks (hybridize)

– Use MPI only for off-node communication – Use threads for on-node communication

  • Minimize the total message volume

– Determine the pixel overlap for every process pair.

  • One-time initialization cost

– If the total overlap data volume is smaller than a full map, use Alltoallv in place of Allreduce

  • Typically Alltoallv for all-sky, Allreduce for part-sky surveys
slide-21
SLIDE 21

3DAPAS – June 8th, 2011

Communication Improvement

slide-22
SLIDE 22

3DAPAS – June 8th, 2011

Communication Evaluation

  • Which enhancement is

more important?

  • Is this system-dependent?
  • Compare

– unthreaded/threaded – allreduce/alltoallv

  • n Cray XT4, XT5 & XE6
  • n 200 – 16,000 cores:

– Alltoallv at low end – Threads at high end

slide-23
SLIDE 23

3DAPAS – June 8th, 2011

Petascale Communication (I)

slide-24
SLIDE 24

3DAPAS – June 8th, 2011

HPC System Evolution

  • Clock speed is no longer able to maintain Moore’s Law.
  • Multi-core CPU and GPGPU are two major approaches.
  • Both of these will require performance modeling, experiment & auto-tuning
  • E.g. NERSC’s new XE6 system Hopper

– 6384 nodes – 2 MagnyCours processors per node – 2 NUMA nodes per processor – 6 cores per NUMA node

  • What is the best way to run hybrid code
  • n such a system?

– “wisdom” says 4 processes x 6 threads to avoid NUMA effects.

slide-25
SLIDE 25

3DAPAS – June 8th, 2011

Petascale Communication (II)

slide-26
SLIDE 26

3DAPAS – June 8th, 2011

Current Status

  • Calculation scale with #observations.
  • IO & communication scale with #pixels.
  • Observations/pixel ~ S/N: science goals will help scaling!

– Planck: O(103) observations per pixel – PolarBear: O(106) observations per pixel

  • For each experiment, fixed data volume => strong scaling.
  • Across experiments, growing data volume => weak scaling.
slide-27
SLIDE 27

3DAPAS – June 8th, 2011

Petascale Communication (III)

slide-28
SLIDE 28

3DAPAS – June 8th, 2011

Conclusions

  • The CMB provides a unique window onto the early Universe

– probe fundamental cosmology & physics.

  • CMB data analysis is a long-standing, computationally-challenging, problem

requiring state-of-the-art HPC capabilities.

  • The science we can extract from CMB data sets is determined by

– the absolute limits on our computational capability, and – our practical ability to exploit this.

  • Both the CMB data sets we gather and the HPC systems we analyze them
  • n are evolving

– analysis algorithms & implementations are a process not a state.