3DAPAS – June 8th, 2011
Big Bang, Big Iron: CMB Data Analysis at the Petascale and Beyond - - PowerPoint PPT Presentation
Big Bang, Big Iron: CMB Data Analysis at the Petascale and Beyond - - PowerPoint PPT Presentation
Big Bang, Big Iron: CMB Data Analysis at the Petascale and Beyond Julian Borrill Computational Cosmology Center, LBL & Space Sciences Laboratory, UCB with Christopher Cantalupo, Theodore Kisner, Radek Stompor et al and the BOOMERanG,
3DAPAS – June 8th, 2011
The Cosmic Microwave Background
About 400,000 years after the Big Bang, the expanding Universe cools through the ionization temperature of hydrogen: p+ + e- => H. Without free electrons to scatter off, the photons free-stream to us today.
- COSMIC - filling all of space.
- MICROWAVE - redshifted by
the expansion of the Universe from 3000K to 3K.
- BACKGROUND - primordial
photons coming from “behind” all astrophysical sources.
3DAPAS – June 8th, 2011
CMB Science
- Primordial photons give the earliest possible image of the Universe.
- The existence of the CMB supports a Big Bang over a Steady State
cosmology (NP1).
- Tiny fluctuations in the CMB temperature (NP2) and polarization encode the
fundamentals of – Cosmology
- geometry, topology, composition, history, …
– Highest energy physics
- grand unified theories, the dark sector, inflation, …
- Current goals:
– definitive T measurement provides complementary constraints for all dark energy experiments. – detection of cosmological B-mode gives energy scale of inflation from primordial gravity waves. (NP3)
3DAPAS – June 8th, 2011
The Concordance Cosmology
High-z Supernova & Supernova Cosmology (1998): Cosmic Dynamics (ΩΛ - Ωm) BOOMERanG & MAXIMA (2000): Cosmic Geometry (ΩΛ + Ωm) 70% Dark Energy + 25% Dark Matter + 5% Baryons 95% Ignorance What (and why) is the Dark Universe ?
3DAPAS – June 8th, 2011
Observing the CMB
- With very sensitive, very cold,
detectors – coloured noise.
- Scanning all of the sky from space,
- r some of the sky from the
stratosphere or high dry ground.
3DAPAS – June 8th, 2011
Analysing The CMB
3DAPAS – June 8th, 2011
CMB Satellite Evolution
Evolving science goals require (i) higher resolution & (ii) polarization sensitivity.
3DAPAS – June 8th, 2011
CMB Data Analysis
- In principle very simple
– Assume Gaussianity and maximize the likelihood 1.
- f maps given the observations and their noise statistics (analytic).
2.
- f power spectra given maps and their noise statistics (iterative).
- In practice very complex
– Foregrounds, glitches, asymmetric beams, non-Gaussian noise, etc. – Algorithm & implementation scaling with evolution of
- CMB data-set size
- HPC architecture
3DAPAS – June 8th, 2011
The CMB Data Challenge
- Extracting fainter signals (polarization, high resolution) from the data requires:
– larger data volumes to provide higher signal-to-noise. – more complex analyses to control fainter systematic effects.
Experiment Start Date Goals Nt Np COBE 1989 All-sky, low res, T 109 104 BOOMERanG 1997 Cut-sky, high-res, T 109 106 WMAP 2001 All-sky, mid-res, T+E 1010 107 Planck 2009 All-sky, high-res, T+E(+B) 1012 109 PolarBear 2012 Cut-sky, high-res, T+E+B 1013 107 QUIET-II 2015 Cut-sky, high-res, T+E+B 1014 107 CMBpol 2020+ All-sky, high-res, T+E+B 1015 1010
- 1000x increase in data volume over past & future 15 years
– need linear analysis algorithms to scale through 10 + 10 M-foldings !
3DAPAS – June 8th, 2011
CMB Data Analysis Evolution
Data volume & computational capability dictate analysis approach.
Date Data System Map Power Spectrum 1997 - 2000 B98 Cray T3E x 700 Explicit Maximum Likelihood (Matrix Invert - Np
3)
Explicit Maximum Likelihood (Matrix Cholesky + Tri-solve - Np
3)
Algorithms 2000 - 2003 B2K2 IBM SP3 x 3,000 Explicit Maximum Likelihood (Matrix Invert - Np
3)
Explicit Maximum Likelihood (Matrix Invert + Multiply - Np
3)
2003 - 2007 Planck SF IBM SP3 x 6,000 PCG Maximum Likelihood (band-limited FFT – few Nt) Monte Carlo (Sim + Map - many Nt) 2007 - 2010 Planck AF EBEX Cray XT4 x 40,000 PCG Maximum Likelihood (band-limited FFT – few Nt) Monte Carlo (SimMap - many Nt) Implementations 2010 - 2014 Planck MC PolarBear Cray XE6 x 150,000 PCG Maximum Likelihood (band-limited FFT – few Nt) Monte Carlo (Hybrid SimMap - many Nt)
3DAPAS – June 8th, 2011
Scaling In Practice
- 2000: BOOMERanG T-map
– 108 samples => 105 pixels – 128 Cray T3E processors;
- 2005: Planck T-map
– 1010 samples => 108 pixels – 6000 IBM SP3 processors;
- 2008: EBEX T/P-maps
– 1011 samples, 106 pixels – 15360 Cray XT4 cores.
- 2010: Planck Monte Carlo 1000 noise T-maps
– 1014 samples => 1011 pixels – 32000 Cray XT4 cores.
3DAPAS – June 8th, 2011
Simulation & Mapping: Calculations
Given the instrument noise statistics & beams, a scanning strategy, and a sky: 1) SIMULATION: dt = nt + st= nt + Ptp sp – A realization of the piecewise stationary noise time-stream:
- Pseudo-random number generation & FFT
– A signal time-stream scanned & beam-smoothed from the sky map:
- SHT
2) MAPPING: (PT N-1 P) dp = PT N-1 dt (A x = b) – Build the RHS
- FFT & sparse matrix-vector multiply
– Solve for the map
- PCG over FFT & sparse matrix-vector multiply
3DAPAS – June 8th, 2011
Simulation & Mapping: Scaling
- In theory such analyses should scale
– Perfectly to arbitrary numbers of cores (strong – within an experiment). – Linearly with number of observations (weak – between experiments).
- In practice this does not happen because of
– IO (reading pointing, writing time-streams; reading time-streams, writing maps) – Communication (gathering maps from all processes) – Calculation inefficiency (linear operations only, minimal data re-use)
- Ultimately all comes down to delivering data to cores fast enough.
- Code development has been an ongoing history of addressing this challenge
anew with each new CMB data volume and HPC system/scale.
3DAPAS – June 8th, 2011
IO - Before
For each MC realization For each detector Read detector pointing Sim Write detector timestream For all detectors Read detector timestream & pointing Map Write map ⇒ Read: 56 x Realizations x Detectors x Observations bytes Write: 8 x Realizations x (Detectors x Observations + Pixels) bytes E.g. for Planck, read 500PB & write 70PB.
3DAPAS – June 8th, 2011
IO - Optimizations
- Read sparse telescope pointing instead of dense detector pointing
– Calculate individual detector pointing on the fly.
- Remove redundant write/read of time-streams between simulation &
mapping – Generate simulations on the fly only when map-maker requests data.
- Put MC loop inside map-maker
– Amortize common data reads over all realizations.
3DAPAS – June 8th, 2011
IO – After
Read telescope pointing For each detector Calculate detector pointing For each MC realization SimMap For all detectors Simulate time-stream Write map ⇒ Read: 24 x Sparse Observations bytes Write: 8 x Realizations x Pixels bytes E.g. for Planck, read 2GB & write 70TB (108 read & 103 write compression).
3DAPAS – June 8th, 2011
Communication Details
- The time-ordered data from all the detectors are distributed over the
processes subject to: – Load-balance – Common telescope pointing
- Each process therefore holds
– some of the observations – for some of the pixels.
- In each PCG iteration, each process solves with its observations.
- At the end of each iteration, each process needs to gather the total result for
all of the pixels it observed.
3DAPAS – June 8th, 2011
Communication Implementation
- Initialize a process & MPI task on every core
- Distribute time-stream data & hence pixels
- After each PCG iteration
– Each process creates a full map vector by zero-padding – Call MPI_Allreduce(map, world) – Each process extracts the pixels of interest to it & discards the rest
3DAPAS – June 8th, 2011
Communication Challenge
3DAPAS – June 8th, 2011
Communication Optimizations
- Reduce the number of MPI tasks (hybridize)
– Use MPI only for off-node communication – Use threads for on-node communication
- Minimize the total message volume
– Determine the pixel overlap for every process pair.
- One-time initialization cost
– If the total overlap data volume is smaller than a full map, use Alltoallv in place of Allreduce
- Typically Alltoallv for all-sky, Allreduce for part-sky surveys
3DAPAS – June 8th, 2011
Communication Improvement
3DAPAS – June 8th, 2011
Communication Evaluation
- Which enhancement is
more important?
- Is this system-dependent?
- Compare
– unthreaded/threaded – allreduce/alltoallv
- n Cray XT4, XT5 & XE6
- n 200 – 16,000 cores:
– Alltoallv at low end – Threads at high end
3DAPAS – June 8th, 2011
Petascale Communication (I)
3DAPAS – June 8th, 2011
HPC System Evolution
- Clock speed is no longer able to maintain Moore’s Law.
- Multi-core CPU and GPGPU are two major approaches.
- Both of these will require performance modeling, experiment & auto-tuning
- E.g. NERSC’s new XE6 system Hopper
– 6384 nodes – 2 MagnyCours processors per node – 2 NUMA nodes per processor – 6 cores per NUMA node
- What is the best way to run hybrid code
- n such a system?
– “wisdom” says 4 processes x 6 threads to avoid NUMA effects.
3DAPAS – June 8th, 2011
Petascale Communication (II)
3DAPAS – June 8th, 2011
Current Status
- Calculation scale with #observations.
- IO & communication scale with #pixels.
- Observations/pixel ~ S/N: science goals will help scaling!
– Planck: O(103) observations per pixel – PolarBear: O(106) observations per pixel
- For each experiment, fixed data volume => strong scaling.
- Across experiments, growing data volume => weak scaling.
3DAPAS – June 8th, 2011
Petascale Communication (III)
3DAPAS – June 8th, 2011
Conclusions
- The CMB provides a unique window onto the early Universe
– probe fundamental cosmology & physics.
- CMB data analysis is a long-standing, computationally-challenging, problem
requiring state-of-the-art HPC capabilities.
- The science we can extract from CMB data sets is determined by
– the absolute limits on our computational capability, and – our practical ability to exploit this.
- Both the CMB data sets we gather and the HPC systems we analyze them
- n are evolving