The Sloan Digital Sky Survey From Big Data to Big Database to Big - - PowerPoint PPT Presentation
The Sloan Digital Sky Survey From Big Data to Big Database to Big - - PowerPoint PPT Presentation
The Sloan Digital Sky Survey From Big Data to Big Database to Big Compute Heidi Newberg Rensselaer Polytechnic Institute Summary History of the data deluge from a personal perspective. The transformation of astronomy with the Sloan
Summary
- History of the data deluge from a personal
perspective.
- The transformation of astronomy with the
Sloan Digital Sky Survey.
- The discovery of density substructure in the
Milky Way stellar spheroid.
- Using MilkyWay@home to fit more complex
models to the data.
The new 1024x1024 CCD camera required a new computer to store the data from just one night of
- bserving (2 megabytes every
five minutes). We also needed to write to exabyte tape drives rather than magnetic tapes, so the data would be easier to carry home on the airplane.
The beginning of the data deluge (1990’s)
- New CCD cameras produced
enough data that we could no longer look at each astronomical
- bject individually. Automated
algorithms were needed.
- Mag tapes hold 100 Mbytes each, ~2 hrs of observing
time per tape. (Requires large backpack to transport home.) Exabyte tapes made data transport easier.
- I still own all of these tapes, but it is likely that they are not
- readable. All astronomical data from that era is lost
forever.
The Sloan Digital Sky Survey (SDSS) is a joint project of The University of Chicago, Fermilab, the Institute for Advanced Study, the Japan Participation Group, The Johns Hopkins University, the Max-Planck-Institute for Astronomy (MPIA), the Max- Planck-Institute for Astrophysics (MPA), New Mexico State University, Princeton University, the U.S. Naval Observatory, and the University of Washington. (11 institutions) Funding for the project has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Aeronautics and Space Administration, the National Science Foundation, the U.S. Department of Energy, the Japanese Monbukagakusho, and the Max Planck Society.
The Data
- Images of 14,000 square degrees of sky in 5
passbands (raw data 20 TB)
- A catalog of a billion objects detected in those
images (20 TB SQL database), ~400 parameters per object
- Other data products (DAS – 34 TB)
- 1.5 million spectra of galaxies, stars, and
quasars (3.3 TB)
- Spectral parameters (450 Gbytes)
Data reduction??
I discussed the data processing,
Alex Szalay and his group at Johns Hopkins took on the enormous task of putting all of this data into a database, preserving as much provenance as possible, and making the data as accessible as possible. There are serious issues with speed in a database
- f this size, so his group needed to think
hard about how the data would be accessed, and thus how it should be
- rganized.
The 20 Queries
Q11: Find all elliptical galaxies with spectra that have an anomalous emission line. Q12: Create a grided count of galaxies with u-g>1 and r<21.5 over 60<declination<70, and 200<right ascension<210, on a grid
- f 2’, and create a map of masks over the same grid.
Q13: Create a count of galaxies for each of the HTM triangles which satisfy a certain color cut, like 0.7u-0.5g-0.2i<1.25 && r<21.75, output it in a form adequate for visualization. Q14: Find stars with multiple measurements and have magnitude variations >0.1. Scan for stars that have a secondary object (observed at a different time) and compare their magnitudes. Q15: Provide a list of moving objects consistent with an asteroid. Q16: Find all objects similar to the colors of a quasar at 5.5<redshift<6.5. Q17: Find binary stars where at least one of them has the colors of a white dwarf. Q18: Find all objects within 30 arcseconds of one another that have very similar colors: that is where the color ratios u-g, g-r, r-I are less than 0.05m. Q19: Find quasars with a broad absorption line in their spectra and at least one galaxy within 10 arcseconds. Return both the quasars and the galaxies. Q20: For each galaxy in the BCG data set (brightest color galaxy), in 160<right ascension<170, -25<declination<35 count of galaxies within 30"of it that have a photoz within 0.05 of that galaxy. Q1: Find all galaxies without unsaturated pixels within 1' of a given point of ra=75.327, dec=21.023 Q2: Find all galaxies with blue surface brightness between and 23 and 25 mag per square arcseconds, and -10<super galactic latitude (sgb) <10, and declination less than zero. Q3: Find all galaxies brighter than magnitude 22, where the local extinction is >0.75. Q4: Find galaxies with an isophotal surface brightness (SB) larger than 24 in the red band, with an ellipticity>0.5, and with the major axis of the ellipse having a declination of between 30” and 60”arc seconds. Q5: Find all galaxies with a deVaucouleours profile (r¼ falloff of intensity on disk) and the photometric colors consistent with an elliptical galaxy. The deVaucouleours profile Q6: Find galaxies that are blended with a star, output the deblended galaxy magnitudes. Q7: Provide a list of star-like objects that are 1% rare. Q8: Find all objects with unclassified spectra. Q9: Find quasars with a line width >2000 km/s and 2.5<redshift<2.7. Q10: Find galaxies with spectra that have an equivalent width in Ha >40Å (Ha is the main hydrogen spectral line.)
From talk by Jim Gray (2001)
Scientists were asked for example scientific queries, so the database could be optimized.
Sky survey “Navigate” tool lets you browse through the images
Over a billion hits to the SDSS site, leveling off at 150 million per year. Over 2,000,000 SQL queries per month on the database.
16
Computational Science
- Traditional Empirical Science
– Scientist gathers data by direct
- bservation
– Scientist analyzes data
- Computational Science
– Data captured by instruments Or data generated by simulator – Processed by software – Placed in a database – Scientist analyzes database
From talk by Jim Gray 10/10/2001
17
What’s needed?
(not drawn to scale)
Sc Scie ience nce Dat ata a & Qu & Ques esti tion
- ns
Scientists
Da Data tabase base To To st stor
- re
e da data ta Ex Exec ecute ute Qu Quer eries ies
Plumbers
Da Data ta Mi Mini ning ng Al Algo gorith ithms ms
Miners
Qu Ques estion tion & & An Answe wer Vi Visua ualiza lization tion
Tools
Slide from talk by Jim Gray 4/10/2002
Astronomy Information Age
- Astronomical data is processed without anyone looking at the individual
images/spectra Astronomers used to classify galaxies by eye. Sometimes a graduate student would classify thousands of galaxies from a computer screen. At three per minute, this might take hours, days, or even weeks of time. The SDSS found 108 galaxies. At three per minute, classification would take 63 years of 24 hours per day, seven days per week. The “Galaxy Zoo” is a project that allows private citizens to look at data by eye, and contribute classifications to scientists.
- More data is obtained than anyone can analyze himself (drinking from a
fire hose) Projects like the SDSS SkyServer, the Virtual Observatory, Google Sky, and WikiSky are all projects aimed at letting people better access the data from SDSS.
- New surveys, including Pan-STARRS, LSST, Guo Shou Jing (LAMOST),
DES, RAVE, SEGUE, HERMES, and WFMOS are planned or in progress, patterned on the success of the Sloan Digital Sky Survey.
2 2 2
) / ( , 5 . 3 , q z y x r r
The SDSS survey was funded as an extragalactic project, but Galactic stars could not be completely avoided.
The use of statistical knowledge
- f the
absolute magnitudes of stellar populations to determine the density distributions of stars.
Statistical Photometric Parallax
Newberg et al. 2002 Vivas overdensity, or Virgo Stellar Stream Sagittarius Dwarf Tidal Stream Stellar Spheroid? Monoceros stream, Stream in the Galactic Plane, Galactic Anticenter Stellar Stream, Canis Major Stream, Argo Navis Stream
Squashed halo Spherical halo Exponential disk Prolate halo
Newberg et al. 2002
Kathryn Johnston
David Law
A map of stars in the outer regions of the Milky Way Galaxy, derived from the SDSS images of the northern sky, shown in a Mercator-like projection. The color indicates the distance of the stars, while the intensity indicates the density of stars on the sky. Structures visible in this map include streams of stars torn from the Sagittarius dwarf galaxy, a smaller 'orphan' stream crossing the Sagittarius streams, the 'Monoceros Ring' that encircles the Milky Way disk, trails of stars being stripped from the globular cluster Palomar 5, and excesses of stars found towards the constellations Virgo and Hercules. Circles enclose new Milky Way companions discovered by the SDSS; two of these are faint globular star clusters, while the
- thers are faint dwarf galaxies.
Credit: V. Belokurov and the Sloan Digital Sky Survey.
Why is this important?
- Small dwarf galaxies are merging with the Milky
Way at the present time.
- The Milky Way itself was created by a long history
- f merging smaller galaxies to make larger ones
- The tidal streams are an archeological record of
the merger history that created our galaxy
- The tidal streams encode the gravitational
potential through which the dwarf galaxy traveled, and can therefore tell us about the distribution of dark matter in the Milky Way.
Newberg et al. 2002 Vivas overdensity, or Virgo Stellar Stream Sagittarius Dwarf Tidal Stream Stellar Spheroid? Monoceros stream, Stream in the Galactic Plane, Galactic Anticenter Stellar Stream, Canis Major Stream, Argo Navis Stream
Fitting model parameters
Previous astronomers fit 3 parameters to the entire stellar halo. We want to fit 20 parameters to each of eighteen 2.5-degree wide stripe = 360 parameters. The number of iterations to compute the likelihood increases with the number of stars, and the required accuracy of the calculation At four hours per evaluation and 50 likelihood calculations per iteration in a conjugate gradient descent method and 50 iterations, 10,000 hours are required to optimize one stripe. This would take more than 400 days on a single processor.
Began: November 9, 2007 Computing power: 0.5 PetaFLOPS (high over 2 PetaFLOPS) Number of volunteers (total people): 146,863 Number of computers volunteered (total): 291,944 Number of active volunteers: 25,670 Number of active computer being volunteered: 35,686
Number of volunteers as of 10/4/2012
206 countries
(of which 193 are UN members)
Volunteer Computing with
150,000 volunteers:
- Let us use their CPUs for scientific calculations
- Continously upgrade their hardware
- Populate extensive forum discussions on
science, technical support, and well, anything
- Monitor the health of our system (especially
- ur volunteer moderator)
- Wrote the first GPU version of
- ur software
- Donate money and hardware
Volunteer Computing with
150,000 volunteers also:
- Compete with each other for BOINC “credits”
- Become angry if another person or team is
getting an unfair number of credits
- Return garbage results (which require zero
computations) so they can earn credit faster
- Insult each other on public
forum boards
- Link anti-Semitic websites to ours
Astronomy students write algorithms
MilkyWay@home server sends out jobs to volunteers and collects results
Algorithms are adapted to run on asynchronous, heterogenious, parallel computing environment. The code compiled and tested on 16 platforms including CPUs and GPUs, and attached to the server. Mechanisms are created to start and end “runs.” The MySQL database is maintained.
Astronomy students write algorithms
MilkyWay@home server sends out jobs to volunteers and collects results
Algorithms are adapted to run on asynchronous, heterogenious, parallel computing environment. The code compiled and tested on 16 platforms including CPUs and GPUs, and attached to the
- server. Mechanisms are created to start and
end “runs.” The MySQL database is maintained.
This was originally accomplished with a $750,000 grant shared between astronomy and computer science faculty. But there is no model for maintaining this, since it is no longer an interesting computer science problem, and very expensive for an individual astronomy grant. We need lighter tools
Data from one stripe Stream 1 (6 parameters) Stream 2 (6 parameters) Stream 3 (6 parameters) Smooth (3 parameters)
We can fit 20 parameters to each 2.5-degreee wide stripe of data. We recently analyzed 18 stripes
- f data from DR7
(300-400 parameters).
Law & Majewski (2010) Newby et al., submitted
We can compare the position of the stream in the sky (left), with n-body simulations of Sgr dwarf galaxy disruption (right). The stream positions in the left panel are calculated by 2.5- degree wide stripe.
1.9 million F turnoff stars 160,000 stars with Sgr density 1.7 million non-Sgr stars Polar plots of SDSS F turnoff stars in the north Galactic Cap (top). Using our density model, we place each star in either the Sgr (lower left) or non-Sgr panel (lower right), with the probability given by the model. The stars in the Sgr panel are not guaranteed to be from the stream, but they collectively have the spatial properties of the Sgr stream.
Determining the total mass, lumpiness, and flattening of the Galaxy’s dark matter halo
We now want to fit parameters of the Milky Way galaxy and the dwarf galaxies that fell in, by using n-body simulations of the merging and comparing them to the density parameters we measured in the data. (1) We would like to fit N-body simulations (100,000 particles in the dwarf) instead of orbits (1 particle) (2) We would like to fit multiple streams at the same time. (3) We would like to fit distances, velocities, positions, and densities
- f the streams, and simultaneously fit measurements of the
Milky Way’s rotation curve. (4) We need to consider internal properties of the dwarfs Since modeling one dwarf requires ~30 minutes on a CPU, this requires substantial computational power. But then, we have MilkyWay@home.
Sample 100,000 particle (sub-sampled above) semi-analytic N-body simulations
- f the tidal disruption of the Orphan
- Stream. Fit only the Plummer sphere
parameters for the dwarf galaxy Right now we have a version of the Barnes and Hut (1986) code that works across CPU platforms for our MilkyWay@home with checkpointing, and hope it will be running on GPUs sometime within the coming year.