PRACTICAL ANALYTICS Tams Budavri / The Johns Hopkins University - - PowerPoint PPT Presentation

practical analytics
SMART_READER_LITE
LIVE PREVIEW

PRACTICAL ANALYTICS Tams Budavri / The Johns Hopkins University - - PowerPoint PPT Presentation

PRACTICAL ANALYTICS Tams Budavri / The Johns Hopkins University 7/19/2012 Statistics Tams Budavri Of numbers Of vectors Of functions Of trees ISSAC at HIPACC 7/19/2012 Statistics Tams Budavri Description,


slide-1
SLIDE 1

PRACTICAL ANALYTICS

Tamás Budavári / The Johns Hopkins University

7/19/2012

slide-2
SLIDE 2 Tamás Budavári

Statistics

7/19/2012 ISSAC at HIPACC

 Of numbers  Of vectors  Of functions  Of trees

slide-3
SLIDE 3 Tamás Budavári

Statistics

7/19/2012 ISSAC at HIPACC

 Description, modeling, inference, machine learning  Bayesian / Frequentist / Pragmatist ?

Supervised Unsupervised Discrete Classification Clustering Continuous Regression Dimensional Reduction

slide-4
SLIDE 4 Tamás Budavári

What’s Large?

7/19/2012 ISSAC at HIPACC

 VOLUME

 Say >100TB today but tomorrow? Moving target…

 COMPLEXITY

 The raw dataset are simple unlike their derivatives

 DEFINITION?

 Large when you cannot apply the “usual” tools

slide-5
SLIDE 5

LARGE !!

Data

7/19/2012 ISSAC at HIPACC
slide-6
SLIDE 6

LARGE !!

Data

7/19/2012 ISSAC at HIPACC
slide-7
SLIDE 7 Tamás Budavári

Large?

7/19/2012 ISSAC at HIPACC

 Sample size

slide-8
SLIDE 8 Tamás Budavári

Large?

7/19/2012 ISSAC at HIPACC

 Sample size

slide-9
SLIDE 9 Tamás Budavári

Large?

7/19/2012 ISSAC at HIPACC

 Dimensions

 Ratio of surface/volume grows

all points are lonely in high dimensions

slide-10
SLIDE 10 7/19/2012 ISSAC at HIPACC
slide-11
SLIDE 11 Tamás Budavári

Keeping Up?

 Image processing  Catalog extraction  O(n)  What is difficult?  O(n log n)  O(n2), …

Worse w/ Moore’s law

ISSAC at HIPACC 7/19/2012
slide-12
SLIDE 12 Tamás Budavári

 Cross-identification of sources  To assemble multicolor catalogs  Drop-outs from sky coverage  To constrain fluxes not detected  Constraining physical properties  To interpret the data

Fundamental Challenges

7/19/2012 ISSAC at HIPACC
slide-13
SLIDE 13

From long-tail science to the largest experiments

Cross-Identification

7/19/2012 ISSAC at HIPACC
slide-14
SLIDE 14 Tamás Budavári

Recording Observations

ISSAC at HIPACC

 Astronomers

drew it…

 Now kids

do it on SkyServer

#1 by Haley 

slide-15
SLIDE 15 Tamás Budavári

Multicolor Universe

7/19/2012 ISSAC at HIPACC
slide-16
SLIDE 16 Tamás Budavári

Eventful Universe

7/19/2012 ISSAC at HIPACC
slide-17
SLIDE 17

One of the most fundamental analysis steps

Cross-Identification

7/19/2012 ISSAC at HIPACC
slide-18
SLIDE 18 Tamás Budavári

What is the Right Question?

 Cross-identification is a hard problem

 Computationally, Scientifically & Statistically  Need symmetric n-way solution  Need reliable quality measure

 Same or not?

 Distance threshold? Maximum likelihood?

7/19/2012 ISSAC at HIPACC
slide-19
SLIDE 19 Tamás Budavári

Tabletop Astronomy

 Imagine the observed sky has only 6 pixels One object: one die Observing: rolling a die Locality: die is loaded Sky: a bag of dice

7/19/2012 ISSAC at HIPACC
slide-20
SLIDE 20 Tamás Budavári

Model Comparison: Same or Not?

 Crossmatch: draw two dice with replacement

 Same or not?

 Bayes Factor is the ratio of the

 Likelihood of “Same”  Likelihood of “Not”

 Likelihood of a hypothesis?

 Sum over all possibilities

7/19/2012 ISSAC at HIPACC
slide-21
SLIDE 21 Tamás Budavári

Model Comparison: Same or Not?

 Crossmatch: draw two dice with replacement

 Same or not?

 Bayes Factor is the ratio of the

 Likelihood of “Same”  Likelihood of “Not”

 Likelihood of a hypothesis?

 Sum over all possibilities

7/19/2012 ISSAC at HIPACC
slide-22
SLIDE 22 Tamás Budavári

Model Comparison: Same or Not?

 Model for loaded dice is matrix of probabilities  E.g., loaded toward l =1  Etc. for l =2…6  2-way case  Same:  Not:

 n-way: same

7/19/2012 ISSAC at HIPACC
slide-23
SLIDE 23 Tamás Budavári

Model Comparison: Same or Not?

 Model for loaded dice is matrix of probabilities  E.g., loaded toward l =1  Etc. for l =2…6  2-way case  Same:  Not:

 n-way: same

7/19/2012 ISSAC at HIPACC
slide-24
SLIDE 24 Tamás Budavári

Model Comparison: Same or Not?

 Model for loaded dice is matrix of probabilities  E.g., loaded toward l =1  Etc. for l =2…6  2-way case  Same:  Not:

 n-way: same

7/19/2012 ISSAC at HIPACC
slide-25
SLIDE 25 Tamás Budavári

Celestial Sphere

 Continuous functions  General formalism

 Accuracy is a density fn on sky

7/19/2012 ISSAC at HIPACC
slide-26
SLIDE 26 Tamás Budavári

Modeling the Astrometry

 Astrometric precision A simple function  Where on the sky? Anywhere really…

7/19/2012 ISSAC at HIPACC
slide-27
SLIDE 27 Tamás Budavári

 The Bayes factor  H: all observations of the same object at m  K: might be from separate objects at {mi}

Same or Not?

7/19/2012

SAME NOT OR

ISSAC at HIPACC
slide-28
SLIDE 28 Tamás Budavári

 The Bayes factor  H: all observations of the same object at m  K: might be from separate objects at {mi}

Same or Not?

7/19/2012

SAME NOT OR

ISSAC at HIPACC
slide-29
SLIDE 29 Tamás Budavári

 The Bayes factor  H: all observations of the same object at m  K: might be from separate objects at {mi}

Same or Not?

7/19/2012

SAME NOT OR

ISSAC at HIPACC
slide-30
SLIDE 30 Tamás Budavári

 The Bayes factor  H: all observations of the same object at m  K: might be from separate objects at {mi}

Same or Not?

On the sky Astrometry

7/19/2012

SAME NOT OR

ISSAC at HIPACC
slide-31
SLIDE 31 Tamás Budavári

 The Bayes factor  H: all observations of the same object at m  K: might be from separate objects at {mi}

Same or Not?

On the sky Astrometry

7/19/2012

SAME NOT OR

ISSAC at HIPACC
slide-32
SLIDE 32 Tamás Budavári

Analytic Results

 Normal distribution

 Flat and spherical

 Gauss and Fisher  2-way results

ISSAC at HIPACC
slide-33
SLIDE 33 Tamás Budavári

Normal Distribution

 Astrometric precision:  Fisher distribution:  Analytic results:

 For high accuracies:

ISSAC at HIPACC 7/19/2012
slide-34
SLIDE 34 Tamás Budavári

Wikipedia: Interpretation

7/19/2012 ISSAC at HIPACC
slide-35
SLIDE 35 7/19/2012

Same or not?

Probability of a Match

ISSAC at HIPACC
slide-36
SLIDE 36 Tamás Budavári

 Bayes factor is the connection

From Priors to Posteriors

7/19/2012 ISSAC at HIPACC
slide-37
SLIDE 37 Tamás Budavári 7/19/2012

From Priors to Posteriors

 Posterior probability from prior & Bayes factor  Prior probability of a match

 Like dice in a bag: 1/N and N1n  In general?

ISSAC at HIPACC
slide-38
SLIDE 38 Tamás Budavári

From Priors to Posteriors

ISSAC at HIPACC

 Different selections

 Nearby / Distant  Red / Blue

 But only 1 number

slide-39
SLIDE 39 Tamás Budavári

 Prior has an unknown fudge-factor Educated guess Or solve for it:

Self-Consistent Estimates

TB & Szalay (2008)

7/19/2012 ISSAC at HIPACC
slide-40
SLIDE 40 Tamás Budavári

Simulations

 Mock objects

 With correct clustering  U01 values as properties

 Simulated sources

 Subsets: N1 N2  Overlap: N★

1

ISSAC at HIPACC 7/19/2012
slide-41
SLIDE 41 Tamás Budavári

Simulations

 Mock objects

 With correct clustering  U01 values as properties

 Simulated sources

 Subsets: N1 N2  Overlap: N★

1

ISSAC at HIPACC 7/19/2012
slide-42
SLIDE 42 Tamás Budavári

Simulations

 Quality  Multiple matches

Explained by simple model

  • f point sources!

Heinis, TB, Szalay (2009)

7/19/2012 ISSAC at HIPACC
slide-43
SLIDE 43 Tamás Budavári

 Same hypotheses but different parameters  Just need  prior to integrate

Proper Motion

Sources from SDSS

7/19/2012 ISSAC at HIPACC
slide-44
SLIDE 44 Tamás Budavári

 Same hypotheses but different parameters  Just need  prior to integrate

Proper Motion

Kerekes, TB+ (2010) Sources from SDSS

7/19/2012 ISSAC at HIPACC
slide-45
SLIDE 45 Tamás Budavári

Matching Events

7/19/2012 ISSAC at HIPACC

(1) (2) (x)

 Streams of events in time and space

 E.g., thresholded peaks in signal-to-noise

slide-46
SLIDE 46

Dropouts from Sky Coverage

7/19/2012 ISSAC at HIPACC
slide-47
SLIDE 47 Tamás Budavári

Drawing with Equations

r = 0.6 r = 0.5

TB, Szalay & Fekete (2010)

7/19/2012 ISSAC at HIPACC
slide-48
SLIDE 48

Matching in Practice

7/19/2012 ISSAC at HIPACC
slide-49
SLIDE 49 Tamás Budavári

Open SkyQuery

7/19/2012 ISSAC at HIPACC

 Following our

1st prototype

 Successful  Not bayesian  Limitations

slide-50
SLIDE 50 Tamás Budavári

SkyQuery – The 3rd Generation

7/19/2012 ISSAC at HIPACC

 Dynamic federation of astronomy databases

 Query the collection as if they were one

 The 3rd gen tool coming this summer

 Cluster of machines running partitioned jobs  Proper probabilistic exec with variable errors

slide-51
SLIDE 51 Tamás Budavári

SkyQuery

7/19/2012 ISSAC at HIPACC

 Almost pure

standard SQL

slide-52
SLIDE 52 Tamás Budavári

SkyQuery

7/19/2012 ISSAC at HIPACC

 Almost pure

standard SQL

slide-53
SLIDE 53 Tamás Budavári

SkyQuery

7/19/2012 ISSAC at HIPACC

 Almost pure

standard SQL

slide-54
SLIDE 54 Tamás Budavári

SkyQuery

7/19/2012 ISSAC at HIPACC

 Almost pure

standard SQL

 Added XMATCH

 Verifiable  Flexible

slide-55
SLIDE 55 Tamás Budavári 7/19/2012 ISSAC at HIPACC
slide-56
SLIDE 56 Tamás Budavári

HST Crossmatch Catalog

 SQL pipeline  Astrometric

correction

 Subpixel

precision

TB & Lubow (2012)

RELEASE AT AAS

ISSAC at HIPACC
slide-57
SLIDE 57 Tamás Budavári

HST Crossmatch Catalog

 FoF groups

 Possible chains

 Bayesian model

selection

 Chainbreaker

RELEASE AT AAS

ISSAC at HIPACC
slide-58
SLIDE 58 Tamás Budavári

HST Crossmatch Catalog

 Lots of matching sources during HST’s long life

TB & Lubow (2012)

RELEASE AT AAS

TB & Lubow (2012)

ISSAC at HIPACC
slide-59
SLIDE 59 Tamás Budavári

HST Crossmatch Catalog

 Lots of matching sources during HST’s long life

TB & Lubow (2012)

RELEASE AT AAS

TB & Lubow (2012)

ISSAC at HIPACC
slide-60
SLIDE 60 Tamás Budavári

Zone Algorithm

 Constant Declination zones

 Sort by R.A. within

 Fast SQL code

 SDSS-GALEX in 1 hour  CPU limited!

7/19/2012 ISSAC at HIPACC
slide-61
SLIDE 61 Tamás Budavári

Parallel on GPUs

 Recent Github release

 Multi-GPU implementation

 Search in 5” – great perf!

 NVIDIA GTX 480 1.5GB

 29M×29M in 11 seconds

 C2050 Teslas

 400M×150M in 3 minutes

7/19/2012 ISSAC at HIPACC