Commentary on Techniques for Massive- Data Machine Learning in - - PowerPoint PPT Presentation

▶

Nov 16, 2022 234 likes •492 views

SCMA V, Penn State, Jun 14th 2011 1 of 24 Commentary on Techniques for Massive- Data Machine Learning in Astronomy Nick Ball Herzberg Institute of Astrophysics Victoria, Canada The Problem SCMA V, Penn State, Jun 14th 2011 2 of 24

SLIDE 1

Commentary on “Techniques for Massive- Data Machine Learning in Astronomy”

Nick Ball Herzberg Institute of Astrophysics Victoria, Canada

SCMA V, Penn State, Jun 14th 2011 1 of 24

SLIDE 2

The Problem

2 of 24 SCMA V, Penn State, Jun 14th 2011

Astronomy faces enormous datasets
Their size, dimensionality, and complexity

require intelligent, automated investigation

Exponential increase in data size:

algorithms cannot scale worse than O(N log N)

Most data mining algorithms naïvely scale

as N2 or worse

SLIDE 3

SCMA V, Penn State, Jun 14th 2011

The Solution

3 of 24

Make data mining algorithms that scale as

N log N ! (or better)

May have to compromise accuracy slightly
Deploy them so that astronomers are

willing and able to use them

They must work on real astronomical data

SLIDE 4

Collaboration is Vital

4 of 24 SCMA V, Penn State, Jun 14th 2011

Successful use of astrostatistics and data

mining requires expertise in computer science, statistics, and astronomy

Collaboration enables novelty that would

not arise from a single group

So, computer scientists supplying

algorithms in this way is excellent

SLIDE 5

But

5 of 24 SCMA V, Penn State, Jun 14th 2011

... expertise in computer science, statistics,

and astronomy

Successful collaborations have involved

astronomers who are experts in computing/statistics, or who are working closely and over time with these experts

SLIDE 6

And

Astronomy data are messy:
Large, complex, increasingly high-dimensional, time-

domain

Missing data: non-observation or non-detection
Heteroscedastic, non-Gaussian, underestimated errors
Outliers, artifacts, false detections, systematic effects
Correlated inputs
Etc.

6 of 24 SCMA V, Penn State, Jun 14th 2011

SLIDE 7

7 of 24 SCMA V, Penn State, Jun 14th 2011

How do you apply astrostatistics and fast

algorithms to this? An Example

SLIDE 8

SLIDE 9

The Next Generation Virgo Cluster Survey

9 of 24 SCMA V, Penn State, Jun 14th 2011

10σ point source limiting

magnitude g = 25.7 (faint!)

Photometric (few spectra), ~100

deg2, 5 bands (ugriz, like Sloan)

107+ galaxies, 2.6 terabytes data
40 people at at 23 institutions in

Canada, France, etc. (PI Laura Ferrarese @ HIA)

2009-2012

SLIDE 10

Virgo is an actual cluster

galaxies, the nearest large

ne to

us

SLIDE 11

NGVS Statistical Challenges

Object detection and classification
Photometric redshifts (photo-z)
Virgo cluster membership / background
Missing data
Field-to-field variation
Multi-wavelength data
Completeness(mag, SB, etc. etc.)

11 of 24 SCMA V, Penn State, Jun 14th 2011

SLIDE 12

Object detection: low surface brightness galaxies

SLIDE 13

13 of 24

Cluster membership: photometric redshift using k nearest neighbours

SCMA V, Penn State, Jun 14th 2011

SLIDE 14

14 of 24

Missing data: NGVS fields (not final) don’t all contain all 5 bands ugriz

SCMA V, Penn State, Jun 14th 2011

SLIDE 15

Multi-wavelength data

SLIDE 16

Canadian Astronomy Data Centre

CADC is one of the world’s largest

astronomy data centres

~500 terabytes of data (will grow

to petabytes)

Uses

Virtual Observatory standards

Staffed by astronomers and

computer specialists, but not statisticians

16 of 24 SCMA V, Penn State, Jun 14th 2011

SLIDE 17

CANFAR

17 of 24

Canadian Advanced Network for

Astronomical Research, at CADC

Combines cluster job scheduling with

cloud computing resources

Users manage their own virtual machines

SCMA V, Penn State, Jun 14th 2011

SLIDE 18

18 of 24 SCMA V, Penn State, Jun 14th 2011

So

Put fast data mining tools on the CANFAR

infrastructure

... but early days, not much to say yet

SLIDE 19

Guide to Data Mining in Astronomy

19 of 24

Virtual Observatory KDD-IG guide: http://

www.ivoa.net/cgi-bin/twiki/bin/view/IVOA/ IvoaKDDguide

Emphasizes data mining, which is part of

astroinformatics

But this overlaps with astrostatistics
-> potential outreach channel to wider

community

SCMA V, Penn State, Jun 14th 2011

SLIDE 20

20 of 24 SCMA V, Penn State, Jun 14th 2011

kNN Quasar Photometric Redshifts

Use kd-tree for fast kNN assignment of

photo-zs to Sloan Digital Sky Survey quasars

Single neighbour, perturb input features to

make a PDF in redshift

Removing multi-peaked PDFs removes

almost all catastrophic outliers

SLIDE 21

21 of 24 SCMA V, Penn State, Jun 14th 2011

1 2 3 4 5 6 1 2 3 4 5 6 zspec zmean = 0.34397 20 40 60 80 100 120

kNN Quasar Photometric Redshifts

SLIDE 22

22 of 24 SCMA V, Penn State, Jun 14th 2011

1 2 3 4 5 6 1 2 3 4 5 6 zspec zone peak = 0.11096 20 40 60 80 100 120

kNN Quasar Photometric Redshifts

SLIDE 23

23 of 24 SCMA V, Penn State, Jun 14th 2011

Questions

Can we overcome the problems of real

data?

Will there be data of high intrinsic

dimension?

Will astronomers be able to deploy the

algorithms?

Where do GPUs fit? (GPU+brute force

may be just as fast?)

SLIDE 24

Conclusions

24 of 24 SCMA V, Penn State, Jun 14th 2011

Provided the data can be suitably

prepared, and the science-driven usage of the algorithm intelligently motivated, the fast algorithms presented here have excellent potential for advancing astronomical research