Commentary on Techniques for Massive- Data Machine Learning in - - PowerPoint PPT Presentation

commentary on techniques for massive data machine
SMART_READER_LITE
LIVE PREVIEW

Commentary on Techniques for Massive- Data Machine Learning in - - PowerPoint PPT Presentation

SCMA V, Penn State, Jun 14th 2011 1 of 24 Commentary on Techniques for Massive- Data Machine Learning in Astronomy Nick Ball Herzberg Institute of Astrophysics Victoria, Canada The Problem SCMA V, Penn State, Jun 14th 2011 2 of 24


slide-1
SLIDE 1

Commentary on “Techniques for Massive- Data Machine Learning in Astronomy”

Nick Ball Herzberg Institute of Astrophysics Victoria, Canada

SCMA V, Penn State, Jun 14th 2011 1 of 24

slide-2
SLIDE 2

The Problem

2 of 24 SCMA V, Penn State, Jun 14th 2011

  • Astronomy faces enormous datasets
  • Their size, dimensionality, and complexity

require intelligent, automated investigation

  • Exponential increase in data size:

algorithms cannot scale worse than O(N log N)

  • Most data mining algorithms naïvely scale

as N2 or worse

slide-3
SLIDE 3

SCMA V, Penn State, Jun 14th 2011

The Solution

3 of 24

  • Make data mining algorithms that scale as

N log N ! (or better)

  • May have to compromise accuracy slightly
  • Deploy them so that astronomers are

willing and able to use them

  • They must work on real astronomical data
slide-4
SLIDE 4

Collaboration is Vital

4 of 24 SCMA V, Penn State, Jun 14th 2011

  • Successful use of astrostatistics and data

mining requires expertise in computer science, statistics, and astronomy

  • Collaboration enables novelty that would

not arise from a single group

  • So, computer scientists supplying

algorithms in this way is excellent

slide-5
SLIDE 5

But

5 of 24 SCMA V, Penn State, Jun 14th 2011

  • ... expertise in computer science, statistics,

and astronomy

  • Successful collaborations have involved

astronomers who are experts in computing/statistics, or who are working closely and over time with these experts

slide-6
SLIDE 6

And

  • Astronomy data are messy:
  • Large, complex, increasingly high-dimensional, time-

domain

  • Missing data: non-observation or non-detection
  • Heteroscedastic, non-Gaussian, underestimated errors
  • Outliers, artifacts, false detections, systematic effects
  • Correlated inputs
  • Etc.

6 of 24 SCMA V, Penn State, Jun 14th 2011

slide-7
SLIDE 7

7 of 24 SCMA V, Penn State, Jun 14th 2011

  • How do you apply astrostatistics and fast

algorithms to this? An Example

slide-8
SLIDE 8
slide-9
SLIDE 9

The Next Generation Virgo Cluster Survey

9 of 24 SCMA V, Penn State, Jun 14th 2011

  • 10σ point source limiting

magnitude g = 25.7 (faint!)

  • Photometric (few spectra), ~100

deg2, 5 bands (ugriz, like Sloan)

  • 107+ galaxies, 2.6 terabytes data
  • 40 people at at 23 institutions in

Canada, France, etc. (PI Laura Ferrarese @ HIA)

  • 2009-2012
slide-10
SLIDE 10

Virgo is an actual cluster

  • f

galaxies, the nearest large

  • ne to

us

slide-11
SLIDE 11

NGVS Statistical Challenges

  • Object detection and classification
  • Photometric redshifts (photo-z)
  • Virgo cluster membership / background
  • Missing data
  • Field-to-field variation
  • Multi-wavelength data
  • Completeness(mag, SB, etc. etc.)

11 of 24 SCMA V, Penn State, Jun 14th 2011

slide-12
SLIDE 12

Object detection: low surface brightness galaxies

slide-13
SLIDE 13

13 of 24

Cluster membership: photometric redshift using k nearest neighbours

SCMA V, Penn State, Jun 14th 2011

slide-14
SLIDE 14

14 of 24

Missing data: NGVS fields (not final) don’t all contain all 5 bands ugriz

SCMA V, Penn State, Jun 14th 2011

slide-15
SLIDE 15

Multi-wavelength data

slide-16
SLIDE 16

Canadian Astronomy Data Centre

  • CADC is one of the world’s largest

astronomy data centres

  • ~500 terabytes of data (will grow

to petabytes)

  • Uses

Virtual Observatory standards

  • Staffed by astronomers and

computer specialists, but not statisticians

16 of 24 SCMA V, Penn State, Jun 14th 2011

slide-17
SLIDE 17

CANFAR

17 of 24

  • Canadian Advanced Network for

Astronomical Research, at CADC

  • Combines cluster job scheduling with

cloud computing resources

  • Users manage their own virtual machines

SCMA V, Penn State, Jun 14th 2011

slide-18
SLIDE 18

18 of 24 SCMA V, Penn State, Jun 14th 2011

So

  • Put fast data mining tools on the CANFAR

infrastructure

  • ... but early days, not much to say yet
slide-19
SLIDE 19

Guide to Data Mining in Astronomy

19 of 24

  • Virtual Observatory KDD-IG guide: http://

www.ivoa.net/cgi-bin/twiki/bin/view/IVOA/ IvoaKDDguide

  • Emphasizes data mining, which is part of

astroinformatics

  • But this overlaps with astrostatistics
  • -> potential outreach channel to wider

community

SCMA V, Penn State, Jun 14th 2011

slide-20
SLIDE 20

20 of 24 SCMA V, Penn State, Jun 14th 2011

kNN Quasar Photometric Redshifts

  • Use kd-tree for fast kNN assignment of

photo-zs to Sloan Digital Sky Survey quasars

  • Single neighbour, perturb input features to

make a PDF in redshift

  • Removing multi-peaked PDFs removes

almost all catastrophic outliers

slide-21
SLIDE 21

21 of 24 SCMA V, Penn State, Jun 14th 2011

1 2 3 4 5 6 1 2 3 4 5 6 zspec zmean = 0.34397 20 40 60 80 100 120

kNN Quasar Photometric Redshifts

slide-22
SLIDE 22

22 of 24 SCMA V, Penn State, Jun 14th 2011

1 2 3 4 5 6 1 2 3 4 5 6 zspec zone peak = 0.11096 20 40 60 80 100 120

kNN Quasar Photometric Redshifts

slide-23
SLIDE 23

23 of 24 SCMA V, Penn State, Jun 14th 2011

Questions

  • Can we overcome the problems of real

data?

  • Will there be data of high intrinsic

dimension?

  • Will astronomers be able to deploy the

algorithms?

  • Where do GPUs fit? (GPU+brute force

may be just as fast?)

slide-24
SLIDE 24

Conclusions

24 of 24 SCMA V, Penn State, Jun 14th 2011

  • Provided the data can be suitably

prepared, and the science-driven usage of the algorithm intelligently motivated, the fast algorithms presented here have excellent potential for advancing astronomical research