Big Data Era Challenges and Opportunities in Astronomy: How SOM/LVQ - - PowerPoint PPT Presentation

big data era challenges and opportunities in astronomy
SMART_READER_LITE
LIVE PREVIEW

Big Data Era Challenges and Opportunities in Astronomy: How SOM/LVQ - - PowerPoint PPT Presentation

Big Data Era Challenges and Opportunities in Astronomy: How SOM/LVQ and Related Learning Methods Can Contribute? Prof. Pablo Estvez, Ph.D. Department of Electrical Engineering, Universidad de Chile & Millennium Institute of


slide-1
SLIDE 1

Big Data Era Challenges and Opportunities in Astronomy: How SOM/LVQ and Related Learning Methods Can Contribute?

  • Prof. Pablo Estévez, Ph.D.

Department of Electrical Engineering, Universidad de Chile & Millennium Institute of Astrophysics, Chile

Houston, TX, January 8, 2016

WSOM 2016

slide-2
SLIDE 2

Contents

Astronomy Context

Large Synoptic Survey Telescope Millennium Institute of Astrophysics (MAS)

Big Data SOM/LVQ Two Examples Conclusions

slide-3
SLIDE 3

Mirrors of the largest telescopes

SALT

By 2024 Chile will concentrate 70% of the global observing We have the responsibility to capitalize on this opportunity

slide-4
SLIDE 4

Large Synoptic Survey Telescope (LSST) Cerro Pachón, Chile, 2022

slide-5
SLIDE 5

3 x3 degrees field of view All southern hemisphere in 3 days During 10 years LSST will produce a 3D video of the Universe Cosmic Cinematography: Exploration of time domain In one year it will collect more data than all previous telescopes as a whole (15 PB/year) Real time data management 100,000 transients per night

slide-6
SLIDE 6

Challenges for Chile

Access to 100% of the data But LSST will not do the analysis Avalanche of data Huge challenges in computational intelligence and data analysis New way of doing science: Data-driven Science

slide-7
SLIDE 7

LSST: Big Data Challenges

Mining in real time a massive data stream of

~2 Terabytes per hour for 10 years

Classify more than 50 billion objects and follow

up many of these events in real time

Spectrograph to separate light into frequency

spectrum

Extracting knowledge in real time for ~2 million

events per night

Discovering the unknown unknowns

(serendipity): the things that we do not even know that we don´t know!

slide-8
SLIDE 8

Big Data Four V´s

slide-9
SLIDE 9
  • Credits: ALMA, Maccarena Gonzalez:
slide-10
SLIDE 10

Big Data Analytics

How did the Milky Way form? Challenge: Can Data Mining discover new patterns and correlations? What vs. Why Challenge: Providing good visualization tools for doing science High Performance Computing, GPGPUs

slide-11
SLIDE 11

Pragmatic Approach

Knowing what, not why, is good enough This is done finding out valuable correlations

(including non-linear relationships)

Correlation allows us analyzing a phenomenon

by identifying a good proxy for it

The grand challenge is the problem of

inference: turning data into knowledge through models

A data-driven approach is used instead of a

hypothesis-driven one

slide-12
SLIDE 12

SOM/LVQ Methods

What is the Best Classifier in the World? Source: Fernandez-Delgado et al, JMLR, 2014 “Including all the relevant classifiers available

today”. Comparison of 179 classifiers on 121 data sets

Ranking Classifier 1 Parallel Random Forest 2 Random Forest 3 SVM-C 77 LVQ 119 Supervised SOM

Why the GLVQ* classifiers are not in the list?

  • Implementation in R or Python
  • Easy interface
  • Automatic parameter tuning
slide-13
SLIDE 13

SOM Journal Papers

20 40 60 80 100 120

385 SOM Papers Published in ISI Journals (2010-2015)

slide-14
SLIDE 14

Semi-supervised variable star clustering

New paradigm in the field of ML/CI, is semi-supervised learning:

(unsupervised) find structures, patterns or clusters by

measuring similarities between samples

(supervised) Incorporate labels if available, this guides the

unsupervised half (label propagation) Possibility of detecting something novel (patterns not in the training set) while still discriminating the known classes. Example: Clustering 10,000 periodic variable stars from EROS-2 (Sammon visualization). Only 10% of the data is labeled. Purple: EB, blue: CEPH, yellow: RRL, green: LPV, red: unknown

slide-15
SLIDE 15

Active learning with human in the loop

Active learning: The machine can query the expert for labels. In practice, the number of labels is much less than in the supervised case. Query strategy: (1) Ask labels for the most uncertain samples (boundaries), (2) minimize expected error, (3) minimize output variance, etc. Example: AL query interface for variable star classification, show a pair of samples and choose if they belong to the same class.

slide-16
SLIDE 16

Millennium Institute of Astrophysics (MAS) Started in January 2014

Passion for the exploration of the natural world

slide-17
SLIDE 17

Millennium Institute of Astrophysics (MAS) Started in January 2014

Milky Way Astroinformatics, Astrostatistics Exoplanets, Transients Supernovae

slide-18
SLIDE 18

Astronomical Time Series: Light

  • Curves. “LOS PABLOS” Work

Light Curve: Stellar brightness (magnitude or

flux) versus time.

Variable stars: stars whose luminosity varies

  • ver time (3% of the stars in the universe are

variables, and 1% are periodic variable stars)

Light Curve Analysis: Useful for period

detection, event detection, stellar classification, extra solar planet discovery, measuring distance to earth, etc.

slide-19
SLIDE 19

An Example of a Light Curve

slide-20
SLIDE 20

Variable stars

Eclipsing binary stars Pulsating star

slide-21
SLIDE 21

Folded Light Curves

The transformation “t modulus T” plots successive

cycles atop one another, where T is the period

Usually all periods within a range are tried to find the

  • ne that maximizes a criterion (sweep).

Folding a light curve Estimating the period

slide-22
SLIDE 22

Automated period detection

Correntropy (generalized correlation) is used

to compute similarities between samples

Go beyond second order statistics, taking into

account higher order moments

Robustness to outliers and noise

Spectral decomposition of correntropy using

advanced signal processing techniques

Gaussian basis functions are used instead of

sinusoids

Go beyond Fourier representation to get super-

resolution, more localized and sparser spectra

slide-23
SLIDE 23

Example

slide-24
SLIDE 24

EROS-2 Survey

Survey of the Magellanic Clouds and the Galactic

bulge

Data taken from ESO Observatory, in La Silla, Chile 38.2 million light curves with two channels each

(blue and red).

EROS dataset processed automatically in 18 hours

using GPGPU cluster. We found 120,000 periodic variables.

Near future (within MAS):

Be able to process a billion light curves per day

slide-25
SLIDE 25

DEMO Period_finding_demo_Python

slide-26
SLIDE 26

Real-time Transient Detection Pipeline (PANCHO´S Work)

Quest to complete the luminosity-time diagram

for low luminosities and short cadences

Discovery of new transient phenomena New instruments like DECam and LSST will

allow us to detect for example the explosion of a supernova in real time.

A custom real-time transient pipeline has been

developed.

slide-27
SLIDE 27
slide-28
SLIDE 28

High Cadence Transient Survey (F. Forster et al.)

HiTS scientific objective: Find evidence of shock breakouts (SBO). SBO: Event that occurs instants after the explosion of a supernova. Supernova: Explosion by the end of the life cycle of massive stars Dark Energy Camera (DECam)

  • 1. Formation of neutron star (~secs)
  • 2. Shock emergence (~hrs)
  • 3. Glowing ejecta (~days/weeks)
  • 4. Renmant diffusion (~kyrs)

Figures: nasa.gov

slide-29
SLIDE 29

HiTS Image Reduction Pipeline

  • At this point, candidates are dominated by artifacts 1:10K
  • ML to find the needles in the haystack

Data Capture Preprocessing Alignment PSF Matching Subtraction Candidate Selection Candidate Filtering Visual Inspection

slide-30
SLIDE 30

Non-negative Matrix Factorization

slide-31
SLIDE 31

Feature Extraction using NMF

In NMF, we aim to decompose V into factors W and H by solving where the non-negative constraints are element-wise. Nonnegativity: Only additive combinations. Sparse and part based decompositions The NMF problem is non-convex in W and H at the same time (ill-posed). Regularization can alleviate this.

slide-32
SLIDE 32

Principal Component Analysis

Lee and Seung, Nature 1999

slide-33
SLIDE 33

Non-negative Matrix Factorization

Lee and Seung, Nature 1999

slide-34
SLIDE 34

Cosmic rays and noise

Current Reference Difference Current Reference Difference

slide-35
SLIDE 35

Stars!

Current Reference Difference Current Reference Difference

slide-36
SLIDE 36

2D Visualization of the astronomical images using Self Organizing Maps (SOM)

  • The SOM is an unsupervised

neural network using for visualization...

  • The database contain ~1000

21x21 images of objects labeled as variable by the CMM pipeline.

  • We use Non-negative matrix

factorization (NMF) to capture the different behaviors in the stamps and reduce dimensionality (441 → 16)

  • We train the SOM with the NMF

coefficients and obtain a u-matrix visualization

slide-37
SLIDE 37

The whole picture

slide-38
SLIDE 38

Proposed method: Offline training phase

Slicing & Scaling Training data NMF

Im1 W1 H1 W2 H2 Ws Hs

Train RF model

Im2 Ims

  • Training set:

500,000 labeled HiTS candidates

  • NMF parameters:

K and Lambda

Nx3x441 Nx441 NxK Kx441

slide-39
SLIDE 39

Results: Classification Performance

NMF outperforms PCA and the raw model (a) FPR: 0.1%, FNR: 4% (NMF), 10% (PCA), 15% (raw) (b) FPR: 0.1%, FNR: 15% (NMF), 30% (PCA), 40% (raw) NMF is less affected when SNR decreases

SNR>6 5<SNR<6

slide-40
SLIDE 40

Traditional Pattern Recognition Model

slide-41
SLIDE 41

Inspired in Inception Movie

PANCHO, WE NEED TO GO DEEPER

slide-42
SLIDE 42

Deep Learning

  • For many years shallow neural network

architectures were used

  • Classical multilayer perceptron trained by error

backpropagation (gradient descent)

  • Problem of vanishing gradients for deeper

arquitectures, number of examples, computational time

slide-43
SLIDE 43

21 21 5 5 17 20 3 20 6 50 2 2 200 200 3 6 3 17 3 4 4

Convolutional Neural Nets Applied to HiTS

4 convolutional pooling

Guillermo Cabrera, Ignacio Reyes, Pablo Estevez, Francisco Förster, Juan-Carlos Maureira

slide-44
SLIDE 44

ConvNet Applied to HiTS

Detection error tradeoff curve

455,393 data-set:

  • 350,393 training

set

  • 5,000 validation

set

  • 100,000 test set

It takes ~12 mins using Theano over a GPU Tesla K20 (Graphical Processor Unit)

slide-45
SLIDE 45

Movies!

http://www.das.uchile.cl/~fforster/ATEL/summary_das.html

slide-46
SLIDE 46

Movies!

http://www.das.uchile.cl/~fforster/ATEL/summary_das.html

slide-47
SLIDE 47
slide-48
SLIDE 48
slide-49
SLIDE 49

Movies!

20 Young supernovae were detected in real-time Plus other 70 supernovae In 6-nights of observation at DECam

slide-50
SLIDE 50

Conclusions

Big Data is here to stay Deep learning is a computational intelligence/machine learning technique that allows us to extract features automatically The importance of multidisciplinary teams TO DO List: Larger sets and more heterogeneous sets, combine features (NMF, PCA, DL, engineered features), active learning

slide-51
SLIDE 51

References

  • P. Huijse, P. Estevez, P. Protopapas, JC Principe, P. Zegers,

“Computational Intelligence Challenges and Applications on Large- Scale Astronomical Time Series Databases”, IEEE Computational Intelligence Magazine, August 2014.

  • P. Protopapas, P. Huijse, P. Estevez, P. Zegers, JC Principe, JB

Marquette “A novel, fully automated pipeline for period estimation in the EROS 2 data set”, Astrophysical Journal Supplement Series, 216:25, February 2015.

P.Huijse, P. Estevez, P. Protopapas, P. Zegers, J. Principe, “An

Information Theoretic Algorithm for Finding Periodicities in Stellar Light Curves”, IEEE Transactions on Signal Processing, Vol. 60, n° 10, pp. 5135-5145, 2012.

P.Huijse, P. Estevez, P. Zegers, P. Protopapas, J. Principe, “Period

Estimation in Astronomical Time Series using Slotted Correntropy”, IEEE Signal Processing Letters, Vol. 18, n° 6, pp. 371-374, 2011.