Knowledge Discovery Workflows in the Exploration of Complex - - PowerPoint PPT Presentation

knowledge discovery workflows in the exploration of
SMART_READER_LITE
LIVE PREVIEW

Knowledge Discovery Workflows in the Exploration of Complex - - PowerPoint PPT Presentation

Knowledge Discovery Workflows in the Exploration of Complex Astronomical Datasets Raffaele DAbrusco Harvard-Smithsonian Center for Astrophysics Raffaele DAbrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 1 / 22


slide-1
SLIDE 1

Knowledge Discovery Workflows in the Exploration of Complex Astronomical Datasets

Raffaele D’Abrusco

Harvard-Smithsonian Center for Astrophysics

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 1 / 22

slide-2
SLIDE 2

Galilean experimental method

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 2 / 22

slide-3
SLIDE 3

Setting the stage

Knowledge Discovery - KD - is the “automatic processing of large amount of data to extract patterns that can represent knowledge about the data”.

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 3 / 22

slide-4
SLIDE 4

KD in the real world

Outside our Real and Virtual Domes, KD methodology has already shaped how Data are processed and Knowledge is extracted, in several (expected and unexpected) fields: Social sciences: advertisement placement, social networks... Finance: market analysis tool, derivatives trading... Life science: genetics, epidemiology, drug testing.... Security: face recognition, behavior tracking... Google and the like... And for most of these fields, KD is the only possibility to make sense out of the

  • verwhelming amount of data gathered.

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 4 / 22

slide-5
SLIDE 5

The opportunity in Astronomy

The advancement of astronomical technology (hardware and software) allows to go larger, deeper and with higher resolution, both spatially and spectrally, changing the nature of astronomical data. d a t a c

  • m

p l e x i t y

109

(bytes)

1010 1011 1012 1013 1014 1015 1016 1017 1018

# sources

103 104 105 106

Facilities like LSST, SKA, ALMA, Euclid, etc... and the access and federation to archival data provided by the VO’s will boost this change by making large multivariate datasets (spanning also the time axis) easily available.

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 5 / 22

slide-6
SLIDE 6

Not just a needle in the haystack

A KD workflow is a sequence of analysis steps accomplished through KD techniques to extract the most knowledge out of (usually) large amount of (complex) data. Goals: Discovery

Find new complex correlations; Expand known correlations to more dimensions; Find new simple correlations, so far overlooked;

Using the discovery

Insight into astrophysics;

Classification, regression, new ways to look at things...

While high-dimensional regions of the observable parameters space are still completely unexplored, not all low-dimensionality feature spaces have been investigated yet, as in principle we look into places where they expect to find something. A systematic way to search for “something” is necessary as it does not depend on our biases/prioritization/limited availability of time and resources.

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 6 / 22

slide-7
SLIDE 7

Not just a needle in the haystack

A KD workflow is a sequence of analysis steps accomplished through KD techniques to extract the most knowledge out of (usually) large amount of (complex) data. Goals: Discovery

Find new complex correlations; Expand known correlations to more dimensions; Find new simple correlations, so far overlooked;

Using the discovery

Insight into astrophysics;

Classification, regression, new ways to look at things...

While high-dimensional regions of the observable parameters space are still completely unexplored, not all low-dimensionality feature spaces have been investigated yet, as in principle we look into places where they expect to find something. A systematic way to search for “something” is necessary as it does not depend on our biases/prioritization/limited availability of time and resources.

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 6 / 22

slide-8
SLIDE 8

A first try

Extraction of optical candidate quasars from the SDSS photometric dataset using spectroscopic base of knowledge.

A combination of two unsupervised clustering (UC) techniques and the use of a priori knowledge available for a subset of confirmed SDSS quasars was used to extract

  • ptical candidate quasars from photometric data.

Start

spectroscopic data

Successful cluster?

Characterization in parameter space

End

candidate quasars

Selection of photometric

  • bjects

Photometric data

PPS NEC

Selection of best clusterization

No Yes

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 7 / 22

slide-9
SLIDE 9

The Weak Gated Expert

The Weak Gated Expert (WGE) is a KD procedure for the determination of zphot for galaxies and quasars, based on clustering in the color space and the training of an ensemble of neural networks for regression.

Experts

z(1) phot z(2) phot z(N) phot

z(Best)phot

Fuzzy k-means Gating Network

Cluster 3 Cluster 2 Cluster 4 Cluster 8 Cluster 1 Cluster 6 Cluster 7 Cluster 5

The UC algorithm split the feature space into more homogeneous chunks to prevent under or over-fitting

  • f the experts;

Multiple distinct experts (neural networks) are trained

  • n different regions of the features space;

The gate combines the outputs of the single experts in

  • rder to maximize the accuracy of the reconstruction

and minimize biases.

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 8 / 22

slide-10
SLIDE 10

A more general question

What if the goal is not the improvement of the accuracy of a quantity obtained by regression (zphot) or binary classifications of sources (star vs quasars)? What if the goal is to find out whether any pattern happens to occur in any feature space using clustering techniques? The tenet Spontaneous aggregations of sources in their observable space, the clusters, reflect similarities common traits shared by these sources. Anisotropies in the distribution

  • f clusters populations relative to other observables reflect the existence of

significant patterns.

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 9 / 22

slide-11
SLIDE 11

The CLaSPS method

Clustering-Labels-Scores Patterns Spotter (CLaSPS)

1

A UC algorithm is used to produce clusterings in the parameter space generated by any subset of the observables (the features);

2

Other observables not employed for the clustering (the labels), are used as tags to identify interesting set of clusters using the score;

3

The patterns in the selected set of clusters are selected and studied.

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 10 / 22

slide-12
SLIDE 12

The choice of the clustering(s)

Set of clusters (or single clusters) are picked according to the degree of correlation between the distribution of cluster members in the feature space and their distribution in the labels space. Stot = 1 Nclust ·

Nclust

i=1

Si = 1 Nclust

Nclust

i=1

M(j)−1

j=1

fij −fi(j+1)

  • where fij is the fraction of members of the i-th cluster with values of the label in the j-th class.

# clusters of clustering Cluster 0.0 0.2 0.4 0.6 0.8 1.0 0.625 0.75 0.667 0.545 0.647 0.688 0.514 0.541 0.682 0.714 0.4 0.75 0.48 0.459 0.308 0.619 0.438 0.435 0.48 0.517 0.25 0.696 0.619 0.474 0.487 0.714 0.435 0.483 0.357 0.409 0.6 0.54 0.558 0.529 0.516 0.472 0.485 (16) (70) (25) (12) (37) (37) (25) (9) (22) (13) (29) (38) (11) (7) (21) (4) (39) (29) (17) (25) (16) (23) (7) (1) (22) (16) (8) (23) (21) (23) (14) (1) (5) [0.174] [0.131] [0.105] [0.087] [0.075] [0.065] 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Total

  • 3

4 5 6 7 8 0.4 0.6 0.8

  • lab. 1

Nclust Stot

  • K−means
SOM HC
  • 3

4 5 6 7 8 0.10 0.15 0.20

  • lab. 1

Nclust S'tot

  • K−means
SOM HC

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 11 / 22

slide-13
SLIDE 13

An interesting finding

CLaSPS has been applied on a sample of AGNs with multi-wavelength observations spanning from radio to γ-rays (features and labels) to characterize their SEDs in the colors feature space.

labels labels features

Dataset → AGNs catalog Features → UV(Galex) + Optical(SDSS)+ NIR(UKIDSS) + IR(WISE) Labels → AGNs class., Blazars spectral class. γ-ray emission Three clusters composed of Blazars stood out with large values of the scores spectral classification as label. Further experiments using as labels the γ-ray detection and FSRQs-BL Lacs classifications showed that such patterns

  • f Blazars depend on WISE mid-Infrared colors.

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 12 / 22

slide-14
SLIDE 14

An interesting finding

CLaSPS has been applied on a sample of AGNs with multi-wavelength observations spanning from radio to γ-rays (features and labels) to characterize their SEDs in the colors feature space.

labels labels features

Dataset → AGNs catalog Features → UV(Galex) + Optical(SDSS)+ NIR(UKIDSS) + IR(WISE) Labels → AGNs class., Blazars spectral class. γ-ray emission Three clusters composed of Blazars stood out with large values of the scores spectral classification as label. Further experiments using as labels the γ-ray detection and FSRQs-BL Lacs classifications showed that such patterns

  • f Blazars depend on WISE mid-Infrared colors.

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 12 / 22

slide-15
SLIDE 15

The WISE Blazars strip

This pattern in the IR WISE color space of Blazars would have been apparent even in this low dimensional projection of the multi-λ feature space that we studied with CLaSPS, but it had been overlooked so far.

−1 1 2 3 4 5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 c23 [mag] c12 [mag]

  • WISE Gamma−ray strip

BZBs BZQs

  • Obs. AGNs

QSRs ULIRGs Liners Starbursts LIRGs Seyferts Spirals Ellipticals Stars

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 13 / 22

slide-16
SLIDE 16

Another step in the workflow

The WISE Blazars locus can be used as a supervised classifier.

−4 −2 2 4 0.0 0.2 0.4 0.6 0.8 1.0 PC1 Fraction

The WISE Blazars locus is modeled in the Principal Component space generated by WISE colors space as three distinct subregions dominated by different spectral subclasses of sources (BL Lacs, FSRQ-dominated and mixed). Discrete protoscore psdisc = 1/nextr where nextr is the number of extremal points inside the region (for each region of the locus). Normalized continuos protoscore pscont = 1 6n ·psn

disc

where n is an index used to tweak efficiency and completeness of the association process. Final score s = pscont ·wV where wV = ||Verr.ellips. −Vreg||/Vreg weights according to the volume of the error ellipsoid of the source.

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 14 / 22

slide-17
SLIDE 17

Some more applications

Mixing mid-IR and high-energy variability; Classification for Unassociated Fermi sources; Extraction of new WISE candidate blazars with validations using archival multi-wavelength data

More science with CLaSPS:

1

The characterization of the globular clusters-LMXBs connection in different galaxies;

2

Application to a sample of X-ray selected AGNs with wide-band multi-λ photometry, with already known correlations found by CLaSPS.

28 29 30 31 32 1.0 1.5 2.0 log[Lopt(2500 A °)] [erg s−1 Hz−1] αox

nuv−u −1 1 2 3 4

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 15 / 22

slide-18
SLIDE 18

CLaSPS and Legacy COSMOS

Here comes the Super Chandra-COSMOS! 2.8 Ms exposure time on Chandra were just awarded (P.I. F. Civano) to observe 2 deg2 containing the original Chandra-COSMOS field. Expected to detect 4500 X-ray sources to Flim ∼ 2·10−16 cgs in [0.5,2] keV energy band.

COSMOS multi-wavelength coverage is unparalleled: 47 wide and narrow bands spanning the whole spectrum. Perfect to characterize the SEDs of AGNs and constrain the dependence

  • f SMBHs on their environment, as a

function of the host galaxies properties. A treasure for astronomical data miners!

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 16 / 22

slide-19
SLIDE 19

Improvements

Handling upper-limits and NaN’s (regardless of their origins) becomes crucial with

  • bservationally rich complex samples.

1

Observations or upper-limits in a band can be translated into a binary labels and used to characterize the clustering in the feature space...

2

...but still, discarding sources of the sample with not-measured features can drastically reduce the size and richness

  • f the dataset.

3

Significant comparison with results on similar datasets features-wise to check robustness, assess variance, etc.

Feature-Distributed Clustering (FDC) methods can be used to address points 1 and 2, while simulations and Object-Distributed Clustering (ODC) techniques are useful for point 3.

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 17 / 22

slide-20
SLIDE 20

Improvements

Handling upper-limits and NaN’s (regardless of their origins) becomes crucial with

  • bservationally rich complex samples.

1

Observations or upper-limits in a band can be translated into a binary labels and used to characterize the clustering in the feature space...

2

...but still, discarding sources of the sample with not-measured features can drastically reduce the size and richness

  • f the dataset.

3

Significant comparison with results on similar datasets features-wise to check robustness, assess variance, etc.

Feature-Distributed Clustering (FDC) methods can be used to address points 1 and 2, while simulations and Object-Distributed Clustering (ODC) techniques are useful for point 3.

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 17 / 22

slide-21
SLIDE 21

Stuff that helps

The core CLaSPS functionalities (KD algorithms, statistics and visualization)

  • riginally implemented in R

The connective tissue of the workflow (retrieval of archival data, pre-processing, post-processing) is Python Specific data-related tasks are carried out by the passepartout for the realm of tables: STILTS. All experiments run on my laptop or desktop in my office (OK for small datasets).

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 18 / 22

slide-22
SLIDE 22

Handy stuff that would help

What’s missing? A high-level description of KD workflows in astronomy (to compare and improve methods with different applications/use cases/domain); A repository for code, workflows and template datasets; A scalable platform for KD workflows to tackle massive and complex datasets! (My computers won’t cope with data anymore very soon...); Widespread adoption of versatile data access protocols (TAP interface, casJobs-like access points, etc.) from data centers Astronomers should learn SQL, SQL, SQL, machine learning, statistics,...

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 19 / 22

slide-23
SLIDE 23

Handy stuff that would help

What’s missing? A high-level description of KD workflows in astronomy (to compare and improve methods with different applications/use cases/domain); A repository for code, workflows and template datasets; A scalable platform for KD workflows to tackle massive and complex datasets! (My computers won’t cope with data anymore very soon...); Widespread adoption of versatile data access protocols (TAP interface, casJobs-like access points, etc.) from data centers Astronomers should learn SQL, SQL, SQL, machine learning, statistics,...

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 19 / 22

slide-24
SLIDE 24

The future

The future of astronomy will give me (us?) something to cheer about: Astronomy is becoming a data-intensive discipline Exciting science ahead for the brave and lucky ones KD experts acquire transferable skills and expertise valued outside the academia (Average) astronomers’ awareness of KD usefulness (somewhat) growing KD know-how starting to percolate into the astronomical community

Interesting scientific results will boost KD adoption!

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 20 / 22

slide-25
SLIDE 25

The future

The future of astronomy will give me (us?) something to cheer about: Astronomy is becoming a data-intensive discipline Exciting science ahead for the brave and lucky ones KD experts acquire transferable skills and expertise valued outside the academia (Average) astronomers’ awareness of KD usefulness (somewhat) growing KD know-how starting to percolate into the astronomical community

Interesting scientific results will boost KD adoption!

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 20 / 22

slide-26
SLIDE 26

The future

The future of astronomy will give me (us?) something to cheer about: Astronomy is becoming a data-intensive discipline Exciting science ahead for the brave and lucky ones KD experts acquire transferable skills and expertise valued outside the academia (Average) astronomers’ awareness of KD usefulness (somewhat) growing KD know-how starting to percolate into the astronomical community

Interesting scientific results will boost KD adoption!

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 20 / 22

slide-27
SLIDE 27

Acknowledgements

  • G. Fabbiano (CfA)
  • O. Laurino (CfA)
  • G. Longo (Univ. of Naples)
  • F. Massaro (SLAC)

UC & Classification/Regression → [D’Abrusco, R. et al. 2009, MNRAS, 396, 223], [Laurino, O., D’Abrusco, R. et al. 2011, MNRAS, 418, 4] CLaSPS → [D’Abrusco, R. et al. 2012, ApJ, 755, 2, 92] WISE Blazars → [D’Abrusco, R. et al. 2012, ApJ, 748, 68D], [Massaro, F., D’Abrusco, R. et al. 2012, ApJ, 752, 61M]

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 21 / 22

slide-28
SLIDE 28

Thank you!

Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 22 / 22