Knowledge Discovery Workflows in the Exploration of Complex Astronomical Datasets
Raffaele D’Abrusco
Harvard-Smithsonian Center for Astrophysics
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 1 / 22
Knowledge Discovery Workflows in the Exploration of Complex - - PowerPoint PPT Presentation
Knowledge Discovery Workflows in the Exploration of Complex Astronomical Datasets Raffaele DAbrusco Harvard-Smithsonian Center for Astrophysics Raffaele DAbrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 1 / 22
Harvard-Smithsonian Center for Astrophysics
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 1 / 22
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 2 / 22
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 3 / 22
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 4 / 22
The advancement of astronomical technology (hardware and software) allows to go larger, deeper and with higher resolution, both spatially and spectrally, changing the nature of astronomical data. d a t a c
p l e x i t y
109
(bytes)
1010 1011 1012 1013 1014 1015 1016 1017 1018
# sources
103 104 105 106
Facilities like LSST, SKA, ALMA, Euclid, etc... and the access and federation to archival data provided by the VO’s will boost this change by making large multivariate datasets (spanning also the time axis) easily available.
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 5 / 22
While high-dimensional regions of the observable parameters space are still completely unexplored, not all low-dimensionality feature spaces have been investigated yet, as in principle we look into places where they expect to find something. A systematic way to search for “something” is necessary as it does not depend on our biases/prioritization/limited availability of time and resources.
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 6 / 22
While high-dimensional regions of the observable parameters space are still completely unexplored, not all low-dimensionality feature spaces have been investigated yet, as in principle we look into places where they expect to find something. A systematic way to search for “something” is necessary as it does not depend on our biases/prioritization/limited availability of time and resources.
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 6 / 22
A combination of two unsupervised clustering (UC) techniques and the use of a priori knowledge available for a subset of confirmed SDSS quasars was used to extract
Start
spectroscopic data
Successful cluster?
Characterization in parameter space
End
candidate quasars
Selection of photometric
Photometric data
Selection of best clusterization
No Yes
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 7 / 22
Experts
z(1) phot z(2) phot z(N) phot
z(Best)phot
Fuzzy k-means Gating Network
Cluster 3 Cluster 2 Cluster 4 Cluster 8 Cluster 1 Cluster 6 Cluster 7 Cluster 5
The UC algorithm split the feature space into more homogeneous chunks to prevent under or over-fitting
Multiple distinct experts (neural networks) are trained
The gate combines the outputs of the single experts in
and minimize biases.
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 8 / 22
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 9 / 22
1
2
3
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 10 / 22
# clusters of clustering Cluster 0.0 0.2 0.4 0.6 0.8 1.0 0.625 0.75 0.667 0.545 0.647 0.688 0.514 0.541 0.682 0.714 0.4 0.75 0.48 0.459 0.308 0.619 0.438 0.435 0.48 0.517 0.25 0.696 0.619 0.474 0.487 0.714 0.435 0.483 0.357 0.409 0.6 0.54 0.558 0.529 0.516 0.472 0.485 (16) (70) (25) (12) (37) (37) (25) (9) (22) (13) (29) (38) (11) (7) (21) (4) (39) (29) (17) (25) (16) (23) (7) (1) (22) (16) (8) (23) (21) (23) (14) (1) (5) [0.174] [0.131] [0.105] [0.087] [0.075] [0.065] 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Total
4 5 6 7 8 0.4 0.6 0.8
Nclust Stot
4 5 6 7 8 0.10 0.15 0.20
Nclust S'tot
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 11 / 22
labels labels features
Dataset → AGNs catalog Features → UV(Galex) + Optical(SDSS)+ NIR(UKIDSS) + IR(WISE) Labels → AGNs class., Blazars spectral class. γ-ray emission Three clusters composed of Blazars stood out with large values of the scores spectral classification as label. Further experiments using as labels the γ-ray detection and FSRQs-BL Lacs classifications showed that such patterns
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 12 / 22
labels labels features
Dataset → AGNs catalog Features → UV(Galex) + Optical(SDSS)+ NIR(UKIDSS) + IR(WISE) Labels → AGNs class., Blazars spectral class. γ-ray emission Three clusters composed of Blazars stood out with large values of the scores spectral classification as label. Further experiments using as labels the γ-ray detection and FSRQs-BL Lacs classifications showed that such patterns
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 12 / 22
−1 1 2 3 4 5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 c23 [mag] c12 [mag]
BZBs BZQs
QSRs ULIRGs Liners Starbursts LIRGs Seyferts Spirals Ellipticals Stars
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 13 / 22
−4 −2 2 4 0.0 0.2 0.4 0.6 0.8 1.0 PC1 Fraction
The WISE Blazars locus is modeled in the Principal Component space generated by WISE colors space as three distinct subregions dominated by different spectral subclasses of sources (BL Lacs, FSRQ-dominated and mixed). Discrete protoscore psdisc = 1/nextr where nextr is the number of extremal points inside the region (for each region of the locus). Normalized continuos protoscore pscont = 1 6n ·psn
disc
where n is an index used to tweak efficiency and completeness of the association process. Final score s = pscont ·wV where wV = ||Verr.ellips. −Vreg||/Vreg weights according to the volume of the error ellipsoid of the source.
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 14 / 22
Mixing mid-IR and high-energy variability; Classification for Unassociated Fermi sources; Extraction of new WISE candidate blazars with validations using archival multi-wavelength data
1
2
28 29 30 31 32 1.0 1.5 2.0 log[Lopt(2500 A °)] [erg s−1 Hz−1] αox
nuv−u −1 1 2 3 4
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 15 / 22
Here comes the Super Chandra-COSMOS! 2.8 Ms exposure time on Chandra were just awarded (P.I. F. Civano) to observe 2 deg2 containing the original Chandra-COSMOS field. Expected to detect 4500 X-ray sources to Flim ∼ 2·10−16 cgs in [0.5,2] keV energy band.
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 16 / 22
1
Observations or upper-limits in a band can be translated into a binary labels and used to characterize the clustering in the feature space...
2
...but still, discarding sources of the sample with not-measured features can drastically reduce the size and richness
3
Significant comparison with results on similar datasets features-wise to check robustness, assess variance, etc.
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 17 / 22
1
Observations or upper-limits in a band can be translated into a binary labels and used to characterize the clustering in the feature space...
2
...but still, discarding sources of the sample with not-measured features can drastically reduce the size and richness
3
Significant comparison with results on similar datasets features-wise to check robustness, assess variance, etc.
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 17 / 22
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 18 / 22
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 19 / 22
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 19 / 22
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 20 / 22
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 20 / 22
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 20 / 22
UC & Classification/Regression → [D’Abrusco, R. et al. 2009, MNRAS, 396, 223], [Laurino, O., D’Abrusco, R. et al. 2011, MNRAS, 418, 4] CLaSPS → [D’Abrusco, R. et al. 2012, ApJ, 755, 2, 92] WISE Blazars → [D’Abrusco, R. et al. 2012, ApJ, 748, 68D], [Massaro, F., D’Abrusco, R. et al. 2012, ApJ, 752, 61M]
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 21 / 22
Raffaele D’Abrusco (CfA) KD workflows in Astronomy IAU 2012 - August 29th 2012 22 / 22