Analyzing astronomical data with Apache Spark Julien Peloton In - PowerPoint PPT Presentation

Analyzing astronomical data with Apache Spark Julien Peloton In collaboration with Christian Arnault & St´ ephane Plaszczynski Laboratoire de l’Acc´ el´ erateur Lin´ eaire Statistical challenges for large-scale structure in the era of LSST 1 Julien Peloton Analyzing astronomical data with Apache Spark

Motivation On the one hand... Future telescopes will collect huge amount of data (O(1) TB/day). This is unprecedented in the field of astronomy. ... on the other hand. Big data communities deal with such data volumes (and even more!) for many years. An efficient framework to tackle Big data problems is Apache Spark . 2 Julien Peloton Analyzing astronomical data with Apache Spark

Apache Spark Apache Spark is a cluster-computing framework. Started as a research project at UC Berkeley in 2009. Open Source License (Apache 2.0). Used by +1000 companies over the world. 3 Julien Peloton Analyzing astronomical data with Apache Spark

Apache Spark, Hadoop & HDFS Credo: ”It is cheaper to move computation rather than data”. Spark was initially developed to overcome limitations in the MapReduce paradigm (Read → Map → Reduce). To work, Spark needs a cluster manager a distributed storage system (e.g. HDFS). 4 Julien Peloton Analyzing astronomical data with Apache Spark

Data sources Spark has built-in data sources, but mostly naively structured. A popular file format to store and manipulate data is FITS . FITS format has a complex/heterogeneous structure (image/table HDU, header, data block). Structured data source formats require specific implementation to distribute the data. 5 Julien Peloton Analyzing astronomical data with Apache Spark

Handling FITS data in Spark: several challenges How do we read a FITS file in Scala? How do we access the data of a distributed FITS file across machines? 6 Julien Peloton Analyzing astronomical data with Apache Spark

spark-fits Seamlessly integrate with Apache Spark: a simple ”drag-and-drop” of a FITS file in e.g. HDFS gives you full access to all the framework! There is support for catalogs (images in preparation). API for Scala , Python , R , and Java https://github.com/JulienPeloton/spark-fits 7 Julien Peloton Analyzing astronomical data with Apache Spark

spark-fits API: quick start In Scala // Define a DataFrame with the data from the first HDU val df = spark.read.format("com.sparkfits") .option("hdu", 1) .load("hdfs://...") // The DataFrame schema is inferred from the header df.show(4) +----------+---------+----------+-----+-----+ | target| RA| Dec|Index|RunId| +----------+---------+----------+-----+-----+ |NGC0000000| 3.448297| -0.338748| 0| 1| |NGC0000001| 4.493667| -1.441499| 1| 1| |NGC0000002| 3.787274| 1.329837| 2| 1| |NGC0000003| 3.423602| -0.294571| 3| 1| +----------+---------+----------+-----+-----+ 8 Julien Peloton Analyzing astronomical data with Apache Spark

spark-fits API: quick start In Python ## Define a DataFrame with the data from the first HDU df = spark.read.format("com.sparkfits")\ .option("hdu", 1)\ .load("hdfs://...") ## The DataFrame schema is inferred from the header df.show(4) +----------+---------+----------+-----+-----+ | target| RA| Dec|Index|RunId| +----------+---------+----------+-----+-----+ |NGC0000000| 3.448297| -0.338748| 0| 1| |NGC0000001| 4.493667| -1.441499| 1| 1| |NGC0000002| 3.787274| 1.329837| 2| 1| |NGC0000003| 3.423602| -0.294571| 3| 1| +----------+---------+----------+-----+-----+ 9 Julien Peloton Analyzing astronomical data with Apache Spark

I/O performances VirtualData @ Universit´ e Paris-Sud: 1 driver + 9 executors 9 × 17 Intel Core Processors (Haswell) @ 2GB RAM @ 2.6GHz 10 Julien Peloton Analyzing astronomical data with Apache Spark

Scala, Python, Java, and R API spark-fits written in Scala, but takes advantage of Spark’s interfacing mechanisms. Same kind of performances across different API as far as IO is concerned. 96.0 (Scala API) (Python API) 84.0 Iteration time (s) 100 75 50 25 2.7 2.4 0 First Later First Later iteration iterations iteration iterations 11 Julien Peloton Analyzing astronomical data with Apache Spark

First application: LSST-like catalogs Problematic: Distribute galaxy catalog data ( > 10 10 objects), and manipulate the data efficiently using spark-fits . Data set: LSST-like data from CoLoRe, 110 GB (10 years target). a Ref: Just count the data (I/O reference). b Shells: Split the data into redshift shells with ∆ z = 0 . 1, project into Healpix maps, and reduce the data. 12 Julien Peloton Analyzing astronomical data with Apache Spark

First application: LSST-like catalogs (shells) LSST 1Y - redshift 0.1-0.2 LSST 1Y - redshift 0.2-0.3 -1 1 -1 1 LSST 1Y - redshift 0.3-0.4 LSST 1Y - redshift 0.4-0.5 -1 1 -1 1 13 Julien Peloton Analyzing astronomical data with Apache Spark

spark-fits at NERSC Although not ideal, Spark can be used to process data stored in HPC-style shared file systems. Process 1.2 TB on Cori (NERSC) with spark-fits , using 40 Haswell nodes (1280 cores total). 69.5 (1 OST) (8 OSTs) Iteration time (s) 57.1 60 40 20 2.4 2.1 0 First Later First Later iteration iterations iteration iterations 14 Julien Peloton Analyzing astronomical data with Apache Spark

Conclusion & perspectives A set of new tools for Astronomy to enter in the Big Data era Library to manipulate FITS in Scala . Spark connector to distribute FITS data across machines, and perform efficient data exploration. Proof of concept: Demonstrate robustness against wide range of data set sizes. Distribute up to 1.2 TB of FITS data across machines, and make data exploration interactively. Perspectives Keep extending the tools, e.g. to manipulate image HDU. Develop scientific cases, and integrate it in the current efforts. Your collaboration is welcome! 15 Julien Peloton Analyzing astronomical data with Apache Spark

Backup: First application: LSST-like catalogs Problematic: Distribute galaxy catalog data ( > 10 10 objects), and manipulate the data efficiently using spark-fits . Data set: LSST-like data from CoLoRe, 110 GB (10 years target). a Ref: Just count the data (I/O reference). b Shells: Split the data into redshift shells with ∆ z = 0 . 1, project into Healpix maps, and reduce the data. c Neighbours: Find all galaxies in the catalogs contained in a circle of center ( x 0 , y 0 ) and radius 1 deg, and reduce the data. d Cross-match: Find common objects between two set of catalogs with size 110 GB and 3.5 GB, respectively, and reduce the data. 16 Julien Peloton Analyzing astronomical data with Apache Spark

Backup: Distribution(s) of FITS Problematic: How to distribute and manipulate the data of a FITS file? 17 Julien Peloton Analyzing astronomical data with Apache Spark

Backup: Distribution(s) of FITS Problematic: How to distribute and manipulate the data of a FITS file? 18 Julien Peloton Analyzing astronomical data with Apache Spark

Backup: Spark, Scala, and Python In the Scala world, there is no real equivalent to the numpy, matplotlib, scipy packages. Using the Jep package, we can interface Scala to the Python world. Installation: Pip install: pip install jep --user Build with sbt: unmanagedBase := file("/path/to/python3.5/site-packages/jep") 19 Julien Peloton Analyzing astronomical data with Apache Spark

Backup: HDF5 https://github.com/valiantljk/h5spark https://www.hdfgroup.org/downloads/spark-connector 20 Julien Peloton Analyzing astronomical data with Apache Spark

Analyzing astronomical data with Apache Spark Julien Peloton In - PowerPoint PPT Presentation

Analyzing astronomical data with Apache Spark Julien Peloton In collaboration with Christian Arnault & St ephane Plaszczynski Laboratoire de lAcc el erateur Lin eaire Statistical challenges for large-scale structure in the

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Learning for Categorization Sample Category Learning Problem A training example is an instance

Effect of Number of Drop Effect of Number of Drop Precedences in Assured in Assured Precedences

Segmentation methods It makes a hard decision too soon. We want to Segment foreground

Week 3: Tamara on travel Thu Sep 30 - Mon Oct 3 major sources of analysis problems:

Hugin: a Bayesian Network based decision tool Gianluca Corrado gianluca.corrado@unitn.it Machine

Fourier analysis of Discrete Dirac n torus Nelson Faustino operators on the n torus R n /

Drawing Drawing models Graphics context Display lists Painters Algorithm Clipping &

Data Painter: A Tool for Colormap Interaction Omnia iah Na Nagoor, Ri Rita Bor orgo, Mar ark

Analyzing astronomical data with Apache Spark Julien Peloton In - PowerPoint PPT Presentation

Analyzing astronomical data with Apache Spark Julien Peloton In collaboration with Christian Arnault & St ephane Plaszczynski Laboratoire de lAcc el erateur Lin eaire Statistical challenges for large-scale structure in the

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Learning for Categorization Sample Category Learning Problem A training example is an instance

Effect of Number of Drop Effect of Number of Drop Precedences in Assured in Assured Precedences

Segmentation methods It makes a hard decision too soon. We want to Segment foreground

Week 3: Tamara on travel Thu Sep 30 - Mon Oct 3 major sources of analysis problems:

Hugin: a Bayesian Network based decision tool Gianluca Corrado gianluca.corrado@unitn.it Machine

Fourier analysis of Discrete Dirac n torus Nelson Faustino operators on the n torus R n /

Drawing Drawing models Graphics context Display lists Painters Algorithm Clipping &amp;

Data Painter: A Tool for Colormap Interaction Omnia iah Na Nagoor, Ri Rita Bor orgo, Mar ark

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Drawing Drawing models Graphics context Display lists Painters Algorithm Clipping &