IPython Notebook as a Unified Data Science Interface for Hadoop - - PowerPoint PPT Presentation

ipython notebook as a unified data science interface for
SMART_READER_LITE
LIVE PREVIEW

IPython Notebook as a Unified Data Science Interface for Hadoop - - PowerPoint PPT Presentation

IPython Notebook as a Unified Data Science Interface for Hadoop Casey Stella Spring, 2015 Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015 Table of Contents Preliminaries Data Science in


slide-1
SLIDE 1

IPython Notebook as a Unified Data Science Interface for Hadoop

Casey Stella Spring, 2015

Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

slide-2
SLIDE 2

Table of Contents

Preliminaries Data Science in Hadoop Unified Environment Demo Questions

Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

slide-3
SLIDE 3

Introduction

  • I’m a Principal Architect at Hortonworks
  • I work primarily doing Data Science in the Hadoop Ecosystem
  • Prior to this, I’ve spent my time and had a lot of fun
  • Doing data mining on medical data at Explorys using the Hadoop

ecosystem

  • Doing signal processing on seismic data at Ion Geophysical using

MapReduce

  • Being a graduate student in the Math department at Texas A&M in

algorithmic complexity theory

Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

slide-4
SLIDE 4

Data Science in Hadoop

Hadoop is a great environment for data transformation, but as a data science environment it poses challenges.

Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

slide-5
SLIDE 5

Data Science in Hadoop

Hadoop is a great environment for data transformation, but as a data science environment it poses challenges.

  • A single system where both data transformation and data science

algorithms can be expressed naturally can be a challenging line to toe.

Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

slide-6
SLIDE 6

Data Science in Hadoop

Hadoop is a great environment for data transformation, but as a data science environment it poses challenges.

  • A single system where both data transformation and data science

algorithms can be expressed naturally can be a challenging line to toe.

  • The popular languages of data science with mature external

libraries do not coincide with the JVM languages.

Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

slide-7
SLIDE 7

Data Science in Hadoop

Hadoop is a great environment for data transformation, but as a data science environment it poses challenges.

  • A single system where both data transformation and data science

algorithms can be expressed naturally can be a challenging line to toe.

  • The popular languages of data science with mature external

libraries do not coincide with the JVM languages.

  • A system to represent the output of data science and analysis,

summary analysis and visualizations, can often are either limited in scope of capabilities or require extensive custom coding.

Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

slide-8
SLIDE 8

Data Science in Hadoop

Hadoop is a great environment for data transformation, but as a data science environment it poses challenges.

  • A single system where both data transformation and data science

algorithms can be expressed naturally can be a challenging line to toe.

  • The popular languages of data science with mature external

libraries do not coincide with the JVM languages.

  • A system to represent the output of data science and analysis,

summary analysis and visualizations, can often are either limited in scope of capabilities or require extensive custom coding. A unified environment for data science is elusive, but we do have a great start with the Python bindings of Spark and IPython Notebook.

Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

slide-9
SLIDE 9

Unified Data Science Environment

What are the components of a unified data science environment?

Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

slide-10
SLIDE 10

Unified Data Science Environment

What are the components of a unified data science environment?

  • A single environment supporting mixed-mode local and distributed

processing.

Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

slide-11
SLIDE 11

Unified Data Science Environment

What are the components of a unified data science environment?

  • A single environment supporting mixed-mode local and distributed
  • processing. Apache Spark

Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

slide-12
SLIDE 12

Unified Data Science Environment

What are the components of a unified data science environment?

  • A single environment supporting mixed-mode local and distributed
  • processing. Apache Spark
  • The ability to “reach-out” to languages with heavy data science

algorithm support.

Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

slide-13
SLIDE 13

Unified Data Science Environment

What are the components of a unified data science environment?

  • A single environment supporting mixed-mode local and distributed
  • processing. Apache Spark
  • The ability to “reach-out” to languages with heavy data science

algorithm support. PySpark

Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

slide-14
SLIDE 14

Unified Data Science Environment

What are the components of a unified data science environment?

  • A single environment supporting mixed-mode local and distributed
  • processing. Apache Spark
  • The ability to “reach-out” to languages with heavy data science

algorithm support. PySpark

  • Strong, seamless SQL integration.

Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

slide-15
SLIDE 15

Unified Data Science Environment

What are the components of a unified data science environment?

  • A single environment supporting mixed-mode local and distributed
  • processing. Apache Spark
  • The ability to “reach-out” to languages with heavy data science

algorithm support. PySpark

  • Strong, seamless SQL integration. SparkSQL

Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

slide-16
SLIDE 16

Unified Data Science Environment

What are the components of a unified data science environment?

  • A single environment supporting mixed-mode local and distributed
  • processing. Apache Spark
  • The ability to “reach-out” to languages with heavy data science

algorithm support. PySpark

  • Strong, seamless SQL integration. SparkSQL
  • Ability to visualize and report summary data.

Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

slide-17
SLIDE 17

Unified Data Science Environment

What are the components of a unified data science environment?

  • A single environment supporting mixed-mode local and distributed
  • processing. Apache Spark
  • The ability to “reach-out” to languages with heavy data science

algorithm support. PySpark

  • Strong, seamless SQL integration. SparkSQL
  • Ability to visualize and report summary data. IPython Notebook

Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

slide-18
SLIDE 18

Apache Spark

Apache Spark is an alternative computing system which can run on Yarn and provides

  • An Elegant, Rich and Usable Core API
  • An Expansive set of ecosystem libraries built around the Core API
  • Hive compatibility via SparkSQL
  • Mature Python support for both core APIs as well as the spark

ecosystem projects

Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

slide-19
SLIDE 19

Spark: Core Ideas

Core API facilitates expressing algorithms in terms of transformations

  • f distributed datasets
  • Datasets are Distributed and Resilient (so named RDDs)
  • Datasets are automatically rebuilt on failure
  • Datasets have configurable persistence
  • Transformations are parallel (e.g. map, reduceByKey, filter)
  • Transformations support some relational primitives (e.g. join,

cartesian product)

Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

slide-20
SLIDE 20

PySpark: Python Bindings

In addition to Java and Scala, Spark has solid integration with Python:

  • Supports the standard CPython interpreter
  • There is Python support for the Spark core APIs and most

ecosystem APIs, such as MLLib.

  • IPython Notebook support comes out of the box

Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

slide-21
SLIDE 21

Spark: SQL Integration

The Spark component which lets you query structured data in Spark using SQL is called Spark SQL

  • Has integrated APIs in Python, Scala and Java
  • Allows you to integrate Spark Core APIs with SQL
  • Provides Hive metastore integration so that data managed in Hive

can be seamlessly processed via Spark

Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

slide-22
SLIDE 22

Open Payments Data

Sometimes, doctors and hospitals have financial relationships with health care manufacturing companies. These relationships can include money for research activities, gifts, speaking fees, meals, or travel. The Social Security Act requires CMS to collect information from applicable manufacturers and group purchasing organizations (GPOs) in order to report information about their financial relationships with physicians and hospitals. Let’s use Python and Spark via IPython Notebook to explore this dataset on Hadoop.

Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015

slide-23
SLIDE 23

Questions

Thanks for your attention! Questions?

  • Code & scripts for this talk available on my github presentation

page.1

  • Find me at http://caseystella.com
  • Twitter handle: @casey_stella
  • Email address: cstella@hortonworks.com

1http://github.com/cestella/presentations/ Casey Stella (Hortonworks) IPython Notebook as a Unified Data Science Interface for Hadoop Spring, 2015