Spark: A Coding Joyride Visualizing and plotting data Create a - PDF document

11/16/2015 Objectives • Show Spark's ability to rapidly process Big Data • Extracting information with RDDs • Querying data using DataFrames Spark: A Coding Joyride • Visualizing and plotting data • Create a machine-learning pipeline with Spark-ML and MLLib. • We'll also discuss the internals which make Spark 10-100 times faster than Hadoop MapReduce and Hive. Doug Bateman Director of Training, NewCircle About Me About Me For Fun Engineer, Architect & Instructor • Developing with Java since 1995 • Sailing (Java 1.0) • Rock climbing • +15yrs as software developer, • Snowboarding architect, and consultant. • Chess • Director of Training at NewCircle • Curriculum Lead at NewCircle 3 4 Who are you? Environments Workloads 0) I am new to spark. Goal: unified engine across data sources , 1) I have used Spark hands on before… workloads and environments 2) I have more than 1 year hands on experience with spark.. Data Sources 1

11/16/2015 Spark – 100% open source and mature Environments Used in production by over 500 organizations. From fortune 100 to small innovators YARN Workloads DataFrames API Spark SQL Spark Streaming MLlib GraphX RDD API Spark Core {J {JSON} Data Sources Apache Spark: Large user community Large-Scale Usage Commits in the past year 4000 Largest cluster: 8000 nodes Spark 3000 Largest single job: 1 petabyte 2000 Top streaming intake: 1 TB/hour Storm 1000 2014 on-disk 100 TB sort record HDFS YARN MapReduce 0 Spark Physical Cluster On-Disk Sort Record: Time to sort 100TB Spark Driver 2013 Record: 2100 machines JVM Hadoop Hadoop 72 minutes 207 machines 2014 Record: Executor Executor Executor Executor Spar Sp ark 23 minutes Slot Slot Slot Slot Slot Slot Slot Slot JVM JVM JVM JVM 11 Source: Daytona GraySort benchmark, sortbenchmark.org 2

11/16/2015 Spark Physical Cluster Power Plant Demo Spark Driver JVM Executor Executor Executor Executor Task Task Task Slot Slot Slot Task Task JVM JVM JVM JVM Use Case: predict power output given a set of readings from various sensors in a gas‐fired power generation plant Steps: 1. ETL 2. Explore + Visualize Data Schema Definition: AT = Atmospheric Temperature in C 3. Apply Machine Learning V = Exhaust Vacuum Speed AP = Atmospheric Pressure RH = Relative Humidity PE = Power Output (value we are trying to predict) About Databricks About NewCircle Data science made easy Software Development Training for the Enterprise • The Databricks team contributed more than 75% 75% of the code added • Courses tailored for your team to Spark in the past year • Custom learning pathways & • Cloud-based integrated workspace for Apache Spark. training programs • From the original Spark team at UC • Global delivery Berkeley. 17 18 3

11/16/2015 A few of our courses Paul - Salesforce • Spark Developer Bootcamp Learn more at: • Android Internals • Android Testing • Core AngularJS “In all honesty, this is one of https://databricks.com/spark/training • Advanced Python the best technical classes • Fast Track to Java 8 I’ve ever taken (and I’ve • Spring & Hibernate Bootcamp been doing this a very long • Apache HTTPD & Tomcat Administration time).” Bootcamp 19 Thanks! 30 Day Free Trial of Databricks Thank you. Visit: bit.ly/spark-bootcamp 15% off Spark Developer Bootcamp Training Visit: https://newcircle.com/spark Promo Code: QCON15 21 4

11/16/2015 http://training.databricks.com/sparkcamp.zip Transforming RDDs Error, ts, msg1 Info, ts, msg8 Error, ts, msg3 Error, ts, msg4 Warn, ts, msg2 Warn, ts, msg2 Info, ts, msg5 Spark Fundamentals Warn, ts, msg9 Error, ts, msg1 Info, ts, msg8 Info, ts, msg5 Error, ts, msg1 logLinesRDD (input/base RDD) .filter( f(x) ) Error, ts, msg1 Error, ts, msg3 Error, ts, msg4 New Professor Anthony D. Joseph, UC Berkeley Error, ts, msg1 Error, ts, msg1 Strata NYC September 2015 errorsRDD Transformations  Actions Actions sc.textFile( " hdfs://log " ) .filter( f(x) ) Error, ts, msg1 Error, ts, msg3 Error, ts, msg4 Execute DAG! Error, ts, msg1 Error, ts, msg1 errorsRDD .coalesce( 2 ) .coalesce( 2 ) Error, ts, msg1 Error, ts, msg4 Error, ts, msg3 .collect() Error, ts, msg1 Error, ts, msg1 cleanedRDD .collect() Driver Driver Lifecycle Lifecycle logLinesRDD logLinesRDD errorsRDD .filter( f(x) ) . saveAsTextFile() Error, ts, msg1 Error, ts, msg4 Error, ts, msg3 errorsRDD Error, ts, msg1 Error, ts, msg1 cleanedRDD .coalesce( 2 ) .filter( f(x) ) .count() cleanedRDD Error, ts, msg1 .collect() Error, ts, msg1 Error, ts, msg1 5 errorMsg1RDD .collect() Error, ts, msg1 Error, ts, msg4 Error, ts, msg3 Error, ts, msg1 Error, ts, msg1 Driver 1

11/16/2015 Partition  Task  Partition Lifecycle logLinesRDD errorsRDD logLinesRDD .cache() (HadoopRDD) . saveAsTextFile() Error, ts, msg1 Error, ts, msg4 Task-1 Error, ts, msg3 Task-2 Error, ts, msg1 Error, ts, msg1 .filter( f(x) ) Task-3 cleanedRDD Task-4 .filter( f(x) ) .count() Error, ts, msg1 errorsRDD Error, ts, msg1 Error, ts, msg1 5 (filteredRDD) errorMsg1RDD .collect() Lifecycle of a Spark Program • Create input RDDs from external data • … or parallelize a collection in your driver program End of Spark • Use transformations to lazily transform them and create new RDDs • … using transformations like filter() or map() Fundamentals Module • Ask Spark to cache() any intermediate RDDs that will be reused • Execute actions to kick off a parallel computation • … such as count() and collect() • Optimized and executed by Spark 2

Spark: A Coding Joyride Visualizing and plotting data Create a - PDF document

11/16/2015 Objectives Show Spark's ability to rapidly process Big Data Extracting information with RDDs Querying data using DataFrames Spark: A Coding Joyride Visualizing and plotting data Create a machine-learning pipeline

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Formal Modeling in Cognitive Science 1 Coding Theorems Lecture 28: Kraft Inequality; Source Coding

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Image and Video Coding: Video Coding Extensions Screen Content Coding Screen Content Coding

ADVANCED MULTIMEDIA ADVANCED MULTIMEDIA CODING CODING Fernando Pereira Instituto Superior

Dynamical systems Expanding maps on the circle. Coding Jana Rodriguez Hertz ICTP 2018 coding

YOU ARE AT THE STATE FAIR WHERE YOU PAY TO TAKE A SHORT JOYRIDE IN A SMALL AIRPLANE Your

Be Fast Or Stay Behind Building a Continuous Delivery Platform Schlomo Schapiro, Systems

Software Security Economics: Theory, in Practice An Exploratory Analysis Stephan

Database Management Systems Session 8 Instructor: Vinnie Costa vcosta@optonline.net CSC056-Z1

EMERGENCY MANAGER REPORT DISTRESSED UNIT APPEALS BOARD (DUAB) March 2, 2018 Prepared by Distressed

Secure Enhanced Linux Julian Richen SELinux? Started as a research project from the National

Great Lakes Observing System GLOS DMAC Infrastructure Hardware Server

ACMS Version 1.0.1 A System for Academic Course Management and Scheduling via the World Wide

DAY 4 - INTERACTIVITY Robby Seitz 121 Powers Hall 915-7822 rseitz@olemiss.edu Interactive Web

Spark: A Coding Joyride Visualizing and plotting data Create a - PDF document

11/16/2015 Objectives Show Spark's ability to rapidly process Big Data Extracting information with RDDs Querying data using DataFrames Spark: A Coding Joyride Visualizing and plotting data Create a machine-learning pipeline

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Formal Modeling in Cognitive Science 1 Coding Theorems Lecture 28: Kraft Inequality; Source Coding

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Image and Video Coding: Video Coding Extensions Screen Content Coding Screen Content Coding

ADVANCED MULTIMEDIA ADVANCED MULTIMEDIA CODING CODING Fernando Pereira Instituto Superior

Dynamical systems Expanding maps on the circle. Coding Jana Rodriguez Hertz ICTP 2018 coding

YOU ARE AT THE STATE FAIR WHERE YOU PAY TO TAKE A SHORT JOYRIDE IN A SMALL AIRPLANE Your

Be Fast Or Stay Behind Building a Continuous Delivery Platform Schlomo Schapiro, Systems

Software Security Economics: Theory, in Practice An Exploratory Analysis Stephan

Database Management Systems Session 8 Instructor: Vinnie Costa vcosta@optonline.net CSC056-Z1

EMERGENCY MANAGER REPORT DISTRESSED UNIT APPEALS BOARD (DUAB) March 2, 2018 Prepared by Distressed

Secure Enhanced Linux Julian Richen SELinux? Started as a research project from the National

Great Lakes Observing System GLOS DMAC Infrastructure Hardware Server

ACMS Version 1.0.1 A System for Academic Course Management and Scheduling via the World Wide

DAY 4 - INTERACTIVITY Robby Seitz 121 Powers Hall 915-7822 rseitz@olemiss.edu Interactive Web

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark