Spark & Spark SQL High-Speed In-Memory Analytics over Hadoop - PowerPoint PPT Presentation

CSE 6242 / CX 4242 Data and Visual Analytics | Georgia Tech Spark & Spark SQL High-Speed In-Memory Analytics   over Hadoop and Hive Data Instructor: Duen Horng (Polo) Chau 1 Slides adopted from Matei Zaharia (MIT) and Oliver Vagner (TGI Fridays)

What is Spark ? http://spark.apache.org Not a modified version of Hadoop Separate , fast, MapReduce-like engine » In-memory data storage for very fast iterative queries » General execution graphs and powerful optimizations » Up to 40x faster than Hadoop Compatible with Hadoop’s storage APIs » Can read/write to any Hadoop-supported system, including HDFS, HBase, SequenceFiles, etc 2

What is Spark SQL?   (Formally called Shark) Port of Apache Hive to run on Spark Compatible with existing Hive data, metastores, and queries (HiveQL, UDFs, etc) Similar speedups of up to 40x 3

Project History [latest: v1.1] Spark project started in 2009 at UC Berkeley AMP lab,   open sourced 2010 UC BERKELEY Became Apache Top-Level Project in Feb 2014 Shark/Spark SQL started summer 2011 Built by 250+ developers and people from 50 companies Scale to 1000+ nodes in production In use at Berkeley, Princeton, Klout, Foursquare, Conviva, Quantifind, Yahoo! Research, … 4 http://en.wikipedia.org/wiki/Apache_Spark

Why a New Programming Model? MapReduce greatly simplified big data analysis But as soon as it got popular, users wanted more: » More complex , multi-stage applications (e.g.   iterative graph algorithms and machine learning) » More interactive ad-hoc queries 5

Why a New Programming Model? MapReduce greatly simplified big data analysis But as soon as it got popular, users wanted more: » More complex , multi-stage applications (e.g.   iterative graph algorithms and machine learning) » More interactive ad-hoc queries Require faster data sharing across parallel jobs 5

Up for debate … as of 10/7/2014 Is MapReduce dead? http://www.datacenterknowledge.com/archives/ 2014/06/25/google-dumps-mapreduce-favor-new- hyper-scale-analytics-system/ http://www.reddit.com/r/compsci/comments/296aqr/on_the_death_of_mapreduce_at_google/ 6

Data Sharing in MapReduce HDFS   HDFS   HDFS   HDFS   read write read write iter. 1 iter. 2 . . . Input result 1 query 1 HDFS   read result 2 query 2 query 3 result 3 Input . . . 7

Data Sharing in MapReduce HDFS   HDFS   HDFS   HDFS   read write read write iter. 1 iter. 2 . . . Input result 1 query 1 HDFS   read result 2 query 2 query 3 result 3 Input . . . Slow due to replication, serialization, and disk IO 7

Data Sharing in Spark iter. 1 iter. 2 . . . Input query 1 one-time   processing query 2 query 3 Input Distributed   . . . memory 8

Data Sharing in Spark iter. 1 iter. 2 . . . Input query 1 one-time   processing query 2 query 3 Input Distributed   . . . memory 10-100 × faster than network and disk 8

Spark Programming Model Key idea: resilient distributed datasets (RDDs) » Distributed collections of objects that can be cached in memory across cluster nodes » Manipulated through various parallel operators » Automatically rebuilt on failure Interface » Clean language-integrated API in Scala » Can be used interactively from Scala, Python console » Supported languages: Java, Scala , Python, R 9

http://www.scala-lang.org/old/faq/4   Functional programming in D3: http://sleptons.blogspot.com/2015/01/functional-programming-d3js-good-example.html Scala vs Java 8: http://kukuruku.co/hub/scala/java-8-vs-scala-the-difference-in-approaches-and-mutual-innovations 10

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns 11 http://www.slideshare.net/normation/scala-dreaded

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Driver Worker Worker 11 http://www.slideshare.net/normation/scala-dreaded

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) Driver cachedMsgs = messages.cache() Worker Worker 11 http://www.slideshare.net/normation/scala-dreaded

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RDD lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) Driver cachedMsgs = messages.cache() Worker Worker 11 http://www.slideshare.net/normation/scala-dreaded

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Transformed RDD Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) Driver cachedMsgs = messages.cache() Worker Worker 11 http://www.slideshare.net/normation/scala-dreaded

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) Driver cachedMsgs = messages.cache() cachedMsgs.filter(_.contains(“foo”)).count Worker Worker 11 http://www.slideshare.net/normation/scala-dreaded

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) Driver cachedMsgs = messages.cache() Action cachedMsgs.filter(_.contains(“foo”)).count Worker Worker 11 http://www.slideshare.net/normation/scala-dreaded

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) Driver cachedMsgs = messages.cache() cachedMsgs.filter(_.contains(“foo”)).count Worker Worker 11 http://www.slideshare.net/normation/scala-dreaded

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) Block 1 Driver cachedMsgs = messages.cache() cachedMsgs.filter(_.contains(“foo”)).count Worker Block 2 Worker Block 3 11 http://www.slideshare.net/normation/scala-dreaded

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) tasks Block 1 Driver cachedMsgs = messages.cache() cachedMsgs.filter(_.contains(“foo”)).count Worker Block 2 Worker Block 3 11 http://www.slideshare.net/normation/scala-dreaded

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) results messages = errors.map(_.split(‘\t’)(2)) tasks Block 1 Driver cachedMsgs = messages.cache() cachedMsgs.filter(_.contains(“foo”)).count Worker Block 2 Worker Block 3 11 http://www.slideshare.net/normation/scala-dreaded

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) results messages = errors.map(_.split(‘\t’)(2)) tasks Block 1 Driver cachedMsgs = messages.cache() cachedMsgs.filter(_.contains(“foo”)).count Cache 2 Worker Cache 3 Block 2 Worker Block 3 11 http://www.slideshare.net/normation/scala-dreaded

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) Block 1 Driver cachedMsgs = messages.cache() cachedMsgs.filter(_.contains(“foo”)).count Cache 2 Worker Cache 3 Block 2 Worker Block 3 11 http://www.slideshare.net/normation/scala-dreaded

Spark & Spark SQL High-Speed In-Memory Analytics over Hadoop - PowerPoint PPT Presentation

CSE 6242 / CX 4242 Data and Visual Analytics | Georgia Tech Spark & Spark SQL High-Speed In-Memory Analytics over Hadoop and Hive Data Instructor: Duen Horng (Polo) Chau 1 Slides adopted from Matei Zaharia (MIT) and Oliver Vagner (TGI

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

SQL SQL SQL = Structured Query Language Standard query language for relational

A1 (Part 2): Injection SQL Injection SQL injection is prevalent SQL injection is impactful Why a

What is SQL? SQL stands for Structured Query Language SQL lets you access and manipulate

BASIC SQL CHAPTER 4 (6/E) CHAPTER 8 (5/E) 1 CHAPTER 4 OUTLINE SQL Data Definition and

This Lecture SQL The SQL language SQL, the relational model, and E/R diagrams SQL Data

Intermezzo: A typical database architecture 136 A typical database architecture SQL SQL SQL

Basic SQL Lecture 2 1 Outline Data in SQL Simple Queries in SQL Queries with more

COMP9313: Big Data Management Spark SQL Why Spark SQL? Table is one of the most commonly

The SQL Procedure Language (SQL PL) Tony Andrews Themis Education tandrews@themisinc.com

SQL & MySQL Jeff Siarto - TC 361 Whats the Difference? MySQL is a database SQL is

Outline Background SQL history and terminology Introduction SAS seminar Proc

SQL Developed by IBM (for System R) in the 1970s. Standard used by many vendors.

SQL and JS Pitfalls Assignment 2 Preparation SQL Concepts SQL vs. NoSQL

Advanced SQL 01 The Core of SQL Torsten Grust Universitt Tbingen, Germany 1 The Core

FIRST FRAMEWORK ON SHaRK OS Mlardalen University Giuseppe Lipari, Michael Trimarchi The

Pricing Derivatives with Barriers in a Stochastic Interest Rate Environment Carole Bernard

Spark and Friends Presented by: Jeff Rasley & John Meehan

Architectural Knowledge and Organizational Context: The Case for Socio-Technical Styles James

First-Timer's Guide to the 2017 National Brownfields Training Conference We Webinar Presenters

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL <larkin@cray.com>

W l Welcome! ! The webinar will begin at The webinar will begin at 2:00 Eastern/11:00 Pacific

FLORA STEVEN PRIMARY SCHOOL PARENT COUNCIL MEETING 21st of January 2018 1. Welcome,

Spark & Spark SQL High-Speed In-Memory Analytics over Hadoop - PowerPoint PPT Presentation

CSE 6242 / CX 4242 Data and Visual Analytics | Georgia Tech Spark & Spark SQL High-Speed In-Memory Analytics over Hadoop and Hive Data Instructor: Duen Horng (Polo) Chau 1 Slides adopted from Matei Zaharia (MIT) and Oliver Vagner (TGI

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

SQL SQL SQL = Structured Query Language Standard query language for relational

A1 (Part 2): Injection SQL Injection SQL injection is prevalent SQL injection is impactful Why a

What is SQL? SQL stands for Structured Query Language SQL lets you access and manipulate

BASIC SQL CHAPTER 4 (6/E) CHAPTER 8 (5/E) 1 CHAPTER 4 OUTLINE SQL Data Definition and

This Lecture SQL The SQL language SQL, the relational model, and E/R diagrams SQL Data

Intermezzo: A typical database architecture 136 A typical database architecture SQL SQL SQL

Basic SQL Lecture 2 1 Outline Data in SQL Simple Queries in SQL Queries with more

COMP9313: Big Data Management Spark SQL Why Spark SQL? Table is one of the most commonly

The SQL Procedure Language (SQL PL) Tony Andrews Themis Education tandrews@themisinc.com

SQL &amp; MySQL Jeff Siarto - TC 361 Whats the Difference? MySQL is a database SQL is

Outline Background SQL history and terminology Introduction SAS seminar Proc

SQL Developed by IBM (for System R) in the 1970s. Standard used by many vendors.

SQL and JS Pitfalls Assignment 2 Preparation SQL Concepts SQL vs. NoSQL

Advanced SQL 01 The Core of SQL Torsten Grust Universitt Tbingen, Germany 1 The Core

FIRST FRAMEWORK ON SHaRK OS Mlardalen University Giuseppe Lipari, Michael Trimarchi The

Pricing Derivatives with Barriers in a Stochastic Interest Rate Environment Carole Bernard

Spark and Friends Presented by: Jeff Rasley &amp; John Meehan

Architectural Knowledge and Organizational Context: The Case for Socio-Technical Styles James

First-Timer's Guide to the 2017 National Brownfields Training Conference We Webinar Presenters

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL &lt;larkin@cray.com&gt;

W l Welcome! ! The webinar will begin at The webinar will begin at 2:00 Eastern/11:00 Pacific

FLORA STEVEN PRIMARY SCHOOL PARENT COUNCIL MEETING 21st of January 2018 1. Welcome,

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

SQL & MySQL Jeff Siarto - TC 361 Whats the Difference? MySQL is a database SQL is

Spark and Friends Presented by: Jeff Rasley & John Meehan

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL <larkin@cray.com>