Spark & Spark SQL High-Speed In-Memory Analytics over - PowerPoint PPT Presentation

CSE 6242 / CX 4242 Data and Visual Analytics | Georgia Tech Spark ¡& ¡ Spark ¡SQL High-‑Speed ¡In-‑Memory ¡Analytics   over ¡Hadoop ¡and ¡Hive ¡Data Instructor: Duen Horng (Polo) Chau 1 Slides ¡adopted ¡from ¡Matei ¡Zaharia ¡(MIT) ¡and ¡Oliver ¡Vagner ¡(Manheim, ¡GT)

What ¡is ¡Spark ¡ ¡ ¡? http://spark.apache.org Not ¡a ¡modified ¡version ¡of ¡Hadoop ¡ Separate , ¡fast, ¡MapReduce-‑like ¡engine ¡ » In-‑memory ¡ data ¡storage ¡for ¡very ¡fast ¡iterative ¡queries ¡ » General ¡execution ¡graphs ¡and ¡powerful ¡optimizations ¡ » Up ¡to ¡40x ¡faster ¡than ¡Hadoop ¡ Compatible ¡with ¡Hadoop’s ¡storage ¡APIs ¡ » Can ¡read/write ¡to ¡any ¡Hadoop-‑supported ¡system, ¡ including ¡HDFS, ¡HBase, ¡SequenceFiles, ¡etc 2

What ¡is ¡Spark ¡SQL? ¡   (Formally ¡called ¡Shark) Port ¡of ¡Apache ¡Hive ¡to ¡run ¡on ¡Spark ¡ Compatible ¡with ¡existing ¡Hive ¡data, ¡metastores, ¡ and ¡queries ¡(HiveQL, ¡UDFs, ¡etc) ¡ Similar ¡speedups ¡of ¡up ¡to ¡40x 3

Project ¡History ¡ [latest: ¡v1.1] Spark ¡project ¡started ¡in ¡2009 ¡at ¡UC ¡Berkeley ¡AMP ¡lab, ¡open ¡ sourced ¡2010 ¡ UC ¡BERKELEY Became ¡Apache ¡Top-‑Level ¡Project ¡in ¡Feb ¡2014 ¡ Shark/Spark ¡SQL ¡started ¡summer ¡2011 ¡ Built ¡by ¡250+ ¡developers ¡and ¡people ¡from ¡50 ¡companies ¡ Scale ¡to ¡1000+ ¡nodes ¡in ¡production ¡ In ¡use ¡at ¡Berkeley, ¡Princeton, ¡Klout, ¡Foursquare, ¡Conviva, ¡ Quantifind, ¡Yahoo! ¡Research, ¡… 4 http://en.wikipedia.org/wiki/Apache_Spark

Why ¡a ¡New ¡Programming ¡Model? MapReduce ¡greatly ¡simplified ¡big ¡data ¡analysis But ¡as ¡soon ¡as ¡it ¡got ¡popular, ¡users ¡wanted ¡more: » More ¡ complex , ¡multi-‑stage ¡applications ¡(e.g.   iterative ¡graph ¡algorithms ¡and ¡machine ¡learning) » More ¡ interactive ¡ad-‑hoc ¡queries 5

Why ¡a ¡New ¡Programming ¡Model? MapReduce ¡greatly ¡simplified ¡big ¡data ¡analysis But ¡as ¡soon ¡as ¡it ¡got ¡popular, ¡users ¡wanted ¡more: » More ¡ complex , ¡multi-‑stage ¡applications ¡(e.g.   iterative ¡graph ¡algorithms ¡and ¡machine ¡learning) » More ¡ interactive ¡ad-‑hoc ¡queries Require ¡faster ¡ data ¡sharing ¡ across ¡parallel ¡jobs 5

Up for debate … as of 10/7/2014 Is ¡MapReduce ¡dead? http://www.datacenterknowledge.com/archives/ 2014/06/25/google-dumps-mapreduce-favor-new- hyper-scale-analytics-system/ http://www.reddit.com/r/compsci/comments/296aqr/on_the_death_of_mapreduce_at_google/ 6

Data ¡Sharing ¡in ¡MapReduce HDFS   HDFS   HDFS   HDFS   read write read write . ¡ ¡. ¡ ¡. iter. ¡1 iter. ¡2 Input result ¡1 query ¡1 HDFS   read result ¡2 query ¡2 query ¡3 result ¡3 Input . ¡ ¡. ¡ ¡. 7

Data ¡Sharing ¡in ¡MapReduce HDFS   HDFS   HDFS   HDFS   read write read write . ¡ ¡. ¡ ¡. iter. ¡1 iter. ¡2 Input result ¡1 query ¡1 HDFS   read result ¡2 query ¡2 query ¡3 result ¡3 Input . ¡ ¡. ¡ ¡. Slow ¡due ¡to ¡replication, ¡serialization, ¡and ¡disk ¡IO 7

Data ¡Sharing ¡in ¡Spark iter. ¡1 iter. ¡2 . ¡ ¡. ¡ ¡. Input query ¡1 one-‑time   processing query ¡2 query ¡3 Input Distributed   . ¡ ¡. ¡ ¡. memory 8

Data ¡Sharing ¡in ¡Spark iter. ¡1 iter. ¡2 . ¡ ¡. ¡ ¡. Input query ¡1 one-‑time   processing query ¡2 query ¡3 Input Distributed   . ¡ ¡. ¡ ¡. memory 10-‑100 × ¡ faster ¡than ¡network ¡and ¡disk 8

Spark ¡Programming ¡Model Key ¡idea: ¡ resilient ¡distributed ¡datasets ¡(RDDs) ¡ » Distributed ¡collections ¡of ¡objects ¡that ¡can ¡be ¡cached ¡in ¡ memory ¡across ¡cluster ¡nodes ¡ » Manipulated ¡through ¡various ¡parallel ¡operators ¡ » Automatically ¡rebuilt ¡on ¡failure ¡ Interface ¡ » Clean ¡language-‑integrated ¡API ¡in ¡Scala ¡ » Can ¡be ¡used ¡ interactively ¡from ¡Scala, ¡Python ¡console ¡ » Supported ¡languages: ¡Java, ¡Scala, ¡Python 9

http://www.scala-lang.org/old/faq/4 Java vs Scala: http://www.toptal.com/scala/why-should-i-learn-scala 10

Example: ¡Log ¡Mining Load ¡error ¡messages ¡from ¡a ¡log ¡into ¡memory, ¡then ¡ interactively ¡search ¡for ¡various ¡patterns 11

Example: ¡Log ¡Mining Load ¡error ¡messages ¡from ¡a ¡log ¡into ¡memory, ¡then ¡ interactively ¡search ¡for ¡various ¡patterns Worker Driver Worker Worker 11

Example: ¡Log ¡Mining Load ¡error ¡messages ¡from ¡a ¡log ¡into ¡memory, ¡then ¡ interactively ¡search ¡for ¡various ¡patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) Driver cachedMsgs = messages.cache() Worker Worker 11

Example: ¡Log ¡Mining Load ¡error ¡messages ¡from ¡a ¡log ¡into ¡memory, ¡then ¡ interactively ¡search ¡for ¡various ¡patterns Base ¡RDD lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) Driver cachedMsgs = messages.cache() Worker Worker 11

Example: ¡Log ¡Mining Load ¡error ¡messages ¡from ¡a ¡log ¡into ¡memory, ¡then ¡ interactively ¡search ¡for ¡various ¡patterns lines = spark.textFile(“hdfs://...”) Transformed ¡RDD Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) Driver cachedMsgs = messages.cache() Worker Worker 11

Example: ¡Log ¡Mining Load ¡error ¡messages ¡from ¡a ¡log ¡into ¡memory, ¡then ¡ interactively ¡search ¡for ¡various ¡patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) Driver cachedMsgs = messages.cache() cachedMsgs.filter(_.contains(“foo”)).count Worker Worker 11

Example: ¡Log ¡Mining Load ¡error ¡messages ¡from ¡a ¡log ¡into ¡memory, ¡then ¡ interactively ¡search ¡for ¡various ¡patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) Driver cachedMsgs = messages.cache() Action cachedMsgs.filter(_.contains(“foo”)).count Worker Worker 11

Example: ¡Log ¡Mining Load ¡error ¡messages ¡from ¡a ¡log ¡into ¡memory, ¡then ¡ interactively ¡search ¡for ¡various ¡patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) Driver cachedMsgs = messages.cache() cachedMsgs.filter(_.contains(“foo”)).count Worker Worker 11

Example: ¡Log ¡Mining Load ¡error ¡messages ¡from ¡a ¡log ¡into ¡memory, ¡then ¡ interactively ¡search ¡for ¡various ¡patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) Block ¡1 Driver cachedMsgs = messages.cache() cachedMsgs.filter(_.contains(“foo”)).count Worker Block ¡2 Worker Block ¡3 11

Example: ¡Log ¡Mining Load ¡error ¡messages ¡from ¡a ¡log ¡into ¡memory, ¡then ¡ interactively ¡search ¡for ¡various ¡patterns lines = spark.textFile(“hdfs://...”) Worker errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) tasks Block ¡1 Driver cachedMsgs = messages.cache() cachedMsgs.filter(_.contains(“foo”)).count Worker Block ¡2 Worker Block ¡3 11

Spark & Spark SQL High-Speed In-Memory Analytics over - PowerPoint PPT Presentation

CSE 6242 / CX 4242 Data and Visual Analytics | Georgia Tech Spark & Spark SQL High-Speed In-Memory Analytics over Hadoop and Hive Data Instructor: Duen Horng (Polo) Chau 1 Slides adopted

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

SQL SQL SQL = Structured Query Language Standard query language for relational

A1 (Part 2): Injection SQL Injection SQL injection is prevalent SQL injection is impactful Why a

What is SQL? SQL stands for Structured Query Language SQL lets you access and manipulate

BASIC SQL CHAPTER 4 (6/E) CHAPTER 8 (5/E) 1 CHAPTER 4 OUTLINE SQL Data Definition and

This Lecture SQL The SQL language SQL, the relational model, and E/R diagrams SQL Data

Intermezzo: A typical database architecture 136 A typical database architecture SQL SQL SQL

Basic SQL Lecture 2 1 Outline Data in SQL Simple Queries in SQL Queries with more

COMP9313: Big Data Management Spark SQL Why Spark SQL? Table is one of the most commonly

The SQL Procedure Language (SQL PL) Tony Andrews Themis Education tandrews@themisinc.com

SQL & MySQL Jeff Siarto - TC 361 Whats the Difference? MySQL is a database SQL is

Outline Background SQL history and terminology Introduction SAS seminar Proc

SQL Developed by IBM (for System R) in the 1970s. Standard used by many vendors.

SQL and JS Pitfalls Assignment 2 Preparation SQL Concepts SQL vs. NoSQL

Advanced SQL 01 The Core of SQL Torsten Grust Universitt Tbingen, Germany 1 The Core

Sophos and Diane Searchable Symmetric Encryption with (Very) Low Overhead Raphael Bost, Brice

22 101 10/10/96 58 103 11/12/96

SQL - The Language of Databases Developed by IBM in the 1970s Create and process database

By Shervin Daneshpajouh Legend Legend Legend Legend Software Engineering Observation g g

The Triumph of Simplicity How simple services are reshaping databases and the enterprise Life

tt r r

BROGRAMMING LANGUAGES BROGRAMMING LANGUAGES WANT TO BRO DOWN AND CRUSH CODE? The Bro Network

tts ssrs t r

Sambuz

Useful Links

Newsletter

Mail Us

Spark & Spark SQL High-Speed In-Memory Analytics over - PowerPoint PPT Presentation

CSE 6242 / CX 4242 Data and Visual Analytics | Georgia Tech Spark & Spark SQL High-Speed In-Memory Analytics over Hadoop and Hive Data Instructor: Duen Horng (Polo) Chau 1 Slides adopted

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

SQL SQL SQL = Structured Query Language Standard query language for relational

A1 (Part 2): Injection SQL Injection SQL injection is prevalent SQL injection is impactful Why a

What is SQL? SQL stands for Structured Query Language SQL lets you access and manipulate

BASIC SQL CHAPTER 4 (6/E) CHAPTER 8 (5/E) 1 CHAPTER 4 OUTLINE SQL Data Definition and

This Lecture SQL The SQL language SQL, the relational model, and E/R diagrams SQL Data

Intermezzo: A typical database architecture 136 A typical database architecture SQL SQL SQL

Basic SQL Lecture 2 1 Outline Data in SQL Simple Queries in SQL Queries with more

COMP9313: Big Data Management Spark SQL Why Spark SQL? Table is one of the most commonly

The SQL Procedure Language (SQL PL) Tony Andrews Themis Education tandrews@themisinc.com

SQL &amp; MySQL Jeff Siarto - TC 361 Whats the Difference? MySQL is a database SQL is

Outline Background SQL history and terminology Introduction SAS seminar Proc

SQL Developed by IBM (for System R) in the 1970s. Standard used by many vendors.

SQL and JS Pitfalls Assignment 2 Preparation SQL Concepts SQL vs. NoSQL

Advanced SQL 01 The Core of SQL Torsten Grust Universitt Tbingen, Germany 1 The Core

Sophos and Diane Searchable Symmetric Encryption with (Very) Low Overhead Raphael Bost, Brice

22 101 10/10/96 58 103 11/12/96

SQL - The Language of Databases Developed by IBM in the 1970s Create and process database

By Shervin Daneshpajouh Legend Legend Legend Legend Software Engineering Observation g g

The Triumph of Simplicity How simple services are reshaping databases and the enterprise Life

tt r r

BROGRAMMING LANGUAGES BROGRAMMING LANGUAGES WANT TO BRO DOWN AND CRUSH CODE? The Bro Network

tts ssrs t r

Sambuz

Useful Links

Newsletter

Mail Us

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

SQL & MySQL Jeff Siarto - TC 361 Whats the Difference? MySQL is a database SQL is