[S PARK ] Shrideep Pallickara Computer Science Colorado State - PDF document

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University CS 555: D ISTRIBUTED S YSTEMS [S PARK ] Shrideep Pallickara Computer Science Colorado State University CS555: Distributed Systems [Fall 2019] October 3, 2019 L12.1 Dept. Of Computer Science , Colorado State University Frequently asked questions from the previous class survey ¨ Custom Partitioners ¤ How often, example? ¨ How does Hadoop decide whether or not to call the combiner? L12. 2 CS555: Distributed Systems [Fall 2019] October 3, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L12.1 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Topics covered in this lecture ¨ Spark ¤ Software stack ¤ Interactive shells in Spark ¤ Core Spark concepts L12. 3 CS555: Distributed Systems [Fall 2019] October 3, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA A PACHE S PARK CS555: Distributed Systems [Fall 2019] October 3, 2019 L12.4 Dept. Of Computer Science , Colorado State University L12.2 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Spark: What is it? ¨ Cluster computing platform ¤ Designed to be fast and general purpose ¨ Speed ¤ Often considered to be a design alternative for Apache MapReduce ¤ Extends MapReduce to support more types of computations n Interactive queries, iterative tasks, and stream processing ¨ Why is speed important? ¤ Difference between waiting for hours versus exploring data interactively L12. 5 CS555: Distributed Systems [Fall 2019] October 3, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA Spark: Influences and Innovations ¨ Spark has inherited parts of its API, design, and supported formats from other existing computational frameworks ¤ Particularly DryadLINQ ¨ Spark’s internals, especially how it handles failures, differ from many traditional systems ¨ Spark’s ability to leverage lazy evaluation within memory computations makes it particularly unique L12. 6 CS555: Distributed Systems [Fall 2019] October 3, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L12.3 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Where does Spark fit in the Analytics Ecosystem? ¨ Spark provides methods to process data in parallel that are generalizable ¨ On its own, Spark is not a data storage solution ¤ Performs computations on Spark JVMs that last only for the duration of a Spark application ¨ Spark is used in tandem with: ¤ A distributed storage system (e.g., HDFS, Cassandra, or S3) n To house the data processed with Spark ¤ A cluster manager — to orchestrate the distribution of Spark applications across the cluster L12. 7 CS555: Distributed Systems [Fall 2019] October 3, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA Key enabling idea in Spark ¨ Memory-resident data ¨ Spark loads data into the memory of worker nodes ¤ Processing is performed on memory-resident data L12. 8 CS555: Distributed Systems [Fall 2019] October 3, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L12.4 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University A look at the memory hierarchy Item time Scaled time in human terms (2 billion times slower) Processor cycle 0.5 ns (2 GHz) 1 second Cache access 1 ns (1 GHz) 2 seconds Memory access 70 ns 140 seconds Context switch 5,000 ns (5 μ s) 167 minutes Disk access 162 days 7,000,000 ns (7 ms) Quantum 100,000,000 ns (100 ms) 6.3 years Source: Kay Robbins & Steve Robbins. Unix Systems Programming , 2nd edition, Prentice Hall. L12. 9 CS555: Distributed Systems [Fall 2019] October 3, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA Spark covers a wide range of workloads ¨ Batch applications ¨ Iterative algorithms ¨ Queries ¨ Stream processing ¨ This has previously required multiple, independent tools L12. 10 CS555: Distributed Systems [Fall 2019] October 3, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L12.5 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Running Spark ¨ You can use Spark from Python, Java, Scala, R, or SQL ¨ Spark itself is written in Scala , and runs on the Java Virtual Machine (JVM) ¤ You can Spark either on your laptop or a cluster, all you need is an installation of Java ¨ If you want to use the Python API, you will also need a Python interpreter (version 2.7 or later) ¨ If you want to use R, you will need a version of R on your machine. L12. 11 CS555: Distributed Systems [Fall 2019] October 3, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA Spark integrates well with other tools ¨ Can run in Hadoop clusters ¨ Access Hadoop data sources, including Cassandra L12. 12 CS555: Distributed Systems [Fall 2019] October 3, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L12.6 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University At its core, Spark is a computational engine ¨ Spark is responsible for several aspects of applications that comprise ¤ Many tasks across many machines (compute clusters) ¨ Responsibilities include: ① Scheduling ② Distributions ③ Monitoring L12. 13 CS555: Distributed Systems [Fall 2019] October 3, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA Spark execution ¨ The cluster of machines that Spark will use to execute tasks is managed by a cluster manager ¤ Spark’s standalone cluster manager, YARN, or Mesos ¨ We submit Spark Applications to these cluster managers, which will grant resources to the application to complete the work L12. 14 CS555: Distributed Systems [Fall 2019] October 3, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L12.7 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Spark applications ¨ Spark Applications consist of ¤ A driver process n The driver process is absolutely essential n The heart of a Spark Application and maintains all relevant information during the lifetime of the application ¤ A set of executor processes L12. 15 CS555: Distributed Systems [Fall 2019] October 3, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA The Driver ¨ The driver process runs your main() function, sits on a node in the cluster ¨ Driver is responsible for three things: ¤ Maintaining information about the Spark Application ¤ Responding to a user’s program or input ¤ Analyzing, distributing, and scheduling work across the executors L12. 16 CS555: Distributed Systems [Fall 2019] October 3, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L12.8 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University The executors ¨ The executors are responsible for actually carrying out the work that the driver assigns them ¨ Each executor is responsible for only two things: ¤ Executing code assigned to it by the driver, and ¤ Reporting the state of the computation on that executor back to the driver node L12. 17 CS555: Distributed Systems [Fall 2019] October 3, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA Architecture of a Spark Application Driver Process Executors Spark Session User Code Cluster Manager L12. 18 CS555: Distributed Systems [Fall 2019] October 3, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L12.9 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University How Spark runs Python or R ¨ You write Python and R code that Spark translates into code that it then can run on the executor JVM Python JVM Process To executors Spark Session R Process L12. 19 CS555: Distributed Systems [Fall 2019] October 3, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA SparkSession ¨ We need a way to send user commands and data to a Spark Application ¤ We do that by first creating a SparkSession ¨ The SparkSession instance is the way Spark executes user-defined manipulations across the cluster. ¤ There is a one-to-one correspondence between a SparkSession and a Spark Application ¤ In Scala and Python, the variable is available as spark when you start the console. L12. 20 CS555: Distributed Systems [Fall 2019] October 3, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L12.10 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

[S PARK ] Shrideep Pallickara Computer Science Colorado State - PDF document

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University CS 555: D ISTRIBUTED S YSTEMS [S PARK ] Shrideep Pallickara Computer Science Colorado State University CS555: Distributed Systems [Fall 2019] October 3,

Liberty State Park Park Interior WRT Liberty State Park Today Liberty State Park The Park

Tyrol Hill Park Phase 4 Elementary Campbell Elementary Campbell Park Spaces Open Park

OPERA BROGLIE CAR PARK 1 OPERA BROGLIE CAR PARK 2 OPERA BROGLIE CAR PARK 3 OPERA BROGLIE CAR

HADEN PARK W I T T E R D . . D R T N I O P G N O L COMMUNITY PARK NATURE PARK

Pleasant Valley Parks Bower Park Camp Nooteeming Helen Aldrich Park Pleasant Valley Parade

Project Area Vilas Park Vilas Park Master Plan Vilas Park Master Plan Plan Maestro De Vilas Park

Addison Circle District Park Statistics 5 Parks Addison Circle Park Beckert Park

FAIRVIEW PARK By: Hannah S. ABOUT FAIRVIEW PARK Fairview Park is the best city I have ever

LACAMAS PARK (ROUND LAKE PARK) FALLEN LEAF LAKE PARK (DEAD LAKE) LACAMAS PARK MAPPING &

ACTIVATION ENGAGEMENT INVESTMENT PARK DESIGN SERVICES Park Design Services Offered by Austin

ARMY ARMY ARMY ARMY Timothy Park Timothy Park Timothy Park Timothy Park 2LT MC USAR 2LT MC

FAIRVIEW PARK PC17-PR-002 Fairview Park (south) Rt. 50 Fairview Park (north) I-495 Callison

BROOKLYN BRIDGE PARK A. PROjEct OVERVIEW B. PARK DESIGN c. BUILDING tHE PARK

BASS LAKE BASS LAKE REGIONAL PARK REGIONAL PARK EL DORADO COUNTY PARKS EL DORADO COUNTY PARKS

SUNNYNOOK RIVER PARK SUNNYNOOK RIVER PARK SUNNYNOOK RIVER PARK Community Presentation - - July

Bushy Park Industrial Complex Bushy Park Industrial Complex Bushy Park Industrial Complex

GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a long tradition of

Big Data Analytics 1 / 11 What is Big Data? Caracterized by Volume No specific

CS 327E Class 12 December 2, 2019 Announcements CIS Survey: Your voice matters .

Frequency dependence of the vertex function for the fRG and beyond Ciro Taranto

Hadoop Dr. Mihail Content derived from: Ankam, Venkat. Big Data Analytics. Packt Publishing,

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 6: Analyzing Relational

Reliable communication via semilattice properties of partial knowledge A. Pagourtzis 1 G.

Oshkosh Corporation Third Quarter Fiscal 2019 August 1, 2019 WILSON R. JONES PRESIDENT AND

[S PARK ] Shrideep Pallickara Computer Science Colorado State - PDF document

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University CS 555: D ISTRIBUTED S YSTEMS [S PARK ] Shrideep Pallickara Computer Science Colorado State University CS555: Distributed Systems [Fall 2019] October 3,

Liberty State Park Park Interior WRT Liberty State Park Today Liberty State Park The Park

Tyrol Hill Park Phase 4 Elementary Campbell Elementary Campbell Park Spaces Open Park

OPERA BROGLIE CAR PARK 1 OPERA BROGLIE CAR PARK 2 OPERA BROGLIE CAR PARK 3 OPERA BROGLIE CAR

HADEN PARK W I T T E R D . . D R T N I O P G N O L COMMUNITY PARK NATURE PARK

Pleasant Valley Parks Bower Park Camp Nooteeming Helen Aldrich Park Pleasant Valley Parade

Project Area Vilas Park Vilas Park Master Plan Vilas Park Master Plan Plan Maestro De Vilas Park

Addison Circle District Park Statistics 5 Parks Addison Circle Park Beckert Park

FAIRVIEW PARK By: Hannah S. ABOUT FAIRVIEW PARK Fairview Park is the best city I have ever

LACAMAS PARK (ROUND LAKE PARK) FALLEN LEAF LAKE PARK (DEAD LAKE) LACAMAS PARK MAPPING &amp;

ACTIVATION ENGAGEMENT INVESTMENT PARK DESIGN SERVICES Park Design Services Offered by Austin

ARMY ARMY ARMY ARMY Timothy Park Timothy Park Timothy Park Timothy Park 2LT MC USAR 2LT MC

FAIRVIEW PARK PC17-PR-002 Fairview Park (south) Rt. 50 Fairview Park (north) I-495 Callison

BROOKLYN BRIDGE PARK A. PROjEct OVERVIEW B. PARK DESIGN c. BUILDING tHE PARK

BASS LAKE BASS LAKE REGIONAL PARK REGIONAL PARK EL DORADO COUNTY PARKS EL DORADO COUNTY PARKS

SUNNYNOOK RIVER PARK SUNNYNOOK RIVER PARK SUNNYNOOK RIVER PARK Community Presentation - - July

Bushy Park Industrial Complex Bushy Park Industrial Complex Bushy Park Industrial Complex

GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a long tradition of

Big Data Analytics 1 / 11 What is Big Data? Caracterized by Volume No specific

CS 327E Class 12 December 2, 2019 Announcements CIS Survey: Your voice matters .

Frequency dependence of the vertex function for the fRG and beyond Ciro Taranto

Hadoop Dr. Mihail Content derived from: Ankam, Venkat. Big Data Analytics. Packt Publishing,

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 6: Analyzing Relational

Reliable communication via semilattice properties of partial knowledge A. Pagourtzis 1 G.

Oshkosh Corporation Third Quarter Fiscal 2019 August 1, 2019 WILSON R. JONES PRESIDENT AND

LACAMAS PARK (ROUND LAKE PARK) FALLEN LEAF LAKE PARK (DEAD LAKE) LACAMAS PARK MAPPING &