[S PARK ] Shrideep Pallickara Computer Science Colorado State - PDF document

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University CS 555: D ISTRIBUTED S YSTEMS [S PARK ] Shrideep Pallickara Computer Science Colorado State University CS555: Distributed Systems [Fall 2019] October 8, 2019 L13.1 Dept. Of Computer Science , Colorado State University Frequently asked questions from the previous class survey ¨ Why use Hadoop if Spark is so much faster? L13. 2 CS555: Distributed Systems [Fall 2019] October 8, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L13.1 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Topics covered in this lecture ¨ Orchestration Plans ¨ Transformations and Dependencies ¨ Spark Resilient Distributed Datasets L13. 3 CS555: Distributed Systems [Fall 2019] October 8, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA A simple Scala word count example def simpleWordCount( rdd: RDD[ String]): RDD[( String, Int)] = { val words = rdd.flatMap(_. split(" ")) val wordPairs = words.map((_, 1)) val wordCounts = wordPairs.reduceByKey(_ + _) wordCounts } L13. 4 CS555: Distributed Systems [Fall 2019] October 8, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L13.2 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University O RCHESTRATION P LANS CS555: Distributed Systems [Fall 2019] October 8, 2019 L13.5 Dept. Of Computer Science , Colorado State University Executing Spark code in clusters: Overview ¨ Write DataFrame/Dataset/SQL Code. ¨ If valid code, Spark converts this to a Logical Plan ¨ Spark transforms this Logical Plan to a Physical Plan , checking for optimizations along the way ¨ Spark then executes this Physical Plan (RDD manipulations) on the cluster L13. 6 CS555: Distributed Systems [Fall 2019] October 8, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L13.3 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Once you have the code ready ¨ Code is submitted either through the console or via a submitted job ¨ This code passes through the Catalyst Optimizer ¤ Decides how the code should be executed ¤ Lays out a plan for doing so before, finally, the code is run n And the result returned to the user L13. 7 CS555: Distributed Systems [Fall 2019] October 8, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA The Catalyst Optimizer Physical Plan SQL Catalyst DataFrames Optimizer Datasets L13. 8 CS555: Distributed Systems [Fall 2019] October 8, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L13.4 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Logical Planning ¨ The logical plan only represents a set of abstract transformations ¤ Does not refer to executors or drivers ¤ Simply converts the user’s set of expressions into the most optimized version ¨ Converting user’s code into an unresolved logical plan ¤ This plan is unresolved because although your code might be valid, the tables or columns that it refers to might or might not exist L13. 9 CS555: Distributed Systems [Fall 2019] October 8, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA How are columns and tables resolved? ¨ Spark uses the catalog , a repository of all table and DataFrame information, to resolve columns and tables in the analyzer optimizations ¨ The analyzer might reject the unresolved logical plan if the required table or column name does not exist in the catalog ¨ If the analyzer can resolve it, the result is passed through the Catalyst Optimizer L13. 10 CS555: Distributed Systems [Fall 2019] October 8, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L13.5 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University The Structured API Logical Planning Process Logical Optimization Optimized Analysis User Unresolved Resolved logical plan Code Logical Plan logical plan Catalog L13. 11 CS555: Distributed Systems [Fall 2019] October 8, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA Catalyst Optimizer ¨ A collection of rules that attempt to optimize the logical plan by pushing down predicates or selections ¨ Catalyst is extensible ¤ Users can include their own rules for domain-specific optimizations L13. 12 CS555: Distributed Systems [Fall 2019] October 8, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L13.6 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Physical Planning [1/2] ¨ The physical plan specifies how the logical plan will execute on the cluster ¨ Involves generating different physical execution strategies and comparing them through a cost model ¨ An example of the cost comparison might be choosing how to perform a given join by looking at the physical attributes of a given table ¤ How big the table is or ¤ How big its partitions are L13. 13 CS555: Distributed Systems [Fall 2019] October 8, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA Physical Planning [2/2] ¨ Physical planning results in a series of RDDs and transformations ¨ This is why Spark is also referred to as a compiler ¤ Takes queries in DataFrames, Datasets, and SQL and compiles them into RDD transformations L13. 14 CS555: Distributed Systems [Fall 2019] October 8, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L13.7 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University The Physical Planning Process Executed on the Optimized Physical cluster Logical Plan Plans Cost Model Best Physical Plan L13. 15 CS555: Distributed Systems [Fall 2019] October 8, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA Execution ¨ Spark performs further optimizations at runtime ¨ Generating native Java bytecode that can remove entire tasks or stages during execution ¨ Finally the result is returned to the user L13. 16 CS555: Distributed Systems [Fall 2019] October 8, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L13.8 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University W IDE AND N ARROW T RANSFORMATIONS CS555: Distributed Systems [Fall 2019] October 8, 2019 L13.17 Dept. Of Computer Science , Colorado State University Transformations and Dependencies ¨ Two categories of dependencies ¤ Narrow n Each parent partition is used by at most one child partition ¤ Wide n Multiple child partitions may depend on a single parent partition ¨ The narrow versus wide distinction has significant implications for the way Spark evaluates a transformation and, consequently, for its performance L13. 18 CS555: Distributed Systems [Fall 2019] October 8, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L13.9 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Narrow Transformations ¨ Narrow transformations are those in which each input partition contributes to only one output partition ¨ Can be determined at design time , irrespective of the values of the records in the parent partitions ¨ Partitions in narrow transformations can either depend on: ¤ One parent (such as in the map operator), or ¤ A unique subset of the parent partitions that is known at design time ( coalesce ) ¨ Narrow transformations can be executed on an arbitrary subset of the data without any information about the other partitions L13. 19 CS555: Distributed Systems [Fall 2019] October 8, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA Dependencies between partitions for narrow transformations PARENT CHILD L13. 20 CS555: Distributed Systems [Fall 2019] October 8, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L13.10 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

[S PARK ] Shrideep Pallickara Computer Science Colorado State - PDF document

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University CS 555: D ISTRIBUTED S YSTEMS [S PARK ] Shrideep Pallickara Computer Science Colorado State University CS555: Distributed Systems [Fall 2019] October 8,

Liberty State Park Park Interior WRT Liberty State Park Today Liberty State Park The Park

Tyrol Hill Park Phase 4 Elementary Campbell Elementary Campbell Park Spaces Open Park

OPERA BROGLIE CAR PARK 1 OPERA BROGLIE CAR PARK 2 OPERA BROGLIE CAR PARK 3 OPERA BROGLIE CAR

HADEN PARK W I T T E R D . . D R T N I O P G N O L COMMUNITY PARK NATURE PARK

Pleasant Valley Parks Bower Park Camp Nooteeming Helen Aldrich Park Pleasant Valley Parade

Project Area Vilas Park Vilas Park Master Plan Vilas Park Master Plan Plan Maestro De Vilas Park

Addison Circle District Park Statistics 5 Parks Addison Circle Park Beckert Park

FAIRVIEW PARK By: Hannah S. ABOUT FAIRVIEW PARK Fairview Park is the best city I have ever

LACAMAS PARK (ROUND LAKE PARK) FALLEN LEAF LAKE PARK (DEAD LAKE) LACAMAS PARK MAPPING &

ACTIVATION ENGAGEMENT INVESTMENT PARK DESIGN SERVICES Park Design Services Offered by Austin

ARMY ARMY ARMY ARMY Timothy Park Timothy Park Timothy Park Timothy Park 2LT MC USAR 2LT MC

FAIRVIEW PARK PC17-PR-002 Fairview Park (south) Rt. 50 Fairview Park (north) I-495 Callison

BROOKLYN BRIDGE PARK A. PROjEct OVERVIEW B. PARK DESIGN c. BUILDING tHE PARK

BASS LAKE BASS LAKE REGIONAL PARK REGIONAL PARK EL DORADO COUNTY PARKS EL DORADO COUNTY PARKS

SUNNYNOOK RIVER PARK SUNNYNOOK RIVER PARK SUNNYNOOK RIVER PARK Community Presentation - - July

Bushy Park Industrial Complex Bushy Park Industrial Complex Bushy Park Industrial Complex

Breakout Group Reinforcement Learning F ABIAN R UEHLE (U NIVERSITY OF O XFORD ) String_Data 2017,

CS 4803 / 7643: Deep Learning Topic: Reinforcement Learning (RL) Overview Markov

Endless Network Programming An Update from eBPF Land Quentin Monnet @qeole Outline Q.

Welcome! INFOMOV Lecture 2 Low Leve l 2 Previously in INFOMOV Consistent

ESTABLISHMENT Unit 6 Risk Assessment And Risk Response Lecture Objectives Todays objectives

Invited talk for 8th International Symposium on Formal Aspects of Component Software, FACS11,

SAFETY CLOUD H&S Software Transforming UKs Industry Adam Francis- Product Manager &

This webinar is presented by Tonights panel Dr Anne Wyatt Dr Neil Ozanne Dr Nigel Strauss Dr

[S PARK ] Shrideep Pallickara Computer Science Colorado State - PDF document

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University CS 555: D ISTRIBUTED S YSTEMS [S PARK ] Shrideep Pallickara Computer Science Colorado State University CS555: Distributed Systems [Fall 2019] October 8,

Liberty State Park Park Interior WRT Liberty State Park Today Liberty State Park The Park

Tyrol Hill Park Phase 4 Elementary Campbell Elementary Campbell Park Spaces Open Park

OPERA BROGLIE CAR PARK 1 OPERA BROGLIE CAR PARK 2 OPERA BROGLIE CAR PARK 3 OPERA BROGLIE CAR

HADEN PARK W I T T E R D . . D R T N I O P G N O L COMMUNITY PARK NATURE PARK

Pleasant Valley Parks Bower Park Camp Nooteeming Helen Aldrich Park Pleasant Valley Parade

Project Area Vilas Park Vilas Park Master Plan Vilas Park Master Plan Plan Maestro De Vilas Park

Addison Circle District Park Statistics 5 Parks Addison Circle Park Beckert Park

FAIRVIEW PARK By: Hannah S. ABOUT FAIRVIEW PARK Fairview Park is the best city I have ever

LACAMAS PARK (ROUND LAKE PARK) FALLEN LEAF LAKE PARK (DEAD LAKE) LACAMAS PARK MAPPING &amp;

ACTIVATION ENGAGEMENT INVESTMENT PARK DESIGN SERVICES Park Design Services Offered by Austin

ARMY ARMY ARMY ARMY Timothy Park Timothy Park Timothy Park Timothy Park 2LT MC USAR 2LT MC

FAIRVIEW PARK PC17-PR-002 Fairview Park (south) Rt. 50 Fairview Park (north) I-495 Callison

BROOKLYN BRIDGE PARK A. PROjEct OVERVIEW B. PARK DESIGN c. BUILDING tHE PARK

BASS LAKE BASS LAKE REGIONAL PARK REGIONAL PARK EL DORADO COUNTY PARKS EL DORADO COUNTY PARKS

SUNNYNOOK RIVER PARK SUNNYNOOK RIVER PARK SUNNYNOOK RIVER PARK Community Presentation - - July

Bushy Park Industrial Complex Bushy Park Industrial Complex Bushy Park Industrial Complex

Breakout Group Reinforcement Learning F ABIAN R UEHLE (U NIVERSITY OF O XFORD ) String_Data 2017,

CS 4803 / 7643: Deep Learning Topic: Reinforcement Learning (RL) Overview Markov

Endless Network Programming An Update from eBPF Land Quentin Monnet @qeole Outline Q.

Welcome! INFOMOV Lecture 2 Low Leve l 2 Previously in INFOMOV Consistent

ESTABLISHMENT Unit 6 Risk Assessment And Risk Response Lecture Objectives Todays objectives

Invited talk for 8th International Symposium on Formal Aspects of Component Software, FACS11,

SAFETY CLOUD H&amp;S Software Transforming UKs Industry Adam Francis- Product Manager &amp;

This webinar is presented by Tonights panel Dr Anne Wyatt Dr Neil Ozanne Dr Nigel Strauss Dr

LACAMAS PARK (ROUND LAKE PARK) FALLEN LEAF LAKE PARK (DEAD LAKE) LACAMAS PARK MAPPING &

SAFETY CLOUD H&S Software Transforming UKs Industry Adam Francis- Product Manager &