CS5412 / Lecture 25 Kishore Pusukuri, Apache Spark and RDDs Spring - PowerPoint PPT Presentation

CS5412 / Lecture 25 Kishore Pusukuri, Apache Spark and RDDs Spring 2019 HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 1

Recap MapReduce • For easily writing applications to process vast amounts of data in- parallel on large clusters in a reliable, fault-tolerant manner • Takes care of scheduling tasks, monitoring them and re-executes the failed tasks HDFS & MapReduce : Running on the same set of nodes  compute nodes and storage nodes same (keeping data close to the computation)  very high throughput YARN & MapReduce : A single master resource manager, one slave node manager per node, and AppMaster per application 2

Today’s Topics •Motivation •Spark Basics •Spark Programming 3

History of Hadoop and Spark 4

Apache Spark ** Spark can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN) Spark Spark Other Processing Spark ML Applications Stream SQL Data Resource Spark Core Yet Another Resource Mesos etc. manager Ingestion (Standalone Negotiator (YARN) Systems Scheduler) e.g., Apache S3, Cassandra etc., Hadoop NoSQL Database (HBase) Kafka, Data other storage systems Flume, etc Storage Hadoop Distributed File System (HDFS) Hadoop Spark 5

Apache Ha doop La ck a Unified Vision • Sparse Modules • Diversity of APIs • Higher Operational Costs 6

Spark Ecosystem: A Unified Pipeline Note: Spark is not designed for IoT real-time. The streaming layer is used for continuous input streams like financial data from stock markets, where events occur steadily and must be processed as they occur. But there is no sense of direct I/O from sensors/actuators. For IoT use cases, Spark would not be suitable. 7

Key ideas In Hadoop, each developer tends to invent his or her own style of work With Spark, serious effort to standardize around the idea that people are writing pa ra llel code tha t often runs for ma ny “cycles” or “itera tions” in which a lot of reuse of informa tion occurs. Spark centers on Resilient Distributed Dataset, RDDs, that capture the informa tion being reused. 8

How this works You express your application as a graph of RDDs. The graph is only evaluated as needed, and they only compute the RDDs a ctua lly needed for the output you ha ve requested. Then Spark can be told to cache the reusea ble informa tion either in memory, in SSD stora ge or even on disk, ba sed on when it will be needed again, how big it is , and how costly it would be to recreate. You write the RDD logic and control all of this via hints 9

Motivation (1) MapR apRedu duce: The original scalable, general, processing engine of the Hadoop ecosystem • Disk-based data processing framework (HDFS files) • Persists intermediate results to disk • Data is reloaded from disk with every query → Costly I/O • Best for ETL like workloa ds (ba tch processing) • Costly I/O → Not a ppropria te for itera tive or strea m processing workloa ds 10

Motivation (2) Spar park: General purpose computational framework that substantially improves performance of MapReduce, but retains the basic model • Memory based data processing framework → a voids costly I/O by keeping intermedia te results in memory • Levera ges distributed memory • Remembers opera tions a pplied to da ta set • Da ta loca lity ba sed computa tion → High Performa nce • Best for both itera tive (or strea m processing) a nd ba tch workloa ds 11

Motivation - Summa ry Softwa re engineering point of view Hadoop code base is huge  Contributions/Extensions to Hadoop are cumbersome  Java-only hinders wide adoption, but Java support is fundamental  System/Fra mework point of view Unified pipeline  Simplified data flow  Faster processing speed  Da ta a bstra ction point of view New fundamental abstraction RDD  Easy to extend with new operators  More descriptive computing model  12

Today’s Topics •Motivation •Spark Basics •Spark Programming 13

Spark Basics(1) Spa rk: Flexible, in-memory da ta processing fra mework written in Sca la Goa ls: • Simplicity (Ea sier to use):  Rich APIs for Sca la , Ja va , a nd Python • Genera lity: APIs for different types of workloa ds  Ba tch, Strea ming, Ma chine Lea rning, Gra ph • Low La tency (Performa nce) : In-memory processing a nd ca ching • Fa ult-tolera nce: Fa ults shouldn’t be specia l ca se 14

Spark Basics(2) There a re two wa ys to ma nipula te da ta in Spa rk • Spa rk Shell:  Intera ctive – for lea rning or da ta explora tion  Python or Sca la • Spa rk Applica tions  For la rge sca le da ta processing  Python, Sca la , or Ja va 15

Spark Core: Code Base (2012) 16

Spark Shell The Spa rk Shell provides intera ctive da ta exploration (REPL) REPL: Repeat/Evaluate/Print Loop 17

Spark Fundamentals Example of an application: • Spark Context • Resilient Distributed Data • Transformations • Actions 18

Spark Context (1) •Every Spark application requires a spark context : the main entry point to the Spark API •Spark Shell provides a preconfigured Spark Context called “sc” 19

Spark Context (2) •Standalone applications  Driver code  Spark Context •Spark Context holds configuration information and represents connection to a Spark cluster Standalone Application (Drives Computation) 20

Spark Context (3) Spa rk context works a s a client a nd represents connection to a Spa rk cluster 21

Spark Fundamentals Example of an application: •Spark Context • Resilient Distributed Data •Transformations •Actions 22

Resilient Distributed Dataset RD RDD (Resilient Distributed Dataset) is the fundamental unit of data in Spark: : An Immutable collection of objects (or records, or elements) that can be operated on “in parallel” (spread across a cluster) Resi sili lient -- if data in memory is lost, it can be recreated Recover from node failures • An RDD keeps its lineage information  it ca n be recrea ted from • pa rent RDDs Dist strib ibuted -- processed a cross the cluster Ea ch RDD is composed of one or more pa rtitions  (more pa rtitions – • more pa ra llelism) Dat atas aset -- initia l da ta ca n come from a file or be crea ted 23

RDDs Key ey I Idea ea: Write applications in terms of transformations on distributed datasets. One RDD per transformation. • Organize the RDDs into a DAG showing how data flows. • RDD can be saved and reused or recomputed. Spark can save it to disk if the dataset does not fit in memory • Built through parallel transformations (map, filter, group-by, join, etc). Automatically rebuilt on failure • Controllable persistence (e.g. caching in RAM) 24

RDDs a re designed to be “immuta ble” Crea te once, then reuse without cha nges. Spa rk knows • linea ge  ca n be recrea ted a t a ny time  Fa ult-tolerance Avoids da ta inconsistency problems (no simultaneous • upda tes)  Correctness Ea sily live in memory a s on disk  Ca ching  Sa fe to sha re • a cross processes/tasks  Improves performa nce Tra deoff: ( Fa Fault-tol oleranc nce & & Cor orrectne ness) vs (Disk M k Mem emory & CPU) • 25

Creating a RDD Three wa ys to crea te a RDD • From a file or set of files • From da ta in memory • From a nother RDD 26

Example: A File -ba sed RDD 27

Spark Fundamentals Example of an application: •Spark Context • Resilient Distributed Data • Transformations • Actions 28

RDD Operations Two types of operations Transformations : Define a new RDD based on current RDD(s) Actions : return values 29

RDD Transformations •Set of operations on a RDD that define how they should be transformed •As in relational algebra, the application of a transformation to an RDD yields a new RDD (because RDD are immutable) •Transformations are lazily evaluated, which allow for optimizations to take place before execution •Examples: map(), filter(), groupByKey(), sortByKey(), etc. 30

Example: map and filter Transformations 31

RDD Actions •Apply transformation chains on RDDs, eventually performing some additional operations (e.g., counting) •Some actions only store data to an external data source (e.g. HDFS), others fetch data from the RDD (and its transformation chain) upon which the action is applied, and convey it to the driver •Some common actions  count() – return the number of elements  take( n ) – return an array of the first n elements  collect()– return an array of all elements  saveAsTextFile( file ) – save to text file(s) 32

Lazy Execution of RDDs (1) Data in RDDs is not processed until an action is performed 33

Lazy Execution of RDDs (5) Data in RDDs is not processed until an action is performed Output Action “triggers” computation, pull model 37

CS5412 / Lecture 25 Kishore Pusukuri, Apache Spark and RDDs Spring - PowerPoint PPT Presentation

CS5412 / Lecture 25 Kishore Pusukuri, Apache Spark and RDDs Spring 2019 HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 1 Recap MapReduce For easily writing applications to process vast amounts of data in- parallel on large clusters in

CS5412/LECTURE 14 Ken Birman BLOCKCHAINS FOR I O T (PART 1) CS5412 Spring 2020

CS5412/LECTURE 23 Ken Birman HARDWARE ACCELERATORS CS5412 Spring 2020

CS5412/LECTURE 12 Ken Birman GOSSIP PROTOCOLS CS5412 Spring 2019

CS5412/LECTURE 10 Ken Birman CS5412 Spring 2020 CONSISTENT STORAGE FOR I O T CORNELL UNIVERSITY

CS5412/LECTURE 7 Ken Birman CS5412 Spring 2019 CONSISTENT STORAGE FOR I O T CORNELL UNIVERSITY

CS5412 / LECTURE 19 Ken Birman BIG (I O T) DATA Spring, 2019

CS5412: THE BASE METHODOLOGY VERSUS THE ACID MODEL Lecture VIII Ken Birman Todays lecture

CS5412: LECTURE 4 Ken Birman IMPLEMENTING A SMART FARM Spring, 2018

CS5412: TWO AND THREE PHASE COMMIT Lecture XI Ken Birman Continuing our consistency saga 2

CS5412 / LECTURE 26 Ken Birman THE CHALLENGES OF INTRODUCING Spring, 2020 RDMA INTO CLOUD

CS5412: TRANSACTIONS (I) Lecture XVI Ken Birman Transactions 2 A widely used reliability

CS5412: HOW DURABLE SHOULD IT BE? Lecture XV Ken Birman Durability 2 When a system accepts

CS5412: ANATOMY OF A CLOUD Lecture VII Ken Birman How are cloud structured? 2 Clients talk

CS5412: WHERE DID MY PERFORMANCE GO? Lecture XVIII Ken Birman Suppose you follow the rules

CS5412: LECTURE 4 Ken Birman IMPLEMENTING A SMART FARM Spring, 2018

CS5412: DANGERS OF CONSOLIDATION Lecture XXIII Ken Birman Are Clouds Inherently Dangerous? 2

RAP Tight integration with the physical world Location aware Communication patterns:

Network Layer Goals: Overview: last time understand principles network layer services

Network Layer (Routing) Where we are in the Course Moving on up to the Network Layer!

Providing Administrative Control and Autonomy in Structured Peer-to-Peer Overlays Alan Mislove

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Big Data Analytics with Apache Apostolos N. Papadopoulos (Associate Prof.)

UIMA-based Annotation Type System for a Text Mining Architecture Udo Hahn, Ekaterina Buyko ,

Apache Kafka Real-Time Data Pipelines http://kafka.apache.org/ Joe Stein Developer,