Data-Intensive Distributed Computing CS 431/461 451/651 (Fall 2019) - PowerPoint PPT Presentation

Data-Intensive Distributed Computing CS 431/461 451/651 (Fall 2019) Part 2: From MapReduce to Spark (2/2) Ali Abedi These slides are available at http://roegiest.com/bigdata-2019w/ This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

YARN Hadoop’s (original) limitations: Can only run MapReduce What if we want to run other distributed frameworks? YARN = Yet-Another-Resource-Negotiator Provides API to develop any generic distributed application Handles scheduling and resource request MapReduce (MR2) is one such application in YARN

Hadoop MapReduce Architecture namenode (NN) jobtracker (JT) namenode daemon jobtracker daemon tasktracker daemon tasktracker daemon tasktracker daemon datanode daemon datanode daemon datanode daemon Linux file system Linux file system Linux file system … … … worker node worker node worker node Hadoop v1.0

Hadoop v1.0

Hadoop v2.0

Spark Architecture

Algorithm Design

Closure Takes type X and returns type X • 3 + 4 = 7 (int + int = int) • 5 / 2 = 2.5 (int + int != float)

Identity “concept of nothing” • 5 + 0 = 5 • 5 * 1 = 5 • {3, 11, 9} + {} = {3, 11, 9} • Initializing a counter to zero

Associativity Add parenthesis anywhere • 1 + 2 + 3 = (1 + 2) + 3 • 10 / 2 / 5 != 10 / (2 / 5) • Huge jobs can become many small jobs

Commutativity Reordering • 1 + 2 + 3 = 2 + 3 + 1 • 10 / 2 != 2 /10

Monoid • Closure (int + int = int) • Identity (1 + 0 = 1) • Associativity (1 + 2 + 3 = (1 + 2) + 3) • Commutative Monoid

Commutative Monoid and MapReduce ( ) ( ) ( ) 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 3 7 4 7 4 3 14

Two superpowers: Associativity Commutativity (sorting)

Implications for distributed processing? You don’t know when the tasks begin You don’t know when the tasks end You don’t know when the tasks interrupt each other You don’t know when intermediate data arrive …

Word Count: Baseline class Mapper { def map(key: Long, value: String) = { for (word <- tokenize(value)) { emit(word, 1) } } } class Reducer { def reduce(key: String, values: Iterable[Int]) = { for (value <- values) { sum += value } emit(key, sum) } }

Computing the Mean: Version 1 class Mapper { def map(key: String, value: Int) = { emit(key, value) } } class Reducer { def reduce(key: String, values: Iterable[Int]) { for (value <- values) { sum += value cnt += 1 } emit(key, sum/cnt) } }

Computing the Mean: Version 3 class Mapper { def map(key: String, value: Int) = emit(key, (value, 1)) } class Combiner { def reduce(key: String, values: Iterable[Pair]) = { for ((s, c) <- values) { sum += s cnt += c } emit(key, (sum, cnt)) } } class Reducer { def reduce(key: String, values: Iterable[Pair]) = { for ((s, c) <- values) { sum += s cnt += c } emit(key, sum/cnt) } }

Co-occurrence Matrix: Stripes class Mapper { def map(key: Long, value: String) = { for (u <- tokenize(value)) { val map = new Map() for (v <- neighbors(u)) { map(v) += 1 } emit(u, map) } } } class Reducer { def reduce(key: String, values: Iterable[Map]) = { val map = new Map() for (value <- values) { map += value } emit(key, map) } }

Synchronization: Pairs vs. Stripes Approach 1: turn synchronization into an ordering problem Sort keys into correct order of computation Partition key space so each reducer receives appropriate set of partial results Hold state in reducer across multiple key-value pairs to perform computation Illustrated by the “pairs” approach Approach 2: data structures that bring partial results together Each reducer receives all the data it needs to complete the computation Illustrated by the “stripes” approach

Because you can’t avoid this… … … But commutative monoids help

Synchronization: Pairs vs. Stripes Approach 1: turn synchronization into an ordering problem Sort keys into correct order of computation Partition key space so each reducer receives appropriate set of partial results Hold state in reducer across multiple key-value pairs to perform computation Illustrated by the “pairs” approach Approach 2: data structures that bring partial results together Each reducer receives all the data it needs to complete the computation Illustrated by the “stripes” approach

f (B|A): “ Pairs ” (a, *) → 32 Reducer holds this value in memory (a, b 1 ) → 3 (a, b 1 ) → 3 / 32 (a, b 2 ) → 12 (a, b 2 ) → 12 / 32 (a, b 3 ) → 7 (a, b 3 ) → 7 / 32 (a, b 4 ) → 1 (a, b 4 ) → 1 / 32 … … For this to work: Emit extra (a, *) for every b n in mapper Make sure all a’s get sent to same reducer (use partitioner) Make sure (a, *) comes first (define sort order) Hold state in reducer across different key-value pairs

Two superpowers: Associativity Commutativity (sorting)

When you can’t “ monoidify ” … … Sequence your computations by sorting

Algorithm design in a nutshell… Exploit associativity and commutativity via commutative monoids (if you can) Exploit framework-based sorting to sequence computations (if you can’t) Source: Wikipedia (Walnut)

Data-Intensive Distributed Computing CS 431/461 451/651 (Fall 2019) - PowerPoint PPT Presentation

Data-Intensive Distributed Computing CS 431/461 451/651 (Fall 2019) Part 2: From MapReduce to Spark (2/2) Ali Abedi These slides are available at http://roegiest.com/bigdata-2019w/ This work is licensed under a Creative Commons

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se Amirkabir University of

for Data Intensive Scalable Computing CAP3 Gene Assembly Program Compute intensive

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

Data-Intensive Distributed Computing 431/631 (Fall 2020) Part 1: Introduction to Big Data Ali

Data-Intensive Distributed Computing 451/651 (Fall 2020) Part 1: Introduction to Big Data Ali

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

On safety in distributed computing Srivatsan Ravi On safety in distributed computing Safety in

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) Part 1: MapReduce Algorithm

OCIO UFOs Template 4 April 26, 2011 4 April 26, 2011 Objectives 1. Provide an interoperable

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) Part 6: Data Mining (3/4)

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 9: Real-Time Data

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 6: Data Mining (4/4)

Unsupervised Learning George Konidaris gdk@cs.brown.edu Fall 2019 Machine Learning Subfield of

DarkSide-20k and the Darkside Program for Dark Matter Searches Cristiano Galbiati Princeton

MINER n A Cross Sections what is MINER n A ? why MINER n A ? n beam and n flux n / n inclusive

Numeric Rela5onal Operators The if Statement The if statement

Lecture 8 (Part 2): Texturing Prof Emmanuel Agu Computer Science Dept. Worcester Polytechnic

Mass Storage & IO - II RAID: Redundant Array of Inexpensive Disks multiple disk drives

CMSC 131 Fall 2018 Announcements Project #1 (Orioles Baseball) due Sunday Computers are

Acts 16:25 34 Acts 16:25 34 NKJV 25 But at midnight Paul and Silas were praying and

Data-Intensive Distributed Computing CS 431/461 451/651 (Fall 2019) - PowerPoint PPT Presentation

Data-Intensive Distributed Computing CS 431/461 451/651 (Fall 2019) Part 2: From MapReduce to Spark (2/2) Ali Abedi These slides are available at http://roegiest.com/bigdata-2019w/ This work is licensed under a Creative Commons

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se Amirkabir University of

for Data Intensive Scalable Computing CAP3 Gene Assembly Program Compute intensive

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

Data-Intensive Distributed Computing 431/631 (Fall 2020) Part 1: Introduction to Big Data Ali

Data-Intensive Distributed Computing 451/651 (Fall 2020) Part 1: Introduction to Big Data Ali

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

On safety in distributed computing Srivatsan Ravi On safety in distributed computing Safety in

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) Part 1: MapReduce Algorithm

OCIO UFOs Template 4 April 26, 2011 4 April 26, 2011 Objectives 1. Provide an interoperable

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) Part 6: Data Mining (3/4)

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 9: Real-Time Data

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 6: Data Mining (4/4)

Unsupervised Learning George Konidaris gdk@cs.brown.edu Fall 2019 Machine Learning Subfield of

DarkSide-20k and the Darkside Program for Dark Matter Searches Cristiano Galbiati Princeton

MINER n A Cross Sections what is MINER n A ? why MINER n A ? n beam and n flux n / n inclusive

Numeric Rela5onal Operators The if Statement The if statement

Lecture 8 (Part 2): Texturing Prof Emmanuel Agu Computer Science Dept. Worcester Polytechnic

Mass Storage &amp; IO - II RAID: Redundant Array of Inexpensive Disks multiple disk drives

CMSC 131 Fall 2018 Announcements Project #1 (Orioles Baseball) due Sunday Computers are

Acts 16:25 34 Acts 16:25 34 NKJV 25 But at midnight Paul and Silas were praying and

Mass Storage & IO - II RAID: Redundant Array of Inexpensive Disks multiple disk drives