Parallel applications in the cloud Diana Naranjo Pomalaya Computao - - PowerPoint PPT Presentation

parallel applications in the cloud
SMART_READER_LITE
LIVE PREVIEW

Parallel applications in the cloud Diana Naranjo Pomalaya Computao - - PowerPoint PPT Presentation

Parallel applications in the cloud Diana Naranjo Pomalaya Computao Paralela e Distribuida Agenda Introduction MapReduce Solutions Haloop iMapReduce Pig Global Data Center Traffic Source: Cisco Global Cloud Index,


slide-1
SLIDE 1

Parallel applications in the cloud

Diana Naranjo Pomalaya Computação Paralela e Distribuida

slide-2
SLIDE 2

Agenda

  • Introduction
  • MapReduce
  • Solutions

○ Haloop ○ iMapReduce ○ Pig

slide-3
SLIDE 3

Global Data Center Traffic

Source: Cisco Global Cloud Index, 2013–2018

slide-4
SLIDE 4

Data-intensive applications

  • Sciences

○ Massive-scale simulations data analysis ○ Sensor deployments ○ High-throughput lab equipment

  • Industry

○ Web-data analysis ○ Click-stream analysis ○ Network-monitoring log analysis

slide-5
SLIDE 5

MapReduce

Map Local Sort Local Sort Map Combine Combine Shuffle Combine/ Merge Combine/ Merge Reduce Reduce

slide-6
SLIDE 6

MapReduce

○ Easy-to-use programming model (2 functions) ○ Scalability ○ Fault-tolerance ○ Load balancing ○ Data locality-based

  • ptimization

○ Designed for Batch-

  • riented

computations (N-step dataflows) ○ Low-level abstraction (combined data sets, primitive operations)

slide-7
SLIDE 7

Solutions

Haloop

Loop-aware scheduler Caching mechanisms

iMapReduce

Persistent tasks Input data loaded once Asynchronous execution

Pig

High-level data manipulation Hadoop execution

slide-8
SLIDE 8

Haloop

  • Hadoop based framework
  • Supports iterative programs
  • Loop-aware scheduler and Caching

Mechanisms

slide-9
SLIDE 9

Hadoop

Client Master Node Task Scheduler Slave Node Slave Node Task Tracker Slave Node Slave Node submits jobs schedules tasks creates tasks manages tasks’ execution

slide-10
SLIDE 10

Haloop

Client Master Node Task Scheduler Slave Node Slave Node Task Tracker Slave Node Slave Node submits jobs schedules tasks creates tasks manages tasks’ execution Loop Control Initiates map- reduce steps until termination condition is met Data locality By the means of caching and indexing

slide-11
SLIDE 11

Haloop - Loop control

  • Goal: place on same physical machine map/reduce tasks

that occur in different iterations but access same data

  • How:

○ Keep track of data partitions processed by each task

  • n each physical machine

○ Map new tasks to slave nodes that have alreasy processed that data partition ○ If node full, then re-assign to other node

slide-12
SLIDE 12

Haloop - Caching and indexing

  • Reducer input cache: useful for repeated joins against

large invariant data (wastes less time in shuffling)

  • Reducer output cache: reduce cost of fixpoint

termination condition avaliation

  • Mapper input cache: useful in k-means similar

applications (input data does not vary)

  • Cache reloading: if node is full, copy all required data to

new assigned node

slide-13
SLIDE 13

iMapReduce

  • Based on Hadoop
  • Framework for iterative algorithms
  • Concept of persistent tasks, input data loaded to

persistent tasks once and facilitates asynchronous execution of tasks within iteration

slide-14
SLIDE 14

iMapReduce - Restrictions

  • Map and reduce operations use

same key (one-to-one mapping)

  • Each iteration contains only one

MapReduce job Graph-based iterative algorithms

slide-15
SLIDE 15

iMapReduce - Persistent tasks

  • Tasks keep alive during whole

iteration process (dormat as data is parsed/processed)

  • Depends on available task slots

(problem with balancing load - strangles/leaders nodes)

DFS DFS DFS DFS DFS Map Reduce Map . . . Map Reduce Job 1 Job 2 . . . Job

slide-16
SLIDE 16

iMapReduce - Data management

  • Input data becomes: static data (invariant) and state

data (variant)

  • State data is passed from reduce to map tasks through

socket connections

  • Static data is partitioned with the same hash function

used to shuffle state data

  • Map and reduce tasks (one-to-one due to key

restriction) are scheduled to same worker

slide-17
SLIDE 17

iMapReduce - Asynchronous execution

  • Map tasks can start execution as soon as its state data

arrives

  • No need to wait for other map tasks
  • Fault-tolerance problem: use buffer to save results from

reduce tasks (return to last iteration)

slide-18
SLIDE 18

Pig

  • Provides constructs that allow high-level data

manipulation

  • Allows employment of user-provided executables
  • Compiles data-flow programs (pig latin) into sets of

MapReduce jobs and coordinates its execution (Hadoop)

slide-19
SLIDE 19

Pig - Compilation and execution stages

Parser Logical

  • ptimizer

MapReduce Compiler MapReduce Optimizer Hadoop Job Manager Type check Schema inference

  • Ex. Filter

pushdown (minimize amount of data scanned and processed) Logical Plan (DAG) Physical Plan (DAG) Mapping from logical to physical Distributive/ algebraic

  • perations

to map, combine and reduce steps Monitor

slide-20
SLIDE 20

Pig - Memory Management

  • Pig is implemented in JAVA
  • Memory overflow situations when large bags
  • f tuples are materialized between and inside
  • perators
  • Solution: List of bags ordered in descending
  • rder (estimated size), spill bags when

threshold is reached

slide-21
SLIDE 21

iMapReduce - Streaming

  • User-defined functions are supported in JAVA and are

synchronous

  • Streaming executables allow other languages to be used

(scripts/compiled binaries)

  • Streaming executables are asynchronous (queues)
slide-22
SLIDE 22

Pig - Performance

Source: India Hadoop Summit - Feb, 2011

17 December, 2010 6 June, 2015: release 0.15.0 available

slide-23
SLIDE 23

PigPen

  • MapReduce language that looks and behaves like

Clojure.core

  • Supports unit tests and iterative development
  • Used in Netflix
slide-24
SLIDE 24

References

  • Cisco and/or its affiliates. Cisco global cloud index: Forecast and methodology 20132018 white

paper, 2014.

  • Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. Haloop: Efficient iterative

data processing on large clusters.

  • Yanfeng Zhang, Qixin Gao, Lixin Gao, and Cuirong Wang. C.: Imapreduce: a distributed

computing framework for iterative computation. In In: Proceedings 8 of the 1st International Workshop on Data Intensive Computing in the Clouds (DataCloud, page 1112, 2011.

  • Thilina Gunarathne, Bingjing Zhang, Tak lon Wu, and Judy Qiu. Portable parallel programming
  • n cloud and hpc: Scientific applications of twister4azure,” presented at the portable. In Parallel

Programming on Cloud and HPC: Scientific Applications of Twister4Azure, 2011.

  • Jaliya Ekanayake, Xiaohong Qiu, Thilina Gunarathne, Scott Beason, and Geoffrey Fox. High

performance parallel computing with cloud and cloud technologies.

slide-25
SLIDE 25

Questions?