Distributed Computing 17.7. 22.7. 2011 Wolf-Tilo Balke & Pierre - - PowerPoint PPT Presentation

distributed computing
SMART_READER_LITE
LIVE PREVIEW

Distributed Computing 17.7. 22.7. 2011 Wolf-Tilo Balke & Pierre - - PowerPoint PPT Presentation

DEUTSCH-FRANZSISCHE SOMMERUNIVERSITT UNIVERSIT DT FRANCO -ALLEMANDE FR NACHWUCHSWISSENSCHAFTLER 2011 POUR JEUNES CHERCHEURS 2011 CLOUD COMPUTING : CLOUD COMPUTING : DFIS ET OPPORTUNITS HERAUSFORDERUNGEN UND MGLICHKEITEN


slide-1
SLIDE 1

DEUTSCH-FRANZÖSISCHE SOMMERUNIVERSITÄT FÜR NACHWUCHSWISSENSCHAFTLER 2011 UNIVERSITÉ D’ÉTÉ FRANCO-ALLEMANDE POUR JEUNES CHERCHEURS 2011

CLOUD COMPUTING : DÉFIS ET OPPORTUNITÉS CLOUD COMPUTING : HERAUSFORDERUNGEN UND MÖGLICHKEITEN

17.7. – 22.7. 2011

Distributed Computing

Wolf-Tilo Balke & Pierre Senellart IFIS, Technische Universität Braunschweig IC2, Telecom ParisTech

slide-2
SLIDE 2

Basics in Distributed Computing

Context, Motivation & Applications Basics, Amdahl’s law & Response Time Models Short introduction to GFS

2

Introduction

slide-3
SLIDE 3
  • Basic Definition

– A distributed system consists of multiple autonomous computers (nodes) that communicate through a computer network via message passing – The computers interact with each other in order to achieve a common goal – A computer program that runs in a distributed system is called a distributed program

3

Distributed Computing

slide-4
SLIDE 4
  • What are economic and technical drivers for

having distributed systems?

– Costs: better price/performance as long as commodity hardware is used for the component computers – Performance: by using the combined processing and storage capacity of many nodes, performance levels can be reached that are out of the scope of centralized machines – Scalability/Elasticity: resources such as processing and storage capacity can be increased incrementally – Availability: by having redundant components, the impact

  • f hardware and software faults on users can be reduced

4

Distributed Computing

slide-5
SLIDE 5
  • Hardware costs of a data center

– Usually run by big companies with dedicated data centers – Usually resides on extremely expensive blade servers

  • DELL PowerEdge M910 (Apr 2010)

– 4x XEON L7555, 1.86 GHz, 4C – 128 GB RAM – 1.2 TB RAID HD – 28.000 €

  • Building a data center with such Blades is

very expensive… (1 Rack, 32 Blades)

– ~0,9 Million € for 512 cores, 4 TB RAM, 38,4 TB HD – Additional costs for support, housing, etc…

– Analogy: data lives in high class condos

5

Costs of Data Centers

slide-6
SLIDE 6
  • Hardware costs of a Cloud / P2P system

– Software usually resides on very cheap low-end hardware

  • DELL Inspiron 570 (Apr 2010)

– 1x Athlon IIX4 630 L7555, 2.8 GHz, 4C – 4 GB RAM – 1 TB HD – 500 €

  • Performance comes cheap (1, 800 machines)

– ~ 0,9 Million € for 7200 cores, 7,2 TB RAM, 1,8 PB HD – Blade: ~0,9 Million € for 512 cores, 4 TB RAM, 38,4 TB HD

– Analogy: data lives in the slums

6

Costs of Data Centers

slide-7
SLIDE 7
  • … or how to build one of the most powerful data

centers out of crappy hardware

– Google has jealously guarded the design of its data centers for a long time

  • In 2007 & 2009 some details

have been revealed

  • The Google Servers

– Google only uses custom build servers – Google is the world 4th largest server producer

  • They don’t even sell servers…
  • In 2007, it was estimated that Google operates over 1.000.000

servers over 34 major and many more minor data centers

7

Google Servers

slide-8
SLIDE 8

– Data centers are connected to each other and major internet hubs via massive fiber lines

  • ~7% of all internet traffic is generated by Google
  • ~60% of that traffic connects directly to consumer

networks without connecting to global backbone

– If Google was an ISP , it would be the 3rd largest global carrier

8

Google Servers

slide-9
SLIDE 9
  • Some Google Datacenter facts & rumors

– In 2007, four new data centers were constructed for 600 million dollars – Annual operation costs in 2007 are reported to be 2.4 billion dollars – An average data center uses 50 megawatts of electricity

  • The largest center in Oregon has an estimated use of over

110 megawatts

  • The whole region of Braunschweig is estimated to use up

roughly 225 megawatts

9

Google Servers

slide-10
SLIDE 10
  • Each server rack holds 40 to 80 commodity-class

x86 PC servers with custom Linux

– Each server runs slightly outdated hardware – Each system has its own 12V battery to counter unstable power supplies – No cases used, racks are setup in standard shipping containers and are just wired together

  • More info: http://www.youtube.com/watch?v=Ho1GEyftpmQ

10

Google Servers

slide-11
SLIDE 11
  • Google servers are very unstable

– … but also very cheap – High “bang-for-buck” ratio

  • Typical first year for a new cluster (several racks):

– ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) – ~1 PDU (power distribution unit) failure (~500-1000 machines suddenly disappear, ~6 hours to come back) – ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours) – ~1 network rewiring (rolling ~5% of machines down

  • ver 2-day span)

11

Google Servers

slide-12
SLIDE 12

– ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) – ~5 racks go wonky (40-80 machines see 50% packet loss) – ~8 network maintenances (might cause ~30-minute random connectivity losses) – ~12 router reloads (takes out DNS and external VIPs for a couple minutes) – ~3 router failures (traffic immediately pulled for an hour) – ~dozens of minor 30-second DNS blips – ~1000 individual machine failures – ~thousands of hard drive failures – Countless slow disks, bad memory, misconfigured machines, flaky, machines, etc.

12

Google Servers

slide-13
SLIDE 13
  • Challenges to the data center software

– Deal with all these hardware failures while avoiding any data loss and ~100% global uptime – Decrease maintenance costs to minimum – Allow flexible extension of data centers – Solution:

  • Use cloud technologies
  • GFS (Google File System) and

Google Big Table Data System

13

Google Servers

slide-14
SLIDE 14
  • T

elecommunication networks

– Telephone and cellular networks – Computer networks and the Internet – Wireless sensor networks

  • Network applications

– World Wide Web and Peer-to-Peer networks – Massively multiplayer online games

14

Applications

slide-15
SLIDE 15
  • Distributed databases and distributed information

processing systems such as

– Banking systems – Airline reservation systems – Real-time process control like aircraft control systems.

  • Parallel computation

– Scientific computing, including cluster and grid computing

15

Applications

slide-16
SLIDE 16
  • Example: distributed data systems are important

in astronomy

– No site can hold all information

  • Telescope image archives are already in the multi-TB range
  • Promise to quickly grow larger with the increasing size of

digital detectors and the advent

  • f new all-sky surveys

16

Distributed Systems in Science

slide-17
SLIDE 17
  • Much of the astronomical information is dynamic

– Static catalogs and indexes quickly become obsolete

  • Astronomers use multiple types of data

– images, spectra, time series, catalogs, journal articles,... – All should be easily located and easily accessed with query terms and syntax natural to the discipline

  • Astronomers need to know the provenance of the

data they are using and all details about it

– No one data center is able to have expertise in the wide range of astronomical instrumentation and data sets

17

Distributed Systems in Science

slide-18
SLIDE 18
  • Sample distributed datasets at NASA

18

Distributed Systems in Science

Solar System Exploration – Lunar and planetary science data and mission information Heliophysics – Space and solar physics data and mission information Universe Exploration – Astrophysics data and mission information http://nssdc.gsfc.nasa.gov/

slide-19
SLIDE 19
  • Naval command systems

– Collate information such as:

  • Sensor data (RADAR)
  • Geographic data (Maps)
  • Technical information (Ship types)
  • Air, land, surface and underwater data
  • ...

– Highly interactive

  • Operator may annotate and extend

any given data

– Many operators at a time – Each operator should see all annotations in real time

19

Distributed Military Systems

slide-20
SLIDE 20
  • Hard requirements for the system:

– Consistent, up-to-date view on the situation – Distributed environment – A lot of storage needed (sensor data) – High fault-safety – Real-time requirements

20

Distributed Military Systems

slide-21
SLIDE 21
  • BAE Systems

– British defense, security and aerospace company – Creates electronic systems and software for e.g. “Eurofighter Typhoon” or “Queen Elizabeth class aircraft carriers” – This includes development of naval command systems

21

Distributed Military Systems

slide-22
SLIDE 22
  • The key to success

– Divide a problem into many tasks: each is solved by one or more computers

  • Problems

– The system has to tolerate failures in individual nodes – The structure of the system (network topology, latency, number of nodes) is not known in advance – The system may consist of different kinds of computers and network links – The system may change during the execution of a distributed program – Each node has only a limited, incomplete view of the system

22

How to do it?

slide-23
SLIDE 23
  • Uniprocessor

– Single processor – Direct memory access

23

Hardware Architecture

Processor Memory

slide-24
SLIDE 24
  • Multiprocessor

– Multiple processors with direct memory access – Uniform memory access (e.g., SMP, multicore) – Nonuniform memory access (e.g., NUMA)

24

Hardware Architecture

Processor

Memory

Processor Processor Processor

Memory Memory

slide-25
SLIDE 25
  • Multicomputer

– Multiple computers linked via network – No direct memory access – Homogeneous vs. Heterogeneous

25

Hardware Architecture

Processor

Memory

Processor

Memory

Processor

Memory

slide-26
SLIDE 26
  • Similar for uniprocessors and multiprocessors

– But for multiprocessors: the kernel is designed to handle multiple CPUs and the number of CPUs is transparent

26

Software Architecture

Applications

Operating System Services

Kernel

slide-27
SLIDE 27
  • For multicomputers

there are several possiblities

– Network OS – Middleware – Distributed OS

27

Software Architecture

Distributed Applications OS Services Kernel OS Services Kernel OS Services Kernel Distributed Applications OS Services Kernel Distributed Applications Distributed OS Services Kernel Kernel Kernel OS Services Kernel OS Services Kernel Middleware Services

slide-28
SLIDE 28
  • Not about architectural issues

– A lot of open discussions that would fill our time slot completely…

  • Our main focus: scalability and time

28

This Course

slide-29
SLIDE 29
  • “Classic” cost models focus on

total resource consumption of a task

– Leads to good results for heavy computational load and slow network connections

  • If execution plan saves resources, many threads can be executed

in parallel on different machines

– However, algorithms can also be optimized for short response times

  • “Waste” some resources to get first results earlier
  • Take advantage of lightly loaded machines and fast connections
  • Utilize intra-thread parallelism

– Parallelize one thread instead of cuncurrent multiple threads

29

Response Time Models

slide-30
SLIDE 30
  • Response time models are needed!

– “When does the first piece of the result arrive?”

  • Important for Web search, query processing,

– “When has the final result arrived?”

30

Response Time Models

slide-31
SLIDE 31
  • Example

– Assume relations or fragments A, B, C, and D – All relations/fragments are available on all nodes

  • Full replication

– Compute 𝐵 ⋈ 𝐶 ⋈ 𝐷 ⋈ 𝐸 – Assumptions

  • Each join costs 20 time units (TU)
  • Transferring an intermediate result costs 10 TU
  • Accessing relations is free
  • Each node has one computation thread

31

Distributed Query Processing

slide-32
SLIDE 32
  • T

wo plans:

– Plan 1: Execute all operations on one node

  • T
  • tal costs: 60

– Plan 2: Join on different nodes, ship results

  • T
  • tal costs: 80

32

Distributed Query Processing

Node 1

A ⋈ ⋈ ⋈ B C D

Node 1

A ⋈ ⋈ B C D Send Send

Node 2 Node 3

Receive Receive ⋈

Plan 1 Plan 2

slide-33
SLIDE 33
  • With respect to total costs, plan 1 is better
  • Example (cont.)

– Plan 2 is better wrt. to response time as operations can be carried out in parallel

33

Distributed Query Processing

Time Plan 2 Plan 1 𝐵 ⋈ 𝐶 𝐷 ⋈ 𝐸 𝐵𝐶 ⋈ 𝐷𝐸 𝐷 ⋈ 𝐸 𝑈𝑌 𝑈𝑌 𝐵𝐶 ⋈ 𝐷𝐸 𝐵 ⋈ 𝐶 10 30 40 50 60 20

slide-34
SLIDE 34
  • Response

Time

– Two types of response times

  • First

Tuple & Full Result Response Time

  • Computing response times

– Sequential execution parts

  • Full response time is sum of all computation times of all

used operations

– Multiple parallel threads

  • Maximal costs of all parallel sequences

34

Distributed Query Processing

slide-35
SLIDE 35
  • Considerations:

– How much speedup is possible due to parallelism?

  • Or: “Does kill-it-with-iron” work for parallel problems?

– Performance speed-up of algorithms is limited by Amdahl’s Law

  • Gene Amdahl, 1968
  • Algorithms are composed of parallel and sequential parts
  • Sequential code fragments

severely limit potential speedup of parallelism!

35

Response Time Models

slide-36
SLIDE 36

– Possible maximal speed-up:

  • 𝑛𝑏𝑦𝑡𝑞𝑓𝑓𝑒𝑣𝑞 ≤

𝑞 1+𝑡∗ 𝑞−1

  • 𝑞 is number of parallel threads
  • 𝑡 is percentage of single-threaded code

– e.g. if 10% of an algorithm is sequential, the maximum speed up regardless of parallelism is 10x – For maximal efficient parallel systems, all sequential bottlenecks have to be identified and eliminated!

36

Response Time Models

slide-37
SLIDE 37

37

Response Time Models

slide-38
SLIDE 38
  • Good First item Response benefits from
  • perations executed in a pipelined fashion

– Not pipelined:

  • Each operation is fully completed and a intermediate

result is created

  • Next operation reads intermediate result and is then fully

completed

– Reading and writing of intermediate results costs resources!

– Pipelined

  • Operations do not create intermediate results
  • Each finished tuple is fed directly into the next operation
  • Tuples “flow” through the operations

38

Response Time Models

slide-39
SLIDE 39
  • Usually, the result flow is controlled by iterator

interfaces implemented by each operation

– “Next” command – If execution speed of operations in the pipeline differ, results are either cached or the pipeline blocks

  • Some operations are more suitable than others

for pipelining

– Good: selections, filtering, unions, … – Tricky: joining, intersecting, … – Very Hard: sorting

39

Response Time Models

slide-40
SLIDE 40
  • Simple pipeline example:

– Tablescan, Selection, Projection

  • 1000 tuples are scanned, selectivity is 0.1

– Costs:

  • Accessing one tuple during tablescan: 2 TU (time unit)
  • Selecting (testing) one tuple: 1 TU
  • Projecting one tuple: 1 TU

40

Pipelined Query Processing

FInal

Projection

IR1

Table Scan

IR2

Selection Table Scan Selection Projection Pipeline

Final

time event 2 First tuple in IR1 2000 All tuples in IR1 2001 First tuple in IR2 3000 All tuples in IR2 3001 First tuple in Final 3100 All tuples in Final time event 2 First tuple finished tablescan 3 First tuple finished selection (if selected…) 4 First tuple in Final 3098 Last tuple finished tablescan 3099 Last tuple finished selection 3100 All tuples in Final

Pipelined Non-Pipelined

slide-41
SLIDE 41
  • Consider following example:

– Joining two table subsets

  • Non-pipelined BNL join
  • Both pipelines work in parallel

– Costs:

  • 1.000 tuples are scanned in each pipeline, selectivity 0.1
  • Joining 100 ⋈100 tuples: 10,000 TU (1 TU per tuple combination)

– Response time

  • The first tuple can arrive at the end of any pipeline after 4 TU

– Stored in intermediate result

  • All tuples have arrived at the end of the pipelines after 3,100 TU
  • Final result will be available after 13,100 TU

– No benefit from pipelining wrt. response time – First tuple arrives at 3100 ≪ 𝑢 ≤ 13100

41

Pipelined Query Processing

Table Scan Selection Projection Table Scan Selection Projection BNL Join Pipeline Pipeline

slide-42
SLIDE 42
  • The suboptimal result of the previous example is

due to the unpipelined join

– Most traditional join algorithms are unsuitable for pipelining

  • Pipelining is not a necessary feature in a strict single

thread environment

– Join is fed by two input pipelines – Only one pipeline can be executed at a time – Thus, at least one intermediate result has to be created – Join may be performed single / semi-pipelined

  • In parallel / distributed DBs, fully pipelined joins are

beneficial

42

Pipelined Query Processing

slide-43
SLIDE 43
  • Single-Pipelined-Hash-Join

– One of the “classic” join algorithms – Base idea 𝑩 ⋈ 𝑪

  • One input relation is read from an intermediate result (B), the
  • ther is pipelined though the join operation (A)
  • All tuples of B are stored in a hash table

– Hash function is used on the join attribute – i.e. all tuples showing the same value for the join attribute are in one bucket » Careful: hash collisions! Tuple with different joint attribute value might end up in the same bucket!

  • Every incoming tuple a (via pipeline) of A is hashed by join

attributed

  • Compare a to each tuple in the respective B bucket

– Return those tuples which show matching join attributes

43

Pipelined Query Processing

slide-44
SLIDE 44
  • Double-Pipelined-Hash-Join

– Dynamically build a hashtables for A and B tuples

  • Memory intensive!

– Process tuples on arrival

  • Cache tuples if necessary
  • Balance between A and B tuples for

better performance

  • Rely on statistics for a good A:B

ratio

– If a new A tuple a arrives

  • Insert a into the A-table
  • Check in the B table if there are join

partners for a

  • If yes, return all combined AB tuples

– If a new B tuple arrives, process it analogously

44

Pipelined Query Processing

Hash Tuple 17 A1, A2 31 A3 Hash Tuple 29 B1 A B A Hash B Hash Input Feeds AB Output Feed 𝑩 ⋈ 𝑪 31 B2

slide-45
SLIDE 45

45

Pipelined Query Processing

Hash Tuple 17 A1, A2 31 A3 Hash Tuple 29 B1 31 B2 A B A Hash B Hash Input Feeds AB Output Feed 𝑩 ⋈ 𝑪 31 B2

  • B(31,B2) arrives
  • Insert into B Hash
  • Find matching A tuples
  • Find A3
  • Assume that A3 matches B2…
  • Put AB(A2, B2) into output feed
slide-46
SLIDE 46
  • In pipelines, tuples just “flow” through the operations

– No problem with that in one processing unit… – How do tuples flow to other nodes?

  • Sending each tuple individually may be very

ineffective

– Communication costs:

  • Setting up transfer & opening communication channel
  • Composing message
  • Transmitting message: header information & payload

– Most protocols impose a minimum message size & larger headers – Tuplesize ≪ Minimal Message Size

  • Receiving & decoding message
  • Closing channel

46

Pipelined Query Processing

slide-47
SLIDE 47
  • Idea: Minimize Communication Overhead by

Tuple Blocking

– Do not send single tuples, but larger blocks containing multiple tuples

  • “Burst-Transmission”
  • Pipeline-Iterators have to be able to cache packets
  • Block size should be at least the packet size of the

underlying network protocol

– Often, larger packets are more beneficial – ….more cost factors for the model

47

Pipelined Query Processing

slide-48
SLIDE 48
  • GFS (Google File System) is the distributed

file system used by most Google services

– Driver in development was managing the Google Web search index – Applications may use GFS directly

  • The database Bigtable is an application that was especially

designed to run on-top of GFS

– GFS itself runs on-top of standard POSIX-compliant Linux file systems

  • Hadoop’s file system (HDFS) was coded

inspired by GFS papers, only open source…

48

Google File System

slide-49
SLIDE 49
  • Design constraints and considerations

– Run on potentially unreliable commodity hardware – Files are large (usually ranging from 100 MB to multiple GBs of size)

  • e.g. satellite imaginary, or a Bigtable file

– Billions of files need to be stored – Most write operations are appends

  • Random writes or updates are rare
  • Most files are write-once, read-many (WORM)
  • Appends are much more resilient in distributed environments

than random updates

  • Most Google applications rely on Map and Reduce which

naturally results in file appends

49

GFS

slide-50
SLIDE 50

– Two common types of read operations

  • Sequential streams of large data quantities

– e.g. streaming video, transferring a web index chunk, etc. – Frequent streaming renders caching useless

  • Random reads of small data quantities

– However, random reads are usually “always forward”, e.g. similar to a sequential read skipping large portions of the file

– Focus of GFS is on high overall bandwidth, not latency

  • In contrast to system like e.g. Amazon Dynamo

– File system API must be simple and expandable

  • Flat file name space suffices

– File path is treated as string » No directory listing possible – Qualifying file names consist of namespace and file name

  • No POSIX compatibility needed
  • Additional support for file appends and snapshot operations

50

GFS

slide-51
SLIDE 51
  • A GFS cluster represents a single file

system for a certain set of applications

  • Each cluster consists of

– A single master server

  • The single master is one of the key features of GFS!

– Multiple chunk servers per master

  • Accessed by multiple clients

– Running on commodity Linux machines

  • Files are split into fixed-sized chunks

– Similar to file system blocks – Each labeled with a 64-bit unique global ID – Stored at a chunk server – Usually, each chunk is three times replicated across chunk servers

51

GFS

slide-52
SLIDE 52
  • Application requests are initially handled by a

master server

– Further, chunk-related communication is performed directly between application and chunk server

52

GFS

slide-53
SLIDE 53
  • Master server

– Maintains all metadata

  • Name space, access control, file-to-chunk mappings, garbage collection,

chunk migration

– Queries for chunks are handled by the master server

  • Master returns only chunk locations
  • A client typically asks for multiple chunk locations in a single request
  • The master also optimistically provides chunk locations immediately

following those requested

  • GFS clients

– Consult master for metadata – Request data directly from chunk servers

  • No caching at clients and chunk servers due to the frequent streaming

53

GFS

slide-54
SLIDE 54
  • Files (cont.)

– Each file consists of multiple chunks – For each file, there is a meta-data entry

File namespace

File to chunk mappings

Chunk location information

– Including replicas!

Access control information

Chunk version numbers

54

GFS

slide-55
SLIDE 55
  • Chunks are rather large (usually 64MB)

– Advantages

  • Less chunk location requests
  • Less overhead when accessing large amounts of data
  • Less overhead for storing meta data
  • Easy caching of chunk metadata

– Disadvantages

  • Increases risk for fragmentation

within chunks

  • Certain chunks may become hot spots

55

GFS

slide-56
SLIDE 56
  • Meta-Data is kept in main-memory of master

server

– Fast, easy and efficient to periodically scan through meta data

Re-replication in the presence of chunk server failure

Chunk migration for load balancing

Garbage collection

– Usually, there are 64Bytes of metadata per 64MB chunk

  • Maximum capacity of GFS cluster limited by available main

memory of master

– In practice, query load on master server is low enough such that it never becomes a bottle neck

56

GFS

slide-57
SLIDE 57
  • Master server relies on soft-states

– Regularly sends heart-beat messages to chunk servers

  • Is chunk server down?
  • Which chunks does chunk server store?

– Including replicas

  • Are there any disk failures at a chunk server?
  • Are any replicas corrupted?

– Test by comparing checksums

– „ Master can send instructions to chunk server

  • Delete existing chunks
  • Create new empty chunk

57

GFS

slide-58
SLIDE 58
  • All modifications to meta-data are logged into an
  • peration log to safeguard against GFS master failures

– Meta-data updates are not that frequent – The operation log contains a historical record of critical metadata changes, replicated on multiple remote machines – Checkpoints for fast recovery

  • Operation log can also serve to reconstruct a timeline of changes

– Files and chunks, as well as their versions are all uniquely and eternally identified by the logical times at which they were created – In case of failure, the master recovers its file system state by replaying the operation log

  • Usually, a shadow master is on hot-standby to take over during

recovery

58

GFS

slide-59
SLIDE 59
  • Guarantees of GFS

– Namespace mutations are always atomic

  • Handled by the master with locks
  • e.g. creating new files or chunks
  • Operation is only treated as

successful when operation is performed and all log replicas are flushed to disk

59

GFS

slide-60
SLIDE 60

– Data mutations follow a relaxed consistency model

  • A chunk is consistent, if all clients see the same data,

independently of the queried replica

  • A chunk is defined, if all its modifications are visible

– i.e. writes have been atomic – GFS can recognize defined and undefined chunks

  • In most cases, all chunks should be consistent and defined

– …but not always. – Only using append operations for data mutations minimizes probability for undefined or inconsistent chunks

60

GFS

slide-61
SLIDE 61
  • Mutation operations

– To encourage consistency among replicas, the master grants a lease for each chunk to a chunk server

  • Server owning the lease is responsible for that chunk

– i.e. has the primary replica and is responsible for mutation

  • perations
  • Leases are granted for a limited time (e.g. 1 minute)

– Granting leases can be piggybacked to heartbeat messages – Chunk server may request a lease extension, if it currently mutates the chunk – If a chunk server fails, a new leases can be handed out after the

  • riginal one expired

» No inconsistencies in case of partitions

61

GFS

slide-62
SLIDE 62
  • Mutation operations have a separated data

flow and control flow

– Idea: maximize bandwidth utilization and overall system throughput – Primary replica chunk server is responsible for control flow

´

62

GFS

slide-63
SLIDE 63
  • Mutation workflow overview

63

GFS

Client Master Secondary Replica A Primary Replica Secondary Replica B 1 2 3 3 3 4 7 5 5 6 6 Data flow Control flow

slide-64
SLIDE 64
  • Application originates mutation request
  • 1. GFS client translates request from (filename,

data) to (filename, chunk index), and sends it to master

– Client “knows” which chunk to modify

  • Does not know where the chunk and its replicas are located
  • 2. Master responds with chunk handle and (primary

+ secondary) replica locations

64

GFS

Client Master 1 2

slide-65
SLIDE 65
  • 3. Client pushes write data to all replicas

– Client selects the “best” replica chunk server and transfers all new data

  • e. g. closest in the network, or with highest known

bandwidth

  • Not necessarily the server holding the lease
  • New data: the new data and the address range it is

supposed to replace

– Exception: appends

– Data is stored in chunk servers’ internal buffers

  • New data is stored as fragments in buffer

– New data is pipelined forward to next chunk server

  • … and then the next
  • Serially pipelined transfer of the data
  • Try to optimize bandwidth usage

65

GFS

Client Secondary Replica A Primary Replica Secondary Replica B 3 3 3

slide-66
SLIDE 66
  • 4. After all replicas received the data, the client

sends a write request to the primary chunk server

– Primary determines serial order for new data fragments stored in its buffer and writes the fragments in that order to the chunk

  • Write of fragments is thus atomic

– No additional write request are served during write operation

  • Possibly multiple fragments from
  • ne or multiple clients

66

GFS

Client Primary Replica 4

slide-67
SLIDE 67
  • 5. After the primary server successfully finished

writing the chunk, it orders the replicas to write

– The same serial order is used!

  • Also, the same timestamps are used

– Replicas are inconsistent for a short time

  • 6. After the replicas completed,

the primary server is notified

67

GFS

Secondary Replica A Primary Replica Secondary Replica B 3 3 5 5 6 6

slide-68
SLIDE 68
  • 7. The primary notifies the client

– Also, all error are reported to the client

  • Usually, errors are resolved by retrying some parts of the

workflow

– Some replicas may contain the same datum multiple times due to retries – Only guarantee of GFS: data will be written at least once atomically

  • Failures may render chunks inconsistent

68

GFS

Client Primary Replica 7

slide-69
SLIDE 69
  • Google aims at using append operations for most mutations

– For random updates, clients need to provide the exact range for the new data within the file

  • Easy to have collisions with other clients

– i.e. client A write to range 1, client B overwrites range 1 because it assumed it as empty – Usually, locks would solve the problem

– Appends can be easily performed in parallel

  • Just transfer new data to chunk server

– Clients can transfer new data in parallel – Chunks server buffers data

  • Chunk server will find a correct position at the end of the chunk

– Additional logic necessary for creating new chunks if current chunk cannot hold new data

– Typical use case

  • Multiple producers append to the same file while simultaneously multiple

consumer read from it

– e.g. then of the web crawler and feature extraction engine

69

GFS

slide-70
SLIDE 70
  • Master takes care of chunk creation and

distribution

– New empty chunk creation, re-replication, rebalances

  • Master server notices if a chunk has to few replicas and can re-

replicate

– Master decides on chunk location. Heuristics:

  • Place new replicas on chunk servers with below-average disk

space utilization. Over time this will equalize disk utilization across chunk servers

  • Limit the number of “recent” creations on each chunk server

– Chunks should have different age to spread chunk correlation

  • Spread replicas of a chunk across racks

70

GFS

slide-71
SLIDE 71
  • After a file is deleted, GFS does not immediately

reclaim the available physical storage

– Just delete meta-data entry from the master server – File or chunks become stale

  • Chunks or files may also become stale if a chunk

server misses an update to a chunk

– Updated chunk has a different Id than old chunk – Master server holds only links to new chunks

  • Master knows the current chunks of a file
  • Heartbeat messages with unknown (e.g. old) chunks are ignored
  • During regular garbage collection, stale chunks are

physically deleted

71

GFS

slide-72
SLIDE 72
  • Experiences with GFS

– Chunk server workload

  • Bimodal distribution of small and large files
  • Ratio of write to append operations: 4:1 to 8:1
  • Virtually no overwrites

– Master workload

  • Most request for chunk locations and open files

– Reads achieve 75% of the network limit – Writes achieve 50% of the network limit

72

GFS

slide-73
SLIDE 73
  • Summary and notable features GFS

– GFS is a distributed file system

  • Optimized for file append operations
  • Optimized for large files

– File are split in rather large 64MB chunks and distributed and replicated – Uses single master server for file and chunk management

  • All meta-data in master server in main memory

– Uses flat namespaces

73

GFS

slide-74
SLIDE 74

July 19, 2011

Distributed Computing

Web Data Management http://webdam.inria.fr/Jorge/

  • S. Abiteboul, I. Manolescu, P

. Rigaux, M.-C. Rousset, P . Senellart

slide-75
SLIDE 75

2 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Outline

MapReduce Introduction The MapReduce Computing Model MapReduce Optimization Application: PageRank MapReduce in Hadoop Toward Easier Programming Interfaces: Pig Conclusions

slide-76
SLIDE 76

3 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Outline

MapReduce Introduction The MapReduce Computing Model MapReduce Optimization Application: PageRank MapReduce in Hadoop Toward Easier Programming Interfaces: Pig Conclusions

slide-77
SLIDE 77

4 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Data analysis at a large scale

Very large data collections (TB to PB) stored on distributed filesystems:

Query logs Search engine indexes Sensor data

Need efficient ways for analyzing, reformatting, processing them In particular, we want:

Parallelization of computation (benefiting of the processing power of all nodes in a cluster) Resilience to failure

slide-78
SLIDE 78

5 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Centralized computing with distributed data storage

Run the program at client node, get data from the distributed system.

Client node disk memory disk memory disk memory program data flow (input) data flow (output)

Downsides: important data flows, no use of the cluster computing resources.

slide-79
SLIDE 79

6 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Pushing the program near the data

Client node disk process1 disk disk program() program() program() process2 process3 coordinator() result result result

MapReduce: A programming model (inspired by standard functional programming operators) to facilitate the development and execution of distributed tasks. Published by Google Labs in 2004 at OSDI [Dean and Ghemawat, 2004]. Widely used since then, open-source implementation in Hadoop.

slide-80
SLIDE 80

7 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

MapReduce in Brief

The programmer defines the program logic as two functions: Map transforms the input into key-value pairs to process Reduce aggregates the list of values for each key The MapReduce environment takes in charge distribution aspects A complex program can be decomposed as a succession of Map and Reduce tasks Higher-level languages (Pig, Hive, etc.) help with writing distributed applications

slide-81
SLIDE 81

8 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Outline

MapReduce Introduction The MapReduce Computing Model MapReduce Optimization Application: PageRank MapReduce in Hadoop Toward Easier Programming Interfaces: Pig Conclusions

slide-82
SLIDE 82

9 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Three operations on key-value pairs

  • 1. User-defined: map : (K, V) → list(K ′, V ′)

Example

function map(uri , document) foreach distinct term in document

  • utput (term , count(term , document ))
  • 2. Fixed behavior: shuffle : list(K ′, V ′) → list(K ′, list(V ′)) regroups all

intermediate pairs on the key

  • 3. User-defined: reduce : (K ′, list(V ′)) → list(K ′′, V ′′)

Example

function reduce(term , counts)

  • utput (term , sum(counts ))
slide-83
SLIDE 83

10 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Job workflow in MapReduce

Important: each pair, at each phase, is processed independently from the other pairs.

... ,(kn, vn) , ... , (k2,v2), (k1, v1) Input: a list of (key, value) pairs map(k1, v1) Map operator (k' 1, v'1) ... (k'2, v'2) ... (k'1, v'p) ... (k'1, v'q) ... reduce(k'1, <v'1, v'p, v'q, ...>) (k'2, <v'2, ...>) Reduce operator (v") (k'1, <v'1, v'p, v'q, ...>) Intermediate structure

Network and distribution are transparently managed by the MapReduce environment.

slide-84
SLIDE 84

11 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Example: term count in MapReduce (input)

URL Document u1 the jaguar is a new world mammal of the felidae family. u2 for jaguar, atari was keen to use a 68k family device. u3 mac os x jaguar is available at a price of us $199 for apple’s new “family pack”. u4

  • ne such ruling family to incorporate the jaguar into their name

is jaguar paw. u5 it is a big cat.

slide-85
SLIDE 85

12 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Example: term count in MapReduce

term count jaguar 1 mammal 1 family 1 jaguar 1 available 1 jaguar 1 family 1 family 1 jaguar 2 . . . map output shuffle input

slide-86
SLIDE 86

12 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Example: term count in MapReduce

term count jaguar 1 mammal 1 family 1 jaguar 1 available 1 jaguar 1 family 1 family 1 jaguar 2 . . . map output shuffle input term count jaguar 1,1,1,2 mammal 1 family 1,1,1 available 1 . . . shuffle output reduce input

slide-87
SLIDE 87

12 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Example: term count in MapReduce

term count jaguar 1 mammal 1 family 1 jaguar 1 available 1 jaguar 1 family 1 family 1 jaguar 2 . . . map output shuffle input term count jaguar 1,1,1,2 mammal 1 family 1,1,1 available 1 . . . shuffle output reduce input term count jaguar 5 mammal 1 family 3 available 1 . . . final output

slide-88
SLIDE 88

13 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Example: simplification of the map

function map(uri , document) foreach distinct term in document

  • utput (term , count(term , document ))

can actually be further simplified:

function map(uri , document) foreach term in document

  • utput (term , 1)

since all counts are aggregated. Might be less efficient though (we may need a combiner, see further)

slide-89
SLIDE 89

14 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

A MapReduce cluster

Nodes inside a MapReduce cluster are decomposed as follows: A jobtracker acts as a master node; MapReduce jobs are submitted to it Several tasktrackers run the computation itself, i.e., map and reduce tasks A given tasktracker may run several tasks in parallel Tasktrackers usually also act as data nodes of a distributed filesystem (e.g., GFS, HDFS) + a client node where the application is launched.

slide-90
SLIDE 90

15 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Processing a MapReduce job

A MapReduce job takes care of the distribution, synchronization and failure handling. Specifically: the input is split into M groups; each group is assigned to a mapper (assignment is based on the data locality principle) each mapper processes a group and stores the intermediate pairs locally grouped instances are assigned to reducers thanks to a hash function (shuffle) intermediate pairs are sorted on their key by the reducer

  • ne obtains grouped instances, submitted to the reduce function

Remark: the data locality does no longer hold for the reduce phase, since it reads from the mappers.

slide-91
SLIDE 91

16 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Assignment to reducer and mappers

Each mapper task processes a fixed amount of data (split), usually set to the distributed filesystem block size (e.g., 64MB) The number of mapper nodes is function of the number of mapper tasks and the number of available nodes in the cluster: each mapper nodes can process (in parallel and sequentially) several mapper tasks Assignment to mapper tries optimizing data locality: the mapper node in charge of a split is, if possible, one that stores a replica of this split (or if not possible, a node of the same rack) The number of reducer tasks is set by the user Assignment to reducers is done through a hashing of the key, usually uniformly at random; no data locality possible

slide-92
SLIDE 92

17 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Distributed execution of a MapReduce job.

slide-93
SLIDE 93

18 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Processing the term count example

Let the input consists of documents, say, one million 100-terms documents of approximately 1 KB each. The split operation distributes these documents in groups of 64 MBs: each group consist of 64,000 documents. Therefore M = ⌈1,000,000/64,000⌉ ≈ 16,000 groups. If there are 1,000 mapper node, each node processes on average 16 splits. If there are 1,000 reducers, each reducer ri processes all key-value pairs for terms t such that hash(t) = i (1 ≤ i ≤ 1, 000)

slide-94
SLIDE 94

19 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Processing the term count example (2)

Assume that hash(’call’) = hash(’mine’) = hash(’blog’) = i = 100. We focus on three Mappers mp, mq and mr:

  • 1. Gp

i =(<. . . , (’mine’, 1), . . . , (’call’,1), . . . , (’mine’,1), . . . , (’blog’, 1)

. . . >

  • 2. Gq

i =(< . . . , (’call’,1), . . . , (’blog’,1), . . . >

  • 3. Gr

i =(<. . . , (’blog’, 1), . . . , (’mine’,1), . . . , (’blog’,1), . . . >

ri reads Gp

i , Gp i and Gp i from the three Mappers, sorts their unioned

content, and groups the pairs with a common key: . . . , (’blog’, <1, 1, 1, 1>), . . . , (’call’, <1, 1>), . . . , (’mine’, <1, 1, 1>) Our reduce function is then applied by ri to each element of this list. The output is (’blog’, 4), (’call’, 2) and (’mine’, 3)

slide-95
SLIDE 95

20 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Failure management

In case of failure, because the tasks are distributed over hundreds or thousands of machines, the chances that a problems occurs somewhere are much larger; starting the job from the beginning is not a valid option. The Master periodically checks the availability and reachability of the tasktrackers (heartbeats) and whether map or reduce jobs make any progress

  • 1. if a reducer fails, its task is reassigned to another tasktracker; this

usually require restarting mapper tasks as well (to produce intermediate groups)

  • 2. if a mapper fails, its task is reassigned to another tasktracker
  • 3. if the jobtracker fails, the whole job should be re-initiated
slide-96
SLIDE 96

21 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Joins in MapReduce

Two datasets, A and B that we need to join for a MapReduce task If one of the dataset is small, it can be sent over fully to each tasktracker and exploited inside the map (and possibly reduce) functions Otherwise, each dataset should be grouped according to the join key, and the result of the join can be computing in the reduce function Not very convenient to express in MapReduce. Much easier using Pig.

slide-97
SLIDE 97

22 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Using MapReduce for solving a problem

Prefer:

Simple map and reduce functions Mapper tasks processing large data chunks (at least the size of distributed filesystem blocks)

A given application may have:

A chain of map functions (input processing, filtering, extraction. . . ) A sequence of several map-reduce jobs No reduce task when everything can be expressed in the map (zero reducers, or the identity reducer function)

Not the right tool for everything(see further)

slide-98
SLIDE 98

23 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Outline

MapReduce Introduction The MapReduce Computing Model MapReduce Optimization Application: PageRank MapReduce in Hadoop Toward Easier Programming Interfaces: Pig Conclusions

slide-99
SLIDE 99

24 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Combiners

A mapper task can produce a large number of pairs with the same key They need to be sent over the network to the reducer: costly It is often possible to combine these pairs into a single key-value pair

Example

(jaguar,1), (jaguar, 1), (jaguar, 1), (jaguar, 2)→(jaguar, 5) combiner : list(V ′) → V ′ function executed (possibly several times) to combine the values for a given key, on a mapper node No guarantee that the combiner is called Easy case: the combiner is the same as the reduce function. Possible when the aggregate function α computed by reduce is distributive: α(k1, α(k2, k3)) = α(k1, k2, k3)

slide-100
SLIDE 100

25 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Compression

Data transfers over the network:

From datanodes to mapper nodes (usually reduced using data locality) From mappers to reducers From reducers to datanodes to store the final output

Each of these can benefit from data compression Tradeoff between volume of data transfer and (de)compression time Usually, compressing map outputs using a fast compressor increases efficiency

slide-101
SLIDE 101

26 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Optimizing the shuffle operation

Sorting of pairs on each reducer, to compute the groups: costly

  • peration

Sorting much more efficient in memory than on disk Increasing the amount of memory available for shuffle operations can greatly increase the performance . . . at the downside of less memory available for map and reduce tasks (but usually not much needed)

slide-102
SLIDE 102

27 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Speculative execution

The MapReduce jobtracker tries detecting tasks that take longer than usual (e.g., because of hardware problems) When detected, such a task is speculatively executed on another tasktracker, without killing the existing task Eventually, when one of the attempts succeeds, the other one is killed

slide-103
SLIDE 103

28 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Outline

MapReduce Introduction The MapReduce Computing Model MapReduce Optimization Application: PageRank MapReduce in Hadoop Toward Easier Programming Interfaces: Pig Conclusions

slide-104
SLIDE 104

29 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

PageRank (Google’s Ranking [Brin and Page, 1998])

Idea Important pages are pages pointed to by important pages. {︄ gij = 0 if there is no link between page i and j; gij = 1

ni

  • therwise, with ni the number of outgoing links of page i.

Definition (Tentative)

Probability that the surfer following the random walk in G has arrived

  • n page i at some distant given point in the future.

pr(i) = (︃ lim

k→+∞(GT)kv

)︃

i

where v is some initial column vector.

slide-105
SLIDE 105

30 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Illustrating PageRank Computation

0.100 0.100 0.100 0.100 0.100 0.100 0.100 0.100 0.100 0.100

slide-106
SLIDE 106

30 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Illustrating PageRank Computation

0.033 0.317 0.075 0.108 0.025 0.058 0.083 0.150 0.117 0.033

slide-107
SLIDE 107

30 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Illustrating PageRank Computation

0.036 0.193 0.108 0.163 0.079 0.090 0.074 0.154 0.094 0.008

slide-108
SLIDE 108

30 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Illustrating PageRank Computation

0.054 0.212 0.093 0.152 0.048 0.051 0.108 0.149 0.106 0.026

slide-109
SLIDE 109

30 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Illustrating PageRank Computation

0.051 0.247 0.078 0.143 0.053 0.062 0.097 0.153 0.099 0.016

slide-110
SLIDE 110

30 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Illustrating PageRank Computation

0.048 0.232 0.093 0.156 0.062 0.067 0.087 0.138 0.099 0.018

slide-111
SLIDE 111

30 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Illustrating PageRank Computation

0.052 0.226 0.092 0.148 0.058 0.064 0.098 0.146 0.096 0.021

slide-112
SLIDE 112

30 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Illustrating PageRank Computation

0.049 0.238 0.088 0.149 0.057 0.063 0.095 0.141 0.099 0.019

slide-113
SLIDE 113

30 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Illustrating PageRank Computation

0.050 0.232 0.091 0.149 0.060 0.066 0.094 0.143 0.096 0.019

slide-114
SLIDE 114

30 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Illustrating PageRank Computation

0.050 0.233 0.091 0.150 0.058 0.064 0.095 0.142 0.098 0.020

slide-115
SLIDE 115

30 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Illustrating PageRank Computation

0.050 0.234 0.090 0.148 0.058 0.065 0.095 0.143 0.097 0.019

slide-116
SLIDE 116

30 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Illustrating PageRank Computation

0.049 0.233 0.091 0.149 0.058 0.065 0.095 0.142 0.098 0.019

slide-117
SLIDE 117

30 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Illustrating PageRank Computation

0.050 0.233 0.091 0.149 0.058 0.065 0.095 0.143 0.097 0.019

slide-118
SLIDE 118

30 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Illustrating PageRank Computation

0.050 0.234 0.091 0.149 0.058 0.065 0.095 0.142 0.097 0.019

slide-119
SLIDE 119

31 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

PageRank with Damping

May not always converge, or convergence may not be unique. To fix this, the random surfer can at each step randomly jump to any page of the Web with some probability d (1 − d: damping factor). pr(i) = (︃ lim

k→+∞((1 − d)GT + dU)kv

)︃

i

where U is the matrix with all 1

N values with N the number of vertices.

slide-120
SLIDE 120

32 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

PageRank computation

PageRank: importance score for nodes in a graph, used for ranking query results of Web search engines. Fixpoint computation, as follows:

  • 1. Compute G. Make sure lines sum to 1.
  • 2. Let u be the uniform vector of sum 1, v = u.
  • 3. Repeat N times:

Set v := (1 − d)GTv + du (say, d = 1

4).

Exercise

Express PageRank computation as a MapReduce problem. Main program? map and reduce functions? combiner function? Illustrate on this graph. 1 2 3 4

slide-121
SLIDE 121

33 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Outline

MapReduce Introduction The MapReduce Computing Model MapReduce Optimization Application: PageRank MapReduce in Hadoop Toward Easier Programming Interfaces: Pig Conclusions

slide-122
SLIDE 122

34 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Hadoop

Open-source software, Java-based, managed by the Apache foundation, for large-scale distributed storage and computing Originally developed for Apache Nutch (open-source Web search engine), a part of Apache Lucene (text indexing platform) Open-source implementation of GFS and Google’s MapReduce Yahoo!: a main contributor of the development of Hadoop Hadoop components:

Hadoop filesystem (HDFS) MapReduce Pig (data exploration), Hive (data warehousing): higher-level languages for describing MapReduce applications HBase: column-oriented distributed DBMS ZooKeeper: coordination service for distributed applications

slide-123
SLIDE 123

35 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Hadoop programming interfaces

Different APIs to write Hadoop programs:

A rich Java API (main way to write Hadoop programs) A Streaming API that can be used to write map and reduce functions in any programming language (using standard inputs and

  • utputs)

A C++ API (Hadoop Pipes) With a higher-language level (e.g., Pig, Hive)

Advanced features only available in the Java API Two different Java APIs depending on the Hadoop version; presenting the “old” one

slide-124
SLIDE 124

36 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Java map for the term count example

public class TermCountMapper extends MapReduceBase implements Mapper <Text ,Text ,Text ,IntWritable > { public void map( Text uri , Text document , OutputCollector <Text , IntWritable > output , Reporter reporter) { Pattern p=Pattern.compile("[\\p{L}]+"); Matcher m=p.matcher(document ); while(matcher.find ()) { String term=matcher.group ().

  • utput.collect(new Text(term),new

IntWritable (1)); } } }

slide-125
SLIDE 125

37 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Java reduce for the term count example

public class TermCountReducer extends MapReduceBase implements Reducer <Text ,IntWritable ,Text ,IntWritable > { public void reduce( Text term , Iterator <IntWritable > counts , OutputCollector <Text , IntWritable > output , Reporter reporter) { int sum =0; while(counts.hasNext ()) { sum+= values.next (). get (); }

  • utput.collect(term , new

IntWritable(sum )); } }

slide-126
SLIDE 126

38 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Java driver for the term count example

public class TermCount { public static void main(String args []) throws IOException { JobConf conf = new JobConf(TermCount.class ); FileInputFormat . addInputPath (conf , new Path(args [0])); FileOutputFormat . addOutputPath (conf , new Path(args [1])); // In a real application , we would have a custom input // format to fetch URI -document pairs

  • conf. setInputFormat ( KeyValueTextInputFormat .class );
  • conf. setMapperClass ( TermCountMapper .class );
  • conf. setCombinerClass ( TermCountReducer .class );
  • conf. setReducerClass ( TermCountReducer .class );
  • conf. setOutputKeyClass (Text.class );
  • conf. setOutputValueClass (IntWritable.class );

JobClient.runJob(conf ); } }

slide-127
SLIDE 127

39 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Testing and executing a Hadoop job

Required environment:

JDK on client JRE on all Hadoop nodes Hadoop distribution (HDFS + MapReduce) on client and all Hadoop nodes SSH servers on each tasktracker, SSH client on jobtracker (used to control the execution of tasktrackers) An IDE (e.g., Eclipse + plugin) on client

Three different execution modes: local One mapper, one reducer, run locally from the same JVM as the client pseudo-distributed mappers and reducers are launched on a single machine, but communicate over the network distributed over a cluster for real runs

slide-128
SLIDE 128

40 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Debugging MapReduce

Easiest: debugging in local mode Web interface with status information about the job Standard output and error channels saved on each node, accessible through the Web interface Counters can be used to track side information across a MapReduce job (e.g., number of invalid input records) Remote debugging possible but complicated to set up (impossible to know in advance where a map or reduce task will be executed) IsolationRunner allows to run in isolation part of the MapReduce job

slide-129
SLIDE 129

41 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Task JVM reuse

By default, each map and reduce task (of a given split) is run in a separate JVM When there is a lot of initialization to be done, or when splits are small, might be useful to reuse JVMs for subsequent tasks Of course, only works for tasks run on the same node

slide-130
SLIDE 130

42 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Hadoop in the cloud

Possibly to set up one’s own Hadoop cluster But often easier to use clusters in the cloud that support MapReduce:

Amazon EC2 Cloudera etc.

Not always easy to know the cluster’s configuration (in terms of racks, etc.) when on the cloud, which hurts data locality in MapReduce

slide-131
SLIDE 131

43 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Outline

MapReduce Toward Easier Programming Interfaces: Pig Basics Pig operators From Pig to MapReduce Conclusions

slide-132
SLIDE 132

44 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Outline

MapReduce Toward Easier Programming Interfaces: Pig Basics Pig operators From Pig to MapReduce Conclusions

slide-133
SLIDE 133

45 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Pig Latin

Motivation: define high-level languages that use MapReduce as an underlying data processor. A Pig Latin statement is an operator that takes a relation as input and produces another relation as output. Pig Latin statements are generally organized in the following manner:

  • 1. A LOAD statement reads data from the file system as a relation (list
  • f tuples).
  • 2. A series of “transformation” statements process the data.
  • 3. A STORE statement writes output to the file system; or, a DUMP

statement displays output to the screen. Statements are executed as composition of MapReduce jobs.

slide-134
SLIDE 134

46 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Using Pig

Part of Hadoop, downloadable from the Hadoop Web site Interactive interface (Grunt) and batch mode Two execution modes: local data is read from disk, operations are directly executed, no MapReduce MapReduce on top of a MapReduce cluster (pipeline of MapReduce jobs)

slide-135
SLIDE 135

47 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Example input data

A flat file, tab-separated, extracted from DBLP .

2005 VLDB J. Model-based approximate querying in sensor networks. 1997 VLDB J. Dictionary-Based Order-Preserving String Compression. 2003 SIGMOD Record Time management for new faculty. 2001 VLDB J. E-Services - Guest editorial. 2003 SIGMOD Record Exposing undergraduate students to database system internals. 1998 VLDB J. Integrating Reliable Memory in Databases. 1996 VLDB J. Query Processing and Optimization in Oracle Rdb 1996 VLDB J. A Complete Temporal Relational Algebra. 1994 SIGMOD Record Data Modelling in the Large. 2002 SIGMOD Record Data Mining: Concepts and Techniques - Book Review.

slide-136
SLIDE 136

48 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Computing average number of publications per year

  • - Load

records from the file articles = load ’journal.txt’ as (year: chararray , journal:chararray , title: chararray) ; sr_articles = filter articles by journal ==’SIGMOD␣Record ’; year_groups = group sr_articles by year; avg_nb = foreach year_groups generate group , count(sr_articles.title ); dump avg_nb;

slide-137
SLIDE 137

49 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

The data model

The model allows nesting of bags and tuples. Example: the year_group temporary bag.

group: 1990 sr_articles: { (1990, SIGMOD Record, SQL For Networks of Relations.), (1990, SIGMOD Record, New Hope on Data Models and Types.) }

Unlimited nesting, but no references, no constraint of any kind (for parallelization purposes).

slide-138
SLIDE 138

50 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Flexible representation

Pig allows the representation of heterogeneous data, in the spirit of semi-structured dat models (e.g., XML). The following is a bag with heterogeneous tuples.

{ (2005, {’SIGMOD Record’, ’VLDB J.’}, {’article1’, article2’}) (2003, ’SIGMOD Record’, {’article1’, article2’}, {’author1’, ’author2’}) }

slide-139
SLIDE 139

51 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Outline

MapReduce Toward Easier Programming Interfaces: Pig Basics Pig operators From Pig to MapReduce Conclusions

slide-140
SLIDE 140

52 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Main Pig operators

Operator Description foreach Apply one or several expression(s) to each of the input tu- ples filter Filter the input tuples with some criteria distinct Remove duplicates from an input join Join of two inputs group Regrouping of data cogroup Associate two related groups from distinct inputs cross Cross product of two inputs

  • rder

Order an input limit Keep only a fixed number of elements union Union of two inputs (note: no need to agree on a same schema, as in SQL) split Split a relation based on a condition

slide-141
SLIDE 141

53 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Example dataset

A simple flat file with tab-separated fields.

1995 Foundations of Databases Abiteboul 1995 Foundations of Databases Hull 1995 Foundations of Databases Vianu 2010 Web Data Management Abiteboul 2010 Web Data Management Manolescu 2010 Web Data Management Rigaux 2010 Web Data Management Rousset 2010 Web Data Management Senellart

NB: Pig accepts inputs from user-defined function, written in Java – allows to extract data from any source.

slide-142
SLIDE 142

54 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

The group operator

The “program”:

books = load ’webdam -books.txt’ as (year: int , title: chararray , author: chararray) ; group_auth = group books by title; authors = foreach group_auth generate group , books.author; dump authors;

and the result:

(Foundations of Databases, {(Abiteboul),(Hull),(Vianu)}) (Web Data Management, {(Abiteboul),(Manolescu),(Rigaux),(Rousset),(Senellart)})

slide-143
SLIDE 143

55 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Unnesting with flatten

Flatten serves to unnest a nested field.

  • - Take the ‘authors ’ bag and

flatten the nested set flattened = foreach authors generate group , flatten(author );

Applied to the previous authors bags, one obtains:

(Foundations of Databases,Abiteboul) (Foundations of Databases,Hull) (Foundations of Databases,Vianu) (Web Data Management,Abiteboul) (Web Data Management,Manolescu) (Web Data Management,Rigaux) (Web Data Management,Rousset) (Web Data Management,Senellart)

slide-144
SLIDE 144

56 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

The cogroup operator

Allows to gather two data sources in nested fields Example: a file with publishers:

Fundations of Databases Addison-Wesley USA Fundations of Databases Vuibert France Web Data Management Cambridge University Press USA

The program:

publishers = load ’webdam -publishers.txt’ as (title: chararray , publisher: chararray) ; cogrouped = cogroup flattened by group , publishers by title;

slide-145
SLIDE 145

57 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

The result

For each grouped field value, two nested sets, coming from both sources.

(Foundations of Databases, { (Foundations of Databases,Abiteboul), (Foundations of Databases,Hull), (Foundations of Databases,Vianu) }, {(Foundations of Databases,Addison-Wesley), (Foundations of Databases,Vuibert) } )

A kind of join? Yes, at least a preliminary step.

slide-146
SLIDE 146

58 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Joins

Same as before, but produces a flat output (cross product of the inner nested bags). The nested model is usually more elegant and easier to deal with.

  • - Take the ’flattened ’ bag , join with ’publishers ’

joined = join flattened by group , publishers by title; (Foundations of Databases,Abiteboul, Fundations of Databases,Addison-Wesley) (Foundations of Databases,Abiteboul, Fundations of Databases,Vuibert) (Foundations of Databases,Hull, Fundations of Databases,Addison-Wesley) (Foundations of Databases,Hull, ...

slide-147
SLIDE 147

59 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Outline

MapReduce Toward Easier Programming Interfaces: Pig Basics Pig operators From Pig to MapReduce Conclusions

slide-148
SLIDE 148

60 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Plans

A Pig program describes a logical data flow This is implemented with a physical plan, in terms of grouping or nesting operations This is in turn (for MapReduce execution) implemented using a sequence of map and reduce steps

slide-149
SLIDE 149

61 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Physical operators

Local Rearrange group tuples with the same key, on a local machine Global Rearrange group tuples with the same key, globally on a cluster Package construct a nested tuple from tuples that have been grouped

slide-150
SLIDE 150

62 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Translation of a simple Pig program

slide-151
SLIDE 151

63 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

A more complex join-group program

  • - Load books , but keep only

books from Victor Vianu books = load ’webdam -books.txt’ as (year: int , title: chararray , author: chararray) ; vianu = filter books by author == ’Vianu ’; publishers = load ’webdam -publishers.txt’ as (title: chararray , publisher: chararray) ;

  • - Join on the book

title joined = join vianu by title , publishers by title;

  • - Now , group on the author

name grouped = group joined by vianu :: author;

  • - Finally

count the publishers

  • - (nb: we should

remove duplicates !) count = foreach grouped generate group , COUNT(joined.publisher );

slide-152
SLIDE 152

64 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Translation of a join-group program

slide-153
SLIDE 153

65 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Outline

MapReduce Toward Easier Programming Interfaces: Pig Conclusions

slide-154
SLIDE 154

66 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

MapReduce limitations (1/2)

High latency. Launching a MapReduce job has a high overhead, and reduce functions are only called after all map functions succeed, not suitable for applications needing a quick result. Batch processing only. MapReduce excels at processing a large collection, not at retrieving individual items from a collection. Write-once, read-many mode. No real possibility of updating a dataset using MapReduce, it should be regenerated from scratch No transactions. No concurrency control at all, completely unsuitable for transactional applications [Pavlo et al., 2009].

slide-155
SLIDE 155

67 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

MapReduce limitations (2/2)

Relatively low-level. Ongoing efforts for more high-level languages: Scope [Chaiken et al., 2008], Pig [Olston et al., 2008, Gates et al., 2009], Hive [Thusoo et al., 2009], Cascading http://www.cascading.org/ No structure. Implies lack of indexing, difficult to optimize,

  • etc. [DeWitt and Stonebraker, 1987]

Hard to tune. Number of reducers? Compression? Memory available at each node? etc.

slide-156
SLIDE 156

68 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Hybrid systems

Best of both worlds?

DBMS are good at transactions, point queries, structured data MapReduce is good at scalability, batch processing, key-value data

HadoopDB [Abouzeid et al., 2009]: standard relational DBMS at each node of a cluster, MapReduce allows communication between nodes Possible to use DBMS inputs natively in Hadoop, but no control about data locality

slide-157
SLIDE 157

69 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Job Scheduling

Multiple jobs concurrently submitted to the MapReduce jobtracker Fair scheduling required:

each submitted job should have some share of the cluster prioritization of jobs long-standing jobs should not block quick jobs fairness with respect to users

Standard Hadoop scheduler: priority queue Hadoop Fair Scheduler: ensures cluster resources are shared among users. Preemption (= killing running tasks) possible in case the sharing becomes unbalnaced.

slide-158
SLIDE 158

70 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

What you should remember on distributed computing

MapReduce is a simple model for batch processing of very large collections. ⇒ good for data analytics; not good for point queries (high latency). The systems brings robustness against failure of a component and transparent distribution and scalability. ⇒ more expressive languages required (Pig)

slide-159
SLIDE 159

71 / 71

Distributed Computing Abiteboul, Manolescu, Rigaux, Rousset, Senellart

Resources

Original description of the MapReduce framework [Dean and Ghemawat, 2004] Hadoop distribution and documentation available at http://hadoop.apache.org/ Documentation for Pig is available at http://wiki.apache.org/pig/ Excellent textbook on Hadoop [White, 2009] The material in these slides is from [Abiteboul et al., 2011], freely available at http://webdam.inria.fr/Jorge

slide-160
SLIDE 160

References I

Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, and Pierre Senellart. Web Data Management. Cambridge University Press, New York, NY, USA, 2011. Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Alexander Rasin, and Avi Silberschatz. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. Proceedings of the VLDB Endowment (PVLDB), 2(1):922–933, 2009. Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks, 30(1–7): 107–117, April 1998. Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. SCOPE: easy and efficient parallel processing of massive data sets. Proceedings of the VLDB Endowment (PVLDB), 1(2):1265–1276, 2008.

slide-161
SLIDE 161

References II

Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified Data Processing on Large Clusters. In Intl. Symp. on Operating System Design and Implementation (OSDI), pages 137–150, 2004.

  • D. DeWitt and M. Stonebraker. MapReduce, a major Step Backward.

DatabaseColumn blog, 1987. http://databasecolumn.vertica.com/database-innovation/mapreduce- a-major-step-backwards/. Alan Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan Narayanam, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, and Utkarsh Srivastava. Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience. Proceedings of the VLDB Endowment (PVLDB), 2(2):1414–1425, 2009. Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig latin: a not-so-foreign language for data

  • processing. In Proc. ACM Intl. Conf. on the Management of Data

(SIGMOD), pages 1099–1110, 2008.

slide-162
SLIDE 162

References III

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker. A comparison of approaches to large-scale data analysis. In Proc. ACM Intl. Conf. on the Management of Data (SIGMOD), pages 165–178, 2009. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham

  • Murthy. Hive - A Warehousing Solution Over a Map-Reduce
  • Framework. Proceedings of the VLDB Endowment (PVLDB), 2(2):

1626–1629, 2009. Tom White. Hadoop: The Definitive Guide. O’Reilly, Sebastopol, CA, USA, 2009.