[PPT] - CS345a: Data Mining Jure Leskovec Stanford University CPU Machine PowerPoint Presentation

SLIDE 1

CS345a: Data Mining Jure Leskovec

Stanford University

SLIDE 2

Memory CPU Machine Learning, Statistics Memory “Classical” Data Mining Disk

1/7/2010 2 Jure Leskovec, Stanford CS345a: Data Mining

SLIDE 3

 20+ billion web pages x 20KB = 400+ TB  20+ billion web pages x 20KB = 400+ TB  1 computer reads 30‐35 MB/sec from disk

~4 months to read the web
~4 months to read the web

 ~1,000 hard drives to store the web  Even more to do something with the data

3 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 4

 Web data sets can be very large

y g

Tens to hundreds of terabytes

 Cannot mine on a single server

g

 Standard architecture emerging:

Cluster of commodity Linux nodes

Cluster of commodity Linux nodes

Gigabit ethernet interconnect

 How to organize computations on this

How to organize computations on this architecture?

Mask issues such as hardware failure

Mask issues such as hardware failure

1/7/2010 4 Jure Leskovec, Stanford CS345a: Data Mining

SLIDE 5

 Traditional big‐iron box (circa 2003)

Traditional big iron box (circa 2003)

8 2GHz Xeons
64GB RAM
8TB disk
758,000 USD

 Prototypical Google rack (circa 2003)

Prototypical Google rack (circa 003)

176 2GHz Xeons
176GB RAM

d k

~7TB disk
278,000 USD

 In Aug 2006 Google had ~450,000 machines

5 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 6

Gb b t 2‐10 Gbps backbone between racks S it h S it h Switch 1 Gbps between any pair of nodes in a rack Switch Switch Mem CPU Mem CPU

…

Mem CPU Mem CPU

…

Disk Disk Each rack contains 16 64 nodes Disk Disk Each rack contains 16‐64 nodes

1/7/2010 6 Jure Leskovec, Stanford CS345a: Data Mining

SLIDE 7

L l ti f d t i i bl

 Large scale computing for data mining problems

n commodity hardware
PCs connected in a network
Need to process huge datasets on large clusters of

computers

 Challenges:  Challenges:

How do you distribute computation?
Distributed programming is hard
Machines fail

 Map‐reduce addresses all of the above

Google’s computational/data manipulation model

Google s computational/data manipulation model

Elegant way to work with big data

7 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 8

Y h ’ ll b ti ith d i

 Yahoo’s collaboration with academia

Foster open research
Focus on large‐scale, highly parallel

Focus on large scale, highly parallel computing

 Seed Facility: M45

y

Datacenter in a Box (DiB)
1000 nodes, 4000 cores, 3TB RAM,

1 5PB disk 1.5PB disk

High bandwidth connection to Internet
Located on Yahoo! corporate campus

p p

World’s top 50 supercomputer

8 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 9

 Implications of such computing environment  Implications of such computing environment

Single machine performance does not matter
Add more machines
Add more machines
Machines break
One server may stay up 3 years (1 000 days)
One server may stay up 3 years (1,000 days)
If you have 1,0000 servers, expect to loose 1/day
How can we make it easy to write distributed

How can we make it easy to write distributed programs?

9 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 10

 Idea  Idea

Bring computation close to the data
St

fil lti l ti f li bilit

Store files multiple times for reliability

 Need

Programming model
Map‐Reduce
Infrastructure – File system
Google: GFS
Hadoop: HDFS

10 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 11

 First order problem: if nodes can fail how can  First order problem: if nodes can fail, how can

we store data persistently?

 Answer: Distributed File System  Answer: Distributed File System

Provides global file namespace
Goo le GFS Hadoop HDFS Kosmi KFS
Google GFS; Hadoop HDFS; Kosmix KFS

 Typical usage pattern

H fil (100 f GB t TB)

Huge files (100s of GB to TB)
Data is rarely updated in place

d d d

Reads and appends are common

1/7/2010 11 Jure Leskovec, Stanford CS345a: Data Mining

SLIDE 12

 Reliable distributed file system for petabyte scale

Reliable distributed file system for petabyte scale

 Data kept in 64‐megabyte “chunks” spread across

thousands of machines

 Each chunk replicated, usually 3 times, on

different machines

S l f di k hi f il

Seamless recovery from disk or machine failure

C0 C1 C5 C1 C2 D0 C0 C5 C2 C5

Chunk server 1

D1

Chunk server 3

C3 C5

Chunk server 2

…

D0

Chunk server N

C2 D0

12

Bring computation directly to the data!

Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 13

 Chunk Servers

Chunk Servers

File is split into contiguous chunks
Typically each chunk is 16‐64MB

E h h k li t d ( ll 2 3 )

Each chunk replicated (usually 2x or 3x)
Try to keep replicas in different racks

 Master node

a.k.a. Name Nodes in HDFS
Stores metadata
Might be replicated

Might be replicated

 Client library for file access

Talks to master to find chunk servers
Connects directl to ch nkser ers to access data
Connects directly to chunkservers to access data

1/7/2010 13 Jure Leskovec, Stanford CS345a: Data Mining

SLIDE 14

 We have a large file of words:  We have a large file of words:

one word per line

 Count the number of times each

distinct word appears in the file pp

 Sample application:

analyze web server logs to find popular URLs

1/7/2010 14 Jure Leskovec, Stanford CS345a: Data Mining

SLIDE 15

 Case 1: Entire file fits in memory  Case 1: Entire file fits in memory  Case 2: File too large for mem, but all <word,

count> pairs fit in mem count> pairs fit in mem

 Case 3: File on disk, too many distinct words

to fit in memory to fit in memory

sort datafile | uniq –c

1/7/2010 15 Jure Leskovec, Stanford CS345a: Data Mining

SLIDE 16

 To make it slightly harder suppose we have a  To make it slightly harder, suppose we have a

large corpus of documents

 Count the number of times each distinct word

ccurs in the corpus
words(docs/*) | sort | uniq -c
where words takes a file and outputs the words

in it, one to a line

Th b t th f

 The above captures the essence of

MapReduce

Great thing is it is naturally parallelizable
Great thing is it is naturally parallelizable

1/7/2010 16 Jure Leskovec, Stanford CS345a: Data Mining

SLIDE 17

 Read a lot of data  Read a lot of data  Map

Extract something you care about
Extract something you care about

 Shuffle and Sort  Reduce  Reduce

Aggregate, summarize, filter or transform

 Write the data  Write the data

Outline stays the same, map and reduce change to fit the problem

17 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 18

 Program specifies two primary methods:  Program specifies two primary methods:

Map(k,v)  <k’, v’>*
R d

(k’ < ’>)  <k’ ’’>

Reduce(k’, <v’>*)  <k’, v’’>*

 All v’ with same k’ are reduced together and  All v’ with same k’ are reduced together and

processed in v’ order

1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 18

SLIDE 19

Provided by the Provided by the MAP:

reads input and d f

Group by key:

Collect all pairs

Reduce:

Collect all values b l h

programmer programmer

The crew of the space shuttle

(the 1) (crew 1)

produces a set of key value pairs Collect all pairs with same key belonging to the key and output

data ds

The crew of the space shuttle Endeavor recently returned to Earth as ambassadors, harbingers of a new era of space exploration. Scientists at NASA are saying that the recent assembly of the Dextre bot is the first step in a long‐

(the, 1) (crew, 1) (of, 1) (the, 1) (crew, 1) (crew, 1) (space, 1) (the, 1) (crew, 2) (space, 1) (the 3) y read the d ntial read

bot is the first step in a long term space‐based man/machine partnership. '"The work we're doing now ‐‐ the robotics we're doing ‐‐ is what we're going to need to do to build any work station

r habitat structure on the

(space, 1) (shuttle, 1) (Endeavor, 1) (recently, 1) (the, 1) (the, 1) (shuttle, 1) (recently, 1) (the, 3) (shuttle, 1) (recently, 1) … equentially nly seque

moon or Mars," said Allard Beutel.

Big document (recently, 1) …. (recently, 1) … (key, value) (key, value) (key, value) Se On

19 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 20

map(key, value): // key: document name; value: text of document for each word w in value: emit(w, 1) ( , ) reduce(key, values): // key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(result) emit(result)

1/7/2010 20 Jure Leskovec, Stanford CS345a: Data Mining

SLIDE 21

 Map‐Reduce environment takes care of:  Map‐Reduce environment takes care of:

Partitioning the input data
Scheduling the program’s execution across a set of
Scheduling the program s execution across a set of

machines

Handling machine failures

g

Managing required inter‐machine communication

 Allows programmers without any experience

with parallel and distributed systems to easily utilize the resources of a large distributed cluster

21 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 22

Big document MAP:

reads input and produces a set of key value pairs key value pairs

Group by key:

Collect all pairs with same key

Reduce:

Collect all values belonging to the

22

belonging to the key and output

1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining

SLIDE 23

 Programmer specifies

Input 0 Input 1 Input 2

Programmer specifies

Map and Reduce and input files

 Workflow

Read inputs as a set of key‐value‐pairs

Input 0

Map 0

Input 1

Map 1

Input 2

Map 2

Map transforms input kv‐pairs into a

new set of k'v'‐pairs

Sorts & Shuffles the k'v'‐pairs to output

nodes

Map 0 Map 1 Map 2

Shuffle

nodes

All k’v’‐pairs with a given k’ are sent to

the same reduce

Reduce processes all k'v'‐pairs grouped

b k i k'' '' i

Reduce 0 Reduce 1

by key into new k''v''‐pairs

Write the resulting pairs to files

 All phases are distributed with many

tasks doing the work

Out 0 Out 1

tasks doing the work

23 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 24

24 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 25

 Input final output are stored on a distributed  Input, final output are stored on a distributed

file system

Scheduler tries to schedule map tasks “close” to
Scheduler tries to schedule map tasks close to

physical storage location of input data

 Intermediate results are stored on local FS of  Intermediate results are stored on local FS of

map and reduce workers

 Output is often input to another map reduce  Output is often input to another map reduce

task

1/7/2010 25 Jure Leskovec, Stanford CS345a: Data Mining

SLIDE 26

 Master data structures

Task status: (idle, in‐progress, completed)
Idle tasks get scheduled as workers become

available

When a map task completes, it sends the master

the location and sizes of its R intermediate files the location and sizes of its R intermediate files,

ne for each reducer
Master pushes this info to reducers

Master pushes this info to reducers

 Master pings workers periodically

t d t t f il to detect failures

1/7/2010 26 Jure Leskovec, Stanford CS345a: Data Mining

SLIDE 27

 Map worker failure  Map worker failure

Map tasks completed or in‐progress at worker are

reset to idle reset to idle

Reduce workers are notified when task is

rescheduled on another worker rescheduled on another worker

 Reduce worker failure

Only in‐progress tasks are reset to idle

Only in progress tasks are reset to idle

 Master failure

MapReduce task is aborted and client is notified
MapReduce task is aborted and client is notified

1/7/2010 27 Jure Leskovec, Stanford CS345a: Data Mining

SLIDE 28

 Fine granularity tasks: map tasks >> machines

Fine granularity tasks: map tasks machines

Minimizes time for fault recovery
Can pipeline shuffling with map execution
Better dynamic load balancing

 Often use 200,000 map & 5,000 reduce tasks  Running on 2,000 machines

g ,

28 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 29

29 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 30

30 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 31

31 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 32

32 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 33

33 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 34

34 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 35

35 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 36

36 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 37

37 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 38

38 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 39

39 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 40

 Slow workers significantly slow the  Slow workers significantly slow the

completion time:

Other jobs on the machine

Other jobs on the machine

Bad disks
Weird things

g

 Solution:

Near end of phase, spawn backup copies of tasks
Whichever one finishes first “wins”

 Effect:

i ll h j b l i i

Dramatically shortens job completion time

1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 40

SLIDE 41

 Backup tasks reduce job time  Backup tasks reduce job time  System deals with failures

1/7/2010 Jure Leskovec, Stanford CS345a: Data Mining 41

SLIDE 42

 Often a map task will produce many pairs of  Often a map task will produce many pairs of

the form (k,v1), (k,v2), … for the same key k

E.g., popular words in Word Count

E.g., popular words in Word Count

 Can save network time by pre‐aggregating at

mapper

combine(k1, list(v1))  v2
Usually same as reduce function

 Works only if reduce function is commutative

and associative

1/7/2010 42 Jure Leskovec, Stanford CS345a: Data Mining

SLIDE 43

 Inputs to map tasks are created by contiguous  Inputs to map tasks are created by contiguous

splits of input file

 For reduce, we need to ensure that records

, with the same intermediate key end up at the same worker

 System uses a default partition function e.g.,

hash(key) mod R S ti f l t id

 Sometimes useful to override

E.g., hash(hostname(URL)) mod R ensures URLs

from a host end up in the same output file from a host end up in the same output file

1/7/2010 43 Jure Leskovec, Stanford CS345a: Data Mining

SLIDE 44

 Input does not have to be big

Input does not have to be big

 E.g., want to simulate disease spreading in a

(small) social network

 Input:

Each line: node id, virus parameters (death, birth rate)

 Map:  Map:

Reads a line of input and simulate the virus
Output: triplets (node id virus id hit time)
Output: triplets (node id, virus id, hit time)

 Reduce:

Collect the node IDs and see which nodes are most

vulnerable

44 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 45

 Suppose we have a large web corpus  Suppose we have a large web corpus  Let’s look at the metadata file

Lines of the form (URL size date

)

Lines of the form (URL, size, date, …)

 For each host, find the total number of bytes

i e the s m of the pa e si es for all URLs from
i.e., the sum of the page sizes for all URLs from

that host

1/7/2010 45 Jure Leskovec, Stanford CS345a: Data Mining

SLIDE 46

 Statistical machine translation:  Statistical machine translation:

Need to count number of times every 5‐word

sequence occurs in a large corpuse of duments sequence occurs in a large corpuse of duments

 Easy with MapReduce:  Easy with MapReduce:

Map: extract (5‐word sequence, count) from

document document

Reduce: combine counts

1/7/2010 46 Jure Leskovec, Stanford CS345a: Data Mining

SLIDE 47

 Find all occurrences of the given pattern in a  Find all occurrences of the given pattern in a

very large set of files

1/7/2010 47 Jure Leskovec, Stanford CS345a: Data Mining

SLIDE 48

 Given a directed graph as an adjacency list:  Given a directed graph as an adjacency list:

src1: dest11, dest12, … src2: dest21 dest22 src2: dest21, dest22, …

 Construct the graph in which all the links are  Construct the graph in which all the links are

reversed

1/7/2010 48 Jure Leskovec, Stanford CS345a: Data Mining

SLIDE 49

 Google  Google

Not available outside Google

 Hadoop

An open‐source implementation in Java
Uses HDFS for stable storage

g

Download: http://lucene.apache.org/hadoop/

 Aster Data

Cluster‐optimized SQL Database that also

implements MapReduce M d il bl f f h f thi l

Made available free of charge for this class

1/7/2010 49 Jure Leskovec, Stanford CS345a: Data Mining

SLIDE 50

 Ability to rent computing by the hour  Ability to rent computing by the hour

Additional services e.g., persistent storage

 We will be using Amazon’s “Elastic Compute  We will be using Amazon s Elastic Compute

Cloud” (EC2)

 Aster Data and Hadoop can both be run on  Aster Data and Hadoop can both be run on

EC2

 In discussions with Amazon to provide access  In discussions with Amazon to provide access

free of charge for class

1/7/2010 50 Jure Leskovec, Stanford CS345a: Data Mining

SLIDE 51

 Jeffrey Dean and Sanjay Ghemawat,

Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters http://labs.google.com/papers/mapreduce.html

 Sanjay Ghemawat, Howard Gobioff, and Shun‐Tak Leung, The

Google File System Google File System http://labs.google.com/papers/gfs.html

1/7/2010 51 Jure Leskovec, Stanford CS345a: Data Mining

SLIDE 52

Hadoop Wiki
Hadoop Wiki

– Introduction

http://wiki.apache.org/lucene-hadoop/

– Getting Started

http://wiki.apache.org/lucene-hadoop/GettingStartedWithHadoop

– Map/Reduce Overview p

http://wiki.apache.org/lucene-hadoop/HadoopMapReduce
http://wiki.apache.org/lucene-hadoop/HadoopMapRedClasses

– Eclipse Environment Eclipse Environment

http://wiki.apache.org/lucene-hadoop/EclipseEnvironment
Javadoc

– http://lucene.apache.org/hadoop/docs/api/

52 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 53

Releases from Apache download mirrors

Releases from Apache download mirrors – http://www.apache.org/dyn/closer.cgi/lucene/hado /

p/
Nightly builds of source

– http://people.apache.org/dist/lucene/hadoop/nightl y/

Source code from subversion

– http://lucene apache org/hadoop/version control http://lucene.apache.org/hadoop/version_control. html

53 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010

SLIDE 54

 Programming model inspired by functional language primitives  Partitioning/shuffling similar to many large‐scale sorting systems

NOW‐Sort ['97]

 Re‐execution for fault tolerance

BAD FS ['04] and TACC ['97]
BAD‐FS [ 04] and TACC [ 97]

 Locality optimization has parallels with Active Disks/Diamond work

Active Disks ['01], Diamond ['04]

 Backup tasks similar to Eager Scheduling in Charlotte system

p g g y

Charlotte ['96]

 Dynamic load balancing solves similar problem as River's

distributed queues

Ri ['99]

River ['99]

54 Jure Leskovec, Stanford CS345a: Data Mining 1/7/2010