Large-scale Data Mining: MapReduce and beyond Part 1: Basics - - PowerPoint PPT Presentation

large scale data mining mapreduce and beyond
SMART_READER_LITE
LIVE PREVIEW

Large-scale Data Mining: MapReduce and beyond Part 1: Basics - - PowerPoint PPT Presentation

Large-scale Data Mining: MapReduce and beyond Part 1: Basics Spiros Papadimitriou, Google Jimeng Sun, IBM Research Rong Yan, Facebook Monday, August 23, 2010 Data everywhere 2 Monday, August 23, 2010 Data everywhere Flickr (3 billion


slide-1
SLIDE 1

Large-scale Data Mining: MapReduce and beyond

Part 1: Basics

Spiros Papadimitriou, Google Jimeng Sun, IBM Research Rong Yan, Facebook

Monday, August 23, 2010

slide-2
SLIDE 2

2

Data everywhere

Monday, August 23, 2010

slide-3
SLIDE 3

2

Data everywhere

 Flickr (3 billion photos)  YouTube (83M videos, 15 hrs/min)  Web (10B videos watched / mo.)  Digital photos (500 billion / year)  All broadcast (70,000TB / year)  Yahoo! Webmap (3 trillion links,

300TB compressed, 5PB disk)

 Human genome (2-30TB uncomp.)

Monday, August 23, 2010

slide-4
SLIDE 4

2

Data everywhere

 Flickr (3 billion photos)  YouTube (83M videos, 15 hrs/min)  Web (10B videos watched / mo.)  Digital photos (500 billion / year)  All broadcast (70,000TB / year)  Yahoo! Webmap (3 trillion links,

300TB compressed, 5PB disk)

 Human genome (2-30TB uncomp.)

So what ??

Monday, August 23, 2010

slide-5
SLIDE 5

2

Data everywhere

 Flickr (3 billion photos)  YouTube (83M videos, 15 hrs/min)  Web (10B videos watched / mo.)  Digital photos (500 billion / year)  All broadcast (70,000TB / year)  Yahoo! Webmap (3 trillion links,

300TB compressed, 5PB disk)

 Human genome (2-30TB uncomp.)

So what ??

more is:

Monday, August 23, 2010

slide-6
SLIDE 6

2

Data everywhere

 Flickr (3 billion photos)  YouTube (83M videos, 15 hrs/min)  Web (10B videos watched / mo.)  Digital photos (500 billion / year)  All broadcast (70,000TB / year)  Yahoo! Webmap (3 trillion links,

300TB compressed, 5PB disk)

 Human genome (2-30TB uncomp.)

So what ??

more is: more …

Monday, August 23, 2010

slide-7
SLIDE 7

2

Data everywhere

 Flickr (3 billion photos)  YouTube (83M videos, 15 hrs/min)  Web (10B videos watched / mo.)  Digital photos (500 billion / year)  All broadcast (70,000TB / year)  Yahoo! Webmap (3 trillion links,

300TB compressed, 5PB disk)

 Human genome (2-30TB uncomp.)

So what ??

more is: different! more is: more …

Monday, August 23, 2010

slide-8
SLIDE 8

3

Data everywhere

 Opportunities

Real-time access to content Richer context from users and hyperlinks Abundant training examples “Brute-force” methods may suffice

 Challenges

“Dirtier” data Efficient algorithms Scalability (with reasonable cost)

Monday, August 23, 2010

slide-9
SLIDE 9

3

Data everywhere

 Opportunities

Real-time access to content Richer context from users and hyperlinks Abundant training examples “Brute-force” methods may suffice

 Challenges

“Dirtier” data Efficient algorithms Scalability (with reasonable cost)

Monday, August 23, 2010

slide-10
SLIDE 10

3

Data everywhere

 Opportunities

Real-time access to content Richer context from users and hyperlinks Abundant training examples “Brute-force” methods may suffice

 Challenges

“Dirtier” data Efficient algorithms Scalability (with reasonable cost)

Monday, August 23, 2010

slide-11
SLIDE 11

4

“The Google Way”

Chris Anderson, Wired (July 2008)

Monday, August 23, 2010

slide-12
SLIDE 12

4

“The Google Way”

“All models are wrong, but some are useful” – George Box

Chris Anderson, Wired (July 2008)

Monday, August 23, 2010

slide-13
SLIDE 13

4

“The Google Way”

“All models are wrong, but some are useful” – George Box “All models are wrong, and increasingly you can succeed without them.” – Peter Norvig

 Google PageRank  Shotgun gene sequencing

Chris Anderson, Wired (July 2008)

Monday, August 23, 2010

slide-14
SLIDE 14

4

“The Google Way”

“All models are wrong, but some are useful” – George Box “All models are wrong, and increasingly you can succeed without them.” – Peter Norvig

 Google PageRank  Shotgun gene sequencing  Language translation

Chris Anderson, Wired (July 2008)

Monday, August 23, 2010

slide-15
SLIDE 15

4

“The Google Way”

“All models are wrong, but some are useful” – George Box “All models are wrong, and increasingly you can succeed without them.” – Peter Norvig

 Google PageRank  Shotgun gene sequencing  Language translation  …

Chris Anderson, Wired (July 2008)

Monday, August 23, 2010

slide-16
SLIDE 16

5

Getting over the marketing hype… Cloud Computing =

Monday, August 23, 2010

slide-17
SLIDE 17

5

Getting over the marketing hype… Cloud Computing = Internet

Monday, August 23, 2010

slide-18
SLIDE 18

5

Getting over the marketing hype… Cloud Computing = Internet + Commoditization/ standardization

Monday, August 23, 2010

slide-19
SLIDE 19

5

Getting over the marketing hype… Cloud Computing = Internet + Commoditization/ standardization

‘It’s what I and many others have worked towards our entire

  • careers. It’s just happening now.’

– Eric Schmidt

Monday, August 23, 2010

slide-20
SLIDE 20

6

This tutorial

 Is not about cloud computing  But about large scale data processing

Monday, August 23, 2010

slide-21
SLIDE 21

6

This tutorial

 Is not about cloud computing  But about large scale data processing

Data + Algorithms

Monday, August 23, 2010

slide-22
SLIDE 22

7

Tutorial overview

 Part 1 (Spiros): Basic concepts & tools

 MapReduce & distributed storage  Hadoop / HBase / Pig / Cascading / Hive

 Part 2 (Jimeng): Algorithms

 Information retrieval  Graph algorithms  Clustering (k-means)  Classification (k-NN, naïve Bayes)

 Part 3 (Rong): Applications

 Text processing  Data warehousing  Machine learning

Monday, August 23, 2010

slide-23
SLIDE 23

8

Outline

 Introduction  MapReduce & distributed storage  Hadoop

HBase Pig Cascading Hive

 Summary

Monday, August 23, 2010

slide-24
SLIDE 24

9

What is MapReduce?

 Programming model?  Execution environment?  Software package?

Monday, August 23, 2010

slide-25
SLIDE 25

9

What is MapReduce?

 Programming model?  Execution environment?  Software package?

It’s all of those things, depending who you ask…

Monday, August 23, 2010

slide-26
SLIDE 26

9

What is MapReduce?

 Programming model?  Execution environment?  Software package?

It’s all of those things, depending who you ask…

“MapReduce” (this talk) == Distributed computation + distributed storage + scheduling / fault tolerance

Monday, August 23, 2010

slide-27
SLIDE 27

10

Example – Programming model

# LAST FIRST SALARY Smith John $90,000 Brown David $70,000 Johnson George $95,000 Yates John $80,000 Miller Bill $65,000 Moore Jack $85,000 Taylor Fred $75,000 Smith David $80,000 Harris John $90,000 ... ... ... ... ... ... employees.txt

Q: “What is the frequency

  • f each first name?”

Monday, August 23, 2010

slide-28
SLIDE 28

10

Example – Programming model

# LAST FIRST SALARY Smith John $90,000 Brown David $70,000 Johnson George $95,000 Yates John $80,000 Miller Bill $65,000 Moore Jack $85,000 Taylor Fred $75,000 Smith David $80,000 Harris John $90,000 ... ... ... ... ... ... employees.txt

Q: “What is the frequency

  • f each first name?”

def getName (line):

Monday, August 23, 2010

slide-29
SLIDE 29

10

Example – Programming model

# LAST FIRST SALARY Smith John $90,000 Brown David $70,000 Johnson George $95,000 Yates John $80,000 Miller Bill $65,000 Moore Jack $85,000 Taylor Fred $75,000 Smith David $80,000 Harris John $90,000 ... ... ... ... ... ... employees.txt

Q: “What is the frequency

  • f each first name?”

mapper

def getName (line): return line.split(‘\t’)[1] def addCounts (hist, name):

Monday, August 23, 2010

slide-30
SLIDE 30

10

Example – Programming model

# LAST FIRST SALARY Smith John $90,000 Brown David $70,000 Johnson George $95,000 Yates John $80,000 Miller Bill $65,000 Moore Jack $85,000 Taylor Fred $75,000 Smith David $80,000 Harris John $90,000 ... ... ... ... ... ... employees.txt

Q: “What is the frequency

  • f each first name?”

mapper

def getName (line): return line.split(‘\t’)[1] def addCounts (hist, name): hist[name] = \ hist.get(name,default=0) + 1 return hist input = open(‘employees.txt’, ‘r’)

Monday, August 23, 2010

slide-31
SLIDE 31

10

Example – Programming model

# LAST FIRST SALARY Smith John $90,000 Brown David $70,000 Johnson George $95,000 Yates John $80,000 Miller Bill $65,000 Moore Jack $85,000 Taylor Fred $75,000 Smith David $80,000 Harris John $90,000 ... ... ... ... ... ... employees.txt

Q: “What is the frequency

  • f each first name?”

mapper

def getName (line): return line.split(‘\t’)[1] def addCounts (hist, name): hist[name] = \ hist.get(name,default=0) + 1 return hist input = open(‘employees.txt’, ‘r’)

Monday, August 23, 2010

slide-32
SLIDE 32

10

Example – Programming model

# LAST FIRST SALARY Smith John $90,000 Brown David $70,000 Johnson George $95,000 Yates John $80,000 Miller Bill $65,000 Moore Jack $85,000 Taylor Fred $75,000 Smith David $80,000 Harris John $90,000 ... ... ... ... ... ... employees.txt

Q: “What is the frequency

  • f each first name?”

mapper

def getName (line): return line.split(‘\t’)[1] def addCounts (hist, name): hist[name] = \ hist.get(name,default=0) + 1 return hist input = open(‘employees.txt’, ‘r’) intermediate = map(getName, input)

Monday, August 23, 2010

slide-33
SLIDE 33

10

Example – Programming model

# LAST FIRST SALARY Smith John $90,000 Brown David $70,000 Johnson George $95,000 Yates John $80,000 Miller Bill $65,000 Moore Jack $85,000 Taylor Fred $75,000 Smith David $80,000 Harris John $90,000 ... ... ... ... ... ... employees.txt

Q: “What is the frequency

  • f each first name?”

mapper

def getName (line): return line.split(‘\t’)[1] def addCounts (hist, name): hist[name] = \ hist.get(name,default=0) + 1 return hist input = open(‘employees.txt’, ‘r’) intermediate = map(getName, input)

Monday, August 23, 2010

slide-34
SLIDE 34

10

Example – Programming model

# LAST FIRST SALARY Smith John $90,000 Brown David $70,000 Johnson George $95,000 Yates John $80,000 Miller Bill $65,000 Moore Jack $85,000 Taylor Fred $75,000 Smith David $80,000 Harris John $90,000 ... ... ... ... ... ... employees.txt

Q: “What is the frequency

  • f each first name?”

mapper reducer

def getName (line): return line.split(‘\t’)[1] def addCounts (hist, name): hist[name] = \ hist.get(name,default=0) + 1 return hist input = open(‘employees.txt’, ‘r’) intermediate = map(getName, input) result = reduce(addCounts, \ intermediate, {})

Monday, August 23, 2010

slide-35
SLIDE 35

11

def getName (line): return (line.split(‘\t’)[1], 1) def addCounts (hist, (name, c)): hist[name] = \ hist.get(name,default=0) + c return hist input = open(‘employees.txt’, ‘r’) intermediate = map(getName, input) result = reduce(addCounts, \ intermediate, {})

Example – Programming model

# LAST FIRST SALARY Smith John $90,000 Brown David $70,000 Johnson George $95,000 Yates John $80,000 Miller Bill $65,000 Moore Jack $85,000 Taylor Fred $75,000 Smith David $80,000 Harris John $90,000 ... ... ... ... ... ... employees.txt

Q: “What is the frequency

  • f each first name?”

mapper reducer

Monday, August 23, 2010

slide-36
SLIDE 36

11

def getName (line): return (line.split(‘\t’)[1], 1) def addCounts (hist, (name, c)): hist[name] = \ hist.get(name,default=0) + c return hist input = open(‘employees.txt’, ‘r’) intermediate = map(getName, input) result = reduce(addCounts, \ intermediate, {})

Example – Programming model

# LAST FIRST SALARY Smith John $90,000 Brown David $70,000 Johnson George $95,000 Yates John $80,000 Miller Bill $65,000 Moore Jack $85,000 Taylor Fred $75,000 Smith David $80,000 Harris John $90,000 ... ... ... ... ... ... employees.txt

Q: “What is the frequency

  • f each first name?”

mapper reducer

Key-value iterators

Monday, August 23, 2010

slide-37
SLIDE 37

12

public class HistogramJob extends Configured implements Tool { public static class FieldMapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,LongWritable> { private static LongWritable ONE = new LongWritable(1); private static Text firstname = new Text(); @Override public void map (LongWritable key, Text value, OutputCollector<Text,LongWritable> out, Reporter r) { firstname.set(value.toString().split(“\t”)[1]);

  • utput.collect(firstname, ONE);

} } // class FieldMapper

Example – Programming model

Hadoop / Java

Monday, August 23, 2010

slide-38
SLIDE 38

12

public class HistogramJob extends Configured implements Tool { public static class FieldMapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,LongWritable> { private static LongWritable ONE = new LongWritable(1); private static Text firstname = new Text(); @Override public void map (LongWritable key, Text value, OutputCollector<Text,LongWritable> out, Reporter r) { firstname.set(value.toString().split(“\t”)[1]);

  • utput.collect(firstname, ONE);

} } // class FieldMapper

Example – Programming model

Hadoop / Java

non-boilerplate

Monday, August 23, 2010

slide-39
SLIDE 39

12

public class HistogramJob extends Configured implements Tool { public static class FieldMapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,LongWritable> { private static LongWritable ONE = new LongWritable(1); private static Text firstname = new Text(); @Override public void map (LongWritable key, Text value, OutputCollector<Text,LongWritable> out, Reporter r) { firstname.set(value.toString().split(“\t”)[1]);

  • utput.collect(firstname, ONE);

} } // class FieldMapper

Example – Programming model

Hadoop / Java

non-boilerplate typed…

Monday, August 23, 2010

slide-40
SLIDE 40

13

Example – Programming model

Hadoop / Java

public static class LongSumReducer extends MapReduceBase implements Mapper<LongWritable,Text,Text,LongWritable> { private static LongWritable sum = new LongWritable(); @Override public void reduce (Text key, Iterator<LongWritable> vals, OutputCollector<Text,LongWritable> out, Reporter r) { long s = 0; while (vals.hasNext()) s += vals.next().get(); sum.set(s);

  • utput.collect(key, sum);

} } // class LongSumReducer

Monday, August 23, 2010

slide-41
SLIDE 41

13

Example – Programming model

Hadoop / Java

public static class LongSumReducer extends MapReduceBase implements Mapper<LongWritable,Text,Text,LongWritable> { private static LongWritable sum = new LongWritable(); @Override public void reduce (Text key, Iterator<LongWritable> vals, OutputCollector<Text,LongWritable> out, Reporter r) { long s = 0; while (vals.hasNext()) s += vals.next().get(); sum.set(s);

  • utput.collect(key, sum);

} } // class LongSumReducer

Monday, August 23, 2010

slide-42
SLIDE 42

14

Example – Programming model

Hadoop / Java

public int run (String[] args) throws Exception { JobConf job = new JobConf(getConf(), HistogramJob.class); job.setJobName(“Histogram”); FileInputFormat.setInputPaths(job, args[0]); job.setMapperClass(FieldMapper.class); job.setCombinerClass(LongSumReducer.class); job.setReducerClass(LongSumReducer.class); // ... JobClient.runJob(job); return 0; } // run() public static main (String[] args) throws Exception { ToolRunner.run(new Configuration(), new HistogramJob(), args); } // main() } // class HistogramJob

Monday, August 23, 2010

slide-43
SLIDE 43

14

Example – Programming model

Hadoop / Java

public int run (String[] args) throws Exception { JobConf job = new JobConf(getConf(), HistogramJob.class); job.setJobName(“Histogram”); FileInputFormat.setInputPaths(job, args[0]); job.setMapperClass(FieldMapper.class); job.setCombinerClass(LongSumReducer.class); job.setReducerClass(LongSumReducer.class); // ... JobClient.runJob(job); return 0; } // run() public static main (String[] args) throws Exception { ToolRunner.run(new Configuration(), new HistogramJob(), args); } // main() } // class HistogramJob ~ 30 lines = 25 boilerplate (Eclipse) + 5 actual code

Monday, August 23, 2010

slide-44
SLIDE 44

15

MapReduce for…

 Distributed clusters

 Google’s original  Hadoop (Apache Software Foundation)

 Hardware

 SMP/CMP: Phoenix (Stanford)  Cell BE

 Other

 Skynet (in Ruby/DRB)  QtConcurrent  BashReduce  …many more

Monday, August 23, 2010

slide-45
SLIDE 45

15

MapReduce for…

 Distributed clusters

 Google’s original  Hadoop (Apache Software Foundation)

 Hardware

 SMP/CMP: Phoenix (Stanford)  Cell BE

 Other

 Skynet (in Ruby/DRB)  QtConcurrent  BashReduce  …many more

Monday, August 23, 2010

slide-46
SLIDE 46

16

Recap

Quick-n-dirty script Hadoop ~5 lines of (non-bo

  • n-boilerplate) code

Single machine, local drive Up to thousands of machines and drives

vs

Monday, August 23, 2010

slide-47
SLIDE 47

16

Recap

What is hidden to achieve this:

 Data partitioning, placement and replication  Computation placement (and replication)  Number of nodes (mappers / reducers)  …

Quick-n-dirty script Hadoop ~5 lines of (non-bo

  • n-boilerplate) code

Single machine, local drive Up to thousands of machines and drives

vs

Monday, August 23, 2010

slide-48
SLIDE 48

16

Recap

What is hidden to achieve this:

 Data partitioning, placement and replication  Computation placement (and replication)  Number of nodes (mappers / reducers)  …

Quick-n-dirty script Hadoop ~5 lines of (non-bo

  • n-boilerplate) code

Single machine, local drive Up to thousands of machines and drives

vs As a programmer, you don’t need to know what I’m about to show you next…

Monday, August 23, 2010

slide-49
SLIDE 49

17

Execution model: Flow

SPLIT 0 SPLIT 1 SPLIT 2 SPLIT 3 MAPPER REDUCER MAPPER MAPPER REDUCER PART 0 PART 1 MAPPER Input file Output file

Monday, August 23, 2010

slide-50
SLIDE 50

17

Execution model: Flow

SPLIT 0 SPLIT 1 SPLIT 2 SPLIT 3 MAPPER REDUCER MAPPER MAPPER REDUCER PART 0 PART 1 MAPPER Key/value iterators Input file Output file

Monday, August 23, 2010

slide-51
SLIDE 51

17

Execution model: Flow

SPLIT 0 SPLIT 1 SPLIT 2 SPLIT 3 MAPPER REDUCER MAPPER MAPPER REDUCER PART 0 PART 1 MAPPER Sequential scan Key/value iterators Input file Output file

Monday, August 23, 2010

slide-52
SLIDE 52

17

Execution model: Flow

SPLIT 0 SPLIT 1 SPLIT 2 SPLIT 3 MAPPER REDUCER MAPPER MAPPER REDUCER PART 0 PART 1 MAPPER Sequential scan Key/value iterators

Smith John $90,000 Yates John $80,000

Input file Output file

Monday, August 23, 2010

slide-53
SLIDE 53

17

Execution model: Flow

SPLIT 0 SPLIT 1 SPLIT 2 SPLIT 3 MAPPER REDUCER MAPPER MAPPER REDUCER PART 0 PART 1 MAPPER Sequential scan Key/value iterators

John 1 John 1

Input file Output file

Monday, August 23, 2010

slide-54
SLIDE 54

17

Execution model: Flow

SPLIT 0 SPLIT 1 SPLIT 2 SPLIT 3 MAPPER REDUCER MAPPER MAPPER REDUCER PART 0 PART 1 MAPPER Sequential scan Key/value iterators All-to-all, hash partitioning

John 1 John 1

Input file Output file

Monday, August 23, 2010

slide-55
SLIDE 55

17

Execution model: Flow

SPLIT 0 SPLIT 1 SPLIT 2 SPLIT 3 MAPPER REDUCER MAPPER MAPPER REDUCER PART 0 PART 1 MAPPER Sequential scan Key/value iterators All-to-all, hash partitioning Sort-merge

John 2

Input file Output file

Monday, August 23, 2010

slide-56
SLIDE 56

17

Execution model: Flow

SPLIT 0 SPLIT 1 SPLIT 2 SPLIT 3 MAPPER REDUCER MAPPER MAPPER REDUCER PART 0 PART 1 MAPPER Sequential scan Key/value iterators All-to-all, hash partitioning Sort-merge

John 2

Input file Output file

Monday, August 23, 2010

slide-57
SLIDE 57

18

Execution model: Placement

HOST 0

SPLIT 0

Replica 1/3

SPLIT 1

Replica 2/3

SPLIT 3

Replica 2/3

HOST 1

SPLIT 0

Replica 2/3

SPLIT 4

Replica 1/3

SPLIT 3

Replica 1/3

HOST 2

SPLIT 3

Replica 3/3

SPLIT 2

Replica 2/3

SPLIT 0

Replica 3/3

HOST 3

SPLIT 2

Replica 3/3

SPLIT 1

Replica 1/3

SPLIT 4

Replica 2/3

HOST 4 HOST 5 HOST 6 Monday, August 23, 2010

slide-58
SLIDE 58

18

Execution model: Placement

HOST 0

SPLIT 0

Replica 1/3

MAPPER SPLIT 1

Replica 2/3

SPLIT 3

Replica 2/3

HOST 1

SPLIT 0

Replica 2/3

SPLIT 4

Replica 1/3

SPLIT 3

Replica 1/3

HOST 2

SPLIT 3

Replica 3/3

MAPPER SPLIT 2

Replica 2/3

SPLIT 0

Replica 3/3

HOST 3

SPLIT 2

Replica 3/3

MAPPER SPLIT 1

Replica 1/3

SPLIT 4

Replica 2/3

MAPPER

HOST 4 HOST 5 HOST 6

Computation co-located with data (as much as possible)

Monday, August 23, 2010

slide-59
SLIDE 59

19

Execution model: Placement

HOST 0

SPLIT 0

Replica 1/3

MAPPER SPLIT 1

Replica 2/3

SPLIT 3

Replica 2/3

HOST 1

SPLIT 0

Replica 2/3

SPLIT 4

Replica 1/3

SPLIT 3

Replica 1/3

HOST 2

SPLIT 3

Replica 3/3

MAPPER SPLIT 2

Replica 2/3

SPLIT 0

Replica 3/3

HOST 3

SPLIT 2

Replica 3/3

MAPPER SPLIT 1

Replica 1/3

SPLIT 4

Replica 2/3

MAPPER

HOST 4 HOST 5 HOST 6 Monday, August 23, 2010

slide-60
SLIDE 60

19

Execution model: Placement

HOST 0

SPLIT 0

Replica 1/3

MAPPER SPLIT 1

Replica 2/3

SPLIT 3

Replica 2/3

HOST 1

SPLIT 0

Replica 2/3

SPLIT 4

Replica 1/3

SPLIT 3

Replica 1/3

HOST 2

SPLIT 3

Replica 3/3

MAPPER SPLIT 2

Replica 2/3

SPLIT 0

Replica 3/3

HOST 3

SPLIT 2

Replica 3/3

MAPPER SPLIT 1

Replica 1/3

SPLIT 4

Replica 2/3

MAPPER

HOST 4 HOST 5 HOST 6

REDUCER

Monday, August 23, 2010

slide-61
SLIDE 61

19

Execution model: Placement

HOST 0

SPLIT 0

Replica 1/3

MAPPER SPLIT 1

Replica 2/3

SPLIT 3

Replica 2/3

HOST 1

SPLIT 0

Replica 2/3

SPLIT 4

Replica 1/3

SPLIT 3

Replica 1/3

HOST 2

SPLIT 3

Replica 3/3

MAPPER SPLIT 2

Replica 2/3

SPLIT 0

Replica 3/3

HOST 3

SPLIT 2

Replica 3/3

MAPPER SPLIT 1

Replica 1/3

SPLIT 4

Replica 2/3

MAPPER

HOST 4 HOST 5 HOST 6

REDUCER

Rack/network-aware

Monday, August 23, 2010

slide-62
SLIDE 62

19

Execution model: Placement

HOST 0

SPLIT 0

Replica 1/3

MAPPER SPLIT 1

Replica 2/3

SPLIT 3

Replica 2/3

HOST 1

SPLIT 0

Replica 2/3

SPLIT 4

Replica 1/3

SPLIT 3

Replica 1/3

HOST 2

SPLIT 3

Replica 3/3

MAPPER SPLIT 2

Replica 2/3

SPLIT 0

Replica 3/3

HOST 3

SPLIT 2

Replica 3/3

MAPPER SPLIT 1

Replica 1/3

SPLIT 4

Replica 2/3

MAPPER

HOST 4 HOST 5 HOST 6

REDUCER

Rack/network-aware

C C C C C COMBINER

Monday, August 23, 2010

slide-63
SLIDE 63

20

MapReduce Summary

Monday, August 23, 2010

slide-64
SLIDE 64

20

MapReduce Summary

 Simple programming model  Scalable, fault-tolerant  Ideal for (pre-)processing large volumes of

data

Monday, August 23, 2010

slide-65
SLIDE 65

20

MapReduce Summary

 Simple programming model  Scalable, fault-tolerant  Ideal for (pre-)processing large volumes of

data

‘However, if the data center is the computer, it leads to the even more intriguing question “What is the equivalent of the ADD instruction for a data center?” […] If MapReduce is the first instruction of the “data center computer”, I can’t wait to see the rest of the instruction set, as well as the data center programming language, the data center operating system, the data center storage systems, and more.’ – David Patterson, “The Data Center Is The Computer”, CACM, Jan. 2008

Monday, August 23, 2010

slide-66
SLIDE 66

21

Outline

 Introduction  MapReduce & distributed storage  Hadoop

HBase Pig Cascading Hive

 Summary

Monday, August 23, 2010

slide-67
SLIDE 67

22

Hadoop

HBase MapReduce Core Avro HDFS

Zoo Keeper

Hive Pig Chukwa

Monday, August 23, 2010

slide-68
SLIDE 68

22

Hadoop

HBase MapReduce Core Avro HDFS

Zoo Keeper

Hive Pig Chukwa Hadoop’s stated mission (Doug Cutting interview): Commoditize infrastructure for web-scale, data-intensive applications

Monday, August 23, 2010

slide-69
SLIDE 69

23

Who uses Hadoop?

 Yahoo!  Facebook  Last.fm  Rackspace  Digg  Apache Nutch  … more in part 3

Monday, August 23, 2010

slide-70
SLIDE 70

24

MapReduce Avro HDFS

Hadoop

HBase Core

Zoo Keeper

Hive Pig Chukwa

Monday, August 23, 2010

slide-71
SLIDE 71

24

MapReduce Avro HDFS

Hadoop

HBase Core

Zoo Keeper

Hive Pig Chukwa Filesystems and I/O:

 Abstraction APIs  RPC / Persistence

Monday, August 23, 2010

slide-72
SLIDE 72

25

Core

Zoo Keeper

MapReduce HDFS

Hadoop

HBase Hive Pig Chukwa Avro

Monday, August 23, 2010

slide-73
SLIDE 73

25

Core

Zoo Keeper

MapReduce HDFS

Hadoop

HBase Hive Pig Chukwa Cross-language serialization:

 RPC / persistence  ~ Google ProtoBuf / FB Thrift

Avro

Monday, August 23, 2010

slide-74
SLIDE 74

26

HBase Hive Pig Core

Zoo Keeper

HDFS

Hadoop

Chukwa Avro MapReduce

Monday, August 23, 2010

slide-75
SLIDE 75

26

HBase Hive Pig Core

Zoo Keeper

HDFS

Hadoop

Chukwa Distributed execution (batch)

 Programming model  Scalability / fault-tolerance

Avro MapReduce

Monday, August 23, 2010

slide-76
SLIDE 76

27

MapReduce Chukwa Avro HBase Hive Pig Core

Zoo Keeper

Hadoop

HDFS

Monday, August 23, 2010

slide-77
SLIDE 77

27

MapReduce Chukwa Avro HBase Hive Pig Core

Zoo Keeper

Hadoop

Distributed storage (read-opt.)

 Replication / scalability  ~ Google filesystem (GFS)

HDFS

Monday, August 23, 2010

slide-78
SLIDE 78

28

HDFS Chukwa Avro HBase Hive Pig Core

Zoo Keeper

Hadoop

MapReduce

Monday, August 23, 2010

slide-79
SLIDE 79

28

HDFS Chukwa Avro HBase Hive Pig Core

Zoo Keeper

Hadoop

Coordination service

 Locking / configuration  ~ Google Chubby

MapReduce

Monday, August 23, 2010

slide-80
SLIDE 80

29

Avro MapReduce Hive Pig HBase Core

Zoo Keeper

HDFS

Hadoop

Chukwa

Monday, August 23, 2010

slide-81
SLIDE 81

29

Avro MapReduce Hive Pig HBase Core

Zoo Keeper

HDFS

Hadoop

Chukwa Column-oriented, sparse store

 Batch & random access  ~ Google BigTable

Monday, August 23, 2010

slide-82
SLIDE 82

30

MapReduce HDFS Hive HBase Avro Pig Core

Zoo Keeper

Hadoop

Chukwa

Monday, August 23, 2010

slide-83
SLIDE 83

30

MapReduce HDFS Hive HBase Avro Pig Core

Zoo Keeper

Hadoop

Chukwa Data flow language

 Procedural SQL-inspired lang.  Execution environment

Monday, August 23, 2010

slide-84
SLIDE 84

31

Chukwa HDFS Pig MapReduce Hive HBase Avro Core

Zoo Keeper

Hadoop

Monday, August 23, 2010

slide-85
SLIDE 85

31

Chukwa HDFS Pig MapReduce Hive HBase Avro Core

Zoo Keeper

Hadoop

Distributed data warehouse

 SQL-like query language  Data mgmt / query execution

Monday, August 23, 2010

slide-86
SLIDE 86

32

Hadoop

HBase MapReduce Core Avro HDFS

Zoo Keeper

Hive Pig Chukwa

… … more

Monday, August 23, 2010

slide-87
SLIDE 87

33

MapReduce

 Mapper:

(k1, v1)  (k2, v2)[]

 E.g., (void, textline : string)

 (first : string, count : int)  Reducer: (k2, v2[])  (k3, v3)[]

 E.g., (first : string, counts : int[])

 (first : string, total : int)  Combiner: (k2, v2[])  (k2, v2)[]  Partition: (k2, v2)  int

Monday, August 23, 2010

slide-88
SLIDE 88

34

Mapper interface

interface Mapper<K1, V1, K2, V2> { void configure (JobConf conf); void map (K1 key, V1 value, OutputCollector<K2, V2> out, Reporter reporter); void close(); }

 Initialize in configure()  Clean-up in close()  Emit via out.collect(key,val) any time

  

Monday, August 23, 2010

slide-89
SLIDE 89

35

Reducer interface

interface Reducer<K2, V2, K3, V3> { void configure (JobConf conf); void reduce ( K2 key, Iterator<V2> values, OutputCollector<K3, V3> out, Reporter reporter); void close(); }

 Initialize in configure()  Clean-up in close()  Emit via out.collect(key,val) any time

  

Monday, August 23, 2010

slide-90
SLIDE 90

36

Some canonical examples

 Histogram-type jobs:

Graph construction (bucket = edge) K-means et al. (bucket = cluster center)

 Inverted index:

Text indices Matrix transpose

 Sorting  Equi-join  More details in part 2

Monday, August 23, 2010

slide-91
SLIDE 91

37

Equi-joins

“Reduce-side”

(Smith, 7) (Jones, 7) (Brown, 7) (Davis, 3) (Dukes, 5) (Black, 3) (Gruhl, 7) (Sales, 3) (Devel, 7) (Acct., 5) MAP MAP

Monday, August 23, 2010

slide-92
SLIDE 92

37

Equi-joins

“Reduce-side”

(Smith, 7) (Jones, 7) (Brown, 7) (Davis, 3) (Dukes, 5) (Black, 3) (Gruhl, 7) (Sales, 3) (Devel, 7) (Acct., 5) 7: (, (Smith)) MAP MAP

Monday, August 23, 2010

slide-93
SLIDE 93

37

Equi-joins

“Reduce-side”

(Smith, 7) (Jones, 7) (Brown, 7) (Davis, 3) (Dukes, 5) (Black, 3) (Gruhl, 7) (Sales, 3) (Devel, 7) (Acct., 5) 7: (, (Smith)) 7: (, (Devel)) MAP MAP

Monday, August 23, 2010

slide-94
SLIDE 94

37

Equi-joins

“Reduce-side”

(Smith, 7) (Jones, 7) (Brown, 7) (Davis, 3) (Dukes, 5) (Black, 3) (Gruhl, 7) (Sales, 3) (Devel, 7) (Acct., 5) 7: (, (Smith)) (7,): (Smith) 7: (, (Devel)) (7,): (Devel)

  • OR-
  • OR-

MAP MAP

Monday, August 23, 2010

slide-95
SLIDE 95

38

Equi-joins

“Reduce-side”

(Smith, 7) (Jones, 7) (Brown, 7) (Davis, 3) (Dukes, 5) (Black, 3) (Gruhl, 7) (Sales, 3) (Devel, 7) (Acct., 5) 7: (, (Smith)) 7: (, (Jones)) 7: (, (Brown)) 7: (, (Gruhl)) 7: (, (Devel)) MAP MAP

Monday, August 23, 2010

slide-96
SLIDE 96

38

Equi-joins

“Reduce-side”

(Smith, 7) (Jones, 7) (Brown, 7) (Davis, 3) (Dukes, 5) (Black, 3) (Gruhl, 7) (Sales, 3) (Devel, 7) (Acct., 5) 7: (, (Smith)) 7: (, (Jones)) 7: (, (Brown)) 7: (, (Gruhl)) 7: (, (Devel)) 7: {(, (Smith)), (, (Jones)), (, (Brown)), (, (Gruhl)), (, (Devel)) } MAP MAP

SHUF.

Monday, August 23, 2010

slide-97
SLIDE 97

38

Equi-joins

“Reduce-side”

(Smith, 7) (Jones, 7) (Brown, 7) (Davis, 3) (Dukes, 5) (Black, 3) (Gruhl, 7) (Sales, 3) (Devel, 7) (Acct., 5) 7: (, (Smith)) 7: (, (Jones)) 7: (, (Brown)) 7: (, (Gruhl)) 7: (, (Devel)) 7: {(, (Smith)), (, (Jones)), (, (Brown)), (, (Gruhl)), (, (Devel)) } (Smith, Devel), (Jones, Devel), (Brown, Devel), (Gruhl, Devel) MAP MAP

SHUF.

RED.

Monday, August 23, 2010

slide-98
SLIDE 98

39

HDFS & MapReduce processes

HOST 0

SPLIT 0

Replica 1/3

MAPPER SPLIT 1

Replica 2/3

SPLIT 3

Replica 2/3

HOST 1

SPLIT 0

Replica 2/3

SPLIT 4

Replica 1/3

SPLIT 3

Replica 1/3

HOST 2

SPLIT 3

Replica 3/3

MAPPER SPLIT 2

Replica 2/3

SPLIT 0

Replica 3/3

HOST 3

SPLIT 2

Replica 3/3

MAPPER SPLIT 1

Replica 1/3

SPLIT 4

Replica 2/3

MAPPER

HOST 4 HOST 5 HOST 6

REDUCER

C C C C

HOST 0 HOST 1 HOST 2 HOST 3

Monday, August 23, 2010

slide-99
SLIDE 99

39

HDFS & MapReduce processes

HOST 0

SPLIT 0

Replica 1/3

MAPPER SPLIT 1

Replica 2/3

SPLIT 3

Replica 2/3

HOST 1

SPLIT 0

Replica 2/3

SPLIT 4

Replica 1/3

SPLIT 3

Replica 1/3

HOST 2

SPLIT 3

Replica 3/3

MAPPER SPLIT 2

Replica 2/3

SPLIT 0

Replica 3/3

HOST 3

SPLIT 2

Replica 3/3

MAPPER SPLIT 1

Replica 1/3

SPLIT 4

Replica 2/3

MAPPER

HOST 4 HOST 5 HOST 6

REDUCER

C C C C

DN DN

HOST 0 HOST 1 HOST 2 HOST 3

DATA NODE DATA NODE DATA NODE DN DATA NODE

Monday, August 23, 2010

slide-100
SLIDE 100

39

HDFS & MapReduce processes

HOST 0

SPLIT 0

Replica 1/3

MAPPER SPLIT 1

Replica 2/3

SPLIT 3

Replica 2/3

HOST 1

SPLIT 0

Replica 2/3

SPLIT 4

Replica 1/3

SPLIT 3

Replica 1/3

HOST 2

SPLIT 3

Replica 3/3

MAPPER SPLIT 2

Replica 2/3

SPLIT 0

Replica 3/3

HOST 3

SPLIT 2

Replica 3/3

MAPPER SPLIT 1

Replica 1/3

SPLIT 4

Replica 2/3

MAPPER

HOST 4 HOST 5 HOST 6

REDUCER

C C C C

DN DN

HOST 0 HOST 1 HOST 2 HOST 3

DATA NODE DATA NODE DATA NODE DN DATA NODE NAME NODE

Monday, August 23, 2010

slide-101
SLIDE 101

39

HDFS & MapReduce processes

HOST 0

SPLIT 0

Replica 1/3

MAPPER SPLIT 1

Replica 2/3

SPLIT 3

Replica 2/3

HOST 1

SPLIT 0

Replica 2/3

SPLIT 4

Replica 1/3

SPLIT 3

Replica 1/3

HOST 2

SPLIT 3

Replica 3/3

MAPPER SPLIT 2

Replica 2/3

SPLIT 0

Replica 3/3

HOST 3

SPLIT 2

Replica 3/3

MAPPER SPLIT 1

Replica 1/3

SPLIT 4

Replica 2/3

MAPPER

HOST 4 HOST 5 HOST 6

REDUCER

C C C C

TT TT TT

DN DN

HOST 0 HOST 1 HOST 2 HOST 3

DATA NODE DATA NODE DATA NODE TASK TRACKER TASK TRACKER TASK TRACKER DN DATA NODE NAME NODE TASK TRACKER

Monday, August 23, 2010

slide-102
SLIDE 102

39

HDFS & MapReduce processes

HOST 0

SPLIT 0

Replica 1/3

MAPPER SPLIT 1

Replica 2/3

SPLIT 3

Replica 2/3

HOST 1

SPLIT 0

Replica 2/3

SPLIT 4

Replica 1/3

SPLIT 3

Replica 1/3

HOST 2

SPLIT 3

Replica 3/3

MAPPER SPLIT 2

Replica 2/3

SPLIT 0

Replica 3/3

HOST 3

SPLIT 2

Replica 3/3

MAPPER SPLIT 1

Replica 1/3

SPLIT 4

Replica 2/3

MAPPER

HOST 4 HOST 5 HOST 6

REDUCER

C C C C

TT TT TT

DN DN

HOST 0 HOST 1 HOST 2 HOST 3

DATA NODE DATA NODE DATA NODE TASK TRACKER JOB TRACKER TASK TRACKER TASK TRACKER DN DATA NODE NAME NODE TASK TRACKER

Monday, August 23, 2010

slide-103
SLIDE 103

40

Hadoop Streaming & Pipes

 Don’t have to use Java for MapReduce  Hadoop Streaming:

Use stdin/stdout & text format Any language (C/C++, Perl, Python, shell, etc)

 Hadoop Pipes:

Use sockets & binary format (more efficient) C++ library required

Monday, August 23, 2010

slide-104
SLIDE 104

41

Outline

 Introduction  MapReduce & distributed storage  Hadoop

HBase Pig Cascading Hive

 Summary

Monday, August 23, 2010

slide-105
SLIDE 105

42

HBase introduction

 MapReduce canonical example:

 Inverted index (more in Part 2)

 Batch computations on large datasets:

 Build static index on crawl snapshot

 However, in reality crawled pages are:

 Updated by crawler  Augmented by other parsers/analytics  Retrieved by cache search  Etc…

Monday, August 23, 2010

slide-106
SLIDE 106

43

HBase introduction

 MapReduce & HDFS:

Distributed storage + computation Good for batch processing But: no facilities for accessing or updating

individual items

 HBase:

Adds random-access read / write operations Originally developed at Powerset Based on Google’s Bigtable

Monday, August 23, 2010

slide-107
SLIDE 107

44

HBase data model

Column family Column Row key Partitioned over many nodes

Monday, August 23, 2010

slide-108
SLIDE 108

44

HBase data model

Column family Column Row key (millions) (billions; sorted) (hundreds; fixed) (thousands) Partitioned over many nodes

Monday, August 23, 2010

slide-109
SLIDE 109

44

HBase data model

Column family Column Row key (millions) (billions; sorted) (hundreds; fixed) (thousands) Partitioned over many nodes

Keys and cell values are arbitrary byte arrays

Monday, August 23, 2010

slide-110
SLIDE 110

44

HBase data model

Column family Column Row key (millions) (billions; sorted) (hundreds; fixed) (thousands) Partitioned over many nodes

Can use any underlying data store (local, HDFS, S3, etc) Keys and cell values are arbitrary byte arrays

Monday, August 23, 2010

slide-111
SLIDE 111

45

Data model example

empId

profile:last profile:first profile:salary

profile: family

Smith John $90,000

Monday, August 23, 2010

slide-112
SLIDE 112

46

Data model example

empId

profile:last profile:first profile:salary

profile: family

Smith John $90,000

bm: (bookmarks) family

bm:url1 bm:urlN

Monday, August 23, 2010

slide-113
SLIDE 113

46

Data model example

empId

profile:last profile:first profile:salary

profile: family

Smith John $90,000

Always access via primary key bm: (bookmarks) family

bm:url1 bm:urlN

Monday, August 23, 2010

slide-114
SLIDE 114

47

HBase vs. RDBMS

 Different solution, similar problems  RDBMSes:

 Row-oriented  Fixed-schema  ACID

 HBase et al.:

 Designed from ground-up to scale out, by adding

commodity machines

 Simple consistency scheme: atomic row writes  Fault tolerance  Batch processing  No (real) indexes

Monday, August 23, 2010

slide-115
SLIDE 115

48

Outline

 Introduction  MapReduce & distributed storage  Hadoop

HBase Pig Cascading Hive

 Summary

Monday, August 23, 2010

slide-116
SLIDE 116

49

Pig introduction

 “ ~5 lines of non-boilerplate code ”  Writing a single MapReduce job requires

significant gruntwork

Boilerplates (mapper/reducer, create job, etc) Input / output formats

 Many tasks require more than one

MapReduce job

Monday, August 23, 2010

slide-117
SLIDE 117

50

Pig main features

 Data structures (multi-valued, nested)  Pig-latin: data flow language

SQL-inspired, but imperative (not declarative)

Monday, August 23, 2010

slide-118
SLIDE 118

51

Pig example

records = LOAD filename AS (last: chararray, first: chararray, salary:int); grouped = GROUP records BY first; counts = FOREACH grouped GENERATE group, COUNT(records.first); DUMP counts;

# LAST FIRST SALARY Smith John $90,000 Brown David $70,000 Johnson George $95,000 Yates John $80,000 Miller Bill $65,000 Moore Jack $85,000 Taylor Fred $75,000 Smith David $80,000 Harris John $90,000 ... ... ... ... ... ... employees.txt

Q: “What is the frequency of each first name?”

Monday, August 23, 2010

slide-119
SLIDE 119

52

Pig schemas

 Schema = tuple data type  Schemas are optional!

Data-loading step is not required “Unknown” schema: similar to AWK ($0, $1, ..)

 Support for most common datatypes  Support for nesting

Monday, August 23, 2010

slide-120
SLIDE 120

53

Pig Latin feature summary

 Data loading / storing

 LOAD / STORE / DUMP

 Filtering

 FILTER / DISTINCT / FOREACH / STREAM

 Group-by

 GROUP

 Join & co-group

 JOIN / COGROUP / CROSS

 Sorting

 ORDER / LIMIT

 Combining / splitting

 UNION / SPLIT

Monday, August 23, 2010

slide-121
SLIDE 121

54

Outline

 Introduction  MapReduce & distributed storage  Hadoop

HBase Pig Cascading Hive

 Summary

Monday, August 23, 2010

slide-122
SLIDE 122

55

Cascading introduction

 Provides higher-level abstraction

Fields, Tuples Pipes Operations Taps, Schemes, Flows

 Ease composition of multi-job flows

Monday, August 23, 2010

slide-123
SLIDE 123

55

Cascading introduction

 Provides higher-level abstraction

Fields, Tuples Pipes Operations Taps, Schemes, Flows

 Ease composition of multi-job flows

Library, not a new language

Monday, August 23, 2010

slide-124
SLIDE 124

56

Cascading example

Scheme srcScheme = new TextLine(); Tap source = new Hfs(srcScheme, filename); Scheme dstScheme = new TextLine(); Tap sink = new Hfs(dstScheme, filename, REPLACE); Pipe assembly = new Pipe(“lastnames”); Function splitter = new RegexSplitter( new Fields(“last”, “first”, “salary”), “\t”); assembly = new Each(assembly, new Fields(“line”), splitter); assembly = new GroupBy(assembly, new Fields(“first”)); Aggregator count = new Count(new Fields(“count”)); assembly = new Every(assembly, count); FlowConnector flowConnector = new FlowConnector(); Flow flow = flowConnector.connect(“last-names”, source, sink, assembly); flow.complete();

# LAST FIRST SALARY Smith John $90,000 Brown David $70,000 ... ... ... employees.txt

Q: “What is the frequency

  • f each first name?”

Monday, August 23, 2010

slide-125
SLIDE 125

56

Cascading example

Scheme srcScheme = new TextLine(); Tap source = new Hfs(srcScheme, filename); Scheme dstScheme = new TextLine(); Tap sink = new Hfs(dstScheme, filename, REPLACE); Pipe assembly = new Pipe(“lastnames”); Function splitter = new RegexSplitter( new Fields(“last”, “first”, “salary”), “\t”); assembly = new Each(assembly, new Fields(“line”), splitter); assembly = new GroupBy(assembly, new Fields(“first”)); Aggregator count = new Count(new Fields(“count”)); assembly = new Every(assembly, count); FlowConnector flowConnector = new FlowConnector(); Flow flow = flowConnector.connect(“last-names”, source, sink, assembly); flow.complete();

# LAST FIRST SALARY Smith John $90,000 Brown David $70,000 ... ... ... employees.txt

Q: “What is the frequency

  • f each first name?”

Monday, August 23, 2010

slide-126
SLIDE 126

57

Cascading feature summary

 Pipes: transform streams of tuples

Each GroupBy / CoGroup Every SubAssembly

 Operations: what is done to tuples

Function Filter Aggregator / Buffer

Monday, August 23, 2010

slide-127
SLIDE 127

58

Outline

 Introduction  MapReduce & distributed storage  Hadoop

HBase Pig Cascading Hive

 Summary

Monday, August 23, 2010

slide-128
SLIDE 128

59

Hive introduction

 Originally developed at Facebook

Now a Hadoop sub-project

 Data warehouse infrastructure

Execution: MapReduce Storage: HDFS files

 Large datasets, e.g. Facebook daily logs

30GB (Jan’08), 200GB (Mar’08), 15+TB (2009)

 Hive QL: SQL-like query language

Monday, August 23, 2010

slide-129
SLIDE 129

60

Hive example

CREATE EXTERNAL TABLE records (last STRING, first STRING, salary INT) ROW FORMAT DELIMETED FIELDS TERMINATED BY ’\t’ STORED AS TEXTFILE LOCATION filename; SELECT records.first, COUNT(1) FROM records GROUP BY records.first;

# LAST FIRST SALARY Smith John $90,000 Brown David $70,000 Johnson George $95,000 Yates John $80,000 Miller Bill $65,000 Moore Jack $85,000 Taylor Fred $75,000 Smith David $80,000 Harris John $90,000 ... ... ... ... ... ... employees.txt

Q: “What is the frequency of each first name?”

Monday, August 23, 2010

slide-130
SLIDE 130

61

Hive schemas

 Data should belong to tables

 But can also use pre-existing data  Data loading optional (like Pig) but encouraged

 Partitioning columns:

 Mapped to HDFS directories  E.g., (date, time)  datadir/2009-03-12/18_30_00

 Data columns (the rest):

 Stored in HDFS files

 Support for most common data types  Support for pluggable serialization

Monday, August 23, 2010

slide-131
SLIDE 131

62

Hive QL feature summary

 Basic SQL

 FROM subqueries  JOIN (only equi-joins)  Multi GROUP BY  Multi-table insert  Sampling

 Extensibility

 Pluggable MapReduce scripts  User Defined Functions  User Defined Types  SerDe (serializer / deserializer)

Monday, August 23, 2010

slide-132
SLIDE 132

63

Outline

 Introduction  MapReduce & distributed storage  Hadoop

HBase Pig Cascading Hive

 Summary

Monday, August 23, 2010

slide-133
SLIDE 133

64

Recap

Monday, August 23, 2010

slide-134
SLIDE 134

64

Recap

 Scalable: all

Monday, August 23, 2010

slide-135
SLIDE 135

64

Recap

 Scalable: all  High(-er) level: all except MR

Monday, August 23, 2010

slide-136
SLIDE 136

64

Recap

 Scalable: all  High(-er) level: all except MR  Existing language: MR, Cascading

Monday, August 23, 2010

slide-137
SLIDE 137

64

Recap

 Scalable: all  High(-er) level: all except MR  Existing language: MR, Cascading  “Schemas”: HBase, Pig, Hive, (Casc.)

Pluggable data types: all

Monday, August 23, 2010

slide-138
SLIDE 138

64

Recap

 Scalable: all  High(-er) level: all except MR  Existing language: MR, Cascading  “Schemas”: HBase, Pig, Hive, (Casc.)

Pluggable data types: all

 Easy transition: Hive, (Pig)

Monday, August 23, 2010

slide-139
SLIDE 139

65

Related projects

Higher level—computation:

 Dryad & DryadLINQ (Microsoft) [EuroSys 2007]  Sawzall (Google) [Sci Prog Journal 2005]

Higher level—storage:

 Bigtable [OSDI 2006] / Hypertable

Lower level:

 Kosmos Filesystem (Kosmix)  VSN (Parascale)  EC2 / S3 (Amazon)  Ceph / Lustre / PanFS  Sector / Sphere (http://sector.sf.net/)  …

Monday, August 23, 2010

slide-140
SLIDE 140

66

Summary

MapReduce:

 Simplified parallel programming model

Hadoop:

 Built from ground-up for:

Scalability Fault-tolerance Clusters of commodity hardware

 Growing collection of components and

extensions (HBase, Pig, Hive, etc)

Monday, August 23, 2010

slide-141
SLIDE 141

67

Tutorial overview

 Part 1 (Spiros): Basic concepts & tools

 MapReduce & distributed storage  Hadoop / HBase / Pig / Cascading / Hive

 Part 2 (Jimeng): Algorithms

 Information retrieval  Graph algorithms  Clustering (k-means)  Classification (k-NN, naïve Bayes)

 Part 3 (Rong): Applications

 Text processing  Data warehousing  Machine learning

NEXT:

Monday, August 23, 2010

slide-142
SLIDE 142

Large-scale Data Mining: MapReduce and beyond

Part 1: Basics

Spiros Papadimitriou, Google Jimeng Sun, IBM Research Rong Yan, Facebook

Monday, August 23, 2010