Large-Scale Data Analysis: Bridging the Gap Alekh Jindal, Yagiz - - PowerPoint PPT Presentation

large scale data analysis bridging the gap
SMART_READER_LITE
LIVE PREVIEW

Large-Scale Data Analysis: Bridging the Gap Alekh Jindal, Yagiz - - PowerPoint PPT Presentation

Large-Scale Data Analysis: Bridging the Gap Alekh Jindal, Yagiz Kargin, Sarath Kumar, Vinay Setty Outline Motivation: Parallel DBMS vs Map/Reduce Schema & Benchmarks Overview Original(Pavlo) Map/Reduce Plans Improved(SAVY)


slide-1
SLIDE 1

Large-Scale Data Analysis: Bridging the Gap

Alekh Jindal, Yagiz Kargin, Sarath Kumar, Vinay Setty

slide-2
SLIDE 2

Outline

  • Motivation: Parallel DBMS vs Map/Reduce
  • Schema & Benchmarks Overview
  • Original(Pavlo) Map/Reduce Plans
  • Improved(SAVY) Design & Implementation
  • Improving Hadoop

○Indexing ○Co-Partitioning

  • Experiments
  • Conclusion
slide-3
SLIDE 3

Motivation

  • Ever growing data

○About 20TB per Google crawl!

  • Computing Solutions

○High-end server: 1625.60€/core, 97.66€/GB ○Share-nothing nodes: 299.50€/core, 166.33 €/GB

  • Two Paradigms

○Parallel DBMS ○Map/Reduce

slide-4
SLIDE 4

Parallel DBMS

Data Data Data Data

scan

sort sort sort sort

scan scan scan

Merge Query [DeWitt, D. and Gray, J. 1992. ]

slide-5
SLIDE 5

Parallel DBMS: Advantages

  • Can be column based

○Example: Vertica

  • Local joins possible

○Partition based on join key

  • Can work on compressed data

○reduced data transfer

  • Flexible query plans
  • Supports Declarative languages like SQL
slide-6
SLIDE 6

Parallel DBMS - Shortcomings

  • Not free of cost
  • Not open source
  • Cannot scale to thousands of nodes: why?

○Less fault tolerant ○Assumes homogeneous nodes

  • Not so easy to achieve high performance

○Needs highly skilled DBA ○Needs high maintenance

slide-7
SLIDE 7

Map/Reduce(Hadoop): Advantages

  • Free of cost
  • Open source
  • Fault tolerant
  • Scales well to thousands of nodes
  • Less maintenance
  • Flexible query framework
slide-8
SLIDE 8

Map/Reduce(Hadoop): Shortcomings

  • Lack of inbuilt Indexing
  • Cannot guarantee local joins
  • Performance degradation for SQL like

queries

○Multiple MR phases ○Each MR phase adds extra cost

  • No Flexible query plans
  • Data transfer not optimized

Current Focus Current Focus Current Focus

slide-9
SLIDE 9

Benchmarks and Schema

slide-10
SLIDE 10

Schema

CREATE TABLE Documents ( url VARCHAR (100) PRIMARY KEY, contents TEXT ); CREATE TABLE Rankings ( pageURL VARCHAR (100) PRIMARY KEY, pageRank INT, avgDuration INT );

slide-11
SLIDE 11

Schema

CREATE TABLE UserVisits ( sourceIP VARCHAR(16), destURL VARCHAR(100), visitDate DATE, adRevenue FLOAT, userAgent VARCHAR(64), countryCode VARCHAR(3), languageCode VARCHAR(6), searchWord VARCHAR(32), duration INT );

slide-12
SLIDE 12

Benchmarks 1&2

  • Selection task (Benchmark 1)

○SELECT pageURL, pageRank FROM Rankings WHERE pageRank > X;

  • Aggregation task (Benchmark 2)

○SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP; ○SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue) FROM UserVisits GROUP BY SUBSTR(sourceIP, 1, 7);

slide-13
SLIDE 13

Benchmark 3: Join Task

  • SELECT INTO Temp sourceIP, AVG

(pageRank) as avgPageRank, SUM (adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV. visitDate BETWEEN Date(‘2000-01-15’) AND Date (‘2000-01-22’) GROUP BY UV.sourceIP;

  • SELECT sourceIP, totalRevenue, avgPageRank

FROM Temp ORDER BY totalRevenue DESC LIMIT 1;

Projection & Aggregation Join selectio n

slide-14
SLIDE 14

Original (Pavlo) MR Plans

slide-15
SLIDE 15

Benchmark 1

Data Data Data HDFS

Extra MR job to merge results

Map() PageRank > 10? Map() PageRank > 10? Map() PageRank > 10? Mappers

Phase 1

Result Result Result Resul t

Phase 2

Reduce Reducer

Identity Identity Identity

Mapper s

SELECT pageURL, pageRank FROM Rankings WHERE pageRank > 10;

slide-16
SLIDE 16

Benchmark 2: Phase 1

Reduce: Aggr Reduce: Aggr Reduce Aggr Reduce rs Dat a Dat a Dat a HDFS

Phase 1

Map: split Map: split Map: split Mapper s su m su m sum Combiner s Inter. Resu lt Inter. Resu lt Inter. Resu lt Result1 Result1 Result1

slide-17
SLIDE 17

Benchmark 2: Phase 2

Resul t

Phase 2

Reduce Reducer

Identity Identity Identity

Mapper s

Extra MR job to merge results

Result1 Result1 Result1 HDFS

slide-18
SLIDE 18

rank rank ranks

Benchmark 3 – Phase 1

join join join Reduce rs HDFS

predicate

predicate

predicate

Mapper s Inter. Resu lt Inter. Resu lt Inter. Resu lt Result1 Result1 Result1

Ranking s Ranking s Ranking s

User visits User visits User visits

Also classifies two types of records

<Source IP, URL, PageRank, adReveune> <Source IP, URL, PageRank, adReveune> <Source IP, URL, PageRank, adReveune>

Also classifies two types of records Phase 1

slide-19
SLIDE 19

Benchmark 3 – Phase 2

Avg(PR), Sum (adRevnue)

Reducers HDFS

Identity Identity Identity

Mappers Inter. Resu lt Inter. Resu lt Inter. Resu lt Result1 Result1 Result1

<Source IP, URL, PageRank, adReveune> <Source IP, URL, PageRank, adReveune> <Source IP, URL, PageRank, adReveune> <Source IP, Avg(PR), Max (Sum(adRevenue))>

Avg(PR), Sum (adRevnue) Avg(PR), Sum (adRevnue)

<Source IP, Avg(PR), Max (Sum(adRevenue))> <Source IP, Avg(PR), Max (Sum(adRevenue))>

Phase 2

slide-20
SLIDE 20

Benchmark 3 – Phase 3

Reducer HDFS

Identity Identity Identity

Mapper s Inter. Resu lt Inter. Resu lt Inter. Resu lt

<Source IP, Avg(PR), Max (Sum(adRevenue))>

Max(Sum (adRevnue)

<Source IP, Avg(PR), Max (Sum(adRevenue))> <Source IP, Avg(PR), Max (Sum(adRevenue))>

Final Result

Source IP, Avg(PR), Sum (adRevenue)

Phase 3

slide-21
SLIDE 21

Improved (Savy) MR Plans

slide-22
SLIDE 22

Binary Data

  • Eliminates delimiters
  • Avoids splitting
  • Makes tuples of fixed length
  • Helps in indexing
slide-23
SLIDE 23

Benchmark 1

Data Data Data HDFS PageRank > 10? PageRank > 10? PageRank > 10? Mappers

Phase 1

Resul t

Phase 2

Reduce Reducer Result Result Result

Extra MR job to merge results Binary data

slide-24
SLIDE 24

Benchmark 2

Dat a Dat a Dat a HDFS

Phase 1

split split split Mapper s su m su m sum Combiners Inter. Resu lt Inter. Resu lt Inter. Resu lt Aggr Aggr Aggr Reduce rs Res ult

Phase 2

Res ult Res ult Res ult merge Reduce r

Extra MR job to merge results

Aggr Reducer Result

Binary data

slide-25
SLIDE 25

rank rank ranks

Benchmark 3(Design I) – Phase 1

join join join Reducers HDFS

Identity

Identity

Identity

Mappers

Inter. Resu lt Inter. Resu lt Inter. Resu lt Result1 Result1 Result1

Ranking s Ranking s Ranking s

User visits User visits User visits

<Source IP, URL, PageRank, adReveune> <Source IP, URL, PageRank, adReveune> <Source IP, URL, PageRank, adReveune>

predicate

predicate

predicate

Record Reader s

Easy to classify (just look at record size) Binary data Phase 1

slide-26
SLIDE 26

Benchmark 3(Design I) – Phase 2

HDFS

Identity Identity Identity

Mapper s Inter. Resu lt Inter. Resu lt Inter. Resu lt Result1 Result1 Result1

<Source IP, URL, PageRank, adReveune> <Source IP, URL, PageRank, adReveune> <Source IP, URL, PageRank, adReveune>

Reducer s

Max(Sum (adRevnue)

Avg(PR), Sum (adRevnue Avg(PR), Sum (adRevnu e Avg(PR), Sum (adRevnu e Combiner s

Final Result

Source IP, Avg(PR), Sum (adRevenue)

No Phase 3! Phase 2

slide-27
SLIDE 27

Benchmark 3(Design II) – Phase 1

Max(Sum (adRevenue)) Max(Sum (adRevenue)) Max(Sum (adRevenue))

R Reducers HDFS Identity Identity Identity

Mappers

Inter. Resu lt Inter. Resu lt Inter. Resu lt Result1 Result1 Result1 User visits User visits User visits

<Source IP, Sum (adReveune), <Dest URLs>>

predic ate predic ate predic ate RR

Only UserVisit s

<Source IP, Sum (adReveune), <Dest URLs>> <Source IP, Sum (adReveune), <Dest URLs>>

Very small data (Top R records) Phase 1

slide-28
SLIDE 28

Benchmark 3(Design I) – Phase 2

HDFS Rankin gs Ranking s Rankin gs Read Read Read Record Reader s Result1 Result1 Result1

<Source IP, Sum(adReveune), <Dest URLs>> <Source IP, Sum(adReveune), <Dest URLs>> <Source IP, Sum(adReveune), <Dest URLs>>

Max(sum(adRevenue)) & Join

Single Mapper

Final Result

Source IP, Avg(PR), Sum (adRevenue)

Phase 2

slide-29
SLIDE 29

Improving Hadoop

slide-30
SLIDE 30

Improving Hadoop

  • Improve Selection (Indexing)
  • Improve Join (Co-partitioning)
slide-31
SLIDE 31

Indexing

  • Data Loading

○index and load data into DFS

  • Query Execution

○ index look-up and selection

  • Implementation on Hadoop
slide-32
SLIDE 32

Data Loading

  • Partitioning
  • Sorting
  • Bulk Loading
  • HID Splits
slide-33
SLIDE 33

Data Loading

slide-34
SLIDE 34

Partitioning

Split input data at tuple boundaries

slide-35
SLIDE 35

Partitioning

Split input data at tuple boundaries

slide-36
SLIDE 36

Partitioning

Split input data at tuple boundaries

slide-37
SLIDE 37

Partitioning

Split input data at tuple boundaries

slide-38
SLIDE 38

Sorting

Sort each split on the index key

slide-39
SLIDE 39

Sorting

Sort each split on the index key

slide-40
SLIDE 40

Bulk Loading

Bulk load CSS tree index

slide-41
SLIDE 41

HID Split

Construct Header-Index-Data Split

slide-42
SLIDE 42

HID Split

Construct Header-Index-Data Split

slide-43
SLIDE 43

HID Split

Construct Header-Index-Data Split Header: Index end offset Data end offset Start index key End index key

slide-44
SLIDE 44

HID Split

Construct Header-Index-Data Split Header: Index end offset Data end offset Start index key End index key

slide-45
SLIDE 45

Query Execution

  • Partitioning
  • Split selection
  • Index lookup
  • Extractor
slide-46
SLIDE 46

Query Execution

slide-47
SLIDE 47

Partitioning

Read header to get HID boundaries

slide-48
SLIDE 48

Partitioning

Read header to get HID boundaries

slide-49
SLIDE 49

Partitioning

Read header to get HID boundaries

slide-50
SLIDE 50

Partitioning

Read header to get HID boundaries

slide-51
SLIDE 51

Split Selection

Discard splits containing out of range index keys

slide-52
SLIDE 52

Index Lookup

Find data offsets corresponding to LOW and HIGH keys

slide-53
SLIDE 53

Index Lookup

Find data offsets corresponding to LOW and HIGH keys

slide-54
SLIDE 54

Index Lookup

Find data offsets corresponding to LOW and HIGH keys

slide-55
SLIDE 55

Index Lookup

Find data offsets corresponding to LOW and HIGH keys

slide-56
SLIDE 56

Index Lookup

Find data offsets corresponding to LOW and HIGH keys

slide-57
SLIDE 57

Index Lookup

Find data offsets corresponding to LOW and HIGH keys

slide-58
SLIDE 58

Index Lookup

Find data offsets corresponding to LOW and HIGH keys

slide-59
SLIDE 59

Index Lookup

Find data offsets corresponding to LOW and HIGH keys

slide-60
SLIDE 60

Index Lookup

Find data offsets corresponding to LOW and HIGH keys

slide-61
SLIDE 61

Index Lookup

Find data offsets corresponding to LOW and HIGH keys

slide-62
SLIDE 62

Extractor

Perform selection on data

slide-63
SLIDE 63

Extractor

Pass sub-split to Record Reader for processing

slide-64
SLIDE 64

Implementation on Hadoop

Loading

  • CSS Tree Index
  • Indirect index
  • Four key types supported - Int, Float, Date, String
  • Index stored as byte array
  • Reducer to reduce number of files
  • Integral number of HID splits per reducer output

Querying

  • Discover HID split boundaries from respective headers
  • Read only the selected data from HDFS
slide-65
SLIDE 65

Co-Partitioning

  • Data loading
  • Query execution
slide-66
SLIDE 66

Data Loading

slide-67
SLIDE 67

Data Loading

slide-68
SLIDE 68

Data Loading

slide-69
SLIDE 69

Data Loading

slide-70
SLIDE 70

Data Loading

slide-71
SLIDE 71

Query Execution

slide-72
SLIDE 72

Query Execution

slide-73
SLIDE 73

Query Execution

slide-74
SLIDE 74

Query Execution

slide-75
SLIDE 75

Query Execution

slide-76
SLIDE 76

Indexing on top of Co-partitioning

slide-77
SLIDE 77

Indexing on top of Co-partitioning

slide-78
SLIDE 78

Indexing on top of Co-partitioning

slide-79
SLIDE 79

Indexing on top of Co-partitioning

slide-80
SLIDE 80

Indexing on top of Co-partitioning

slide-81
SLIDE 81

Indexing on top of Co-partitioning

slide-82
SLIDE 82

Indexing on top of Co-partitioning

slide-83
SLIDE 83

Experiments

slide-84
SLIDE 84

Experimental Setup

  • Hadoop 0.19.1
  • 5 nodes
  • Speed?
  • RAM?
  • Gigabit Ethernet
  • Data size

○User Visits: 20GB ○Rankings: 32MB

slide-85
SLIDE 85

Results

slide-86
SLIDE 86

Results

slide-87
SLIDE 87

Results

slide-88
SLIDE 88

Results

slide-89
SLIDE 89

Results

slide-90
SLIDE 90

Roadblocks Faced

  • Data generation:

○20GB UserVisits, 338MB Rankings in HDFS ○Took 16 hours for generation ○Too many OS/library dependencies ○Poor documentation

  • Number of nodes:

○Allocated 6 nodes ○Effective (up-and-running) 4 nodes ○Map/Reduce parallelism not exploited ○Per-split indexing ideally suited for highly parallel execution

slide-91
SLIDE 91

Roadblocks Faced

  • Data normalization

○Schema uses VARCHAR data types ○Input data normalized to fixed tuple-sized binaries ○Byte oriented processing speedup negated by increased input size ○However, facilitates indexing and co-partitioning

  • Low selectivity

○Selection task has selectivity close to 1 ○Indexing benefits are sabotaged

  • Incorrect base result

○Reported join task result was not correct

slide-92
SLIDE 92

Roadblocks Faced

  • Implementation deviation from the paper

○Composite key is not really used in join task

slide-93
SLIDE 93

Discussion: Loopholes

  • Benchmarks are well suited (biased) for databases
  • Huge difference in data loading time
  • Queries make heavy use of indexing, sorting data
  • Query optimization not done for Map/Reduce
  • Fault tolerance not compared
slide-94
SLIDE 94

Discussion: We can do better!

  • Map/Reduce plans can be optimized
  • Normalized binary input data can help
  • Indexing feasible and performs good
  • Co-partitioning feasible and looks promising
slide-95
SLIDE 95

Conclusions

slide-96
SLIDE 96

References

  • Pavlo, A., Paulson, E., Rasin, A., Abadi,
  • D. J., DeWitt, D. J., Madden, S., and

Stonebraker, M. 2009. A comparison of approaches to large-scale data analysis. SIGMOD '09.

  • DeWitt, D. and Gray, J. 1992. Parallel

database systems: the future of high performance database systems. Commun. ACM35, 6 (Jun. 1992)