Incoop: MapReduce for Incremental Computations by Bhatotia et al - - PowerPoint PPT Presentation

incoop mapreduce for incremental computations
SMART_READER_LITE
LIVE PREVIEW

Incoop: MapReduce for Incremental Computations by Bhatotia et al - - PowerPoint PPT Presentation

Incoop: MapReduce for Incremental Computations by Bhatotia et al What is Incoop? Hadoop based framework Designed for improved efficiency of incremental programs Developed at the Max Plank institute by Bhatotia et al. Why Incoop?


slide-1
SLIDE 1

Incoop: MapReduce for Incremental Computations

by Bhatotia et al

slide-2
SLIDE 2

What is Incoop?

  • Hadoop based framework
  • Designed for improved efficiency of

incremental programs

  • Developed at the Max Plank institute by

Bhatotia et al.

slide-3
SLIDE 3

Why Incoop?

slide-4
SLIDE 4
slide-5
SLIDE 5
  • Lots of applications are incremental

○ Machine Learning, wc over a range of docs etc

  • Easy to write, input = Hadoop programs
  • Great speedups

Why run incremental computation

  • n Incoop?
slide-6
SLIDE 6

What differs Incoop from Hadoop?

  • Incremental HDFS
  • Incremental map and incremental reduce

through contraction phase

  • Memoization-aware scheduler
slide-7
SLIDE 7

HDFS recap

  • Large, fixed sized chunks - 64MB
  • Append only filesystem
  • Serial reads and writes
slide-8
SLIDE 8

What’s bad about HDFS?

  • Even small changes to input data results in

unstable partitioning!

  • This makes it difficult to reuse results
slide-9
SLIDE 9

The problem with HDFS

Partitioning

Input file Input file Input file Mapper Mapper Mapper

HDFS

slide-10
SLIDE 10

The problem with HDFS

Partitioning

Input file Input file Input file Mapper Mapper Mapper

HDFS

slide-11
SLIDE 11

The problem with HDFS

Partitioning

Input file Input file Input file Mapper Mapper Mapper

HDFS

slide-12
SLIDE 12

Incremental HDFS

  • Splits input data based on content
  • Variable length chunk sizes
  • Done at the input creation phase
  • Follows the HDFS API
slide-13
SLIDE 13

Solution with incremental HDFS

Input file Input file Input file Mapper Mapper Mapper

INC-HDFS

slide-14
SLIDE 14

Solution with incremental HDFS

Input file Input file Input file Mapper Mapper Mapper

INC-HDFS

slide-15
SLIDE 15

Solution with incremental HDFS

Input file Input file Input file Mapper Mapper Mapper

INC-HDFS

slide-16
SLIDE 16

What differs Incoop from Hadoop?

  • Incremental HDFS
  • Incremental map/reduce and contraction

phase

  • Memoization-aware scheduler
slide-17
SLIDE 17

Incremental Map Phase

  • Persistently stores result between iterations
  • Creates a reference to the result in the

memoization server (via hashing)

  • Later iterations fetches results pointed to by

the memoization server

slide-18
SLIDE 18

Incremental Map Phase

slide-19
SLIDE 19

Incremental Reduce phase

  • More challenging than the Map Phase
  • Coarse grained memoization

○ Reducers copies map input only if result not already computed

  • Fine-grained memoization

○ Combiners

slide-20
SLIDE 20

What are combiners?

  • A step between mappers and reducers
  • Traditionally used to reduce the bandwidth

between mappers and reducers

  • Used in incoop to split reduce tasks and

allow for better memoization

slide-21
SLIDE 21

Incremental Reduce phase

slide-22
SLIDE 22

What differs Incoop from Hadoop?

  • Incremental HDFS
  • Incremental map/reduce and contraction

phase

  • Memoization-aware scheduler
slide-23
SLIDE 23

Memoization Scheduling

  • Built using memcached
  • Per node work queue for good use of data

locality and memoization

  • Work stealing
slide-24
SLIDE 24

Results - incremental runs

slide-25
SLIDE 25

Results - Scheduler

slide-26
SLIDE 26

Results - Overheads

slide-27
SLIDE 27

Results - Overheads

slide-28
SLIDE 28

Criticisms

  • Lack of comparison against other frameworks
  • How were the percentual incremental changes

generated?

  • Garbage collection is pretty naïve. Odd-even

runtime workloads sees no memoization.

  • How realistic are the incremental results for real

world workloads wrt Inc-HDFS?

slide-29
SLIDE 29

Questions?