SpongeFiles:
Mitigating Data Skew in MapReduce Using Distributed Memory
Khaled Elmeleegy Turn Inc. kelmeleegy@turn.com Christopher Olston Google Inc.
- lston@google.com
Benjamin Reed Facebook Inc. br33d@fb.com
1
SpongeFiles Mitigating Data Skew in MapReduce Using Distributed - - PowerPoint PPT Presentation
SpongeFiles Mitigating Data Skew in MapReduce Using Distributed Memory Khaled Elmeleegy Benjamin Reed Christopher Olston Turn Inc. Facebook Inc. Google Inc. kelmeleegy@turn.com br33d@fb.com olston@google.com 1 Background
Khaled Elmeleegy Turn Inc. kelmeleegy@turn.com Christopher Olston Google Inc.
Benjamin Reed Facebook Inc. br33d@fb.com
1
processing web & social networking data sets
capacity
2
3
the task
4
5
6
7
1 2 p k v k v k v k v p k v p k v …… …… …… kvoffset[] 1.25% kvindeces[] 3.75% kvbuffer[] 95% By default 100MB
8
MapOutputCopier MapOutputCopier MapOutputCopier MapOutputCopier
…… copiers Memory buffer Spill to Disk Local Disk In Memory On Disk
InMemFSMergerThread
LocalFSMerger
…… …… …… ……
merge merge
9
10
Share memory in the same node Share memory between peers
write )
11
12
Effect: Share memory between tasks in the same node Steps:
data)
13
Effect: Share memory between tasks among peers
14
Effect: It is the last resort, similar to spill on disk Steps:
distributed file systems
15
Tasks are alive: delete their SpongeFiles before they exit Tasks failed: sponge servers perform periodic garbage collections
16
17
= 1% per month
18
In Memory On Disk category Time(ms) category Time(ms) Local shared memory 1 Disk 25 Local memory ( through sponge server) 7 Disk with back- ground IO 174 Remote memory (over the network 9 Disk with back- ground IO and memory pressure 499
19
Spill a 1 MB buffer 10,000 times to disk and mem
20
processing and multiple messageexchanges)
memory
21
spilling to disk
data spilled and the time dierence between when the data is spilled and when it is read back
22
disk contention and memory pressure(Similar behavior is seen for the spam quantiles)
with disk contention, spilling to disk performs slightly better than spilling to SpongeFiles.
23
24
25
disk contention and by up to 85%