Pf Pfimbi : Accelerat ating Big Dat ata Jo Jobs Through Flow- - - PowerPoint PPT Presentation

pf pfimbi accelerat ating big dat ata jo jobs through
SMART_READER_LITE
LIVE PREVIEW

Pf Pfimbi : Accelerat ating Big Dat ata Jo Jobs Through Flow- - - PowerPoint PPT Presentation

Pf Pfimbi : Accelerat ating Big Dat ata Jo Jobs Through Flow- -Co Controlled Da Data Replicati tion SimbarasheDzinamarira* Florin Dinu T. S. Eugene Ng* *Rice University, EPFL 1 DFSs have a critical role on the


slide-1
SLIDE 1

Pf Pfimbi : Accelerat ating Big Dat ata Jo Jobs Through Flow-­‑

  • ­‑Co

Controlled Da Data Replicati tion

SimbarasheDzinamarira* Florin Dinu▵

  • T. S. Eugene Ng*

*Rice University, ▵EPFL

1

slide-2
SLIDE 2

DF DFSs have a critical role on the Big-­‑

  • ­‑Da

Data la landscape

Management & Monitoring

(Ambari)

Coordination

(ZooKeeper)

Workflow & Scheduling

(Oozie)

Scripting

(Pig)

Machine Learning

(Mahout)

Query

(Hive)

Distributed Processing

(MapReduce)

Distributed Processing

(HDFS)

NoSQL Database

(HBase)

Data Integration

(Sqoop/REST/ODBC)

  • Rich ecosystem of distributedsystems around Hadoop and Spark
  • Predominantly use HDFS for persistent storage
  • A performant HDFS benefits all these system

Image reproducedfrom https //www.mssqltips.com/sqlservertip/3262/big-­‑data-­‑basics-­‑-­‑part-­‑6-­‑-­‑related-­‑apache-­‑projects-­‑in-­‑hadoop-­‑ecosystem/

2

slide-3
SLIDE 3

Synchronous data replication in HD HDFS an and its shortcomings

DATANODE

3.DATA

  • 4. ACKNOWLEDGEMENTS

DATANODE

  • Contention between primary writes and replication
  • Bottlenecks affect the whole pipeline

3

slide-4
SLIDE 4

Sy Synchronous replication seldom helps bo boost appl pplica cation pe performance nce

  • In a study by Fetterly et al. only about 2% of data was read

within 5 minutes of being written [TidyFS: USENIX ATC 2011]

  • Fast networks reduce the cost of non-­‑local reads

DATANODE DATANODE DATANODE

  • There can be data locality withoutreplication

4

slide-5
SLIDE 5

Sy Synchronous ¡ ¡replication ¡ ¡impedes ¡ ¡ industry ¡ ¡efforts ¡ ¡to ¡ ¡improve ¡ ¡HD HDFS

  • Heterogeneous ¡storage
  • Memory ¡as ¡a ¡storage ¡medium

STORAGE DEVICE STORAGE DEVICE STORAGE DEVICE

HDD HDD SSD SSD RAM HDD

SSD ¡image ¡from ¡: ¡http://www.storagereview.com/intel_ssd_525_msata_review

5

slide-6
SLIDE 6

As Asynchronous replication relieves the ef effec ects of pipel eline bottlen enec ecks

DATANODE

3.DATA

DATANODE

6

slide-7
SLIDE 7

Be Beside asynchronous replication, we ne need flow cont ntrol to manag anage cont ntent ntion

DATANODE DATANODE DATANODE

TIME BANDWIDTH SHARE

WITHOUT FLOW CONTROL

TIME BANDWIDTH SHARE

WITH FLOW CONTROL

7

slide-8
SLIDE 8

Pf Pfimbi effectively supports flow co cont ntrolled asynchr nchrono nous us repl plica cation

  • Allows diverse flow control policies
  • Cleanly separates mechanisms from policies
  • Isolates primary writes from replication
  • Avoids IO underutilization

8

slide-9
SLIDE 9

Pf Pfimbi Overview

  • Inter-­‑node flow control

DATANODE DATANODE DATANODE

  • Intra-­‑node flow control

DATANODE

SSD image from http //www.storagereview.com/intel ssd 525 msata review,; Magnifier image from https //commons.wikimedia.org/wiki/File Magnifying glass icon.svg

9

slide-10
SLIDE 10

In Inter-­‑

  • ­‑no

node de ¡ ¡flow ¡ ¡co cont ntrol

  • Timely ¡transfer ¡of ¡replicas ¡to ¡ensure ¡high ¡utilization
  • Flexible ¡policies ¡for ¡sharing ¡bandwidth

CLIENT Kernel ¡Space BLOCK ¡ BUFFER

PFIMBI

Kernel ¡Space BLOCK ¡ BUFFER

PFIMBI

Block ¡notification Send ¡a ¡block

Synchronous Asynchronous

  • Client ¡API ¡: ¡(# ¡of ¡replicas ¡, ¡# ¡of ¡synchronous ¡replicas)

10

slide-11
SLIDE 11

Hi Hierarchical flow control allows Pfimbi to to implement many IO policies

Replication traffic Position 1 Position 2 100 1 Job 1 Job 2 Job 3 1 1 1 Job 1 Job 2 Job 3 1 1 1

  • Example 1 : prioritize replicas earlier in the pipeline
  • Example 2 : fair sharing of bandwidth between jobs

11

slide-12
SLIDE 12

In Intra-­‑

  • ­‑no

node de flow co cont ntrol

Tapimage from https //image.freepik.com/free-­‑icon/bathroom-­‑tap-­‑silhouette 318-­‑63404.png

Incoming Data Block Buffer Synchronous data Asynchronous data Monitoring Activity Bu Buffer Ca Cache

  • Isolate synchronousdata from asynchronousdata
  • Avoid IO underutilization

12

slide-13
SLIDE 13

In Intra-­‑

  • ­‑no

node de flow co cont ntrol Pf Pfimbi’s strategy

OS threshold for flushing buffered data : T Threshold for asynchronous replication : T + δ

Buffer cache

  • Keep the disk fully utilized
  • Limit the amount of replication data in the buffer cache

Typical Values T 10% of RAM (~13GB) δ 500MB Buffer Cache 20% of RAM (~26GB)

13

slide-14
SLIDE 14

Ad Additional topics that are discussed in de detail in the he pa pape per

  • Other activity metrics and their shortcomings
  • Consistency
  • We maintain read and write consistency
  • Failure handling
  • Same mechanism as in HDFS to recover from failures
  • Scalability
  • Pfimbi’s flow control is distributed

14

slide-15
SLIDE 15

Ev Evaluation

  • 30 worker nodes
  • NodeManagers collocatedwith DataNodes
  • 1 Master node
  • ResourceManager collocatedwith NameNode
  • Storage
  • 2TB HDD
  • 200GB SSD
  • 128GB DRAM

15

slide-16
SLIDE 16

Pf Pfimbi improves job runtime and ex exploits SSDs well

DFSIO on HDFS DFSIO on PFIMBI

HDD->HDD->HDD SSD->HDD->HDD

Configurations

100 200 300 400 500 600 700 800 900 1000

Completion time of replicas(s)

HDD->HDD->HDD SSD->HDD->HDD

Configurations

100 200 300 400 500 600 700 800 900 1000

Completion time of eplicas(s)

Primary write 2nd replica 1st replica Syncing dirty data

16

slide-17
SLIDE 17

Ne Necessity of flow control when doing as asynchronous replicat ation

Without Flow Control With Flow Control

Configurations

100 200 300 400 500 600 700 800 900 1000

Completion time (s)

Job 1 Remaining replication Job 2

Two DFSIO jobs

17

slide-18
SLIDE 18

Pf Pfimbi performs well for a mix of different jobs: SWI WIM workload

18% IMPROVEMENT IN AVERAGE JOB RUNTIME

18

slide-19
SLIDE 19

Po Policy Example: Pfimbi can flexibly divide bandwidth be between repl plica po positions ns

200 400 600 800 1000

Timeline of DFSIO writes (s)

10 20 30 40 50 60 70

Number of block completions

200 400 600 800 1000

Timeline of DFSIO writes (s)

10 20 30 40 50 60 70

Number of block completions

1st replica 3rd replica 2nd replica

Equal weights Weights in ratio 100:10:1

19

slide-20
SLIDE 20

Re Related Work

  • Sinbad [SIGCOMM 2013]
  • Flexible endpointto reduce network congestion
  • TidyFS [USENIX ATC 2011]
  • Asynchronousreplication
  • Retro [NSDI 2015]
  • Fairness and prioritizationusing rate control
  • Does not eliminatecontentionwithinnodes
  • No flow control leads to arbitrary contention
  • Synchronousreplication

20

slide-21
SLIDE 21

Co Conclusion

  • Pfimbi effectively supports flow controlled

asynchronous replication

  • Successfully balances managing contention and

maintaining high utilization

  • Expressive and backward compatible with HDFS

21