Improving I/O Performance Through Colocating Interrelated Input Data - PowerPoint PPT Presentation

Improving I/O Performance Through Colocating Interrelated Input Data and Near-Optimal Load Balancing Felix Seibert , Mathias Peters, and Florian Schintke 4th HPBDC Workshop 2018, Vancouver seibert@zib.de 1 / 28

Overview Background and Motivation Placement Strategy Experimental Evaluation 2 / 28

Background: Application ◮ GeoMultiSens: Analyze the Earth’s surface based on remote sensing images Homogenization Visual Exploration Data Sources Evaluation ◮ The more images, the better the results (several PB available) ◮ Implementation as MapReduce job in Flink ◮ Goal: distribute files in the distributed file system (XtreemFS) such that the computation is efficient and performant 3 / 28

Background: Application Details (1) ◮ Data parallel application ◮ Parallelization along geographical regions (UTM grid) Read from XtreemFS Flink Map Flink Map (via Flink DataSource) Image 1 Image 2 ... Composite Classi fi cation Image k 4 / 28

Background: Application Details (2) Read from XtreemFS Flink Map Flink Map (via Flink DataSource) Image 1 Image 2 ... Composite Classi fi cation Image k ◮ composites have large memory footprint ◮ (de)serialization is expansive (python ↔ java) ◮ ⇒ composites should not be moved ◮ ⇒ analysis of one region (group) should be on one node 5 / 28

File Placement Issues (1) ◮ State of the art: more or less random distribution ⇒ no data local processing possible node 1 node 2 node 3 assigned 32TPT 32TQU 32TKS region(s) 32TQU/ fi le_1 32TKS/ fi le_2 32TPT/ fi le_3 fi les stored on local hard 32TKS/ fi le_1 32TQU/ fi le_2 32TKS/ fi le_3 drive 32TPT/ fi le_1 32TPT/ fi le_2 32TQU/ fi le_3 ◮ Network traffic ◮ Disk scheduling 6 / 28

File Placement Issues (2) ◮ Goal: colocated UTM regions (groups) for local access node 1 node 2 node 3 assigned 32TQU 32TKS 32TPT region(s) 32TQU/ fi le_1 32TKS/ fi le_1 32TPT/ fi le_1 fi les stored on local hard 32TQU/ fi le_2 32TKS/ fi le_2 32TPT/ fi le_2 drive 32TQU/ fi le_3 32TKS/ fi le_3 32TPT/ fi le_3 ◮ Network traffic ◮ ⇒ local grouping ◮ Disk scheduling ◮ ⇒ load balancing issues (Europe: 1400 groups between 3 MB and 25 GB) 7 / 28

Hybrid placement optimization strategy ◮ place all files of the same group in same, tagged folder ◮ distributed file system places all files of same group on same server ◮ load balancing (Storage Server Assignment Problem) is NP hard ◮ thus, use approximation algorithm 8 / 28

Storage Server Assignment Problem (1) File groups (regions) Storage machines (OSDs) m 1 m 2 Goal: Assign file groups to machines such that the most loaded machine is loaded as little as possible 9 / 28

Storage Server Assignment Problem (2) File groups Two assignments m 1 m 2 m 1 m 2 ◮ our assignment problem is equivalent to multi-processor scheduling ◮ Approximation algorithm: Largest Processing Time first (LPT) becomes largest group size first in our scenario 10 / 28

LPT - Formal Description ◮ S set of storage servers (OSDs) with capacities c : S → N ◮ F set of file groups with sizes s : F → N ◮ Sort F = { f 1 , . . . , f n } such that s ( f i ) ≥ s ( f j ) for all 1 ≤ i < j ≤ n ◮ S i denotes the storage server assigned to group f i ◮ for i = 1 , . . . , n , S i is given by ( ℓ ( S ) + s ( f i ) S i = arg min c ( S )) S ∈S 11 / 28

LPT - Step by Step Example (1) Sorted file groups 1 m 1 1 m 2 2 1 m 3 2 12 / 28

LPT - Step by Step Example (2) Remaining file groups 1 m 1 1 m 2 2 1 m 3 2 13 / 28

LPT: Key Properties ◮ simple and fast algorithm ◮ suitable for offline and online problems ◮ good theoretical performance ◮ practical evaluation: differs less than 1 % (offline) / less than 5 % (online) from the optimal solution 19 / 28

Implementation: Architecture client OSDs mount fi le system object storage devices GMS application metadata and replica LPT implementation catalogue client tool MRC new code: ◮ client tool ◮ OSD selection policy for MRC 20 / 28

Implementation: Add Group(s) read/write client OSDs fi le system access fi les via POSIX interface meta data operations assign GMS fi les add fi le groups (tag folders) con fi gure client tool MRC OSD selection policy add folders(/path/to/xtreemfs mount/some/subdirs/32TQU) ⇒ MRC adds mapping entry: some/subdirs/32TQU → OSD 17 21 / 28

Implementation: Add File(s) read/write client OSDs fi le system access fi les meta data operations via POSIX interface assign GMS fi les add fi le groups (tag folders) con fi gure client tool MRC OSD selection policy open(/path/to/xtreemfs mount/ some/subdirs/32TQU/LC8/file.tif ) ⇒ MRC finds match for prefix some/subdirs/32TQU ⇒ file.tif is stored on OSD 17 22 / 28

Experimental setup ◮ input data: 3.3 TB of satellite images in 355 groups ◮ hardware: one master, 29 worker/storage nodes, each with 16 CPU nodes and 10 Gb network ◮ job: read and decompress all data, in the same way as for land cover classification ◮ tested file distributions: ◮ Random File (state of the art default) ◮ Random File Group (e.g., CoHadoop) ◮ LPT File Group (our strategy) ◮ each tested with HDDs and SSDs ◮ 10 repetitions for each setup 23 / 28

Network traffic ◮ measure total (incoming) network traffic of the whole job Random File Random File Group LPT File Group 0 1000 2000 3000 4000 Total Rx in GB ◮ 95 % decrease compared to Random File ◮ 68 % decrease compared to Random File Group 24 / 28

Running times and CPU wait times (relative values) 1.25 1.00 0.75 0.50 Data Placement Random File Random File Group 0.25 LPT File Group 0.00 HDD HDD SSD SSD runtime total CPU wait time runtime total CPU wait time ◮ baseline: Random File takes 40 min with HDDs ◮ 39 % running time and 50 % CPU wait time reduction compared to Random File ◮ 65 % running time and 47 % CPU wait time reduction compared to Random File Group 25 / 28

Running times and CPU wait times (relative values) 1.25 1.00 0.75 0.50 Data Placement Random File 0.25 Random File Group LPT File Group 0.00 HDD HDD SSD SSD runtime total CPU wait time runtime total CPU wait time ◮ baseline: Random File takes 16 min with SSDs ◮ file placement has no significant impact with SSD setups ◮ low network usage seems to have little impact ⇒ HDD speedup mostly due to better scheduling 26 / 28

Conclusions/Summary ◮ Lightweight file placement mechanism that combines: ◮ colocation of related input files for local performance ◮ nearly optimal storage server selection for global perfomance (load balancing) ◮ Empirically verified benefits of colocated LPT placement: ◮ network traffic reduced by around 95% compared to Random File placement ◮ time to read input reduced by 39% / 65% compared to Random File / Random Group placement ◮ difference to optimal solution less than 5% of the optimal solution ⇒ XtreemFS is ready for efficient large-scale analysis of the Earth’s surface 27 / 28

References ◮ XtreemFS: www.xtreemfs.org , https://github.com/xtreemfs/xtreemfs ◮ client tool: https://github.com/felse/xtreemfs_client ◮ Application (GeoMultiSens): http://www.geomultisens.de/ ◮ Many thanks to the GeoMultiSens team! ◮ Felix Seibert: https://www.zib.de/members/seibert ◮ Funding: GeoMultiSens (grants 01IS14010C and 01IS14010B) and the Berlin Big Data Center (BBDC) (grant 01IS14013B). 28 / 28

Improving I/O Performance Through Colocating Interrelated Input Data - PowerPoint PPT Presentation

Improving I/O Performance Through Colocating Interrelated Input Data and Near-Optimal Load Balancing Felix Seibert , Mathias Peters, and Florian Schintke 4th HPBDC Workshop 2018, Vancouver seibert@zib.de 1 / 28 Overview Background and

Improving Improving Finances, Finances, Improving Improving Lives Lives www.jeanchatzky.com

Pennine Acute Hospitals NHS Trust: Improvement Journey 1 Pennine Improvement Plan Improving

for innovation improving for innovation improving Design Thinking for innovation improving New

Silicate Melts & Glasses: Chemical Diffusion, Nucleation & Crystallization, Interrelated

ARRANGEMENT OF FACILITIES What is manufacturing management? The set of interrelated management

By a hierarchic system, or hierarchy, I mean a sys- tem that is composed of interrelated

Model Synchronization with the Role-oriented Single Underlying Model Model Synchronization

Improving Improving AI Decision Modeling AI Decision Modeling Through Through Utility Theory

Improving Performance We want to improve the performance of our computation Unit 17

Improving the performance of the qcow2 format KVM Forum 2017 Alberto Garcia

Improving performance for Improving performance for security enabled web security enabled web

CSE 504: Project Proposal Jennifer Niederlnder 01/13/2016 Improving Security Testing

Improving Outcomes and Controlling Costs: Improving Outcomes and Controlling Costs: Improving

Services in Portsmouth date Improving health services Improving health services Improving

Bending the Cost Curve and Improving Bending the Cost Curve and Improving Bending the Cost Curve

Duke iGEM 2014 Methodology Scaling up Synthetic Biology Improving Improving Improving CRISPR

Technology in Mauritius United Nations / South Africa Symposium on Basic Space Technology

ELECTRIFICATION | DIGITAL Energy Management and Asset Supervision through ABB Digital

MEKONG RIVER MEKONG RIVER COMMISSION PROGRAMMES COMMISSION PROGRAMMES FOR SUSTAINABLE FOR

Malcode Analysis Malcode Analysis Techniques Techniques for for Incident Handlers Incident

The EME Programme Efficacy and Mechanism Evaluation Programme EME webinar 2017 www.nihr.ac.uk

MRC Stratified Medicine Initiative Jonathan Pearce Medical Research Council, Translational

(Re-)examining the creation of an electronic collection on faculty scholarship John Sterbenz

Open Source BIOS at Scale We gave it a try, it worked. You can jump in! Online / Scaleway @

Improving I/O Performance Through Colocating Interrelated Input Data - PowerPoint PPT Presentation

Improving I/O Performance Through Colocating Interrelated Input Data and Near-Optimal Load Balancing Felix Seibert , Mathias Peters, and Florian Schintke 4th HPBDC Workshop 2018, Vancouver seibert@zib.de 1 / 28 Overview Background and

Improving Improving Finances, Finances, Improving Improving Lives Lives www.jeanchatzky.com

Pennine Acute Hospitals NHS Trust: Improvement Journey 1 Pennine Improvement Plan Improving

for innovation improving for innovation improving Design Thinking for innovation improving New

Silicate Melts &amp; Glasses: Chemical Diffusion, Nucleation &amp; Crystallization, Interrelated

ARRANGEMENT OF FACILITIES What is manufacturing management? The set of interrelated management

By a hierarchic system, or hierarchy, I mean a sys- tem that is composed of interrelated

Model Synchronization with the Role-oriented Single Underlying Model Model Synchronization

Improving Improving AI Decision Modeling AI Decision Modeling Through Through Utility Theory

Improving Performance We want to improve the performance of our computation Unit 17

Improving the performance of the qcow2 format KVM Forum 2017 Alberto Garcia

Improving performance for Improving performance for security enabled web security enabled web

CSE 504: Project Proposal Jennifer Niederlnder 01/13/2016 Improving Security Testing

Improving Outcomes and Controlling Costs: Improving Outcomes and Controlling Costs: Improving

Services in Portsmouth date Improving health services Improving health services Improving

Bending the Cost Curve and Improving Bending the Cost Curve and Improving Bending the Cost Curve

Duke iGEM 2014 Methodology Scaling up Synthetic Biology Improving Improving Improving CRISPR

Technology in Mauritius United Nations / South Africa Symposium on Basic Space Technology

ELECTRIFICATION | DIGITAL Energy Management and Asset Supervision through ABB Digital

MEKONG RIVER MEKONG RIVER COMMISSION PROGRAMMES COMMISSION PROGRAMMES FOR SUSTAINABLE FOR

Malcode Analysis Malcode Analysis Techniques Techniques for for Incident Handlers Incident

The EME Programme Efficacy and Mechanism Evaluation Programme EME webinar 2017 www.nihr.ac.uk

MRC Stratified Medicine Initiative Jonathan Pearce Medical Research Council, Translational

(Re-)examining the creation of an electronic collection on faculty scholarship John Sterbenz

Open Source BIOS at Scale We gave it a try, it worked. You can jump in! Online / Scaleway @

Silicate Melts & Glasses: Chemical Diffusion, Nucleation & Crystallization, Interrelated