Improving I/O Performance Through Colocating Interrelated Input Data - - PowerPoint PPT Presentation

improving i o performance through colocating interrelated
SMART_READER_LITE
LIVE PREVIEW

Improving I/O Performance Through Colocating Interrelated Input Data - - PowerPoint PPT Presentation

Improving I/O Performance Through Colocating Interrelated Input Data and Near-Optimal Load Balancing Felix Seibert , Mathias Peters, and Florian Schintke 4th HPBDC Workshop 2018, Vancouver seibert@zib.de 1 / 28 Overview Background and


slide-1
SLIDE 1

Improving I/O Performance Through Colocating Interrelated Input Data and Near-Optimal Load Balancing

Felix Seibert, Mathias Peters, and Florian Schintke 4th HPBDC Workshop 2018, Vancouver seibert@zib.de

1 / 28

slide-2
SLIDE 2

Overview

Background and Motivation Placement Strategy Experimental Evaluation

2 / 28

slide-3
SLIDE 3

Background: Application

◮ GeoMultiSens: Analyze the Earth’s surface based on remote sensing images

Data Sources Homogenization Evaluation Visual Exploration

◮ The more images, the better the results (several PB available) ◮ Implementation as MapReduce job in Flink ◮ Goal: distribute files in the distributed file system (XtreemFS) such that the computation is efficient and performant

3 / 28

slide-4
SLIDE 4

Background: Application Details (1)

◮ Data parallel application ◮ Parallelization along geographical regions (UTM grid)

Image 1 Image 2 Image k

...

Composite Classification

Flink Map Flink Map Read from XtreemFS (via Flink DataSource)

4 / 28

slide-5
SLIDE 5

Background: Application Details (2)

Image 1 Image 2 Image k

...

Composite Classification

Flink Map Flink Map Read from XtreemFS (via Flink DataSource)

◮ composites have large memory footprint ◮ (de)serialization is expansive (python ↔ java) ◮ ⇒ composites should not be moved ◮ ⇒ analysis of one region (group) should be on one node

5 / 28

slide-6
SLIDE 6

File Placement Issues (1)

◮ State of the art: more or less random distribution ⇒ no data local processing possible

node 1 node 2 node 3

files stored

  • n local

hard drive assigned region(s)

32TQU 32TKS 32TPT 32TQU/file_1 32TKS/file_1 32TPT/file_1 32TPT/file_2 32TQU/file_2 32TKS/file_2 32TKS/file_3 32TPT/file_3 32TQU/file_3

◮ Network traffic ◮ Disk scheduling

6 / 28

slide-7
SLIDE 7

File Placement Issues (2)

◮ Goal: colocated UTM regions (groups) for local access

node 1 node 2 node 3

assigned region(s) files stored

  • n local

hard drive

32TQU 32TQU/file_1 32TKS 32TQU/file_2 32TPT 32TKS/file_1 32TKS/file_2 32TKS/file_3 32TPT/file_1 32TPT/file_2 32TPT/file_3 32TQU/file_3

◮ Network traffic ◮ Disk scheduling ◮ ⇒ local grouping ◮ ⇒ load balancing issues (Europe: 1400 groups between 3 MB and 25 GB)

7 / 28

slide-8
SLIDE 8

Hybrid placement optimization strategy

◮ place all files of the same group in same, tagged folder ◮ distributed file system places all files of same group on same server ◮ load balancing (Storage Server Assignment Problem) is NP hard ◮ thus, use approximation algorithm

8 / 28

slide-9
SLIDE 9

Storage Server Assignment Problem (1)

File groups (regions) Storage machines (OSDs)

m1 m2

Goal: Assign file groups to machines such that the most loaded machine is loaded as little as possible

9 / 28

slide-10
SLIDE 10

Storage Server Assignment Problem (2)

File groups Two assignments

m1 m2 m1 m2

◮ our assignment problem is equivalent to multi-processor scheduling ◮ Approximation algorithm: Largest Processing Time first (LPT) becomes largest group size first in our scenario

10 / 28

slide-11
SLIDE 11

LPT - Formal Description

◮ S set of storage servers (OSDs) with capacities c : S → N ◮ F set of file groups with sizes s : F → N ◮ Sort F = {f1, . . . , fn} such that s(fi) ≥ s(fj) for all 1 ≤ i < j ≤ n ◮ Si denotes the storage server assigned to group fi ◮ for i = 1, . . . , n, Si is given by Si = arg min

S∈S

(ℓ(S) + s(fi) c(S))

11 / 28

slide-12
SLIDE 12

LPT - Step by Step Example (1)

Sorted file groups

m2 m3 m1 1

1 2 1 2 12 / 28

slide-13
SLIDE 13

LPT - Step by Step Example (2)

Remaining file groups

m2 m3 m1 1

1 2 1 2 13 / 28

slide-14
SLIDE 14

LPT - Step by Step Example (3)

Remaining file groups

m2 m3 m1 1

1 2 1 2 14 / 28

slide-15
SLIDE 15

LPT - Step by Step Example (4)

Remaining file groups

m2 m3 m1 1

1 2 1 2 15 / 28

slide-16
SLIDE 16

LPT - Step by Step Example (5)

Remaining file groups

m2 m3 m1 1

1 2 1 2 16 / 28

slide-17
SLIDE 17

LPT - Step by Step Example (6)

Remaining file groups

m2 m3 m1 1

1 2 1 2 17 / 28

slide-18
SLIDE 18

LPT - Step by Step Example (7)

Remaining file groups

m2 m3 m1 1

1 2 1 2 18 / 28

slide-19
SLIDE 19

LPT: Key Properties

◮ simple and fast algorithm ◮ suitable for offline and online problems ◮ good theoretical performance ◮ practical evaluation: differs less than 1 % (offline) / less than 5 % (online) from the optimal solution

19 / 28

slide-20
SLIDE 20

Implementation: Architecture GMS client client tool MRC OSDs

application mount file system LPT implementation metadata and replica catalogue

  • bject

storage devices

new code: ◮ client tool ◮ OSD selection policy for MRC

20 / 28

slide-21
SLIDE 21

Implementation: Add Group(s)

file system access via POSIX interface add file groups configure meta data operations read/write (tag folders) files OSD selection policy assign files

GMS client client tool MRC OSDs

add folders(/path/to/xtreemfs mount/some/subdirs/32TQU) ⇒ MRC adds mapping entry: some/subdirs/32TQU → OSD 17

21 / 28

slide-22
SLIDE 22

Implementation: Add File(s)

file system access via POSIX interface add file groups configure meta data operations read/write (tag folders) files OSD selection policy assign files

GMS client client tool MRC OSDs

  • pen(/path/to/xtreemfs mount/some/subdirs/32TQU/LC8/file.tif)

⇒ MRC finds match for prefix some/subdirs/32TQU ⇒ file.tif is stored on OSD 17

22 / 28

slide-23
SLIDE 23

Experimental setup

◮ input data: 3.3 TB of satellite images in 355 groups ◮ hardware: one master, 29 worker/storage nodes, each with 16 CPU nodes and 10 Gb network ◮ job: read and decompress all data, in the same way as for land cover classification ◮ tested file distributions:

◮ Random File (state of the art default) ◮ Random File Group (e.g., CoHadoop) ◮ LPT File Group (our strategy)

◮ each tested with HDDs and SSDs ◮ 10 repetitions for each setup

23 / 28

slide-24
SLIDE 24

Network traffic

◮ measure total (incoming) network traffic of the whole job

LPT File Group Random File Group Random File 1000 2000 3000 4000 Total Rx in GB

◮ 95 % decrease compared to Random File ◮ 68 % decrease compared to Random File Group

24 / 28

slide-25
SLIDE 25

Running times and CPU wait times (relative values)

0.00 0.25 0.50 0.75 1.00 1.25 HDD runtime HDD total CPU wait time SSD runtime SSD total CPU wait time Data Placement Random File Random File Group LPT File Group

◮ baseline: Random File takes 40 min with HDDs ◮ 39 % running time and 50 % CPU wait time reduction compared to Random File ◮ 65 % running time and 47 % CPU wait time reduction compared to Random File Group

25 / 28

slide-26
SLIDE 26

Running times and CPU wait times (relative values)

0.00 0.25 0.50 0.75 1.00 1.25 HDD runtime HDD total CPU wait time SSD runtime SSD total CPU wait time Data Placement Random File Random File Group LPT File Group

◮ baseline: Random File takes 16 min with SSDs ◮ file placement has no significant impact with SSD setups ◮ low network usage seems to have little impact ⇒ HDD speedup mostly due to better scheduling

26 / 28

slide-27
SLIDE 27

Conclusions/Summary

◮ Lightweight file placement mechanism that combines:

◮ colocation of related input files for local performance ◮ nearly optimal storage server selection for global perfomance (load balancing)

◮ Empirically verified benefits of colocated LPT placement:

◮ network traffic reduced by around 95% compared to Random File placement ◮ time to read input reduced by 39% / 65% compared to Random File / Random Group placement ◮ difference to optimal solution less than 5% of the optimal solution

⇒ XtreemFS is ready for efficient large-scale analysis of the Earth’s surface

27 / 28

slide-28
SLIDE 28

References

◮ XtreemFS: www.xtreemfs.org, https://github.com/xtreemfs/xtreemfs ◮ client tool: https://github.com/felse/xtreemfs_client ◮ Application (GeoMultiSens): http://www.geomultisens.de/ ◮ Many thanks to the GeoMultiSens team! ◮ Felix Seibert: https://www.zib.de/members/seibert

◮ Funding: GeoMultiSens (grants 01IS14010C and 01IS14010B) and the Berlin Big Data Center (BBDC) (grant 01IS14013B).

28 / 28