waveSZ: A Hardware-Algorithm Co-Design of Efficient Lossy - - PowerPoint PPT Presentation

wavesz a hardware algorithm co design of efficient lossy
SMART_READER_LITE
LIVE PREVIEW

waveSZ: A Hardware-Algorithm Co-Design of Efficient Lossy - - PowerPoint PPT Presentation

waveSZ: A Hardware-Algorithm Co-Design of Efficient Lossy Compression for Scientific Data The University of Alabama Jiannan Tian Sheng Di Argonne National Laboratory Chengming Zhang The University of Alabama Xin Liang University of


slide-1
SLIDE 1

waveSZ: A Hardware-Algorithm Co-Design of Efficient Lossy Compression for Scientific Data

Jiannan Tian The University of Alabama Sheng Di Argonne National Laboratory Chengming Zhang The University of Alabama Xin Liang University of California, Riverside Sian Jin The University of Alabama Dazhao Cheng University of North Carolina at Charlote Dingwen Tao The University of Alabama Franck Cappello Argonne National Laboratory February 24, 2020 PPoPP ’20 at San Diego, California, USA

slide-2
SLIDE 2

Background Introduction Proposed Design of waveSZ Experimental Evaluation Conclusion and Future Work

Trend of Supercomputing Systems

Storage capacity and bandwidth are developing more slowly compared to computational capability. supercomputer year class PF MS SB MS/SB PF/SB Cray Jaguar 2008 1 PFLOPS 1.75 PFLOPS 360 TB 240 GB/s 1.5k 7.3k Cray Blue Waters 2012 10 PFLOPS 13.3 PFLOPS 1.5 PB 1.1 TB/s 1.3k 13k Cray CORI 2017 10 PFLOPS 30 PFLOPS 1.4 PB 1.7 TB/s⋆ 0.8k 17k IBM Summit 2018 100 PFLOPS 200 PFLOPS >10 PB⋆⋆ 2.5 TB/s >4k 80k PF: peak FLOPS MS: memory size SB: storage bandwidth

⋆ when using burst buffer ⋆⋆ counting only DDR4

Source: F. Cappello (ANL)

Table 1: Three classes of supercomputers showing their performance, MS and SB.

  • Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 2 / 17
slide-3
SLIDE 3

Background Introduction Proposed Design of waveSZ Experimental Evaluation Conclusion and Future Work

Current Status of Scientific Applications

Today’s scientific research is data driven at a large scale (simulations or instruments). PB to process & analyze. (PB datasets are coming.) Data reduction is on demand.

Argonne Leadership Computing Facility mouse brain X-ray Photon Source Upgrade Connectome storage (×100 specimen) and analysis 150 TB/specimen

◮ cosmology simulation HACC a generates 1 20 PB data per one-trillion-particle (1012) simulation, 2 exhausting the FS b and 3 taking long to store c. 4 Reduction at rate 10 needed. ◮ climate simulation CESM generates 1 1 TB data per compute day, 2 increasing hardware budget in storage (NCAR), from 20% (2013) to 50% (2017). 3 Reduction rate at 10+ needed [A. Baker et al., HPDC ’16]. ◮ APS-U Project (High-Energy X-ray Beams Experiments) brain initiatives: 1 multi-hundred PB

  • f storage. 2 Data analysis performed off-site on

ANL Mira, with connection at 100 GB/s d,e. 3 Reduction rate at 100 needed.

a Hardware/Hybrid Accelerated Cosmology Code b Mira at ANL has 26 PB FS, 20 PB/26 PB ≈ 80% c NSF Blue Waters (1TB/s I/O bandwidth), 5h30m to store the data d Would take ∼115 days to transfer the data e There is no 100 PB buffer at the APSL :(

slide-4
SLIDE 4

Background Introduction Proposed Design of waveSZ Experimental Evaluation Conclusion and Future Work

(Error-Bounded) Lossy Compression Maters

◮ Scientific datasets lossless-compressed at rate 2:1 [Son et al. 2014]

◮ represented in floating-point ◮ We need 10:1 or even higher!

◮ Industry lossy compressors offer much higher reduction rate.

◮ designed/optimized considering human perception ◮ not suitable for supercomputer applications

◮ Strict error control toward scientific discovery and accurate postanalysis

◮ data analysis with lossy datasets (afer or during simulation) ◮ execution restarting from failures ◮ calculation from lossy data in memory

◮ Need diverse compression modes

◮ absolute error bound (L∞ norm error) ◮ pointwise relative error bound ◮ RMSE error bound (L2 norm error) ◮ fixed bitrate

◮ SZ [Di and Cappello 2016; Tao et al. 2017; Xin et al. 2018]

◮ prediction-based lossy compressor framework for scientific data ◮ strictly control the global upper bound of compression error

JPEG, reduction rate decreasing and hence quality increasing, lef to right lossy compression for scientific data at varying reduction rate figure from Peter Lindstrom, LLNL rate from 10:1 to 250:1

  • Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 4 / 17
slide-5
SLIDE 5

Background Introduction Proposed Design of waveSZ Experimental Evaluation Conclusion and Future Work

How SZ Works

initial data + parameters prediction

linear (1D), or multidimensional

quantization

linear-scaling,

  • f prediction errors

variable-length

(Hufgman code) low entropy

lossily comp

  • ressed data

lossless

×

input lossy

  • utput

with strict error control

decorrelation coding approximation ◮ Lorenzo predictor allows arbitrary-dimensional prediction.

◮ k1...d=0

0≤k1...d≤n

d

j=1(−1)kj+1 n kj

  • · Dx1−k1,··· ,xd−kd .

◮ Single-layer form Lorenzo predictor works the best generally [Tao et al. 2017]. 2D form: ℓ(D0,0) = dot

  • −1 1

1 0

  • ,

D−1,−1 D0,−1

d−1,0 D0,0

  • .

◮ Customized Huffman encoding

◮ sizeof(T)-byte long symbol to Huffman code ◮ high quantization quality (aggregated in center) makes Huffman coded bitstream more possible to further gzip

eb{

− + +

  • dim0

dim1

1st layer 2nd layer 3rd layer processed processing unprocessed

Lorenzo (ℓ) prediction linear-scaling quantization bincode: 1 (offset)

predicted value true value

  • Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 5 / 17
slide-6
SLIDE 6

Background Introduction Proposed Design of waveSZ Experimental Evaluation Conclusion and Future Work

Issues with SZ and Its Current FPGA Implementation

◮ Low throughput of SZ

◮ lack of parallelism: SIMD and SIMT cannot apply

◮ Limitations in FPGA GhostSZ

◮ totally performance-driven design ◮ 3 predictors in use, need extra bits to encode ◮ more “workflow pipelines” (more resource) ◮ low compression ratio

encoding prediction error in quantized form (16 bits)

waveSZ, SZ-1.4

encoding predictors (2 bits) encoding prediction error in quantized form (14 bits)

GhostSZ

◮ New use scenarios of adopting FPGA

◮ real-time processing; “inline processing” (Intel, 2018) ◮ ExaNet—an FPGA-based direct network architecture of the European exascale systems [Ammendola et al. 2018]

j+0 j+1 j+2 j+3 i−1 i−0 iteration m+0 i−1 i−0 iteration m+1 i−1 i−0 iteration m+2

Figure 1: Loop-carried dependencies due to writeback.

capacity radius “easy” to encode

  • utlier
  • utlier

“hard” to encode

Figure 2: General distribution patern of quantization code.

−0.01 0.00 0.01 100 200 300

Distribution of Prediction Errors

#points SZ-1.4 SZ-1.0 GhostSZ 300 200 100

Figure 3: CESM-ATM CLDLOW.

  • Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 6 / 17
slide-7
SLIDE 7

Background Introduction Proposed Design of waveSZ Experimental Evaluation Conclusion and Future Work

Memory Access Patern and Dependency

0,0

1

0,1

2

0,2

3

0,3

4

0,4

5

0,5

6

0,6

7

0,7

8

0,8

9

0,9 1,0 1,1

1

1,2

2

1,3

3

1,4

4

1,5

5

1,6 1,7 1,8 1,9 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7 2,8 2,9 3,0 3,1 3,2 3,3 3,4 2,4 2,5 3,4 3,5

(a) SZ-1.4 memory access

patern.

1 1 2 2 3 3 4 4 5 5 6 7 8 9

0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7 2,8 2,9 3,0 3,1 3,2 3,3 3,4 4,0 4,1 4,2 4,3 4,4 5,0 5,1 5,2 5,3 5,4 3,5 3,4 3,3 3,2

(b) GhostSZ memory access

patern.

1 2 3 4 5 6 7 8 9 10 11 12 13 14

6 7 7 8

(c) SZ-1.4 dependency in

Manhatan distance.

1 1 2 2 3 3 4 4 5 5 6 7 8 9

5 4 3 2

(d) GhostSZ dependency in

Manhatan distance.

Figure 4: SZ-1.4 and GhostSZ: memory access patern and data dependency in Manhatan distance.

◮ Dependencies denoted with Manhatan distance from • zero point ◮ SZ-1.4

◮ iterate against the dependencies, see

  • Fig. 4(c)

◮ RAW at the last cycle, impossible to extract parallelism

◮ GhostSZ

◮ overlook multidimensional smoothness ◮ slice data of dimensionality into 1D ◮ hence multiple • zero points ◮ no dependency “vertically”

  • Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 7 / 17
slide-8
SLIDE 8

Background Introduction Proposed Design of waveSZ Experimental Evaluation Conclusion and Future Work

Memory Access Patern and Dependency (cont’d)

0,0

1

0,1

2

0,2

3

0,3

4

0,4

5

0,5

6

0,6

7

0,7

8

0,8

9

0,9 1,0 1,1

1

1,2

2

1,3

3

1,4

4

1,5

5

1,6 1,7 1,8 1,9 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7 2,8 2,9 3,0 3,1 3,2 3,3 3,4 2,4 2,5 3,4 3,5

(a) SZ-1.4 memory access

patern.

0,0

1 1

0,1

2 2

0,2

3 3

0,3

4 4

0,4

5 5

0,5

6

0,6

7

0,7

8

0,8

9 10

1,0

11

1,1

12

1,2

13

1,3

14

1,4 1,5 1,6 1,7 2,0 2,1 2,2 2,3 2,4 2,5 2,6 3,0 3,1 3,2 3,3 3,4 4,0 4,1 4,2 4,3 5,0 5,1 5,2 2,4 2,5 3,4 3,5

dim0 dim1

(b) waveSZ memory access patern.

1 2 3 4 5 6 7 8 9 10 11 12 13 14

6 7 7 8

(c) SZ-1.4 dependency in

Manhatan distance.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

6 7 7 8

(d) waveSZ dependency in Manhatan

distance.

Figure 5: SZ-1.4 and waveSZ: memory access patern and data dependency in Manhatan distance.

◮ Dependencies denoted with Manhatan distance from • zero point ◮ waveSZ

◮ iterate along the aligned dependency-free points ◮ exploit the parallelism by pipelining

◮ Pipelining

◮ not to change too much ◮ expect platform-support pipelining control

  • Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 8 / 17
slide-9
SLIDE 9

Background Introduction Proposed Design of waveSZ Experimental Evaluation Conclusion and Future Work

Explicit Pipeline

#task=5 II = 1 depth=10 total latency=14 #task=5 II = 2 depth=10 total latency=18

◮ How pipeline works

◮ depth: #cycles to complete one iteration ◮ initiation interval (II): #cycles to wait until the next iteration ◮ total latency = (#task − 1) × II + depth ◮ speedup = (depth × #task) / total latency ◮ II=1 → II=2, speedup: 2.78× → 3.57× (28% beter) for this demo. ◮ Suppose #task → ∞, speedup =

depth×#task (#task−1)×II+depth ∼ depth II

, hence, II maters.

◮ Reason why is not (always) II=1

◮ data dependency (loop carried) ◮ resource in use (e.g., memory port)

  • Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 9 / 17
slide-10
SLIDE 10

Background Introduction Proposed Design of waveSZ Experimental Evaluation Conclusion and Future Work

Temporal-Spatial Mapping

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . idx0 idx1 1 2 3 4 … It takes 1 cycle to start (?; j +1) since (?; j). i It takes Λ cycles to start (i +1; ?) since (i; ?). i +1 (r; c), starting time: c·Λ+r, ending time: (c+1)·Λ+(r −1). iteration direction dependency direction 1 +0 2 +1 3 +2 . . . . . . Λ−1 Λ +Λ−2 +Λ−1

dim0 dim1

head, spanning Λ body, spanning (dim0−Λ−1) tail, spanning Λ

◮ FPGA + wavefront memory layout = more pipelining control ◮ Ideally, suppose Λ cycles to finish (prediction + quantization), no stall if

1 II = 1, and 2 (vertically) iterating over Λ points from (r, c) to (r, c+1)

◮ body (“perfect loop”) unrolled with factor Λ (= vertical dimension) and II = 1

  • Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 10 / 17
slide-11
SLIDE 11

Background Introduction Proposed Design of waveSZ Experimental Evaluation Conclusion and Future Work

Performance

◮ Platform

◮ target board: Xilinx ZC706 ◮ programming: C/C++ and high-level synthesis ◮ HLS: C/C++ semantics to HDL ◮ unroll with II=1, loops become pipelined commands

◮ Datasets

◮ Scientific Data Reduction Benchmarks (SDRB) suite from https://sdrbench.github.io ◮ 3 representative datasets, a diversity of fields

◮ Synthesis report

◮ successfully unroll body, the “perfect loop” with II=1 ◮ waveSZ use less resource

# fields type dimensions example fields CESM-ATM 79

float32

1800×3600

CLDHGH, CLDLOW

Hurricane 20

float32

100×500×500

CLOUDf48, Uf48

NYX 6

float32

512×512×512

baryon_density

Figure 6: Representative datasets.

total waveSZ (%) GhostSZ (%)

BRAM_18K

1090 9 0.84 162 14.86

DSP48E

900 0.00 63 7.00

FF

437,200 4473 1.02 19470 4.45

LUT

218,600 8208 3.75 27030 12.37

Table 2: Resource utilization from synthesis.

  • Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 11 / 17
slide-12
SLIDE 12

Background Introduction Proposed Design of waveSZ Experimental Evaluation Conclusion and Future Work

Performance (cont’d)

◮ Baseline: CPU SZ-1.4, and GhostSZ ◮ Multilane waveSZ on FPGA vs OpenMP SZ ◮ Compressor configuration

◮ error bound set to 10−3 relative to value range ◮ 16-bit quantization code, waveSZ, ompSZ ◮ 14-bit quantization code, 2-bit predictor code for GhostSZ

◮ Performance, in MB/s

waveSZ GhostSZ SZ-1.4 CESM-ATM 995 130 114 Hurricane 838 101 122 NYX 986 110 125

◮ Scaling up

◮ OpenMP parallelizes sublinearly, 59% at 32 cores ◮ OpenMP version support 3D only ◮ FPGA implementations saturate at PCIe bandwidth

1 2 4 8 16 32 1000 2000 3000 4000

Throughput (MB/s)

PCIe gen2 ×4 (peak perf. for ZC706) PCIe gen3 ×4 (peak perf. as reference)

Hurricane

1 2 4 8 16 32

.

NYX

SZ-1.4 (omp) waveSZ GhostSZ

Number of Parallelism Figure 7: Performance.

  • Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 12 / 17
slide-13
SLIDE 13

Background Introduction Proposed Design of waveSZ Experimental Evaluation Conclusion and Future Work

Statistics and Postanalysis

Compression Ratio (CR) ◮ How CR is affected

◮ distribution of quantization code ◮ the amount of the unpredictable/“outliers” ◮ how it is losslessly encoded

◮ GhostSZ reserves 2 bits to encode predictors in use, diverging 3 peaks at 0b00. . ., 0b01. . ., 0b11. . .. ◮ Techniques in use

◮ G⋆ stands for gzip only, ◮ H⋆G⋆ stands for Huffman + gzip. ◮ With G⋆, waveSZ shows higher CR than that of GhostSZ. ◮ With simulated H⋆G⋆, waveSZ CR ≈ SZ-1.4 CR.

quantization code density eb

center denoting zero offset from

  • riginal value

10−3 10−4 10−5

Figure 8: Error bound changing impacts on CR.

CESM-ATM Hurricane NYX GhostSZ 7.9 6.2 6.6 waveSZ G 12.3 13.2 18.3 HG 29.4 20.3 34.8 SZ-1.4 31.2 21.4 33.8

Table 3: Compression ratio.

  • Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 13 / 17
slide-14
SLIDE 14

Background Introduction Proposed Design of waveSZ Experimental Evaluation Conclusion and Future Work

Statistics and Postanalysis (cont’d)

−0.0010 −0.0005 0.0000 0.0005 0.0010 2000 4000 6000 8000 10000

Distribution of Compression Errors

GhostSZ waveSZ

CLDLOW waveSZ error.abs.val GhostSZ error.abs.val

0.0 0.5 0.0000 0.0005 0.0000 0.0005

◮ eb = 10−3 relative to value range ◮ GhostSZ has slightly higher PSNR

◮ Curve fiting is more intuitive. ◮ Lorenzo (multidimensionally linear) has lower chance to get high prediction accuracy in the similar-value areas. ◮ in the case of CESM-ATM

◮ Tradeoff between these two predictors

◮ multidimensionality (Lorenzo) ◮ higher PSNR (curve fiting) ◮ less resource use (Lorenzo) ◮ higher CR (Lorenzo)

GhostSZ waveSZ SZ-1.4 CESM-ATM 73.9 65.1 64.9 Hurricane 70.6 66.0 65.0 NYX 74.5 66.5 65.2

Table 4: PSNR.

  • Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 14 / 17
slide-15
SLIDE 15

Background Introduction Proposed Design of waveSZ Experimental Evaluation Conclusion and Future Work

Conclusion and Future Work

Conclusion ◮ We adopt a wavefront memory layout to alleviate dependency SZ-1.4 with arbitrary-dimensional predictor. ◮ We propose a co-design framework for SZ lossy compression, waveSZ, and implement it in HLS. ◮ We propose a hardware-algorithm co-optimization (e.g., via HLS directive, base-two algorithmic

  • perations).

◮ We evaluate on three real-world datasets from SDRB suite, showing 2.1× CR and 5.8× througuput on average, compared with the current FPGA implementation. Future Work ◮ Integrate open-source production-level gzip ◮ Integrate Huffman encoding Thoughts on Future Systems ◮ Co-acceleration

◮ FPGA is not in place of manycore accelerator ◮ manycore + FPGA (the availability)

◮ What’s added

◮ feature: low latency (and high throughput) ◮ real-time processing in big-data analytics

  • Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 15 / 17
slide-16
SLIDE 16

Background Introduction Proposed Design of waveSZ Experimental Evaluation Conclusion and Future Work

Acknowledgement

This research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S. Department of Energy’s Office of Science and National Nuclear Security Administration, responsible for delivering a capable exascale ecosystem, including sofware, applications, and hardware technology, to support the nation’s exascale computing imperative. The material was also supported by and supported by the National Science Foundation under Grant No. 1305624, No. 1513201, and No. 1619253.

  • Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 16 / 17
slide-17
SLIDE 17

Thank You

any question?

slide-18
SLIDE 18

BackUp (Rasterization)

Due to the rasterization (an 1800×3600 datum is visualized with serval inches each dimension), the compression error for Lorenzo predictor seems significantly worse than that of Order-{0,1,2}. (top: origin, botom lef: GhostSZ error, botom right: waveSZ error)

@(900, 0)

  • rigin, Ghost.err, wave.err

@(1800, 0) 90×90 @(900, 900) @(1800, 900)

  • Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 1 / 4
slide-19
SLIDE 19

BackUp (FPGA and GPU)

◮ Prediction + quantization

◮ tight dependencies in the original SZ ◮ alleviated with wavefront, dependency on one direction lef ◮ expensive synchronizations across iterations

◮ Lossless stage

◮ we have open-source gzip ◮ too many if-branches and random accesses

1 barrier 2 barrier 3 barrier 4 barrier 5 barrier 6 barrier 7 barrier 8 barrier 9 barrier 10 barrier 11 barrier 12 barrier 13 barrier 14 barrier

thread0 thread1 thread2 thread3 thread4 thread5 6 7 6 7 8 6 7 8 6 7 8 6 7 8 7 8

  • Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 2 / 4
slide-20
SLIDE 20

BackUp (PSNR and CR)

waveSZ is using the same predictor as SZ-1.4 mainly but has higher PSNR because ◮ waveSZ doesn’t do any bit truncation to unpredictable data. ◮ Interestingly, on the NYX dataset, waveSZ has a slightly higher CR (H⋆G⋆). waveSZ goes along “y-direction” (corresponding to outer loop) to overlap the prediction and quantization latency and then change to the following point in the “x-direction” (as shown in Slide 10).

  • Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 3 / 4
slide-21
SLIDE 21

BackUp (ℓ-Predictor)

◮ Gaussian-like, with signum altering to Manhatan distance to the (polarized) current point (). G5×5 =   

1 4 6 4 1 4 16 24 16 4 6 24 36 24 6 4 16 24 16 4 1 4 6 4 1

   ℓ5×5 =   

−1 4 −6 4 −1 4 −16 24 −16 4 −6 24 −36 24 −6 4 −16 24 −16 4 −1 4 −6 4

  ◮ Works for arbitrary dimension: from line to cube, to hypercube...

(x−1,) (x,) (x,−1) (x−1,−1) −

+ +

(x−1,,z) (x,,z) (x,−1,z) (x−1,−1,z) (x−1,,z−1) (x,,z−1) (x,−1,z−1) (x−1,−1,z−1)

− − − + + + +

  • Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 4 / 4