waveSZ: A Hardware-Algorithm Co-Design of Efficient Lossy - PowerPoint PPT Presentation

waveSZ: A Hardware-Algorithm Co-Design of Efficient Lossy Compression for Scientific Data The University of Alabama Jiannan Tian Sheng Di Argonne National Laboratory Chengming Zhang The University of Alabama Xin Liang University of California, Riverside Sian Jin The University of Alabama Dazhao Cheng University of North Carolina at Charlote Dingwen Tao The University of Alabama Franck Cappello Argonne National Laboratory February 24, 2020 PPoPP ’20 at San Diego, California, USA

Background Introduction Proposed Design of w aveSZ Experimental Evaluation Conclusion and Future Work Trend of Supercomputing Systems Storage capacity and bandwidth are developing more slo wly compared to computational capability. supercomputer year class PF MS SB MS/SB PF/SB Cray Jaguar 2008 1 PFLOPS 1.75 PFLOPS 360 TB 240 GB/s 1.5k 7.3k Cray Blue Waters 2012 10 PFLOPS 13.3 PFLOPS 1.5 PB 1.1 TB/s 1.3k 13k 1.7 TB/s ⋆ Cray CORI 2017 10 PFLOPS 30 PFLOPS 1.4 PB 0.8k 17k > 10 PB ⋆⋆ IBM Summit 2018 100 PFLOPS 200 PFLOPS 2.5 TB/s > 4k 80k PF: peak FLOPS MS: memory size SB: storage bandwidth ⋆ when using burst buffer ⋆⋆ counting only DDR4 Source: F. Cappello (ANL) Table 1: Three classes of supercomputers showing their performance, MS and SB. Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 2 / 17

Background Introduction Proposed Design of w aveSZ Experimental Evaluation Conclusion and Future Work Current Status of Scientific Applications Today’s scientific research is data driven at a large scale (simulations or instruments). PB to process & analyze. (PB datasets are coming.) Data reduction is on demand. ◮ cosmology simulation HACC a generates 1 20 PB data per one-trillion-particle ( 10 12 ) simulation, 2 exhausting the FS b and 3 taking long to store c . 4 Reduction at rate 10 needed. ◮ climate simulation CESM generates 1 1 TB data per compute day, 2 increasing hardware budget in storage (NCAR), from 20% (2013) to 50% (2017). 3 Reduction rate at 10+ needed [A. Baker et al., HPDC ’16]. ◮ APS-U Project (High-Energy X-ray Beams storage Experiments) brain initiatives: 1 multi-hundred PB ( × 100 specimen) of storage. 2 Data analysis performed off-site on Argonne Leadership and analysis Computing Facility ANL Mira, with connection at 100 GB/s d,e . 3 150 TB/specimen Reduction rate at 100 needed. a Hardware/Hybrid Accelerated Cosmology Code b Mira at ANL has 26 PB FS, 20 PB/26 PB ≈ 80% Connectome c NSF Blue Waters (1TB/s I/O bandwidth), 5h30m to store the data d Would take ∼ 115 days to transfer the data Photon Source mouse brain X-ray e There is no 100 PB buffer at the APSL :( Upgrade

Background Introduction Proposed Design of w aveSZ Experimental Evaluation Conclusion and Future Work (Error-Bounded) Lossy Compression Maters ◮ Scientific datasets lossless-compressed at rate 2:1 [Son et al. 2014] ◮ represented in floating-point ◮ We need 10:1 or even higher! ◮ Industry lossy compressors offer much higher reduction rate. ◮ designed/optimized considering human perception ◮ not suitable for supercomputer applications ◮ Strict error control toward scientific discovery and accurate JPEG, reduction rate decreasing and hence quality increasing, lef to right postanalysis ◮ data analysis with lossy datasets (afer or during simulation) ◮ execution restarting from failures ◮ calculation from lossy data in memory ◮ Need diverse compression modes ◮ absolute error bound ( L ∞ norm error) ◮ pointwise relative error bound rate from 10:1 to 250:1 ◮ RMSE error bound ( L 2 norm error) ◮ fixed bitrate ◮ SZ [Di and Cappello 2016; Tao et al. 2017; Xin et al. 2018] ◮ prediction-based lossy compressor framework for scientific data lossy compression for scientific data ◮ strictly control the global upper bound of compression error at varying reduction rate figure from Peter Lindstrom, LLNL Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 4 / 17

Background Introduction Proposed Design of w aveSZ Experimental Evaluation Conclusion and Future Work How SZ Works decorrelation approximation coding with strict prediction quantization variable-length lossy error control initial data + input output lossily comp parameters linear (1D), or linear-scaling, (Hu fg man code) -ressed data multidimensional of prediction errors low entropy × lossless dim 1 0 bincode: 1 (offset) ◮ Lorenzo predictor allows arbitrary-dimensional prediction. 3rd layer eb { j = 1 ( − 1 ) k j + 1 � n �� d �� ◮ � k 1 ... d � = 0 · D x 1 − k 1 , ··· , x d − k d . + − 0 ≤ k 1 ... d ≤ n k j 2nd layer ◮ Single-layer form Lorenzo predictor works the best generally [Tao + • predicted et al. 2017]. 2D form: 1st layer � D − 1 , − 1 D 0 , − 1 value dim 0 • �� − 1 1 ℓ ( D 0 , 0 ) = dot � . , 1 0 d − 1 , 0 D 0 , 0 true value ◮ Customized Huffman encoding ◮ sizeof(T) -byte long symbol to Huffman code processed processing ◮ high quantization quality (aggregated in center) makes Huffman unprocessed coded bitstream more possible to further gzip linear-scaling Lorenzo ( ℓ ) prediction quantization Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 5 / 17

Background Introduction Proposed Design of w aveSZ Experimental Evaluation Conclusion and Future Work Issues with SZ and Its Current FPGA Implementation “hard” “easy” to encode to encode ◮ Low throughput of SZ j +0 j +1 j +2 j +3 ◮ lack of parallelism: SIMD and SIMT cannot apply outlier outlier ◮ Limitations in FPGA GhostSZ i − 1 radius iteration ◮ totally performance-driven design m +0 i − 0 capacity ◮ 3 predictors in use, need extra bits to encode Figure 2: General ◮ more “workflow pipelines” (more resource) distribution patern of ◮ low compression ratio i − 1 quantization code. iteration m +1 GhostSZ i − 0 Distribution of encoding encoding prediction error predictors (2 bits) in quantized form (14 bits) Prediction Errors 300 300 waveSZ , SZ-1.4 i − 1 iteration #points encoding prediction error � SZ-1.4 200 m +2 200 in quantized form (16 bits) i − 0 � SZ-1.0 ◮ New use scenarios of adopting FPGA � GhostSZ 100 100 ◮ real-time processing; “inline processing” (Intel, 2018) Figure 1: Loop-carried 0 ◮ ExaNet—an FPGA-based direct network architecture of dependencies due to − 0 . 01 0 . 00 0 . 01 the European exascale systems [Ammendola et al. 2018] writeback. Figure 3: CESM-ATM CLDLOW. Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 6 / 17

Background Introduction Proposed Design of w aveSZ Experimental Evaluation Conclusion and Future Work Memory Access Patern and Dependency 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 0 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 ◮ Dependencies denoted with Manhatan 2 2,0 2,1 2,2 2,3 2,4 2,5 2,4 2,5 2,6 2,7 2,8 2,9 2 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7 2,8 2,9 3 3,0 3,1 3,2 3,3 3,4 3,4 3,5 3 3,0 3,1 3,2 3,2 3,3 3,3 3,4 3,4 3,5 distance from • zero point 4 4 4,0 4,1 4,2 4,3 4,4 ◮ SZ-1.4 5 5 5,0 5,1 5,2 5,3 5,4 ◮ iterate against the dependencies, see (a) SZ-1.4 memory access (b) GhostSZ memory access Fig. 4(c) patern. patern. ◮ RAW at the last cycle, impossible to 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 extract parallelism 10 0 ◮ GhostSZ 11 1 ◮ overlook multidimensional smoothness 6 7 12 2 13 3 ◮ slice data of dimensionality into 1D 7 8 2 3 4 5 14 4 ◮ hence multiple • zero points 5 ◮ no dependency “vertically” (c) SZ-1.4 dependency in (d) GhostSZ dependency in Manhatan distance. Manhatan distance. Figure 4: SZ-1.4 and GhostSZ: memory access patern and data dependency in Manhatan distance. Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 7 / 17

Background Introduction Proposed Design of w aveSZ Experimental Evaluation Conclusion and Future Work Memory Access Patern and Dependency (cont’d) 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 0 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 dim 0 1 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 1 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 2 2 2,0 2,1 2,2 2,3 2,4 2,4 2,5 2,5 2,6 2,7 2,8 2,9 2,0 2,1 2,2 2,3 2,4 2,4 2,5 2,5 2,6 dim 1 3 3,0 3,1 3,2 3,3 3,4 3,4 3,5 3 3,0 3,1 3,2 3,3 3,4 3,4 3,5 ◮ Dependencies denoted with Manhatan 4 4 4,0 4,1 4,2 4,3 5 5 distance from • zero point 5,0 5,1 5,2 ◮ waveSZ (a) SZ-1.4 memory access (b) waveSZ memory access patern. ◮ iterate along the aligned patern. 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 dependency-free points 10 10 ◮ exploit the parallelism by pipelining 11 11 ◮ Pipelining 6 7 12 6 7 12 ◮ not to change too much 13 13 7 8 7 8 ◮ expect platform-support pipelining 14 14 control 15 (c) SZ-1.4 dependency in (d) waveSZ dependency in Manhatan Manhatan distance. distance. Figure 5: SZ-1.4 and waveSZ : memory access patern and data dependency in Manhatan distance. Feb. 24, 2020 · PPoPP ’20, San Diego, California, USA · waveSZ · 8 / 17

waveSZ: A Hardware-Algorithm Co-Design of Efficient Lossy - PowerPoint PPT Presentation

waveSZ: A Hardware-Algorithm Co-Design of Efficient Lossy Compression for Scientific Data The University of Alabama Jiannan Tian Sheng Di Argonne National Laboratory Chengming Zhang The University of Alabama Xin Liang University of

Lossless compression in lossy compression systems Almost every lossy compression system

Hardware Observability Framework Hardware Observability Framework Hardware Observability

The Parametric Complexity of Lossy Counter Machines Sylvain Schmitz ICALP , July 12, 2019,

software and hardware for the Internet of Things. Choose hardware Design hardware Design

ECS231 PCA, revisited May 28, 2019 1 / 18 Outline 1. PCA for lossy data compression 2. PCA for

RPL- Routing over Low Power and Lossy Networks Michael Richardson Ines Robles IETF 94

Lecture 7 Lossy Source Coding I-Hsiang Wang Department of Electrical Engineering National

Ackermann-Hardness for Lossy Counter Machines (and Reset Petri Nets) Philippe Schnoebelen LSV,

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Efficient Threshold Encryption from Lossy Trapdoor Functions Xiang Xie, Rui Xue and Rui Zhang

Efficient Lossy Trapdoor Functions based on Subgroup Membership Assumptions Haiyang Xue, Bao Li,

An Efficient Algorithm for An Efficient Algorithm for Simulating Coalescence with Simulating

Flexible Hardware Design at Flexible Hardware Design at Low Levels of Abstraction Low Levels of

Algorithm Analysis October 12, 2016 CMPE 250 Algorithm Analysis October 12, 2016 1 / 66

Unmanned Aerial Vehicles Using WiFi Chitra R. Karanam and Yasamin Mostofi Department of

Basic concepts for LEEM, XPEEM and applications Andrea Locatelli Andrea.locatelli@elettra.eu

Advancing X-ray polarimetry of compact objects with XL-Calibur X-Calibur Mark Pearce KTH Royal

Primary Care Mental Health for Veterans: Integrating Care October 25, 2017 Integrated Care

On support theorems for the X-Ray transform with incomplete data Aleksander Denisiuk University

Contracting Boundaries of CAT(0) Spaces Ruth Charney Dubrovnik, July 2011 Ruth Charney ()

Multi-wavelength [not radio] Polarimetry of Isolated Neutron Stars Roberto P. Mignani

Non-abelian Radon transform and its applications Roman Novikov CNRS, Centre de Math