CUSZ : A HighPerformance GPU Based Lossy Sian Jin October 5, 2020 - PowerPoint PPT Presentation

CUSZ : A High‑Performance GPU Based Lossy Sian Jin October 5, 2020 Argonne National Laboratory Franck Cappello Washington State University Dingwen Tao Clemson University Jon Calhoun Oak Ridge National Laboratory Xin Liang Washington State University Clemson University Compression Framework for Scientific Data Robert Underwood Clemson University Megan Hickman Fulp The University of Alabama Cody Rivera University of California, Riverside Kai Zhao Argonne National Laboratory Sheng Di Washington State University Jiannan Tian PACT ’20, Virtual Event

Background 17k 1.5 PB 1.1 TB/S 1.3k 13k Cray CORI 2017 10 PFLOPS 30 PFLOPS 1.4 PB 0.8k IBM Summit 10 PFLOPS 2018 100 PFLOPS 200 PFLOPS 2.5 TB/S 80k PF: peak FLOPS MS: memory size SB: storage bandwidth Source: F. Cappello (ANL) Table 1: Three classes of supercomputers showing their performance, MS and SB . Introduction 13.3 PFLOPS 2012 MS Design Evaluation Conclusion Trend of Supercomputing Systems Gap Between Compute and I/O The compute capability is ever growing while storage capacity and bandwidth are developing more slowly and not matching the pace. supercomputer year class Cray Blue Waters PF SB 1.75 PFLOPS 7.3k 1.5k 240GB/S 360TB 1 PFLOPS 2008 Cray Jaguar PF/SB MS/SB 1.7 TB/S • > 10 PB •• > 4k • when using burst buffer •• counting only DDR4 October 5, 2020 · PACT ’20, Virtual Event · CUSZ · 2 / 20

Background APS‑U 5h30m to store 10x climate simulation of h/w budget for storage 2013 vs 2017 NSF Blue Waters, I/O at 1 TBps in need hundreds of PB CESM 100‑PB buffer 100x High‑Energy X‑Ray Beams Experiments brain initiatives or, connection at 100 GBps in need 20% vs 50% in need Introduction passive solution (?) Design Evaluation Conclusion Current Status of Scientific Applications: Big Data application data scale to reduce 26 PB for Mira@ANL HACC 20 PB use up FS 10x cosmology simulation per one‑trillion‑particle simulation October 5, 2020 · PACT ’20, Virtual Event · CUSZ · 3 / 20

Background distinct in design goals github.com/szcompressor/SZ Figure from Peter Lindstrom (LLNL) 250:1, left to right) at varying reduction rate (10:1 to Lossy compression for scientific data [Di and Cappello 2016; Tao et al. 2017; Xin et al.2018] SZ 4 fixed bitrate 2 pointwise relative error bound modes compression Introduction need diverse e.g., JPEG, MPEG rate, not suitable for HPC Design Evaluation Conclusion Error‑Bounded Lossy Compression Matters 2:1 (FP‑type) 10:1 or higher lossless‑compress scientific datasets reduction ratio in need industry lossy compressor despite high reduction 1 absolute error bound ( L ∞ norm) 3 RMSE error bound ( L 2 norm) ▶ prediction‑based lossy compressor framework for scientific data ▶ strictly control the global upper bound of compression error October 5, 2020 · PACT ’20, Virtual Event · CUSZ · 4 / 20

Background (Huffman code) APPROXIMATION CODING DECORRELATION error control with strict output lossy input lossless ‑ressed data lossily comp low entropy variable‑length Introduction of prediction errors linear‑scaling, quantization multidimensional linear (1D), or prediction parameters initial data + (Error‑Bound Workflow) SZ Framework Conclusion Evaluation Design × October 5, 2020 · PACT ’20, Virtual Event · CUSZ · 5 / 20

Background solution Eliminate dependency and parallelize it. iteration iteration iteration Introduction Histograming [Gómez‑Luna et al.] solution All tasks are done on GPU. DUAL ‑QUANTIZATION: { PRE , POST }QUANTIZATION Customized Huffman codec (corse‑grained) Challenge Conclusion SCIENTIFIC DATA. Design Research Objective and Contribution and Contribution Motivation, Challenge, Evaluation j +0 j +1 j +2 j +3 ▶ CUSZ is THE FIRST STRICTLY ERROR‑BOUNDED LOSSY COMPRESSOR ON GPU FOR i − 1 m +0 i − 0 ▶ Tight data dependency (loop‑carried RAW) hinders parallelization. i − 1 m +1 i − 0 ▶ Host‑device communications only considering CPU/GPU suitableness. i − 1 m +2 i − 0 October 5, 2020 · PACT ’20, Virtual Event · CUSZ · 6 / 20

Background Huffman code in units of eb concatenating to dense format MSB LSB bitwidth quant.code on PREQUANTIZATION set bitwidth ... Huff‑code 508 00000110 ... 00001010 509 00000101 ... 00000100 510 PREQUANTIZATION (no RAW) UNUSED 511 memcpy fixed‑length CUSTOMIZED AND PREDICTION DUAL‑QUANTIZATION deflating Huffman codes Huffman code Introduction Huffman codebook DEFLATED build and canonize histograming POSTQUANTIZATION (no RAW) in units of eb (unchanged) ... fixed‑length representation 00000011 ... 00000100 00000010 ... 00000001 original data 793‑‑ 863 652‑‑ 722 0.073% |+ 722‑‑ 793 0.026% |+ 0.0095% 0.14% |+ 863‑‑ 933 0.0021% |+ 933‑‑1024 0.00014% |+ 582‑‑ 652 512 00000110 ... 00001100 00000010 ... 00000011 513 00000011 ... 00000101 514 00000011 ... 00000000 515 range |+ freq. |‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+ 442‑‑ 512 76% |‑‑‑‑‑+ 512‑‑ 582 24% HUFFMAN ENCODING floating‑point representation Diagram of CUSZ Design Evaluation Conclusion System Workflow 3 - 2 2 3 - 1 - 4 - 1 1 - 1 3 - 4 0 0 1 - 4 1 - 5 6 0 5 1 5 7 7 4 - - 1 - 10 3 0 0 2 - 2 - 3 0 2 - 2 2 - 2 1 4 5 - 5 1 - - - 4 4 - 4 4 0 0 3 - 3 - 3 3 1 - ℓ ‑prediction results in unit weight prediction (no RAW) 0 0 0 1 0 0 0 • t 0 1 0 • t 1 root 0 0 t 2 • 1 1 0 1 1 1 0 t n 1 October 5, 2020 · PACT ’20, Virtual Event · CUSZ · 7 / 20

Background and make it under error control. quantization reconstruction w/ loop carried RAW Introduction dependency. SZ COMPRESSION DECOMPRESSION and reconstruction: prediction (reconstructed) data show during compression Loop‑Carried Read‑After‑Write (codec) are mutually reversed procedures. (P+Q) Procedure in SZ Design Evaluation Conclusion ▶ Lossless compression and decompression ▶ Simlarly, SZ makes to‑be‑decompressed k − 2 �� e ◦ ⋆ k − 2 �� d ◦ ⋆ d k − 2 − p ◦ k − 2 = e ◦ k − 2 �� q ◦ k − 2 k − 1 �� e ◦ ⋆ k − 1 �� d ◦ ⋆ d k − 1 − p ◦ k − 1 = e ◦ k − 1 �� q ◦ k − 1 ▶ Error control is conducted during quantization − p ◦ = e ◦ �� q ◦ �� e ◦ ⋆ �� d ◦ ⋆ d k k k k k k e ◦ /(2 · eb ) × (2 · eb ) − e ◦ ≤ eb . ≡ ≡ ≡ ≡ ≡ ≡ q • �� e • �� d • k k k ▶ This introduces loop‑carried read‑after‑write October 5, 2020 · PACT ’20, Virtual Event · CUSZ · 8 / 20

Background Introduction Design Evaluation Conclusion Fully Parallelized (P+Q) Procedure in CUSZ PRE QUANT pre quantization: POST QUANT DECOMPRESSION quantization in SZ. (unnecessary) CUSZ COMPRESSION ▶ Prioritize error control. ≡ δ ◦ ⋆ k − 2 �� d ◦ ⋆ d k − 2 �� d ◦ k − 2 − p ◦ k − 2 = δ ◦ k − 2 ≡ q ◦ k − 2 k − 2 ▶ Error control happens at the very beginning, d k − 1 �� d ◦ k − 1 − p ◦ k − 1 = δ ◦ k − 1 ≡ q ◦ ≡ δ ◦ ⋆ k − 1 �� d ◦ ⋆ k − 1 k − 1 d ◦ /(2 · eb ) × (2 · eb ) − d ◦ ≤ eb , ≡ δ ◦ ⋆ �� d ◦ ⋆ d k �� d ◦ − p ◦ = δ ◦ ≡ q ◦ k k k k k k ▶ And post quantization is corresponding to ≡ ≡ ≡ ≡ ≡ ≡ q • ≡ δ • �� d • k k k October 5, 2020 · PACT ’20, Virtual Event · CUSZ · 9 / 20

Background prediction DECOMPRESSION CUSZ COMPRESSION (unnecessary) POST QUANT PRE QUANT Introduction SZ COMPRESSION w/ loop carried RAW reconstruction quantization DECOMPRESSION vs Fully Parallelized CUSZ Design Evaluation Conclusion Original SZ (Loop‑Carried RAW) k − 2 �� e ◦ ⋆ k − 2 �� d ◦ ⋆ ≡ δ ◦ ⋆ k − 2 �� d ◦ ⋆ d k − 2 − p ◦ k − 2 = e ◦ k − 2 �� q ◦ d k − 2 �� d ◦ k − 2 − p ◦ k − 2 = δ ◦ k − 2 ≡ q ◦ k − 2 k − 2 k − 2 k − 1 �� e ◦ ⋆ k − 1 �� d ◦ ⋆ ≡ δ ◦ ⋆ k − 1 �� d ◦ ⋆ d k − 1 − p ◦ k − 1 = e ◦ k − 1 �� q ◦ d k − 1 �� d ◦ k − 1 − p ◦ k − 1 = δ ◦ k − 1 ≡ q ◦ k − 1 k − 1 k − 1 − p ◦ = e ◦ �� q ◦ �� e ◦ ⋆ �� d ◦ ⋆ �� d ◦ − p ◦ = δ ◦ ≡ q ◦ ≡ δ ◦ ⋆ �� d ◦ ⋆ d k d k k k k k k k k k k k k ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ q • �� e • �� d • q • ≡ δ • �� d • k k k k k k October 5, 2020 · PACT ’20, Virtual Event · CUSZ · 10 / 20

Background coarse‑grained manner. Design Evaluation Conclusion Canonical Codebook and Huffman Encoding adj. [Schwartz and Kallick 1964] Introduction thread busy. ca · non · i · cal ▶ Encoding/decoding is done in a ▶ A GPU thread is assigned to a data chunk. ▶ Tune degree of parallelism to keep every ▶ codebook transformed to a compact manner ▶ no tree in decoding ▶ tree build time: 4‑7 ms (for now) ▶ canonize for 200 us (1024 symbols) October 5, 2020 · PACT ’20, Virtual Event · CUSZ · 11 / 20

CUSZ : A HighPerformance GPU Based Lossy Sian Jin October 5, 2020 - PowerPoint PPT Presentation

CUSZ : A HighPerformance GPU Based Lossy Sian Jin October 5, 2020 Argonne National Laboratory Franck Cappello Washington State University Dingwen Tao Clemson University Jon Calhoun Oak Ridge National Laboratory Xin Liang Washington

Lossless compression in lossy compression systems Almost every lossy compression system

The Parametric Complexity of Lossy Counter Machines Sylvain Schmitz ICALP , July 12, 2019,

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

HIGH-PERFORMANCE GPU VIDEO ENCODING ABHIJIT PATAIT SR. MANAGER, NVIDIA AGENDA GPU Video

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Understanding GPU-Based Lossy Compression for Extreme-Scale Cosmological Simulations Sian Jin

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

ECS231 PCA, revisited May 28, 2019 1 / 18 Outline 1. PCA for lossy data compression 2. PCA for

RPL- Routing over Low Power and Lossy Networks Michael Richardson Ines Robles IETF 94

Lecture 7 Lossy Source Coding I-Hsiang Wang Department of Electrical Engineering National

Ackermann-Hardness for Lossy Counter Machines (and Reset Petri Nets) Philippe Schnoebelen LSV,

Stat 5102 Lecture Slides: Deck 4 Bayesian Inference Charles J. Geyer School of Statistics

CC0pi/CC-inclusive Data Comparisons Patrick Stowell Introduction Learnt from the previous

The First Billion Rows Alexander Zaitsev and Robert Hodges About Us Robert Hodges - Altinity CEO

Learning From Data Lecture 11 Overfitting What is Overfitting When does Overfitting Occur

Differential Privacy and the Right to be Forgotten Cynthia Dwork, Microsoft Research Limiting

Baumgartner, POLI 203 Fall 2014 Background on the DP in NC Reading: Welty From Last Time

AHDB Agronomy 2019 East Anglia 7 th February 2019, Ravenwood Hall Agenda 09:45 Chairs

DOWNTOWN UPDATE Public Safety Assistant Chief Rodney Bryant Major Scott Kreher ADIDs David

CUSZ : A HighPerformance GPU Based Lossy Sian Jin October 5, 2020 - PowerPoint PPT Presentation

CUSZ : A HighPerformance GPU Based Lossy Sian Jin October 5, 2020 Argonne National Laboratory Franck Cappello Washington State University Dingwen Tao Clemson University Jon Calhoun Oak Ridge National Laboratory Xin Liang Washington

Lossless compression in lossy compression systems Almost every lossy compression system

The Parametric Complexity of Lossy Counter Machines Sylvain Schmitz ICALP , July 12, 2019,

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

HIGH-PERFORMANCE GPU VIDEO ENCODING ABHIJIT PATAIT SR. MANAGER, NVIDIA AGENDA GPU Video

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Understanding GPU-Based Lossy Compression for Extreme-Scale Cosmological Simulations Sian Jin

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

ECS231 PCA, revisited May 28, 2019 1 / 18 Outline 1. PCA for lossy data compression 2. PCA for

RPL- Routing over Low Power and Lossy Networks Michael Richardson Ines Robles IETF 94

Lecture 7 Lossy Source Coding I-Hsiang Wang Department of Electrical Engineering National

Ackermann-Hardness for Lossy Counter Machines (and Reset Petri Nets) Philippe Schnoebelen LSV,

Stat 5102 Lecture Slides: Deck 4 Bayesian Inference Charles J. Geyer School of Statistics

CC0pi/CC-inclusive Data Comparisons Patrick Stowell Introduction Learnt from the previous

The First Billion Rows Alexander Zaitsev and Robert Hodges About Us Robert Hodges - Altinity CEO

Learning From Data Lecture 11 Overfitting What is Overfitting When does Overfitting Occur

Differential Privacy and the Right to be Forgotten Cynthia Dwork, Microsoft Research Limiting

Baumgartner, POLI 203 Fall 2014 Background on the DP in NC Reading: Welty From Last Time

AHDB Agronomy 2019 East Anglia 7 th February 2019, Ravenwood Hall Agenda 09:45 Chairs

DOWNTOWN UPDATE Public Safety Assistant Chief Rodney Bryant Major Scott Kreher ADIDs David

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team