S5455, GTC 2015
High Capability Multidimensional Data High Capability Multidimensional Data Compression on GPUs
Sergio E. Zarantonello
szarantonello@scu.edu
Ed Karrels
ed.karrels@gmail.com
High Capability Multidimensional Data High Capability - - PowerPoint PPT Presentation
S5455, GTC 2015 High Capability Multidimensional Data High Capability Multidimensional Data Compression on GPUs Sergio E. Zarantonello szarantonello@scu.edu Ed Karrels ed.karrels@gmail.com School of Engineering, Santa Clara University Part 1:
S5455, GTC 2015
szarantonello@scu.edu
ed.karrels@gmail.com
Part 1: Theory and Applications
scientific simulations, monitoring devices, and high-end imaging li ti
applications.
hardware and software, to transmit, store, and analyze this data. , , , y
requiring several iterations of compress/decompress cycle.
Part 1: Theory and Applications
p p p p subject to error tolerances and error metrics specified by user
parallelism at various levels parallelism at various levels.
bases, adaptive dictionaries, compressive sensing, sparse representations, etc.)
and Non-Destructive Testing of Materials and Non-Destructive Testing of Materials.
Part 1: Theory and Applications
both, spatial and frequency domains. p q y
representations of data.
singularities.
Multidimensional wavelets take advantage of data correlation along all coordinate axes.
algorithms algorithms.
Part 1: Theory and Applications
Part 1: Theory and Applications
g g coordinate axis.
compression possible subject to error constraint(s).
Part 1: Theory and Applications
6
5 6
X-Ray CT scan
10 steps of the cardiac cycle
3 4
y 512 x 512 x 96 cube http://www.osirix-viewer.com
2 1
2d procedure 3d procedure Same error rate: PSNR=46
Compression Ratio=6.6 Cutoff=88% Bins=1106 Max Error= 9.45 Compression Ratio=10.2 Cutoff=92% Bins=850 Max Error = 5.68
Part 1: Theory and Applications
Part 1: Theory and Applications
1 C l l t l t ffi i t
Error Check
compression ratio subject to error tolerance
Part 1: Theory and Applications
54 56
12 14
1600 1800 2000 46 48 50 52
4 6 8 10
1000 1200 1400 95 42 44 46
70 75 80 85 90 95 500 1000 1500 2000 2500 2 4
76 78 80 82 84 86 88 90 92 800
% Cutoff
Cutoff Bins 76 % 1850 92 % 850
Begin End
60 65 70 500
PSNR Ratio 1850 52.4 3.6 X 850 46.2 10.2 X
Part 1: Theory and Applications
Objective: efficient transfer over the internet of high-resolution 3d images of retina for diagnosis.
Dataset courtesy of Quinglin Wang Carl Zeiss Meditec Inc.
100 200 250 200 300 100 150 400 500 50 00 100 200 300 400 500 600 100 200 300 400 500
Part 1: Theory and Applications
50 2 3 x 10
100 150 1 2 200 250
300 350
100 200 300 400 500 600 700 800 400
3D CT X‐Ray Inspection
North Star Imaging
Part 2: Implementation
… Before
evens After
Part 2: Implementation Thread block 0,2 Thread block 0,1 , Thread block 0,0 Thread block 1,0 Thread block 2,0 Thread block 3,0
Part 2: Implementation
X
Y Z
Part 2: Implementation
read write
1 2 3 4 5 6 7 8 9 10 11
read
1 2 3 4 5 6 7 8 9 10 11
write
(thread indices)
Contiguous write to global memory
Global memory Sh d
8 9 10 11 12 13 14 15
write
8 9 10 11 12 13 14 15
memory
1 2 3 4 5 6 7 8 9 10 11
Noncontiguous read from shared memory
Shared memory write
4 8 12 1 5 9 13 2 6 10 14
read
8 9 10 11 12 13 14 15
memory
2 6 10 14 3 7 11 15
Part 2: Implementation
#define FILTER_0 .852
_ #define FILTER_1 .377 #define FILTER_2 ‐.110
(860ms → 8.2ms for 256x256x256 cubelet, 8 levels of transforms along each axis)
#define FILTER_0 .852f #define FILTER_1 .377f #define FILTER_2 ‐.110f
Part 2: Implementation
[‐5 .. +5] [ 5 .. 5]
(7.0 toolkit is 35% faster than 6.5)
Part 2: Implementation
Part 2: Implementation
Log quantization, pSNR 38.269 GPU speedup: 97× (thrust::transform()) Lloyd quantization, pSNR 45.974 GPU speedup: create 13×, apply 48×
Part 2: Implementation
Value Count Encoding 9 16609445 1 8 46198 011
8 46198 011 10 42896 001 11 32594 000
7 30831 0101 12 6942 01000 6 5388 010011
Part 2: Implementation
CPU Wavelet transform Sort
500 1000 1500 2000 2500 GPU Milliseconds Quantize Histogram
GPU: GTX 980 CPU I l C i7 3 5GH
5 10 15 20 25 GPU
CPU: Intel Core i7 3.5GHz
5 10 15 20 25
MATLAB CPU GPU Compress 43000 2300 21 Error control 39000 1400 18
time to process 2563 cubelet, in milliseconds
Part 2: Implementation
Sergio Zarantonello Santa Clara University szarantonello@scu edu Ed Karrels Santa Clara University, ed.karrels@gmail.com Drazen Fabris Santa Clara University dfabris@scu.edu szarantonello@scu.edu David Concha Universidad Rey Juan Carlos, Spain david.concha@urjc.es Anupam Goyal Algorithmica LLC anupam@rithmica.com Bonnie Smithson Santa Clara University Bonnie@DejaThoris.com