A new Direct Connected Component Labeling and Analysis Algorithm for GPUs
Arthur Hennequin1,2, Lionel Lacassagne1
LIP6, Sorbonne University, CNRS, France 1 LHCb experiment, CERN, Switzerland 2
GTC 2019 March 21st
1 / 29
A new Direct Connected Component Labeling and Analysis Algorithm for - - PowerPoint PPT Presentation
A new Direct Connected Component Labeling and Analysis Algorithm for GPUs Arthur Hennequin 1 , 2 , Lionel Lacassagne 1 LIP6, Sorbonne University, CNRS, France 1 LHCb experiment, CERN, Switzerland 2 GTC 2019 March 21 st 1 / 29 What are Connected
LIP6, Sorbonne University, CNRS, France 1 LHCb experiment, CERN, Switzerland 2
1 / 29
gray level image binary level image (segmentation by (motion detection) connected component labeling 1 2 connected component analysis
2 / 29
◮ compute the local positive min over a 3 × 3 neighborhood ◮ until stabilization : the number of iterations depends on the data ◮ not predictable, nor suited for embedded systems
◮ first pass = temporary label creation and equivalence building ◮ need an equivalence table to memorize the connectivity between labels ◮ then transitive closure of the tree associated to the equivalence table ◮ second pass = label relabeling
3 / 29
for i = 0 : h − 1 do for j = 0 : w − 1 do if I[i][j] = 0 then e1 ← E[i − 1][j] e2 ← E[i][j − 1] if (e1 = e2 = 0) then ne ← ne + 1 ex ← ne else r1 ← Find(e1, T) r2 ← Find(e2, T) ex ← min+(r1, r2) if (r1 = 0 and r1 = ex) then T[r1] ← ex if (r2 = 0 and r2 = ex) then T[r2] ← ex else ex ← 0 E[i][j] ← ex
while T[e] = e do e ← T[e] return e // the root of the tree
r1 ← Find(e1, T) r2 ← Find(e2, T) if (r1 < r2) then T[r2] ← r1 else T[r1] ← r2
for i = 0 : ne do T[e] ← T[T[e]]
4 / 29
e1 e2 ex p1 p2 px predecessor pixels image of pixels image of labels current pixel equivalence table
1 2 3 4 2 2 e T[e] 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 1 1 1 2 2 1 3 3 2 2 2 2 2 1 4 2 4 2 2 2 2 2 1 1 1 1 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2
2 3 4 1
predecessor labels current label equivalence trees image of labels after relabeling binary image of pixels image of labels
2 1
ex
2 1 1 1 ex
stair concavity patterns generator
5 / 29
◮ parallel algorithm for CPU ◮ based on RLE (Run Length Encoding) to speed up processing and saves
◮ current fastest CCA algorithm on CPU
◮ direct CCL algorithm for GPU
◮ direct CCL algorithm for GPU (2D and 3D versions) ◮ based on the analysis of local pixels configuration to avoid unnecessary and
6 / 29
1 4 3 4 3 4 3 4 4 1 4 4 1 1 2 2 2 4 3 1 1 4 2 1 4 1 4 4 1 4 4 1 4 4 4 4
◮ 4 can’t be equal to 1 and 2 ◮ ⇒ 4 has to point to 1 and 2 has to point to 1 too... 7 / 29
8 / 29
9 / 29
10 / 29
∆
7 6 5 4 3 1 2 6 8 12 2 6 1 8 1 4 2 6 2 2 3 4 3 40 43 47 48 54 56 62 1 2 3 1 2 3
(a) Initialization
11 / 29
7 6 5 4 3 1 2 6 8 12 2 6 1 8 1 4 2 6 2 2 3 4 3 40 43 47 48 54 56 62 1 2 3 1 2 3
(b) Strip labeling
7 6 5 4 3 1 2 6 6 2 1 8 2 1 6 1 8 1 2 3 2 3 32 34 34 40 47 48 54 1 2 3 1 2 3
(c) Strip labeled
8 16 6 12 20 18 26 32 40 48 34 43 47 54 56 62
12 / 29
7 6 5 4 3 1 2 6 6 2 1 8 2 1 6 1 8 1 2 3 2 3 32 34 34 40 47 48 54 1 2 3 1 2 3
(d) Border merging
7 6 5 4 3 1 2 6 2 1 8 2 1 6 1 8 1 2 3 32 34 34 40 47 48 54 1 2 3 1 2 3
(e) Border merged
8 16 6 12 20 18 26 32 40 48 34 43 47 54 56 62 8 16 6 12 20 18 26 32 40 48 34 43 47 54 56 62
13 / 29
◮ S = x1 − x0,
2 [x1(x1 − 1) − (x0(x0 − 1)] 7 6 5 4 3 1 2 1 2 3 1 2 3
FindRoot
7 6 5 4 3 1 2 1 2 3 1 2 3
Relabeling
32 40 48 34 43 47 54 56 62 8 16 6 12 20 18 26
14 / 29
15 / 29
16 / 29
1 1 1 1 7 6 5 4 3 1 2 2 1 1 1 1 3 2 pixels start_distance end_distance
17 / 29
2 1 3 4 5 6 7 tx pixelsy shared memory pixelsy-1
18 / 29
2 1 3 4 5 6 7 2 1 3 4 5 6 7 tx pixelsy shared memory pixelsy-1
19 / 29
20 / 29
◮ 8C: density = 45% ◮ 4C: density = 64% 21 / 29
(a) Playne (b) Cabaret (c) HA432(ccl) (d) HA464(ccl)
22 / 29
(a) Playne (b) Cabaret (a) HA432 (b) HA464
23 / 29
(a) HA432 Jetson AGX (b) HA464 Jetson AGX (c) HA432 V100 (d) HA464 V100
24 / 29
(a) HA432 Jetson AGX (b) HA464 Jetson AGX (c) HA432 V100 (d) HA464 V100
25 / 29
(a) HA464(cca) V100 (S, Sx, Sy, xmin, ymin, xmax, ymax) (b) HA464(cca) V100 (S, Sx, Sy, Sx2, Sy2)
26 / 29
◮ CCL 2× faster than State-of-the-Art ◮ CCA new on GPU
◮ CCL throughput: 4.6 Gpix/s on AGX (1920x1080: 2208 fps) or ◮ CCA throughput: 3.4 Gpix/s on AGX (1920x1080: 1615 fps)
◮ Design 8-connectivity versions on GPUs ◮ Improve CCA by implementing different merging strategies
27 / 29
28 / 29
analysis on multi-core processors,” Journal of Real Time Image Processing, no. 15,1, pp. 173–196, 2018.
connected component labeling algorithm for GPUs,” in IEEE International Conference on Image Processing Theory, Tools and Applications (IPTA), pp. 1–8, 2017.
IEEE Transactions on Parallel and Distributed Systems, 2018.
to swendsen-wang multi-cluster spin flip algorithm,” Computer Physics Communications, pp. 54–58, 2015.
“https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/,” 2013.
2018.
Labeling and Analysis Algorithms for GPUs,” in DASIP, (Porto, Portugal), Oct. 2018.
29 / 29