Real-time covariance tracking algorithm for embedded systems
- A. Roméro, L. Lacassagne, A. Zahraee, M. Gouiffès
www.lri.fr/~lacas LRI , LIMSI & IEF - University Paris-Sud
Real-time covariance tracking algorithm for embedded systems A. - - PowerPoint PPT Presentation
Real-time covariance tracking algorithm for embedded systems A. Romro, L. Lacassagne, A. Zahraee, M. Gouiffs www.lri.fr/~lacas LRI , LIMSI & IEF - University Paris-Sud Context & goal Covariance matching techniques are
www.lri.fr/~lacas LRI , LIMSI & IEF - University Paris-Sud
/14
representation
(Intel mobile proc, ARM Cortex A9)
2
algorithm presentation algorithm optimization benchmarks video examples
/14
manipulations)
3
+ image covariance
F IF IP features product of features P integral of features integral product
nF=7 nP=28 pixel features product of features
/14
(robust to distractors)
(x,y) (x,y)
RoI #0 (x0,y0) RoI #1 (x1,y1) RoI #0' (x0',y0') RoI #1' (x1',y1') RoI #2' (x2',y2') timeframe (t) timeframe (t+1)
/14
5
about 80% of total time
/14
6
200 400 600 800 1000 60 80 100 120 140 160 180 200
SandyBridge4 size cpp AOS SOA+OpenMP 4 AOS+SIMD
200 400 600 800 1000 100 150 200 250 300 350 400
Penryn4 size cpp AOS SOA+OpenMP 4 AOS+SIMD
/14
cores, Cortex-A15: 2 cores only)
7
200 400 600 800 1000 40 60 80 100 120 140 160 180
Nehalem8 size cpp AOS AOS+SIMD SOA+OpenMP 8
/14
(arith/mem)
8 instructions MUL ADD LOAD STORE AI AoS scalar version with 3 loops product of features nP 2nP nP
3nF 4nF nF
3nP 4nP nP
nP 3(nP + nF ) 6nP + 4nF 2nP + nF
2n2
F + 5nF
4n2
F + 9nF
133 259 0.5 AoS SIMD (with nF = 7) version with 3 loops product of features 7 2 7
21 28 7
6 2 2
49 54 0.9 total Neon (+ 48 PERM) 82 54 1.5
/14
, without access (load & store) to Products
9
image F IF IP features integral of features integral product
1 loop nest computation
instructions MUL ADD LOAD STORE AI AoS scalar version + Loop Fusion integral of features 2nF 2nF nF
nP 2nP nP nP
nP 2(nP + nF ) nP + 2nF nP + nF
1.5n2
F + 3.5nF
n2
F + 4nF
98 77 1.3 AoS SIMD (with nF = 7) version + Loop Fusion integral of features 4 4 2
7 14 7 7
40 20 2.0 total Neon (+ 48 PERM) 73 20 3.7
/14
10
100 200 300 400 500 50 100 150 200 250 300 350 400 450 500
U9300 size cpp SOA AOS AOS+SIMD AOS+T+SIMD AOS+T
100 200 300 400 500 200 400 600 800 1000 1200 1400 1600
Cortex-A9 size cpp SOA AOS AOS+SIMD AOS+T+SIMD AOS+T
/14
11
sequence pand panda pedxi pedxing size 312 x 233 312 x 233 640 x 480 640 x 480 algorithm version SoA AoS+SIMD+T SoA AoS+SIMD+T features computation (cpp) 128 150 128 150 kernel computation (cpp) 599 87 618 91 tracking (cpp) 23 23 11 11 total (cpp) 738 248 769 264 kernel / total ratio 81 % 35 % 80 % 34 % total speedup x 2.9 x 2.9 x 2.9 x 2.9 1-core execution time (ms) 45 15 197 68 2-core execution time (ms) 36 9 158 38 cpp & execution time (ms) for Intel Penryn ULV U9300
/14
12
sequence pand panda pedxi pedxing size 312 x 233 312 x 233 640 x 480 640 x 480 algorithm version SoA AoS+SIMD+T SoA AoS+SIMD+T features computation (cpp) 461 461 486 486 kernel computation (cpp) 1491 395 1600 415 tracking (cpp) 96 96 19 19 total (cpp) 2048 952 2106 921 kernel / total ratio 73 % 42 % 73 % 45 % total speedup x 2.2 x 2.2 x 2.2 x 2.2 1-core execution time (ms) 149 69 647 283 2-core execution time (ms) 108 36 492 149 cpp & execution time (ms) for ARM Cortex-A9
/14
13
/14
14
/14
16
Algorithm 1: product of features - SoA version
k ← 0 foreach k1 ∈ [0..nF − 1] do foreach k2 ∈ [k1..nF − 1] do foreach i ∈ [0..h − 1] do foreach j ∈ [0..w − 1] do P[k][i][j] ← F[k1][i][j] × F[k2][i][j] k ← k + 1
Algorithm 1: product of features - AoS version
foreach i ∈ [0..h − 1] do foreach j ∈ [0..w − 1] do k ← 0 foreach k1 ∈ [0..nF − 1] do foreach k2 ∈ [k1..nF − 1] do P[i][j × nP + k] ← F[i][j × nF + k] × F[i][j × nF + k] k ← k + 1
F0 F1 F2 F3 SoA versus AoS for nF = 4 features
/14
17
Algorithm 1: integral image - SoA version, n ∈ {nF , nP }
foreach k ∈ [0..n − 1] do foreach i ∈ [0..h − 1] do foreach j ∈ [0..w − 1] do I[k][i][j] ← I[k][i][j] + I[k][i][j − 1] + I[k][i − 1][j] − I[k][i − 1][j − 1]
Algorithm 1: integral image - AoS version, n ∈ {nF , nP }
foreach i ∈ [0..h − 1] do foreach j ∈ [0..w − 1] do foreach k ∈ [0..n − 1] do I[i][j × n + k] ← I[i][j × n + k] + I[i][(j − 1) × n + k] + I[i − 1][j × n + k] − I[k][i − 1][(j − 1) × n + k]
/14
18