Maximizing Face Detection Performance
Paulius Micikevicius Developer Technology Engineer, NVIDIA
GTC 2015
(C) 2015 NVIDIA 1
Maximizing Face Detection Performance Paulius Micikevicius - - PowerPoint PPT Presentation
Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 (C) 2015 NVIDIA Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount of work
(C) 2015 NVIDIA 1
(C) 2015 NVIDIA 2
– Use weak classifiers to detect object presence in each position – I’ll call a position a candidate
– Some number of stages are cascaded
– Each stage contains a number of weak-classifiers
– Classifiers are trained on small windows (~20x20 pixels) – To detect objects of different sizes, do one of:
(C) 2015 NVIDIA 3
(C) 2015 NVIDIA 4
(C) 2015 NVIDIA 5
(C) 2015 NVIDIA 6
– Quickly reject negatives, reducing work for subsequent stages – False-positives are OK, false-negatives are not OK
(C) 2015 NVIDIA 7
– 2094 Haar-like features – Each weak classifier is a 2- feature tree
– 1747 features contain 2 rects – 347 features have 3 rects
(C) 2015 NVIDIA 8
(C) 2015 NVIDIA 9
(C) 2015 NVIDIA 10
(C) 2015 NVIDIA 11
(C) 2015 NVIDIA 12
(C) 2015 NVIDIA 13
1920x1080 input image 5 scales:
– 50-200 pixel faces – 1.25x scaling factor
Process each candidate
– Start with 478K candidates – 254 pass all stages
(C) 2015 NVIDIA 14
(C) 2015 NVIDIA 15
(C) 2015 NVIDIA 16
– GPU kernel code = CPU loop body
(C) 2015 NVIDIA 17
– Threads are grouped into threadblocks
terminate
– Instructions are executed per warp (SIMT)
take
– If at least one thread in a warp needs to evaluate a stage, all 32 threads take the same time
– If at least one thread in a threadblock needs to continue evaluating, the resources of all other threads in that block are not released
(C) 2015 NVIDIA 18
– Threads are grouped into threadblocks
terminate
– Instructions are executed per warp (SIMT)
take
– If at least one thread in a warp needs to evaluate a stage, all 32 threads take the same time
– If at least one thread in a threadblock needs to continue evaluating, the resources of all other threads in that block are not released
(C) 2015 NVIDIA 19
Instructions, time
3 2 1 31 30
– Threads are grouped into threadblocks
terminate
– Instructions are executed per warp (SIMT)
take
– If at least one thread in a warp needs to evaluate a stage, all 32 threads go through evaluation instructions
– If at least one thread in a threadblock needs to continue evaluating, the resources of all other threads in that block are not released
(C) 2015 NVIDIA 20
– Use atomics for updating the index
(C) 2015 NVIDIA 21
(C) 2015 NVIDIA 22
(C) 2015 NVIDIA 23
(C) 2015 NVIDIA 24
(C) 2015 NVIDIA 25
(C) 2015 NVIDIA 26
(C) 2015 NVIDIA 27
(C) 2015 NVIDIA 28
(C) 2015 NVIDIA 29
(C) 2015 NVIDIA 30
Scale-1 Scale-2 Scale-3 Scale-4 Scale-0
Sequential Concurrent Concurrent 2 segments
(C) 2015 NVIDIA 31
8.3 ms 2-segments 6.1 ms 9.7 ms
(C) 2015 NVIDIA 32
5 scales of stages 4-7 5 scales of stages 0-3 stages 8-20
K40: 7.2 ms TK1: 78.9 ms
– Very rough rule of thumb: fewer than 512 threads per SM
(C) 2015 NVIDIA 33
candidates to saturate the GPU
(C) 2015 NVIDIA 34
(C) 2015 NVIDIA 35
(C) 2015 NVIDIA 36
(C) 2015 NVIDIA 37
– Shown for scale-0 (58-pixel face) – A candidate window is masked out (black) if fewer than 50% of its pixels were not skin-toned – 76% of candidates were rejected at this scale
(C) 2015 NVIDIA 38
(C) 2015 NVIDIA 39
– Classify each input pixel as skin-toned or not
– Compute integral image of pixel classes – Use the integral image to reject candidates when creating the initial work-queue for detection
– TK1:
134.5 ms
34.9 ms (~4x speedup, as expected)
34.4 ms
34.0 ms
– K40:
9.8 ms
4.3 ms
2.8 ms
2.1 ms
(C) 2015 NVIDIA 40
(~2.3x speedup -> less than 4x expected indicates GPU is idling)
(C) 2015 NVIDIA 41
(C) 2015 NVIDIA 42
(C) 2015 NVIDIA 43
(C) 2015 NVIDIA 44
– Thus less sensitive to bandwidth for this application than TK1
– Thus able to achieve a higher hit-rate than TK1, reducing traffic to DRAM
– Thus less sensitive to bandwidth for this application than TK1
– Thus able to achieve a higher hit-rate than TK1, reducing traffic to DRAM
(C) 2015 NVIDIA 45
– Breaking stages into segments: up to 1.3x – Concurrent processing of scales: 1.2 - 2x
– Downscaling to the first scale first: 1.3 - 2.3x – Masking (ROI): ~3x
– Adjust only image or launch parameters – Together improved cascade time:
– Potential further speedup of ~1.5x
(C) 2015 NVIDIA 46
(C) 2015 NVIDIA 47
– Window-0: the first window (top-left corner, for example)
– Pointer math per vertex: 64-bit multiply-add » A dependent sequence of 2+ instructions on GPU
(C) 2015 NVIDIA 48
(C) 2015 NVIDIA 49
– 6, 8, or 9 pointers: 48-72 bytes per feature – 12-18 or more instructions per window
– Several choices:
– When storing each rectangle
– Usually 20x20 to 32x32 in size – So, could store as few 4B (4 chars), 16 B if storing ints » 4 chars would require bit-extraction instructions
(C) 2015 NVIDIA 50
(C) 2015 NVIDIA 51
(C) 2015 NVIDIA 52
– Have multiple copies of the feature in memory
– Add a level of indirection:
– Map weak-classifiers to unique features – Approach implemented in OpenCV
– Indirection adds 100s-1000s cycles of latency, adds to bandwidth pressure as well
– Typically only a very small percentage of features are replicated
(C) 2015 NVIDIA 53
(C) 2015 NVIDIA 54