A Comprehensive Analysis and Parallelization of an Image Retrieval Algorithm
Zhenman Fang, Donglei Yang, Weihua Zhang, Haibo Chen, Binyu Zang
Parallel Processing Institute, Fudan University
1
Parallelization of an Image Retrieval Algorithm Zhenman Fang , - - PowerPoint PPT Presentation
A Comprehensive Analysis and Parallelization of an Image Retrieval Algorithm Zhenman Fang , Donglei Yang, Weihua Zhang, Haibo Chen, Binyu Zang Parallel Processing Institute, Fudan University 1 Exploding Multimedia Data Figure from [Report on
Parallel Processing Institute, Fudan University
1
2
Cisco VNI Global Consumer Internet Traffic Forecast
Figure from [Report on American Consumers 09]
E.g. medical imagery, video recommendation
3
4
# of cores Software Performance Single Core Performance
Figure from Michael MoCool’s (Intel) many core slides
5
To optimize them on current architectures To design future architectures for them
6
7
8
9
Global feature based
Color features Texture features
Local feature based
Shape context SIFT Features SURF Features
~60% precision
[Wan 08]
accurate but time consuming robust & appealing: insensitive to scale and rotation transformation
[Mikolajczyk 05, Bauer 07]
Descriptor Vector Construction Features Orientation
Assignment
Integral Image
10
Scale Space Analysis Interest Point Localization
Input Image
11
12
Dxx Dxy Dxy Dyy
insensitive to scale transformation
13
14
Orientation
insensitive to rotation transformation
15
16
Detection
Descriptor Vector Construction Features Orientation Assignment Integral Image
17
Scale Space Analysis Interest Point Localization 1% time 24% time 2% time 20% time 53% time 27% time
Description
73% time Input Image
Experiment Enviroment Prog: OpenSURF Input: 48 images HW: 16-core server 32GB memory
18
100 200 300 400 500 600 32 64 96 128 160 192 # of Interest Points Block ID Average line
Scale-level Parallelism Block-level Parallelism
Combination of SIMD and Other Parallelism Combination of Pipeline and Task Parallelism
19
20
Detection writes interest point to the buffer Description reads interest point from the buffer
Orientation Assignment Descriptor Vector Construction
Detection
Descriptor Vector Construction
21
22
1 2 3 4 2-stage 3-stage Speedup
Scale-level Parallelism Block-level Parallelism
Combination of SIMD to Others Combination of Task and Pipeline Parallelism
23
24
Scale Space Analysis Interest Point Localization Description Integral Image Each scale computed concurrently Describe each group
concurrently
25
2 4 6 8 4-core 8-core 12-core 16-core Speedup
26
2 4 6 8 4-core 8-core 12-core 16-core Speedup
Scale-level Parallelism Block-level Parallelism
Combination of SIMD to Others Combination of Task and Pipeline Parallelism
27
Detection Detection Detection
28
Description
Image Block Image Block Image Block Description Description
Sync between neighbor blocks
Detection Detection Detection
29
Description
Image Block Image Block Image Block Description Description
Sync between neighbor blocks
Detection Detection Detection
30
Description
Image Block Image Block Image Block Description Description
Use additional computation to avoid sync
Detection Detection Detection
31
Description
Image Block Image Block Image Block Description Description
Use additional computation to avoid sync
32
2 4 6 8 10 4-core 8-core 12-core 16-core Speedup BlockPar Block-Sync
33
2 4 6 8 10 4-core 8-core 12-core 16-core Speedup BlockPar Block-Sync
34
2 4 6 8 10 4-core 8-core 12-core 16-core Speedup Pipeline ScalePar BlockPar
Scale-level Parallelism Block-level Parallelism
Combination of SIMD to Others Combination of Task and Pipeline Parallelism
35
36
2 4 6 8 10 12 4-core 8-core 12-core 16-core Speedup Pipeline ScalePar BlockPar
37
2 4 6 8 10 12 4-core 8-core 12-core 16-core Speedup Pipeline+SIMD ScalePar+SIMD BlockPar+SIMD
Scale-level Parallelism Block-level Parallelism
Combination of SIMD to Others Combination of Task and Pipeline Parallelism
38
39
2 4 6 8 10 12 14 4-core 8-core 12-core 16-core Speedup BlockPar Block+Pipe BlockPar+SIMD Block+Pipe+SIMD
40
2 4 6 8 10 12 14 4-core 8-core 12-core 16-core Speedup BlockPar Block+Pipe BlockPar+SIMD Block+Pipe+SIMD
41
2 4 6 8 10 12 4-core 8-core 12-core 16-core Speedup P-SURF Our BlockPar Our Block+Pipe
42
2 4 6 8 10 12 4-core 8-core 12-core 16-core Speedup P-SURF Our BlockPar Our Block+Pipe
43
* Can be downloaded form http://www.mis.tu-darmstadt.de/surf
99% 1% Sequential SURF on CPU Initialization
BlockPar
Sequential CPU + GPU Execution Time % Init SURF
44
* Can be downloaded form http://www.mis.tu-darmstadt.de/surf
Initialization
BlockPar
Sequential CPU + GPU Init SURF Execution Time % 47% After BlockPar
53%
45
* Can be downloaded form http://www.mis.tu-darmstadt.de/surf
Init SURF Execution Time % 47% After BlockPar
Initialization
BlockPar
CPU + GPU Pipeline 53% …
46
* Can be downloaded form http://www.mis.tu-darmstadt.de/surf
30X 30X 46X 10 20 30 40 50 CUDA SURF Our BlockPar Our Block+Pipe Speedup
47
* Can be downloaded form http://www.mis.tu-darmstadt.de/surf
30X 30X 46X 10 20 30 40 50 CUDA SURF Our BlockPar Our Block+Pipe Speedup
48
Pipeline Parallelism Task Parallelism at scale-level & block-level Data Parallelism, i.e., SIMD Their combinations
13X speedup on 16-core CPU, 1.84X faster than P-SURF 46X speedup on GPU, 1.53X faster than CUDA SURF
49
Apply parallel analysis to speech recognition Design some energy-efficient architecture, such as
50