[PPT] - Parallelization of an Image Retrieval Algorithm Zhenman Fang , PowerPoint Presentation

SLIDE 1

A Comprehensive Analysis and Parallelization of an Image Retrieval Algorithm

Zhenman Fang, Donglei Yang, Weihua Zhang, Haibo Chen, Binyu Zang

Parallel Processing Institute, Fudan University

1

SLIDE 2

Exploding Multimedia Data

2

Cisco VNI Global Consumer Internet Traffic Forecast

Figure from [Report on American Consumers 09]

SLIDE 3

Multimedia Retrieval App.

Important to retrieve useful data

 E.g. medical imagery, video recommendation

Data-intensive and computing-intensive Significant challenges for real-time retrieval

3

SLIDE 4

Multi-core Era Is Coming

4

# of cores Software Performance Single Core Performance

Figure from Michael MoCool’s (Intel) many core slides

SLIDE 5

New Opportunities

5

Multi-core era needs parallelism Need a comprehensive study on parallelism characteristics in multimedia retrieval

 To optimize them on current architectures  To design future architectures for them

SLIDE 6

Image Retrieval

6

Image retrieval: also core of video retrieval

SLIDE 7

Image Retrieval

7

feature extraction

SLIDE 8

Image Retrieval

8

feature extraction + feature match

SLIDE 9

Image Retrieval (cont.)

9

Two classes of algorithms

 Global feature based

 Color features  Texture features

 Local feature based

 Shape context  SIFT Features  SURF Features

~60% precision

[Wan 08]

accurate but time consuming robust & appealing: insensitive to scale and rotation transformation

[Mikolajczyk 05, Bauer 07]

SLIDE 10

Detection

Descriptor Vector Construction Features Orientation

Assignment

Integral Image

SURF Overview

10

Scale Space Analysis Interest Point Localization

Description

Input Image

SLIDE 11

Integral Image

11

∑ g(x,y) Input Image Integral Image g(x,y) I(x,y)

SLIDE 12

Scale Space Analysis

12

Det(x,y) Det(x,y) = DxxDyy – 0.81Dxy*Dxy

 

Dxx Dxy Dxy Dyy

Hessian Matrix for (x,y) [Bay 06]

insensitive to scale transformation

SLIDE 13

Interest Point Localization

13

Ipoint with max det value

SLIDE 14

Orientation Assignment

14

Based on Haar Wavelet [Bay 06]

Orientation

insensitive to rotation transformation

SLIDE 15

Descriptor Vector Construction

15

SLIDE 16

Descriptor Vector Construction

16

4-dimension vector calculated based on Haar Wavelet 64-dimension feature vector

SLIDE 17

Detection

Descriptor Vector Construction Features Orientation Assignment Integral Image

Execution Profile of SURF

17

Scale Space Analysis Interest Point Localization 1% time 24% time 2% time 20% time 53% time 27% time

Description

73% time Input Image

Experiment Enviroment Prog: OpenSURF Input: 48 images HW: 16-core server 32GB memory

SLIDE 18

Interest Points Distribution

18

Imbalanced distribution for images/blocks

100 200 300 400 500 600 32 64 96 128 160 192 # of Interest Points Block ID Average line

SLIDE 19

Parallel Analysis

Pipeline Parallelism Task Parallelism

 Scale-level Parallelism  Block-level Parallelism

Combination of Different Parallelism

 Combination of SIMD and Other Parallelism  Combination of Pipeline and Task Parallelism

19

SLIDE 20

2-stage Pipeline

20

Detection Description … Description … Description …

Detection writes interest point to the buffer Description reads interest point from the buffer

SLIDE 21

Further divide Description into two stages

Orientation Assignment Descriptor Vector Construction

…

Detection

Descriptor Vector Construction

… …

3-stage Pipeline

21

SLIDE 22

Results of Pipeline Parallelism

22

Pipeline parallelism does not scale

1 2 3 4 2-stage 3-stage Speedup

SLIDE 23

Parallel Analysis

Pipeline Parallelism Task Parallelism

 Scale-level Parallelism  Block-level Parallelism

Combination of Different Parallelism

 Combination of SIMD to Others  Combination of Task and Pipeline Parallelism

23

SLIDE 24

Scale-level Parallelism

24

Scale Space Analysis Interest Point Localization Description Integral Image Each scale computed concurrently Describe each group

f interest points

concurrently

SLIDE 25

Results of Scale-level Parallelism

25

Not scale when exceeding 12 cores

2 4 6 8 4-core 8-core 12-core 16-core Speedup

SLIDE 26

Results of Scale-level Parallelism

26

Not scale when exceeding 12 cores

2 4 6 8 4-core 8-core 12-core 16-core Speedup

 Imbalanced computation  Non-trivial communication overhead

SLIDE 27

Parallel Analysis

Pipeline Parallelism Task Parallelism

 Scale-level Parallelism  Block-level Parallelism

Combination of Different Parallelism

 Combination of SIMD to Others  Combination of Task and Pipeline Parallelism

27

SLIDE 28

Detection Detection Detection

28

Description

Input Image

Image Block Image Block Image Block Description Description

Block-level Parallelism

Sync between neighbor blocks

SLIDE 29

Detection Detection Detection

29

Description

Input Image

Image Block Image Block Image Block Description Description

Block-level Parallelism

Sync between neighbor blocks

Block-level parallelism with synchronization (Block-Sync)

SLIDE 30

Detection Detection Detection

30

Description

Input Image

Image Block Image Block Image Block Description Description

Block-level Parallelism

Use additional computation to avoid sync

SLIDE 31

Detection Detection Detection

31

Description

Input Image

Image Block Image Block Image Block Description Description

Block-level Parallelism Block-level parallelism without synchronization (BlockPar)

Use additional computation to avoid sync

SLIDE 32

Results of Block-level Parallelism

32

BlockPar scales well

2 4 6 8 10 4-core 8-core 12-core 16-core Speedup BlockPar Block-Sync

SLIDE 33

Results of Block-level Parallelism

33

BlockPar scales well

2 4 6 8 10 4-core 8-core 12-core 16-core Speedup BlockPar Block-Sync

Communication overhead between cores is non-trivial; and it could be reduced by additional computation

SLIDE 34

Comparison for Each Parallelism

34

Block-level parallelism is more efficient

2 4 6 8 10 4-core 8-core 12-core 16-core Speedup Pipeline ScalePar BlockPar

SLIDE 35

Parallel Analysis

Pipeline Parallelism Task Parallelism

 Scale-level Parallelism  Block-level Parallelism

Combination of Different Parallelism

 Combination of SIMD to Others  Combination of Task and Pipeline Parallelism

35

SLIDE 36

Combination of SIMD to Others

36

Use ICC to generate SIMD instructions

2 4 6 8 10 12 4-core 8-core 12-core 16-core Speedup Pipeline ScalePar BlockPar

SLIDE 37

Combination of SIMD to Others

37

Use ICC to generate SIMD instructions

2 4 6 8 10 12 4-core 8-core 12-core 16-core Speedup Pipeline+SIMD ScalePar+SIMD BlockPar+SIMD

11% Speedup

SLIDE 38

Parallel Analysis

Pipeline Parallelism Task Parallelism

 Scale-level Parallelism  Block-level Parallelism

Combination of Different Parallelism

 Combination of SIMD to Others  Combination of Task and Pipeline Parallelism

38

SLIDE 39

Combination of Task & Pipeline

39

BlockPar + Pipeline is the most efficient

13X

2 4 6 8 10 12 14 4-core 8-core 12-core 16-core Speedup BlockPar Block+Pipe BlockPar+SIMD Block+Pipe+SIMD

SLIDE 40

Combination of Task & Pipeline

40

BlockPar + Pipeline is the most efficient

13X

2 4 6 8 10 12 14 4-core 8-core 12-core 16-core Speedup BlockPar Block+Pipe BlockPar+SIMD Block+Pipe+SIMD

 Fewer computation  Better locality

SLIDE 41

Comparison to Prior Work

41

Compared to P-SURF [Zhang 10] on multi-core CPU

2 4 6 8 10 12 4-core 8-core 12-core 16-core Speedup P-SURF Our BlockPar Our Block+Pipe

SLIDE 42

Comparison to Prior Work

42

Compared to P-SURF [Zhang 10] on multi-core CPU

2 4 6 8 10 12 4-core 8-core 12-core 16-core Speedup P-SURF Our BlockPar Our Block+Pipe

 1.84X Speedup over P-SURF  Non-trivial communication overhead

SLIDE 43

Comparison to Prior Work (cont.)

43

Our implementation on GPGPU

* Can be downloaded form http://www.mis.tu-darmstadt.de/surf

99% 1% Sequential SURF on CPU Initialization

n CPU

BlockPar

n GPGPU

Sequential CPU + GPU Execution Time % Init SURF

SLIDE 44

Comparison to Prior Work (cont.)

44

Our implementation on GPGPU

* Can be downloaded form http://www.mis.tu-darmstadt.de/surf

Initialization

n CPU

BlockPar

n GPGPU

Sequential CPU + GPU Init SURF Execution Time % 47% After BlockPar

n GPGPU

53%

SLIDE 45

Comparison to Prior Work (cont.)

45

Our implementation on GPGPU

* Can be downloaded form http://www.mis.tu-darmstadt.de/surf

Init SURF Execution Time % 47% After BlockPar

n GPGPU

Initialization

n CPU

BlockPar

n GPGPU

CPU + GPU Pipeline 53% …

SLIDE 46

Comparison to Prior Work (cont.)

46

Compared to CUDA SURF* on GPGPU (Nvidia GTX 260)

* Can be downloaded form http://www.mis.tu-darmstadt.de/surf

30X 30X 46X 10 20 30 40 50 CUDA SURF Our BlockPar Our Block+Pipe Speedup

SLIDE 47

Comparison to Prior Work (cont.)

47

Compared to CUDA SURF* on GPGPU (Nvidia GTX 260)

* Can be downloaded form http://www.mis.tu-darmstadt.de/surf

30X 30X 46X 10 20 30 40 50 CUDA SURF Our BlockPar Our Block+Pipe Speedup

 1.53X speedup over CUDA SURF  CPU+GPU Pipeline not exploited

SLIDE 48

Summary

48

First parallelism analysis of image retrieval

 Pipeline Parallelism  Task Parallelism at scale-level & block-level  Data Parallelism, i.e., SIMD  Their combinations

BlockPar + Pipeline is the most efficient

 13X speedup on 16-core CPU, 1.84X faster than P-SURF  46X speedup on GPU, 1.53X faster than CUDA SURF

SLIDE 49

Conclusion and Future Work

49

Additional computation to avoid synchronization Cooperation between CPU & GPU

Future work

 Apply parallel analysis to speech recognition  Design some energy-efficient architecture, such as

FPGA, to accelerate multimedia retrieval

SLIDE 50

Parallel Processing Institute http://ppi.fudan.edu.cn

Thanks

50

A Comprehensive Analysis and Parallelization of an Image Retrieval Algorithm

Zhenman Fang, Donglei Yang, Weihua Zhang, Haibo Chen, Binyu Zang

Exploding Multimedia Data

Multimedia Retrieval App.

Important to retrieve useful data

Data-intensive and computing-intensive Significant challenges for real-time retrieval

Multi-core Era Is Coming

New Opportunities

Multi-core era needs parallelism Need a comprehensive study on parallelism characteristics in multimedia retrieval

Image Retrieval

Image retrieval: also core of video retrieval

Image Retrieval

feature extraction

Image Retrieval

feature extraction + feature match

Image Retrieval (cont.)

Two classes of algorithms

Detection

SURF Overview

Description

Integral Image

∑ g(x,y) Input Image Integral Image g(x,y) I(x,y)

Scale Space Analysis

Det(x,y) Det(x,y) = Dxx*Dyy – 0.81*Dxy*Dxy

 

Hessian Matrix for (x,y) [Bay 06]

Interest Point Localization

Ipoint with max det value

Orientation Assignment

Based on Haar Wavelet [Bay 06]

Descriptor Vector Construction

Descriptor Vector Construction

4-dimension vector calculated based on Haar Wavelet 64-dimension feature vector

Execution Profile of SURF

Interest Points Distribution

Imbalanced distribution for images/blocks

Parallel Analysis

Pipeline Parallelism Task Parallelism

Combination of Different Parallelism

2-stage Pipeline

Detection Description … Description … Description …

Further divide Description into two stages

…

… …

3-stage Pipeline

Results of Pipeline Parallelism

Pipeline parallelism does not scale

Parallel Analysis

Pipeline Parallelism Task Parallelism

Combination of Different Parallelism

Scale-level Parallelism

Results of Scale-level Parallelism

Not scale when exceeding 12 cores

Results of Scale-level Parallelism

Not scale when exceeding 12 cores

 Imbalanced computation  Non-trivial communication overhead

Parallel Analysis

Pipeline Parallelism Task Parallelism

Combination of Different Parallelism

Input Image

Block-level Parallelism

Input Image

Block-level Parallelism

Block-level parallelism with synchronization (Block-Sync)

Input Image

Block-level Parallelism

Input Image

Block-level Parallelism Block-level parallelism without synchronization (BlockPar)

Results of Block-level Parallelism

BlockPar scales well

Results of Block-level Parallelism

BlockPar scales well

Communication overhead between cores is non-trivial; and it could be reduced by additional computation

Comparison for Each Parallelism

Block-level parallelism is more efficient

Parallel Analysis

Pipeline Parallelism Task Parallelism

Combination of Different Parallelism

Combination of SIMD to Others

Use ICC to generate SIMD instructions

Det(x,y) Det(x,y) = DxxDyy – 0.81Dxy*Dxy