Parallelization of an Image Retrieval Algorithm Zhenman Fang , - - PowerPoint PPT Presentation

parallelization of an image
SMART_READER_LITE
LIVE PREVIEW

Parallelization of an Image Retrieval Algorithm Zhenman Fang , - - PowerPoint PPT Presentation

A Comprehensive Analysis and Parallelization of an Image Retrieval Algorithm Zhenman Fang , Donglei Yang, Weihua Zhang, Haibo Chen, Binyu Zang Parallel Processing Institute, Fudan University 1 Exploding Multimedia Data Figure from [Report on


slide-1
SLIDE 1

A Comprehensive Analysis and Parallelization of an Image Retrieval Algorithm

Zhenman Fang, Donglei Yang, Weihua Zhang, Haibo Chen, Binyu Zang

Parallel Processing Institute, Fudan University

1

slide-2
SLIDE 2

Exploding Multimedia Data

2

Cisco VNI Global Consumer Internet Traffic Forecast

Figure from [Report on American Consumers 09]

slide-3
SLIDE 3

Multimedia Retrieval App.

Important to retrieve useful data

 E.g. medical imagery, video recommendation

Data-intensive and computing-intensive Significant challenges for real-time retrieval

3

slide-4
SLIDE 4

Multi-core Era Is Coming

4

# of cores Software Performance Single Core Performance

Figure from Michael MoCool’s (Intel) many core slides

slide-5
SLIDE 5

New Opportunities

5

Multi-core era needs parallelism Need a comprehensive study on parallelism characteristics in multimedia retrieval

 To optimize them on current architectures  To design future architectures for them

slide-6
SLIDE 6

Image Retrieval

6

Image retrieval: also core of video retrieval

slide-7
SLIDE 7

Image Retrieval

7

feature extraction

slide-8
SLIDE 8

Image Retrieval

8

feature extraction + feature match

slide-9
SLIDE 9

Image Retrieval (cont.)

9

Two classes of algorithms

 Global feature based

 Color features  Texture features

 Local feature based

 Shape context  SIFT Features  SURF Features

~60% precision

[Wan 08]

accurate but time consuming robust & appealing: insensitive to scale and rotation transformation

[Mikolajczyk 05, Bauer 07]

slide-10
SLIDE 10

Detection

Descriptor Vector Construction Features Orientation

Assignment

Integral Image

SURF Overview

10

Scale Space Analysis Interest Point Localization

Description

Input Image

slide-11
SLIDE 11

Integral Image

11

∑ g(x,y) Input Image Integral Image g(x,y) I(x,y)

slide-12
SLIDE 12

Scale Space Analysis

12

Det(x,y) Det(x,y) = Dxx*Dyy – 0.81*Dxy*Dxy

 

Dxx Dxy Dxy Dyy

Hessian Matrix for (x,y) [Bay 06]

insensitive to scale transformation

slide-13
SLIDE 13

Interest Point Localization

13

Ipoint with max det value

slide-14
SLIDE 14

Orientation Assignment

14

Based on Haar Wavelet [Bay 06]

Orientation

insensitive to rotation transformation

slide-15
SLIDE 15

Descriptor Vector Construction

15

slide-16
SLIDE 16

Descriptor Vector Construction

16

4-dimension vector calculated based on Haar Wavelet 64-dimension feature vector

slide-17
SLIDE 17

Detection

Descriptor Vector Construction Features Orientation Assignment Integral Image

Execution Profile of SURF

17

Scale Space Analysis Interest Point Localization 1% time 24% time 2% time 20% time 53% time 27% time

Description

73% time Input Image

Experiment Enviroment Prog: OpenSURF Input: 48 images HW: 16-core server 32GB memory

slide-18
SLIDE 18

Interest Points Distribution

18

Imbalanced distribution for images/blocks

100 200 300 400 500 600 32 64 96 128 160 192 # of Interest Points Block ID Average line

slide-19
SLIDE 19

Parallel Analysis

Pipeline Parallelism Task Parallelism

 Scale-level Parallelism  Block-level Parallelism

Combination of Different Parallelism

 Combination of SIMD and Other Parallelism  Combination of Pipeline and Task Parallelism

19

slide-20
SLIDE 20

2-stage Pipeline

20

Detection Description … Description … Description …

Detection writes interest point to the buffer Description reads interest point from the buffer

slide-21
SLIDE 21

Further divide Description into two stages

Orientation Assignment Descriptor Vector Construction

Detection

Descriptor Vector Construction

… …

3-stage Pipeline

21

slide-22
SLIDE 22

Results of Pipeline Parallelism

22

Pipeline parallelism does not scale

1 2 3 4 2-stage 3-stage Speedup

slide-23
SLIDE 23

Parallel Analysis

Pipeline Parallelism Task Parallelism

 Scale-level Parallelism  Block-level Parallelism

Combination of Different Parallelism

 Combination of SIMD to Others  Combination of Task and Pipeline Parallelism

23

slide-24
SLIDE 24

Scale-level Parallelism

24

Scale Space Analysis Interest Point Localization Description Integral Image Each scale computed concurrently Describe each group

  • f interest points

concurrently

slide-25
SLIDE 25

Results of Scale-level Parallelism

25

Not scale when exceeding 12 cores

2 4 6 8 4-core 8-core 12-core 16-core Speedup

slide-26
SLIDE 26

Results of Scale-level Parallelism

26

Not scale when exceeding 12 cores

2 4 6 8 4-core 8-core 12-core 16-core Speedup

 Imbalanced computation  Non-trivial communication overhead

slide-27
SLIDE 27

Parallel Analysis

Pipeline Parallelism Task Parallelism

 Scale-level Parallelism  Block-level Parallelism

Combination of Different Parallelism

 Combination of SIMD to Others  Combination of Task and Pipeline Parallelism

27

slide-28
SLIDE 28

Detection Detection Detection

28

Description

Input Image

Image Block Image Block Image Block Description Description

Block-level Parallelism

Sync between neighbor blocks

slide-29
SLIDE 29

Detection Detection Detection

29

Description

Input Image

Image Block Image Block Image Block Description Description

Block-level Parallelism

Sync between neighbor blocks

Block-level parallelism with synchronization (Block-Sync)

slide-30
SLIDE 30

Detection Detection Detection

30

Description

Input Image

Image Block Image Block Image Block Description Description

Block-level Parallelism

Use additional computation to avoid sync

slide-31
SLIDE 31

Detection Detection Detection

31

Description

Input Image

Image Block Image Block Image Block Description Description

Block-level Parallelism Block-level parallelism without synchronization (BlockPar)

Use additional computation to avoid sync

slide-32
SLIDE 32

Results of Block-level Parallelism

32

BlockPar scales well

2 4 6 8 10 4-core 8-core 12-core 16-core Speedup BlockPar Block-Sync

slide-33
SLIDE 33

Results of Block-level Parallelism

33

BlockPar scales well

2 4 6 8 10 4-core 8-core 12-core 16-core Speedup BlockPar Block-Sync

Communication overhead between cores is non-trivial; and it could be reduced by additional computation

slide-34
SLIDE 34

Comparison for Each Parallelism

34

Block-level parallelism is more efficient

2 4 6 8 10 4-core 8-core 12-core 16-core Speedup Pipeline ScalePar BlockPar

slide-35
SLIDE 35

Parallel Analysis

Pipeline Parallelism Task Parallelism

 Scale-level Parallelism  Block-level Parallelism

Combination of Different Parallelism

 Combination of SIMD to Others  Combination of Task and Pipeline Parallelism

35

slide-36
SLIDE 36

Combination of SIMD to Others

36

Use ICC to generate SIMD instructions

2 4 6 8 10 12 4-core 8-core 12-core 16-core Speedup Pipeline ScalePar BlockPar

slide-37
SLIDE 37

Combination of SIMD to Others

37

Use ICC to generate SIMD instructions

2 4 6 8 10 12 4-core 8-core 12-core 16-core Speedup Pipeline+SIMD ScalePar+SIMD BlockPar+SIMD

11% Speedup

slide-38
SLIDE 38

Parallel Analysis

Pipeline Parallelism Task Parallelism

 Scale-level Parallelism  Block-level Parallelism

Combination of Different Parallelism

 Combination of SIMD to Others  Combination of Task and Pipeline Parallelism

38

slide-39
SLIDE 39

Combination of Task & Pipeline

39

BlockPar + Pipeline is the most efficient

13X

2 4 6 8 10 12 14 4-core 8-core 12-core 16-core Speedup BlockPar Block+Pipe BlockPar+SIMD Block+Pipe+SIMD

slide-40
SLIDE 40

Combination of Task & Pipeline

40

BlockPar + Pipeline is the most efficient

13X

2 4 6 8 10 12 14 4-core 8-core 12-core 16-core Speedup BlockPar Block+Pipe BlockPar+SIMD Block+Pipe+SIMD

 Fewer computation  Better locality

slide-41
SLIDE 41

Comparison to Prior Work

41

Compared to P-SURF [Zhang 10] on multi-core CPU

2 4 6 8 10 12 4-core 8-core 12-core 16-core Speedup P-SURF Our BlockPar Our Block+Pipe

slide-42
SLIDE 42

Comparison to Prior Work

42

Compared to P-SURF [Zhang 10] on multi-core CPU

2 4 6 8 10 12 4-core 8-core 12-core 16-core Speedup P-SURF Our BlockPar Our Block+Pipe

 1.84X Speedup over P-SURF  Non-trivial communication overhead

slide-43
SLIDE 43

Comparison to Prior Work (cont.)

43

Our implementation on GPGPU

* Can be downloaded form http://www.mis.tu-darmstadt.de/surf

99% 1% Sequential SURF on CPU Initialization

  • n CPU

BlockPar

  • n GPGPU

Sequential CPU + GPU Execution Time % Init SURF

slide-44
SLIDE 44

Comparison to Prior Work (cont.)

44

Our implementation on GPGPU

* Can be downloaded form http://www.mis.tu-darmstadt.de/surf

Initialization

  • n CPU

BlockPar

  • n GPGPU

Sequential CPU + GPU Init SURF Execution Time % 47% After BlockPar

  • n GPGPU

53%

slide-45
SLIDE 45

Comparison to Prior Work (cont.)

45

Our implementation on GPGPU

* Can be downloaded form http://www.mis.tu-darmstadt.de/surf

Init SURF Execution Time % 47% After BlockPar

  • n GPGPU

Initialization

  • n CPU

BlockPar

  • n GPGPU

CPU + GPU Pipeline 53% …

slide-46
SLIDE 46

Comparison to Prior Work (cont.)

46

Compared to CUDA SURF* on GPGPU (Nvidia GTX 260)

* Can be downloaded form http://www.mis.tu-darmstadt.de/surf

30X 30X 46X 10 20 30 40 50 CUDA SURF Our BlockPar Our Block+Pipe Speedup

slide-47
SLIDE 47

Comparison to Prior Work (cont.)

47

Compared to CUDA SURF* on GPGPU (Nvidia GTX 260)

* Can be downloaded form http://www.mis.tu-darmstadt.de/surf

30X 30X 46X 10 20 30 40 50 CUDA SURF Our BlockPar Our Block+Pipe Speedup

 1.53X speedup over CUDA SURF  CPU+GPU Pipeline not exploited

slide-48
SLIDE 48

Summary

48

First parallelism analysis of image retrieval

 Pipeline Parallelism  Task Parallelism at scale-level & block-level  Data Parallelism, i.e., SIMD  Their combinations

BlockPar + Pipeline is the most efficient

 13X speedup on 16-core CPU, 1.84X faster than P-SURF  46X speedup on GPU, 1.53X faster than CUDA SURF

slide-49
SLIDE 49

Conclusion and Future Work

49

Additional computation to avoid synchronization Cooperation between CPU & GPU

Future work

 Apply parallel analysis to speech recognition  Design some energy-efficient architecture, such as

FPGA, to accelerate multimedia retrieval

slide-50
SLIDE 50

Parallel Processing Institute http://ppi.fudan.edu.cn

Thanks

50