Parallel HMMs Parallel Implementation of Hidden Markov Models for - - PowerPoint PPT Presentation

parallel hmms
SMART_READER_LITE
LIVE PREVIEW

Parallel HMMs Parallel Implementation of Hidden Markov Models for - - PowerPoint PPT Presentation

Parallel HMMs Parallel Implementation of Hidden Markov Models for Wireless Applications Authors Shawn Hymel (Wireless@VT, Virginia Tech) Ihsan Akbar (Harris Corporation) Jeffrey Reed (Wireless@VT, Virginia Tech) Agenda Overview of GPGPU


slide-1
SLIDE 1

Parallel HMMs

Parallel Implementation of Hidden Markov Models for Wireless Applications

Authors Shawn Hymel (Wireless@VT, Virginia Tech) Ihsan Akbar (Harris Corporation) Jeffrey Reed (Wireless@VT, Virginia Tech)

slide-2
SLIDE 2

Agenda

  • Overview of GPGPU
  • Overview of HMMs
  • Parallelization
  • Results
  • Applications
  • Why Is This Useful?

2

slide-3
SLIDE 3

General‐Purpose Processing on GPUs

  • CUDA‐specific
  • Important Terms:

– Threads – Blocks – Grid

3

slide-4
SLIDE 4

CUDA Code Flow

4

slide-5
SLIDE 5

Hidden Markov Model

5

Rainy Sunny Start Clean Shop Walk Initialization States Observations 0.6 0.3 0.1 0.5 0.4 0.1 0.7 0.6 0.3 0.4 0.6 0.4

) , , (   B A 

       6 . 4 . 3 . 7 . A

This image cannot currently be displayed.

           1 . 5 . 3 . 4 . 6 . 1 . B

 

4 . 6 .  

slide-6
SLIDE 6

HMM Canonical Problems

  • Evaluation: P(O|λ)

– Forward Algorithm – Backward Algorithm

  • Find the most likely state sequence

– Viterbi Algorithm

  • Training ( maximize P(O|λ) )

– Baum‐Welch Algorithm

6

slide-7
SLIDE 7

Forward Algorithm

Given a model and an observation sequence, calculate P(O|λ)

– T = number of observations – N = number of states – M = number of possible symbols

7

 

N i O b i

i i

,... 2 , 1 , ) (

1 1

   

     

1 1 1   

       

t j N i ij t t

O b a i j  

   

N i T i

O P

1

|  

Initiation: Induction: Termination

slide-8
SLIDE 8

Example of Parallelization

8

     

1 1 1   

       

t j N i ij t t

O b a i j  

For all j, matrix multiplication

A N N αt N ×

=

α' N

For all j, element‐by‐element multiplication

× × ×

=

N α' αt+1 b(Ot+1)

We can perform this step in parallel! O(TN2) → O(T log N)

slide-9
SLIDE 9

Computational Complexity

9

Forward Algorithm Viterbi Algorithm Baum‐Welch Algorithm Serial Parallel O(TN2) O(T log N) O(T log N) O(T log N) O(TN2) O(TN2)

  • r

O(TMN)

slide-10
SLIDE 10

Test Procedures

Component Specification CPU Intel Core 2 Duo U7300 @ 1.30GHz GPU NVIDIA GeForce GT 335M GPU Core Speed 450 MHz GPU Shader Speed 1080 MHz GPU Memory Speed 1066 MHz CUDA Cores 72

  • Time execution of each

algorithm (C vs. CUDA)

– Vary states – Vary symbols – Vary sequence length

  • Calculate total energy

consumption (C vs. CUDA)

– PowerTOP software

10

Test Hardware

slide-11
SLIDE 11

Speed Results

11

Number of States CPU Runtime (s) GPU Runtime (s) Speed Increase Forward Algorithm 4 0.001 0.1531 0.007x 40 0.04 0.1393 0.287x 400 4.2816 0.2379 17.99x 4000 534.2028 2.9495 181.12 x Viterbi Algorithm 4 0.0033 0.1605 0.021x 40 0.0436 0.1801 0.242x 400 4.2684 1.6595 2.57x 4000 534.5543 116.2531 4.60 x Baum‐Welch Algorithm 4 0.0021 0.4142 0.005x 40 0.1946 0.4299 0.453x 400 17.6719 0.7502 23.56x 4000 1834.672 28.1271 65.23 x

slide-12
SLIDE 12

Energy Consumption

12

Algorithm Power (W) States to Break Even C CUDA Forward 18.5 26.5 ~100 Viterbi 18.5 29.1 ~120 BWA 18.3 26.1 ~70 0.000002 0.000004 0.000006 50 100 150 200 250 Energy Consumed (kWh) Number of States

Energy Consumption for Forward Algorithm

CPU GPU

slide-13
SLIDE 13

Applications

  • Pattern Recognition

– Spectrum Sensing – Signal Classification – Specific Emitter Identification – Geolocation

  • Modeling

– Channel Fading – Call Drop Prediction

13

slide-14
SLIDE 14

Why Is This Useful?

  • Evolution of GPUs and multi‐core processors

– Smart phones, tablets, SDR – Co‐processor

  • Utilize existing hardware for HMM applications

– Large number of states – 2D/3D HMMs

  • Uses in other fields (speech recognition, computer vision)
  • Extrapolation to other algorithms (pattern recognition)

14

slide-15
SLIDE 15

Questions?

15

Contact Information Email: hymelsr@vt.edu Blog: http://sgmustadio.wordpress.com/ Code: http://code.google.com/p/hmm‐cuda/

Other Good Resources cuHMM: http://code.google.com/p/chmm/ MATLAB: http://www.cs.ubc.ca/~murphyk/Software/HMM/hmm.html HTK: http://htk.eng.cam.ac.uk/

slide-16
SLIDE 16

16

MATLAB example: >> sum(A)

Reducing arrays to a single value (e.g. sum) go from O(N) to O(log N)

C Implementation: sum = 0; for (i = 0; i < length; i++) { sum = sum + A[i]; } Parallelization:

Supporting Slide:

Reductions

slide-17
SLIDE 17

17

Supporting Slide:

Timing Results (Forward)

Vary States Vary Symbols

200 400 600 1000 2000 3000 4000 5000 Execution Time (s) Number of States

Execution Time for Forward Algorithm

CPU GPU 5 10 15 20 1 10 100 1000 10000 Execution Time (s) Number of States

Execution Time for Forward Alg. on GPU

0.5 1 1.5 2000 4000 6000 8000 10000 Execution Time (s) Number of Observations

Execution Time for Forward Algorithm

CPU GPU

slide-18
SLIDE 18

18

Supporting Slide:

Timing Results (Viterbi)

Vary States Vary Symbols

200 400 600 1000 2000 3000 4000 5000 Execution Time (s) Number of States

Execution Time for Viterbi Algorithm

CPU GPU 200 400 600 1 10 100 1000 10000 Execution Time (s) Number of States

Execution Time for Viterbi on GPU

0.1 0.2 1000 2000 3000 4000 5000 6000 7000 8000 Execution Time (s) Number of Symbols

Execution Time for Viterbi Algorithm

CPU GPU

slide-19
SLIDE 19

19

Supporting Slide:

Timing Results (BWA)

500 1000 1500 2000 1000 2000 3000 4000 5000 Execution Time (s) Number of States

Execution Time for Baum‐Welch Algorithm

CPU GPU 50 100 1 10 100 1000 10000 Execution Time (s) Number of States

Execution Time for BWA on GPU

0.35 0.45 0.55 1000 2000 3000 4000 5000 6000 7000 8000 Execution Time (s) Number of Symbols

Execution Time for Baum‐Welch Algorithm

CPU GPU

Vary States Vary Symbols