parallel hmms
play

Parallel HMMs Parallel Implementation of Hidden Markov Models for - PowerPoint PPT Presentation

Parallel HMMs Parallel Implementation of Hidden Markov Models for Wireless Applications Authors Shawn Hymel (Wireless@VT, Virginia Tech) Ihsan Akbar (Harris Corporation) Jeffrey Reed (Wireless@VT, Virginia Tech) Agenda Overview of GPGPU


  1. Parallel HMMs Parallel Implementation of Hidden Markov Models for Wireless Applications Authors Shawn Hymel (Wireless@VT, Virginia Tech) Ihsan Akbar (Harris Corporation) Jeffrey Reed (Wireless@VT, Virginia Tech)

  2. Agenda • Overview of GPGPU • Overview of HMMs • Parallelization • Results • Applications • Why Is This Useful? 2

  3. General ‐ Purpose Processing on GPUs • CUDA ‐ specific • Important Terms: – Threads – Blocks – Grid 3

  4. CUDA Code Flow 4

  5. Hidden Markov Model This image cannot currently be displayed.    ( , , ) A B Start   Initialization 0.6 0.4 0 . 7 0 . 3  0.7 0.6 0.3   A   0 . 4 0 . 6 States Rainy Sunny 0.4   0 . 1 0 . 6 0.1 0.1 0.5 0.6 0.3   0.4  0 . 4 0 . 3 B     Observations Walk Shop Clean  0 . 5 0 . 1      0 . 6 0 . 4 5

  6. HMM Canonical Problems • Evaluation: P(O| λ ) – Forward Algorithm – Backward Algorithm • Find the most likely state sequence – Viterbi Algorithm • Training ( maximize P(O| λ ) ) – Baum ‐ Welch Algorithm 6

  7. Forward Algorithm Given a model and an observation sequence, calculate P(O| λ ) – T = number of observations – N = number of states – M = number of possible symbols Initiation:       ( ) , 1 , 2 ,... i b O i N 1 1 i i Induction:     N         j  i a  b O   1 1 t t ij j t    1 i Termination N         | P O T i  1 i 7

  8. Example of Parallelization     N         j  i a  b O   1 1 t t ij j t    1 i For all j, element ‐ by ‐ element multiplication For all j, matrix multiplication A α ' b(O t+1 ) α t+1 α t α ' N × = × = N N × × N N We can perform this step in parallel! O(TN 2 ) → O(T log N) 8

  9. Computational Complexity Serial Parallel O(TN 2 ) Forward Algorithm O(T log N) O(TN 2 ) O(T log N) Viterbi Algorithm O(TN 2 ) Baum ‐ Welch Algorithm O(T log N) or O(TMN) 9

  10. Test Procedures Test Hardware • Time execution of each Component Specification algorithm (C vs. CUDA) CPU Intel Core 2 Duo – Vary states U7300 @ 1.30GHz – Vary symbols GPU NVIDIA GeForce GT 335M – Vary sequence length GPU Core Speed 450 MHz • Calculate total energy GPU Shader Speed 1080 MHz consumption (C vs. CUDA) GPU Memory Speed 1066 MHz CUDA Cores 72 – PowerTOP software 10

  11. Speed Results Number of States CPU Runtime (s) GPU Runtime (s) Speed Increase Forward Algorithm 4 0.001 0.1531 0.007x 40 0.04 0.1393 0.287x 400 4.2816 0.2379 17.99x 4000 534.2028 2.9495 181.12 x Viterbi Algorithm 4 0.0033 0.1605 0.021x 40 0.0436 0.1801 0.242x 400 4.2684 1.6595 2.57x 4000 534.5543 116.2531 4.60 x Baum ‐ Welch Algorithm 4 0.0021 0.4142 0.005x 40 0.1946 0.4299 0.453x 400 17.6719 0.7502 23.56x 4000 1834.672 28.1271 65.23 x 11

  12. Energy Consumption Power (W) States to Algorithm Break Even C CUDA Forward 18.5 26.5 ~100 Viterbi 18.5 29.1 ~120 BWA 18.3 26.1 ~70 Energy Consumption for Forward Algorithm Energy Consumed (kWh) 0.000006 0.000004 0.000002 CPU GPU 0 0 50 100 150 200 250 Number of States 12

  13. Applications • Pattern Recognition – Spectrum Sensing – Signal Classification – Specific Emitter Identification – Geolocation • Modeling – Channel Fading – Call Drop Prediction 13

  14. Why Is This Useful? • Evolution of GPUs and multi ‐ core processors – Smart phones, tablets, SDR – Co ‐ processor • Utilize existing hardware for HMM applications – Large number of states – 2D/3D HMMs • Uses in other fields (speech recognition, computer vision) • Extrapolation to other algorithms (pattern recognition) 14

  15. Questions? Contact Information Email: hymelsr@vt.edu Blog: http://sgmustadio.wordpress.com/ Code: http://code.google.com/p/hmm ‐ cuda/ Other Good Resources cuHMM: http://code.google.com/p/chmm/ MATLAB: http://www.cs.ubc.ca/~murphyk/Software/HMM/hmm.html HTK: http://htk.eng.cam.ac.uk/ 15

  16. Supporting Slide: Reductions MATLAB example: C Implementation: >> sum(A) sum = 0; for (i = 0; i < length; i++) { sum = sum + A[i]; } Parallelization: Reducing arrays to a single value (e.g. sum) go from O(N) to O(log N) 16

  17. Supporting Slide: Timing Results (Forward) Execution Time for Forward Algorithm 600 Execution Time (s) 400 200 CPU GPU 0 0 1000 2000 3000 4000 5000 Vary States Number of States Execution Time for Forward Alg. on GPU 20 Execution Time (s) 15 10 5 0 1 10 100 1000 10000 Number of States Execution Time for Forward Algorithm 1.5 Execution Time (s) Vary Symbols 1 0.5 CPU 0 GPU 0 2000 4000 6000 8000 10000 Number of Observations 17

  18. Supporting Slide: Timing Results (Viterbi) Execution Time for Viterbi Algorithm 600 Execution Time (s) 400 200 CPU GPU 0 0 1000 2000 3000 4000 5000 Vary States Number of States Execution Time for Viterbi on GPU 600 Execution Time (s) 400 200 0 1 10 100 1000 10000 Number of States Execution Time for Viterbi Algorithm 0.2 Execution Time (s) Vary Symbols 0.1 CPU 0 GPU 0 1000 2000 3000 4000 5000 6000 7000 8000 Number of Symbols 18

  19. Supporting Slide: Timing Results (BWA) Execution Time for Baum ‐ Welch Algorithm 2000 Execution Time (s) 1500 1000 CPU 500 GPU 0 0 1000 2000 3000 4000 5000 Vary States Number of States Execution Time for BWA on GPU 100 Execution Time (s) 50 0 1 10 100 1000 10000 Number of States Execution Time for Baum ‐ Welch Algorithm 0.55 Execution Time (s) Vary Symbols 0.45 CPU 0.35 GPU 0 1000 2000 3000 4000 5000 6000 7000 8000 Number of Symbols 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend