05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 1
PocketSphinx: Open-Source Speech Recognition for Hand-held and - - PowerPoint PPT Presentation
PocketSphinx: Open-Source Speech Recognition for Hand-held and - - PowerPoint PPT Presentation
PocketSphinx: Open-Source Speech Recognition for Hand-held and Embedded Devices David Huggins-Daines (dhuggins@cs.cmu.edu) Mohit Kumar (mohitkum@cs.cmu.edu) Arthur Chan (archan@cs.cmu.edu) Alan W Black (awb@cs.cmu.edu) Mosur Ravishankar
05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 2
What is PocketSphinx?
- Based on Sphinx-II
– Open source code under MIT-style license – Widely used in CMU and elsewhere – Mature and stable API
- Design goals
– Statistical Language Model support
- Finite-State Grammars also available
– Medium-Large Vocabulary (1-10kwords) – Make it go faster
05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 3
Why do we need it?
- Typical desktop/workstation of 2006
– 128-bit memory bus (6-10GB/sec) – 1.8-3GHz processor (5000 MIPS) – ATA, SATA, or SCSI storage (100-300MB/sec)
- Typical PDA/SOC/smartphone of 2006
– 16 or 32-bit memory bus (100-400MB/sec) – 200-600MHz processor (200-700 MIPS) – SD/MMC or CF storage (1-16MB/sec) – no FPU or vector unit (sometimes a DSP...)
05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 4
ASR bottlenecks
- Wait, you say:
– My cell phone is pretty darn fast! – At least as fast as that DEC we
had a real-time 20k system on back in 1996!
- However: ASR is system
bandwidth limited
– Sphinx benchmarks (shown to
the right) favor large caches and high memory bandwidth (Intel)
– Search, LM, and dictionary look-
up are highly memory-intensive
– We will have to deal with them
(Source: techreport.com)
05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 5
Scaling: Hand-held vs Desktop
10 1000 5000 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25
Hand-held Desktop
# of words in vocabulary Speed (xRT)
05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 6
How to make it go faster
- Low-hanging fruit
– Front-end optimizations (fixed-point, logarithm) – Speeding up GMM computation – Old-fashioned beam tuning
- Non-speech-related work
– Memory optimization (+ model compression) – Machine-level optimization (assembly code)
- What's left?
– Search optimization – dynamic beam tuning – Language model compression and optimization
05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 7
Front-End Optimizations
- Fixed-point calculations
– 32-bit, 16.16 or 18.14 format – Using 64-bit multiply (SMULL) on ARM, 16.16
multiply-accumulate on DSP
– MFCC calculated in log domain, using a
lookup of log2 w/conversion to log1.0001
- Audio downsampling
– Allows smaller order FFT and MFCC – Not as useful for large-vocabulary systems
05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 8
GMM Optimizations
- Top-N based Gaussian selection (Mosur 96)
– Use previous frame's top codewords to select
current frame – standard Sphinx-II technique
- Partial frame-based downsampling (Woszczyna 98)
– Only update top-N every Mth frame – Can significantly affect accuracy
- kd-tree based Gaussian selection (Fritsch 96)
– Approximate nearest neighbor search in k
dimensions using stable partition trees
– 10% speedup, little or no effect on accuracy
05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 9
Search Optimizations
- Absolute pruning
– Approximations in the front end and GMM increase
the effective beam width, paradoxically decreasing performance
– We would like to enforce a hard limit on the number
- f states or word exits evaluated per frame - how?
- Histogram pruning (Ney 1996)
– Partition the beam width into bins – Dynamically recompute beam based on bin
- ccupancy counts
– 30% speedup with 10% relative degradation in WER
05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 10
Memory Optimizations
- Read-only model files
– mmap(2)able, shareable between processes – leverage OS-level caching (virtual memory)
- Precompiled (binary) LM
– Inherited from Sphinx-II – Adapted for memory-mapping – 5000+ vocabulary in <32M of RAM
- Read-only binary model definition file
– Pre-built radix tree of triphones->senones
05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 11
Performance
Task Vocabulary Perplexity xReal-Time Word Error TIDIGITS 10 13.86 0.5 0.87% RM1 994 46.79 0.71 13.11% WSJ devel5k 4989 143.5 0.96 18.50%
- Test platform: iPaq 3670
– 206MHz StrongARM running Linux (FPU
emulation in kernel)
- Also running on:
– Other embedded Linux platforms – Analog Devices Blackfin, uClinux – WinCE using GNU toolchain (untested)
05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 12
How to get it
- Web Site:
http://www.speech.cs.cmu.edu/pocketsphinx/
- Compiles with GCC for i386, ARM,
PowerPC, and Blackfin
- Cross-compiles using an arm-wince-pe
toolchain (available in various Linux distributions) for Windows CE
- Compatible with Sphinx2 fbs.h interface
- Good (fast) acoustic models forthcoming
05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 13
Future work
- Improve accuracy
– Remove Sphinx-II codebook limitations
- Optimize the language model and
dictionary
– Statistical profiling of LM access patterns
- Investigate dynamic search strategies
- Remove various legacy code
- Fast speaker and channel adaptation
05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 14
Thank you
- Any questions?
This work was supported by DARPA grant NB CH-D-03-0010. The content of the information in this publication does not necessarily reflect the position or the policy of the US Government, and no official endorsement should be inferred.