PocketSphinx: Open-Source Speech Recognition for Hand-held and - - PowerPoint PPT Presentation

pocketsphinx
SMART_READER_LITE
LIVE PREVIEW

PocketSphinx: Open-Source Speech Recognition for Hand-held and - - PowerPoint PPT Presentation

PocketSphinx: Open-Source Speech Recognition for Hand-held and Embedded Devices David Huggins-Daines (dhuggins@cs.cmu.edu) Mohit Kumar (mohitkum@cs.cmu.edu) Arthur Chan (archan@cs.cmu.edu) Alan W Black (awb@cs.cmu.edu) Mosur Ravishankar


slide-1
SLIDE 1

05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 1

PocketSphinx:

Open-Source Speech Recognition for Hand-held and Embedded Devices

David Huggins-Daines (dhuggins@cs.cmu.edu) Mohit Kumar (mohitkum@cs.cmu.edu) Arthur Chan (archan@cs.cmu.edu) Alan W Black (awb@cs.cmu.edu) Mosur Ravishankar (rkm@cs.cmu.edu) Alexander I. Rudnicky (air@cs.cmu.edu)

slide-2
SLIDE 2

05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 2

What is PocketSphinx?

  • Based on Sphinx-II

– Open source code under MIT-style license – Widely used in CMU and elsewhere – Mature and stable API

  • Design goals

– Statistical Language Model support

  • Finite-State Grammars also available

– Medium-Large Vocabulary (1-10kwords) – Make it go faster

slide-3
SLIDE 3

05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 3

Why do we need it?

  • Typical desktop/workstation of 2006

– 128-bit memory bus (6-10GB/sec) – 1.8-3GHz processor (5000 MIPS) – ATA, SATA, or SCSI storage (100-300MB/sec)

  • Typical PDA/SOC/smartphone of 2006

– 16 or 32-bit memory bus (100-400MB/sec) – 200-600MHz processor (200-700 MIPS) – SD/MMC or CF storage (1-16MB/sec) – no FPU or vector unit (sometimes a DSP...)

slide-4
SLIDE 4

05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 4

ASR bottlenecks

  • Wait, you say:

– My cell phone is pretty darn fast! – At least as fast as that DEC we

had a real-time 20k system on back in 1996!

  • However: ASR is system

bandwidth limited

– Sphinx benchmarks (shown to

the right) favor large caches and high memory bandwidth (Intel)

– Search, LM, and dictionary look-

up are highly memory-intensive

– We will have to deal with them

(Source: techreport.com)

slide-5
SLIDE 5

05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 5

Scaling: Hand-held vs Desktop

10 1000 5000 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25

Hand-held Desktop

# of words in vocabulary Speed (xRT)

slide-6
SLIDE 6

05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 6

How to make it go faster

  • Low-hanging fruit

– Front-end optimizations (fixed-point, logarithm) – Speeding up GMM computation – Old-fashioned beam tuning

  • Non-speech-related work

– Memory optimization (+ model compression) – Machine-level optimization (assembly code)

  • What's left?

– Search optimization – dynamic beam tuning – Language model compression and optimization

slide-7
SLIDE 7

05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 7

Front-End Optimizations

  • Fixed-point calculations

– 32-bit, 16.16 or 18.14 format – Using 64-bit multiply (SMULL) on ARM, 16.16

multiply-accumulate on DSP

– MFCC calculated in log domain, using a

lookup of log2 w/conversion to log1.0001

  • Audio downsampling

– Allows smaller order FFT and MFCC – Not as useful for large-vocabulary systems

slide-8
SLIDE 8

05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 8

GMM Optimizations

  • Top-N based Gaussian selection (Mosur 96)

– Use previous frame's top codewords to select

current frame – standard Sphinx-II technique

  • Partial frame-based downsampling (Woszczyna 98)

– Only update top-N every Mth frame – Can significantly affect accuracy

  • kd-tree based Gaussian selection (Fritsch 96)

– Approximate nearest neighbor search in k

dimensions using stable partition trees

– 10% speedup, little or no effect on accuracy

slide-9
SLIDE 9

05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 9

Search Optimizations

  • Absolute pruning

– Approximations in the front end and GMM increase

the effective beam width, paradoxically decreasing performance

– We would like to enforce a hard limit on the number

  • f states or word exits evaluated per frame - how?
  • Histogram pruning (Ney 1996)

– Partition the beam width into bins – Dynamically recompute beam based on bin

  • ccupancy counts

– 30% speedup with 10% relative degradation in WER

slide-10
SLIDE 10

05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 10

Memory Optimizations

  • Read-only model files

– mmap(2)able, shareable between processes – leverage OS-level caching (virtual memory)

  • Precompiled (binary) LM

– Inherited from Sphinx-II – Adapted for memory-mapping – 5000+ vocabulary in <32M of RAM

  • Read-only binary model definition file

– Pre-built radix tree of triphones->senones

slide-11
SLIDE 11

05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 11

Performance

Task Vocabulary Perplexity xReal-Time Word Error TIDIGITS 10 13.86 0.5 0.87% RM1 994 46.79 0.71 13.11% WSJ devel5k 4989 143.5 0.96 18.50%

  • Test platform: iPaq 3670

– 206MHz StrongARM running Linux (FPU

emulation in kernel)

  • Also running on:

– Other embedded Linux platforms – Analog Devices Blackfin, uClinux – WinCE using GNU toolchain (untested)

slide-12
SLIDE 12

05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 12

How to get it

  • Web Site:

http://www.speech.cs.cmu.edu/pocketsphinx/

  • Compiles with GCC for i386, ARM,

PowerPC, and Blackfin

  • Cross-compiles using an arm-wince-pe

toolchain (available in various Linux distributions) for Windows CE

  • Compatible with Sphinx2 fbs.h interface
  • Good (fast) acoustic models forthcoming
slide-13
SLIDE 13

05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 13

Future work

  • Improve accuracy

– Remove Sphinx-II codebook limitations

  • Optimize the language model and

dictionary

– Statistical profiling of LM access patterns

  • Investigate dynamic search strategies
  • Remove various legacy code
  • Fast speaker and channel adaptation
slide-14
SLIDE 14

05/18/06 Language Technologies Institute, Carnegie Mellon University Slide 14

Thank you

  • Any questions?

This work was supported by DARPA grant NB CH-D-03-0010. The content of the information in this publication does not necessarily reflect the position or the policy of the US Government, and no official endorsement should be inferred.