Developing Your Own Wake Word Engine Just Like Alexa and OK Google - - PowerPoint PPT Presentation
Developing Your Own Wake Word Engine Just Like Alexa and OK Google - - PowerPoint PPT Presentation
Developing Your Own Wake Word Engine Just Like Alexa and OK Google Xuchen Yao, CEO, KITT.AI Guoguo Chen, CTO, KITT.AI Whats a wake word? Alexa whats the weather today? OK Google Hey Siri Wake word One shot
What’s a “wake word”?
- Wake word
- Hot word
- Offline
- Code runs on
CPU/DSP/MCU
- 7x24
- Always listening
- One shot
understanding
- Online
- Code runs on cloud
- On Demand
- Explicit permission
Alexa OK Google Hey Siri
what’s the weather today?
Conversational UI Pipeline
wake up device speech text text understanding dialogue management text speech text voice
a customizable hotword detection engine a.k.a: deep neural network in 2MB of RAM hotword.io video blog
10,000+ developers, 7000+ unique hotwords
Who’s using it (released 5/2016)
Dominating developer community for hotword detection
Use Cases
#1 Hotword: Smart Mirror
https://github.com/evancohen/smart-mirror (credits to Evan Cohen) video link
Command & Control: GoPiGo
(credits to Paul Matz) video link
Project RePL
(credits to Chris Burns) video link
Conversational UI Pipeline
wake up device speech text text understanding dialogue management text speech text voice Speech Pipeline
Voice Microphone Array Wake Word Detection Speech Recognition local
- Close talking
- Far field (3-9
feet)
- 2, 4, or 6
microphones
- Linear/circular
cloud/local
- Voice Activity
Detection
- Auto Gain
Control
- Fast response
(0.1 second)
- High accuracy
- Adaptive Echo
Cancellation
- Beam forming
- IBM/Microsoft/Nua
nce/Google
- Alexa Voice Service
- Kaldi
- PocketSphinx
- HTK
- Command & Control
- Language
Understanding
- Telephone
(8KHz Sampling)
- Others (16KHz)
- Noises: TV,
radio, street, café, car, music
- Pitch: children,
adults, senior
- Accent:
US/UK/Europe/ Asian…
Speech Pipeline
Supported Platforms and Wrappers
- Raspberry Pi
- Mac OS X
- iPhone/iPad/iPod
- x86/64bit Ubuntu
- Android
- Pine 64
- Intel Edison
- Samsung Artik
- Allwinner R-series
- Ingenic X1000
- Rockchip
Personal vs. Universal models
Personal Universal Voice samples needed 3 At least 1500 Speaker-independent No Yes Speaker-specific Sort of No Robust against noise No Yes Free Yes No Time needed Immediately 2 weeks
Customizing a universal model
define hotword collect voice train a model deliver & evaluate deploy to beta users ship & success collect voice from device hotword web API Iterate & Improve desired performance: >90% detection rate <= 3 false alarms in 24 hours
Science behind wake word
Challenges
- High detection rate
- Low false alarm
- Efficient: detect every 0.1
second
- Small RAM: <2MB
- Too much ambiguity, not
much context
Is this “Alexa”?
short window longer window
Existing Algorithm
Existing Algorithm
Existing Algorithm
- Advantage:
–Simplified pipeline –Simplified decoder
- Disadvantage:
–Massive hotword specific training data
Possible Ways to Improve
- Data augmentation
– Adding noise – Adding reverberation – And so on…
- riginal
add noise add noise and reverberation
Possible Ways to Improve
- Network models
– Model selection
- Feedforward models? Recurrent models?
– Model compression
- 32-bit float 16-bit float 8-bit integer
- Parameters with small absolute value
Possible Ways to Improve
- Decoder redesigning
– Modeling smaller units
- Syllables, phones, etc
– False alarm suppression
- Additional classifier?
Training with Tesla K20/K80
- Positive data
– 1,500 hotword samples
- Negative data
– Thousands of hours of speech
- Training time
– Half a day with 4 K80 GPUs
Software Architecture
Frontend Backend
KITT.AI Scientific Computing
Deep Learning Cloud Devices Production Cloud
Traffic ELB
Content
Websocket audio, msg
HTTPs
Message Queue
Data Training Model Deploy