Developing Your Own Wake Word Engine Just Like Alexa and OK Google - - PowerPoint PPT Presentation

developing your own wake word engine
SMART_READER_LITE
LIVE PREVIEW

Developing Your Own Wake Word Engine Just Like Alexa and OK Google - - PowerPoint PPT Presentation

Developing Your Own Wake Word Engine Just Like Alexa and OK Google Xuchen Yao, CEO, KITT.AI Guoguo Chen, CTO, KITT.AI Whats a wake word? Alexa whats the weather today? OK Google Hey Siri Wake word One shot


slide-1
SLIDE 1

Developing Your Own Wake Word Engine Just Like “Alexa” and “OK Google”

Xuchen Yao, CEO, KITT.AI Guoguo Chen, CTO, KITT.AI

slide-2
SLIDE 2

What’s a “wake word”?

  • Wake word
  • Hot word
  • Offline
  • Code runs on

CPU/DSP/MCU

  • 7x24
  • Always listening
  • One shot

understanding

  • Online
  • Code runs on cloud
  • On Demand
  • Explicit permission

Alexa OK Google Hey Siri

what’s the weather today?

slide-3
SLIDE 3

Conversational UI Pipeline

wake up device speech  text text understanding dialogue management text  speech text voice

slide-4
SLIDE 4

a customizable hotword detection engine a.k.a: deep neural network in 2MB of RAM hotword.io video blog

slide-5
SLIDE 5
slide-6
SLIDE 6

10,000+ developers, 7000+ unique hotwords

Who’s using it (released 5/2016)

Dominating developer community for hotword detection

slide-7
SLIDE 7

Use Cases

slide-8
SLIDE 8

#1 Hotword: Smart Mirror

https://github.com/evancohen/smart-mirror (credits to Evan Cohen) video link

slide-9
SLIDE 9

Command & Control: GoPiGo

(credits to Paul Matz) video link

slide-10
SLIDE 10

Project RePL

(credits to Chris Burns) video link

slide-11
SLIDE 11

Conversational UI Pipeline

wake up device speech  text text understanding dialogue management text  speech text voice Speech Pipeline

slide-12
SLIDE 12

Voice Microphone Array Wake Word Detection Speech Recognition local

  • Close talking
  • Far field (3-9

feet)

  • 2, 4, or 6

microphones

  • Linear/circular

cloud/local

  • Voice Activity

Detection

  • Auto Gain

Control

  • Fast response

(0.1 second)

  • High accuracy
  • Adaptive Echo

Cancellation

  • Beam forming
  • IBM/Microsoft/Nua

nce/Google

  • Alexa Voice Service
  • Kaldi
  • PocketSphinx
  • HTK
  • Command & Control
  • Language

Understanding

  • Telephone

(8KHz Sampling)

  • Others (16KHz)
  • Noises: TV,

radio, street, café, car, music

  • Pitch: children,

adults, senior

  • Accent:

US/UK/Europe/ Asian…

Speech Pipeline

slide-13
SLIDE 13

Supported Platforms and Wrappers

  • Raspberry Pi
  • Mac OS X
  • iPhone/iPad/iPod
  • x86/64bit Ubuntu
  • Android
  • Pine 64
  • Intel Edison
  • Samsung Artik
  • Allwinner R-series
  • Ingenic X1000
  • Rockchip
slide-14
SLIDE 14

Personal vs. Universal models

Personal Universal Voice samples needed 3 At least 1500 Speaker-independent No Yes Speaker-specific Sort of No Robust against noise No Yes Free Yes No Time needed Immediately 2 weeks

slide-15
SLIDE 15

Customizing a universal model

define hotword collect voice train a model deliver & evaluate deploy to beta users ship & success collect voice from device hotword web API Iterate & Improve desired performance: >90% detection rate <= 3 false alarms in 24 hours

slide-16
SLIDE 16

Science behind wake word

slide-17
SLIDE 17

Challenges

  • High detection rate
  • Low false alarm
  • Efficient: detect every 0.1

second

  • Small RAM: <2MB
  • Too much ambiguity, not

much context

Is this “Alexa”?

short window longer window

slide-18
SLIDE 18

Existing Algorithm

slide-19
SLIDE 19

Existing Algorithm

slide-20
SLIDE 20

Existing Algorithm

  • Advantage:

–Simplified pipeline –Simplified decoder

  • Disadvantage:

–Massive hotword specific training data

slide-21
SLIDE 21

Possible Ways to Improve

  • Data augmentation

– Adding noise – Adding reverberation – And so on…

  • riginal

add noise add noise and reverberation

slide-22
SLIDE 22

Possible Ways to Improve

  • Network models

– Model selection

  • Feedforward models? Recurrent models?

– Model compression

  • 32-bit float  16-bit float  8-bit integer
  • Parameters with small absolute value
slide-23
SLIDE 23

Possible Ways to Improve

  • Decoder redesigning

– Modeling smaller units

  • Syllables, phones, etc

– False alarm suppression

  • Additional classifier?
slide-24
SLIDE 24

Training with Tesla K20/K80

  • Positive data

– 1,500 hotword samples

  • Negative data

– Thousands of hours of speech

  • Training time

– Half a day with 4 K80 GPUs

slide-25
SLIDE 25

Software Architecture

Frontend Backend

slide-26
SLIDE 26

KITT.AI Scientific Computing

Deep Learning Cloud Devices Production Cloud

Traffic  ELB

Content

Websocket audio, msg

HTTPs

Message Queue

Data Training Model Deploy

slide-27
SLIDE 27

Running Your First Snowboy Demo