with GP U Davit Baghdasaryan, CEO, 2Hz Arto Minasyan, CTO, 2Hz 2 - - PowerPoint PPT Presentation

with gp u
SMART_READER_LITE
LIVE PREVIEW

with GP U Davit Baghdasaryan, CEO, 2Hz Arto Minasyan, CTO, 2Hz 2 - - PowerPoint PPT Presentation

Revolutionary Voice Enhancement in Real-Time Communications with GP U Davit Baghdasaryan, CEO, 2Hz Arto Minasyan, CTO, 2Hz 2 Mute Background Noises Voice Quality with Deep Learning Mute Background Noise Mute Everyone Except Me


slide-1
SLIDE 1

Revolutionary Voice Enhancement in Real-Time Communications with GPU

Davit Baghdasaryan, CEO, 2Hz Arto Minasyan, CTO, 2Hz

slide-2
SLIDE 2

2

slide-3
SLIDE 3
slide-4
SLIDE 4

Mute Background Noises

slide-5
SLIDE 5

Voice Quality with Deep Learning

5

  • Mute Background Noise

  • Mute Everyone Except Me

  • Remove Room Echo

  • High Resolution Voice Everywhere
slide-6
SLIDE 6

6

Real-Time Noise Suppression with Deep Learning

slide-7
SLIDE 7

7

  • Requires 2-4 mics

  • Runs on edge device

  • Cancels only limited noises

  • Outbound only

Traditional Noise Cancellation

slide-8
SLIDE 8

8

Train krispNet Deep Neural Network

Background Noises Clean Human Speeches

Deep Learning powered Noise Cancellation

  • No dependency on mics

  • Bi-directional

  • Cancels all noise types

  • Runs everywhere - on device 


and in the cloud

slide-9
SLIDE 9

9

How to Measure Voice Quality?

slide-10
SLIDE 10

10

  • Academia - PESQ, Subjective

  • Industry - 3QUEST (Speech MOS, Noise MOS, Global MOS)

  • Skype Audio Test and 3GPP TS 26.131 specifications

Industry Standards

slide-11
SLIDE 11

11

Audio Lab

slide-12
SLIDE 12

12

slide-13
SLIDE 13

13

krisp.ai

slide-14
SLIDE 14
slide-15
SLIDE 15

Seamlessly Integrates in Conferencing Apps Supports any Microphone or Headset

slide-16
SLIDE 16
slide-17
SLIDE 17

17

krisp.ai Best Product in Audio/Voice 2018

slide-18
SLIDE 18

18

Training and Inference

slide-19
SLIDE 19

19

Training Process

slide-20
SLIDE 20

20

  • 2K distinct speakers - gender and age diverse distribution

  • >10K distinct noises - babble, construction, traffic, cafeteria,
  • ffice, etc 

  • 2000+ hours

Training Data

slide-21
SLIDE 21

21

  • All in Python

  • Distributed TensorFlow

  • Multiple in-house NVIDIA 1080ti. Takes a full week.

  • p2.16xlarge in AWS. 16x NVIDIA K80

Training on GPUs

slide-22
SLIDE 22

22

  • Supports NVIDIA, Intel and ARM platforms

  • All in C/C++. Sometimes ASM

  • Smaller network (5x boost with some quality penalty)

  • TensorRT boosts ~2x

Inference

slide-23
SLIDE 23

23

Moving to the Cloud

slide-24
SLIDE 24

24

Server-side Noise Cancellation

slide-25
SLIDE 25

25

Latency Constraints

200ms end to end latency

Codecs and other DSP (10-80ms) Network (varies) DNN Compute ( < 5ms) DNN Algorithmic (15ms)

< 20ms

slide-26
SLIDE 26

26

How do you scale to 100K+ concurrent streams with such latency constraints? 
 


  • Ex. Discord processes 2.5M 


concurrent audio streams

slide-27
SLIDE 27

27

10x-20x less costly

CPU Servers GPU Servers

slide-28
SLIDE 28

28

Scalability with Batching

slide-29
SLIDE 29

29

Ultimate Quality

Remove Noise Remove Room Echo Expand Voice HD Audio Frame Ultimate Quality Audio Frame

} 5ms

slide-30
SLIDE 30

30

Maximum Quality and Scale with NVIDIA Tensor Cores

slide-31
SLIDE 31

31

TensorRT is pretty awesome

750 1500 2250 3000 P100 V100 K80 T4 TensorFlow Batching TensorRT Batching

slide-32
SLIDE 32

32

T4 and V100 are both awesome

1250 2500 3750 5000 P100 V100 T4 FP32 FP16

slide-33
SLIDE 33

33

  • 1. Voice Quality Enhancement is moving to the Cloud

  • 2. For large scale deployments we need GPUs

  • 3. T4 and V100 GPUs are most efficient for this

Key Takeaways

slide-34
SLIDE 34

34

Thank You!

Booth #247