Vulnerabilities of Voice Assistants at the Edge: From Defeating - - PowerPoint PPT Presentation

vulnerabilities of voice assistants at the edge from
SMART_READER_LITE
LIVE PREVIEW

Vulnerabilities of Voice Assistants at the Edge: From Defeating - - PowerPoint PPT Presentation

DAISY D ata A nalysis and I nformation S ecurit Y Lab Vulnerabilities of Voice Assistants at the Edge: From Defeating Hidden Voice Attacks to Audio-based Adversarial Attacks Yingying (Jennifer) Chen Professor, Electrical and Computer


slide-1
SLIDE 1

Vulnerabilities of Voice Assistants at the Edge: From Defeating Hidden Voice Attacks to Audio-based Adversarial Attacks

Yingying (Jennifer) Chen

Professor, Electrical and Computer Engineering Department Associate Director, WINLAB Director, Data Analysis and Information Security (DAISY) Lab Rutgers University, New Brunswick, NJ, USA yingche@scarletmail.rutgers.edu http://www.winlab.rutgers.edu/~yychen/ IEEE ICNP Workshop AIMCOM2 October 13, 2020

DAISY

Data Analysis and Information SecuritY Lab

slide-2
SLIDE 2

Wireless Information Network Laboratory (WINLAB)

q Industry-university research center founded in1989

v Focus on wireless technology

q Hosting world-class researchers

v 20 faculties from different departments v 45 PhD students

q Active research directions:

v Mobile ad hoc networks (MANET) for tactical applications v Mesh network protocols v Delay tolerant networks (DTN) v Software defined networks v Mobile content delivery v Wireless network security

2

slide-3
SLIDE 3

Open-Access Research Testbed for Next- Generation Wireless Networks (ORBIT)

q 400 - USRP open access research testbed q Funded by NSF since 2003 with $12M q Research Applications:

v 5G mm wave v Mobile edge cloud and future mobile Internet v Healthcare IT and Internet of Things (IoT) v Mobile sensing and user behavior recognition v Network coding and spectrum management v Vehicular networking

3

USRP radio board Control room ORBIT nodes

slide-4
SLIDE 4

Cloud Enhanced Open Software Defined Mobile Wireless Testbed for City-Scale Deployment (COSMOS)

q Funded by NSF PAWR for $22M in 2018 for deploying 5G network testbed q Led by Rutgers and collaborating with Columbia University, New York University and University of Arizona q Focus on 5G technologies v Ultra-high bandwidth and low latency wireless communication q Tightly coupled with edge cloud computing

v Deployment in New York City v 9 Large sites and 40 Medium sites v 200 small nodes to support edge computing v Fiber connection to Rutgers, GENI/I2, NYU v Interaction with smart community

q Research Applications:

v Ultra-high bandwidth, low latency, and powerful edge computing v Future mobile Internet and mobile edge cloud v Healthcare IT and Internet of Things (IoT) v AR and VR v Vehicular networking

4

slide-5
SLIDE 5

5

Defeating Hidden Audio Channel Attacks on Edge Voice Assistants

  • via Audio-Induced Surface Vibrations

DAISY

Data Analysis and Information SecuritY Lab

slide-6
SLIDE 6

Motivation

qWidely deployed voice controllable systems (VCS) at the edge

vConvenient way of interaction vIntegrated into many platforms

qFundamental vulnerabilities due to the propagation properties of sound qEmerging hidden voice commands

vRecognizable to VCS vIncomprehensible to humans

2

Mobile phones (e.g., Siri and Google Now) Smart appliances

stand-alone assistants

slide-7
SLIDE 7

Hidden Voice Command

qAttacks the disparities of voice recognition between human and machine qIteratively shaping their audio features to meet the requirements:

vUnderstandable to VCSs vHard to be perceived by the users

10/28/20 3

MFCC Feature Extraction Inverse MFCC Adjusting MFCC parameters Normal voice command Candidate

  • bfuscated

command Speech recognition system Recognized by the system Recognized by human attacker Ye s No Ye s Hidden voice command No

qAttack model vInternal attack – embedded in media and played by the target device vExternal attack – played via a loudspeaker in the proximity browse evil.com call 911

slide-8
SLIDE 8

Related Work

qDefend acoustic attacks based on audio information

vVoice authentication models

ØGaussian Mixture Models Øi-vector models

vSpeech vocal features (e.g., )

qSpeaker liveness detection

vArticulary Gesture vProximity detection leveraging a second microphone (e.g., on a wearable)

10/28/20 4

Only relying on speech audio features is vulnerable to hidden voice commands A multi-modality authentication framework is highly desirable to provide enhanced security:

Audio sending modality + vibration sensing modality

Restricted application scenarios by either requiring the microphone to be held close to mouth or additional dedicated hardware

slide-9
SLIDE 9

Basic Idea

qMany VCS devices (e.g., smartphones and voice assistant systems) are already equipped with motion sensors qUnique audio-induced surface vibrations captured by the motion sensor are hard to forge qTwo modes for capturing noticeable speech impact on motion sensors based on playback

10/28/20 5

Mobile Device HomePod

Front-end playback

Motion Sensor Speaker

Back-end playback

Replay Device in Cloud Service

Basic Idea: utilizing the vibration signatures of the voice command to detect hidden voice commands

slide-10
SLIDE 10

Capturing Voice Using Motion Sensors

qShared surface between loudspeaker and microphone qLow sampling rate motion sensors (e.g., < 200Hz) qNonlinear vibration responses qDistinct vibration domain

10/28/20 10

Played Audio Vibration Responses Lead to aliased vibration signals Down-sampled mic data Accelerometer data “show facebook.com”

slide-11
SLIDE 11

Why Vibration?

qExisting speech/voice recognition methods based on audio domain voice vocal features qHidden voice commands designed to duplicate these audio domain features by iteratively modify a voice command qAudio-induced surface vibrations

vAn additional sensing domain, distinct to audio vHard to be forged from audio signals in software vSimilar audio features result in distinct vibration features vResulting vibration responses are device-dependent (device physical vibrations, motion sensors)

10/28/20 7

The vibration domain approach can work in conjunction with the audio domain approach to more effectively detect the hidden voice commands.

slide-12
SLIDE 12

System Overview

10/28/20 8

Accelerometer Readings

Vibration Feature Derivation

Time/Frequency Domain Statistical Features Acoustic Features (MFCC, Chroma Vector) Vibration Noise Removal Voice Command Segmentation

Data Calibration

Statistical Analysis based Selection

Vibration Feature Selection

Feature Normalization

Hidden Voice Command Detection

Supervised Learning-based Classifier Unsupervised Learning-based Classifier

K-means K-medoid Simple Logistic SMO Random Forest Random Tree

Frontend Playback Backend Playback

Mobile Device or HomePod Motion Sensor Speaker Replay Device in Cloud Service

slide-13
SLIDE 13

Vibration Feature Derivation

qUnique and hard to forge

vStatistical features in time and frequency domains vDeriving Acoustic Features from Motion Sensor Data

ØMFCC ØChrome vectors

10/28/20 13

Audio Domain Vibration Domain human Vibration Domain hvc

“Show facebook.com”

qNonlinear relationship between audio features and vibration features

slide-14
SLIDE 14

Vibration Feature Derivation

qUnique and hard to forge vibration features

vStatistical features in time and frequency domains vDeriving Acoustic Features from Motion Sensor Data

ØMFCC ØChrome vectors

10/28/20 14

qNonlinear relationship between audio features and vibration features qFeature Selection Based on Statistical Analysis

“Show facebook.com”

slide-15
SLIDE 15

Feature Selection Based on Statistical Analysis

10/28/20 15

slide-16
SLIDE 16

Hidden Voice Command Detection

qSupervised Learning-based method

vSimple Logistic vSupport Vector Machine vRandom Forest vRandom Tree

qUnsupervised learning-based method

vk-means/k-medoids based methods vCalculating the Euclidean distance of the voice command samples to the cluster centroid vNot require much training

10/28/20 16

slide-17
SLIDE 17

Experimental Setup

q Front-end playback setup v4 different smartphones vOn table vHeld by hand vPlaced on sofa q Backend playback setup vImitated cloud service device vPrototype on Raspberry Pi q 10 voice commands, 5 speakers q 13,000 vibration data traces v6500 benign commands v6500 hidden voice commands

10/28/20 17

Placed on table Placed on sofa Held by hand On- board Speaker Raspberry Pi Logitech S120 Loudspeaker

On-board Motion Sensors

Front-end playback setup Back-end playback setup

slide-18
SLIDE 18

Performance Evaluation Unsupervised-learning

Up to 99% accuracy for both frontend and backend setups to differentiate normal commands from hidden voice commands

10/28/20 18

Front-end playback setup Back-end playback setup

slide-19
SLIDE 19

Performance Evaluation

qPartial playback to reduce delay qVarious mobile device usage scenarios of frontend playback setup

10/28/20 19

Front-end playback setup Back-end playback setup

slide-20
SLIDE 20

Take-aways

qDemonstrate that hidden voice commands can be detected by their speech features in the vibration domain qDerive the unique vibration features (statistical features in the time and frequency domains and speech features to distinguish hidden voice commands from normal commands qDevelop both supervised and unsupervised learning-based systems to detect hidden voice commands qImplemented the proposed system in two modes: frontend playback and backend playback

10/28/20 20

slide-21
SLIDE 21

21

Practical Adversarial Attacks Against Speaker Recognition Systems

DAISY

Data Analysis and Information SecuritY Lab

slide-22
SLIDE 22

v Access Control

22

What’s Speaker Recognition?

Enrolled Speakers 95 40 60 Score Result

Who is this? qSpeaker Recognition (SR)

qApplications

v Smartphone v Telephone Banking

slide-23
SLIDE 23

qTrend in Speaker Recognition

vAdopting Deep Neural Networks (DNNs) for better performance [1]

23

Attack Chances on Speaker Recognition

qDNNs are vulnerable to adversarial examples [2, 3]

[1] Mitchell McLaren, Yun Lei, and Luciana Ferrer. 2015. Advances in deep neural network approaches to speaker recognition. In IEEE ICASSP 2015.

Benign Input Perturbation Adversarial Example Recognized as Panda Recognized as Gibbon Benign Input Adversarial Example Recognized as Stop Recognized as Speed Limit 45

[2] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv:1412.6572 (2014). [3] Eykholt, Kevin, et al. "Robust physical-world attacks on deep learning visual classification." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

slide-24
SLIDE 24

24

Limitation of Existing Attacks

Microphone Model Classifier

qSpeaker Recognition Pipeline

slide-25
SLIDE 25

25

Limitation of Existing Attacks

Microphone Model Classifier

qConventional Attacks

vReplay attack, synthesis attack, voice conversion attack vPros: injected via physical channel vCons: can be defended by modern SR models [4, 5]

[4] Hong Yu, Zheng-Hua Tan, Yiming Zhang, Zhanyu Ma, and Jun Guo. 2017. DNN filter bank cepstral coefficients for spoofing detection. IEEE Access 5 (2017), 4779–4787. [5] Zhizheng Wu, Tomi Kinnunen, Eng Siong Chng, Haizhou Li, and Eliathamby Ambikairajah. 2012. A study on spoofing attack in state-of-the-art speaker verification: the telephone speech

  • case. In IEEE APSIPA ASC 2012. 1–5.
slide-26
SLIDE 26

26

Limitation of Existing Attacks

Microphone Model Classifier

qAdversarial Attack

vLeverage adversarial examples vPros: strong, can fool state-of-the-art model vCons: success in digital domain, sensitive to over- the-air distortions Our goal: Design a practical over-the-air adversarial attack against state-of-the-art speaker recognition system

slide-27
SLIDE 27

qFirst practical adversarial attack against multi-class SR system

27

Contribution

qUse the estimated room impulse response to launch

  • ver the air attack

qImplement gradient-based algorithms to make the attack unnoticeable qEvaluate on a public dataset of 109 English speakers

slide-28
SLIDE 28

28

Threat Model

SR Model … …

Legitimate User

slide-29
SLIDE 29

29

Threat Model

SR Model … …

Legitimate User Hidden Speaker

Untargeted Attack

slide-30
SLIDE 30

30

Threat Model

SR Model … …

Imposter

slide-31
SLIDE 31

31

Threat Model

SR Model … … Targeted Attack

Imposter Speaker

slide-32
SLIDE 32

32

Target Model

Enrolled Speaker Profile Score Calculation Identified Speaker

PLDA Classifier

qX-vector [6]

vThe state-of-the-art DNN-based multi-class speaker recognition model vComponents

Ø Mel Frequency Cepstral Coefficients

(MFCC)

Statistics Pooling

… … … … … … … … DNN Embedding Model

Embedding

Time-delay neural network layers MFCC Feature Extraction Input Audio

Ø Probabilistic Linear Discriminant Analysis (PLDA) Ø Embedding Model

[6] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-vectors: Robust dnn embeddings for speaker recognition. In IEEE ICASSP 2018.

slide-33
SLIDE 33

qThreat Model

vWhite-box

33

Problem Formulation

MFCC Feature Extraction Input Audio Enrolled Speaker Profile Score Calculation Identified Speaker PLDA Classifier

Statistics Pooling

… … … … … … … … DNN Embedding Model

Embedding

𝑔(#) 𝑄 𝑌

q Notation

Embedding model – 𝑔: 𝑌 → 𝑄 Input audio – 𝑌, original label 𝑧 Probability vector – 𝑄 = [𝑞!, … , 𝑞"]

q Untargeted Attack

v Find minimal 𝜀

s.t. 𝑏𝑠𝑕𝑛𝑏𝑦(𝑔 𝑌 + 𝜀 ) ≠ 𝑏𝑠𝑕𝑛𝑏𝑦(𝑧)

q Targeted Attack

v Find minimal 𝜀

s.t. 𝑏𝑠𝑕𝑛𝑏𝑦(𝑔 𝑌 + 𝜀 ) = 𝑏𝑠𝑕𝑛𝑏𝑦(𝑧#)

slide-34
SLIDE 34

34

Attack Overview

Adversarial Example Play over-the-air Speaker Recognition System Incorrect Speaker Original Audio RIR Speaker Recognition System Predicted Speaker Adversarial Noise

+

Gradient of loss with respect to input

Untargeted Attack

Original Audio RIR Speaker Recognition System Adversarial Noise

+

Update noise via gradient descent Target Speaker? No

Targeted Attack

Yes

slide-35
SLIDE 35

qRoom Impulse Response (RIR) – ℎ(𝑢)

v Model the transfer function between the played audio 𝑦(𝑢) and the received audio 𝑧(𝑢)

35

Room Impulse Response Estimation

𝑧 𝑢 = 𝑦(𝑢)⨂ℎ(𝑢) q RIR estimation

v Play an excitation signal 𝑦! 𝑢 v Record the response 𝑧! 𝑢 v Estimate RIR, where 𝑔(𝑢) is the time-reversal of 𝑦! 𝑢 ℎ 𝑢 = 𝑧!(𝑢)⨂𝑔(𝑢)

slide-36
SLIDE 36

qPreliminary Experiment

v𝑔 = 20 − 20𝑙𝐼𝑨, T = 5𝑡 vMeasured Mean Square Error (MSE)

ØRecorded & Predicted = 0.112 ØOriginal & Recorded = 0.84

36

Room Impulse Response Estimation

Original Signal Predicted Signal (w/RIR) Recorded Signal

slide-37
SLIDE 37

qUntargeted Attack

v Due to the local linearity of DNN models, a linear perturbation is sufficient for untargeted attacks [7]:

37

Adversarial Example Generation

[7] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv:1412.6572 (2014).

v Digital untargeted adversarial example v Practical untargeted adversarial example

slide-38
SLIDE 38

qTargeted Attack

v Adversarial example targeting at label 𝑧" can be generated through solving an optimization problem:

38

Adversarial Example Generation

v Lagrangian relaxation: v Apply gradient descent to find the optimal 𝜀∗ v Digital targeted adversarial example 𝑌$ = 𝑌 + 𝜀∗ v Practical targeted adversarial example

slide-39
SLIDE 39

qDataset

v CSTR VCTK Corpus v Total 44217 utterances spoken by 109 English speakers with various accents, training & testing ratio = 4:1

39

Experimental Methodology

q Baseline Model

v 30 dimensional MFCC with frame length of 25 ms v Pretrained X-vector model provided in Kaldi [8]

q Evaluation Metrics

v Speaker Recognition Accuracy (%) v Attack Success Rate (%) v Distortion Metric (dB)

[8] Povey et al., The Kaldi Speech Recognition Toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding.

slide-40
SLIDE 40

qDigital Untargeted Attack

v Test set : 8896 audio files

40

Evaluation of Digital Attacks

q Digital Targeted Attack

v Tested on all original-target speaker combinations (total 109*108 pairs)

slide-41
SLIDE 41

qExperimental Setup

v Two realistic scenarios: office & apartment v 10 digital/practical targeted adversarial example tested in each scenario

41

Evaluation of Practical Attack

slide-42
SLIDE 42

qMaking Speaker #1 recognized as Speaker #20

v Original audio

Ø Recognized as Speaker #1

v Practical adversarial audio

Ø Misrecognized as Speaker #20 Ø Measured distortion: −42.35𝑒𝐶

v Genuine speech from Speaker #20

42

Audio Samples

slide-43
SLIDE 43

q We demonstrate a practical and systematic adversarial attack against DNN-based speaker recognition systems q Apply gradient-based algorithms to launch both untargeted and targeted attacks q Integrate the estimated RIR into the adversarial example generation for a more practical attack q Conduct extensive experiment in both digital and real- world settings

43

Take-aways

slide-44
SLIDE 44

44

Future work: Security Issues on Voice Recognition Systems at the edge

  • Attacker could control your smart home
slide-45
SLIDE 45

45

Future work: Security Issues on Augmented Reality (AR) System

  • Attacker could control your ‘reality’
slide-46
SLIDE 46

Thanks to my collaborators and students