 
              DAISY D ata A nalysis and I nformation S ecurit Y Lab Vulnerabilities of Voice Assistants at the Edge: From Defeating Hidden Voice Attacks to Audio-based Adversarial Attacks Yingying (Jennifer) Chen Professor, Electrical and Computer Engineering Department Associate Director, WINLAB Director, Data Analysis and Information Security (DAISY) Lab Rutgers University, New Brunswick, NJ, USA yingche@scarletmail.rutgers.edu http://www.winlab.rutgers.edu/~yychen/ IEEE ICNP Workshop AIMCOM2 October 13, 2020
Wireless Information Network Laboratory (WINLAB) q Industry-university research center founded in1989 v Focus on wireless technology q Hosting world-class researchers v 20 faculties from different departments v 45 PhD students q Active research directions: v Mobile ad hoc networks (MANET) for tactical applications v Mesh network protocols v Delay tolerant networks (DTN) v Software defined networks v Mobile content delivery v Wireless network security 2
Open-Access Research Testbed for Next- Generation Wireless Networks (ORBIT) USRP radio board ORBIT nodes Control room q 400 - USRP open access research testbed q Funded by NSF since 2003 with $12M q Research Applications: v 5G mm wave v Mobile edge cloud and future mobile Internet v Healthcare IT and Internet of Things (IoT) v Mobile sensing and user behavior recognition v Network coding and spectrum management v Vehicular networking 3
Cloud Enhanced Open Software Defined Mobile Wireless Testbed for City-Scale Deployment (COSMOS) q Funded by NSF PAWR for $22M in 2018 for deploying 5G network testbed q Led by Rutgers and collaborating with Columbia University, New York University and University of Arizona q Focus on 5G technologies v Ultra-high bandwidth and low latency wireless communication q Tightly coupled with edge cloud computing v Deployment in New York City v Fiber connection to Rutgers, GENI/I2, NYU v 9 Large sites and 40 Medium sites v Interaction with smart community v 200 small nodes to support edge computing q Research Applications: v Ultra-high bandwidth, low latency, and powerful edge computing v Future mobile Internet and mobile edge cloud v Healthcare IT and Internet of Things (IoT) v AR and VR v Vehicular networking 4
DAISY D ata A nalysis and I nformation S ecurit Y Lab Defeating Hidden Audio Channel Attacks on Edge Voice Assistants - via Audio-Induced Surface Vibrations 5
Motivation q Widely deployed voice controllable systems (VCS) at the edge v Convenient way of interaction v Integrated into many platforms Smart appliances Mobile phones (e.g., Siri and Google Now) stand-alone assistants q Fundamental vulnerabilities due to the propagation properties of sound q Emerging hidden voice commands v Recognizable to VCS v Incomprehensible to humans 2
Hidden Voice Command q Attacks the disparities of voice recognition between human and machine q Iteratively shaping their audio features to meet the requirements : v Understandable to VCSs v Hard to be perceived by the users q Attack model v Internal attack – embedded in Normal voice Adjusting MFCC Candidate media and played by the command parameters obfuscated command target device v External attack – played via a MFCC Feature Inverse MFCC Extraction loudspeaker in the proximity No Ye s browse evil.com Recognized No Ye Recognized by human s by the system attacker call 911 Hidden voice Speech recognition command system 3 10/28/20
Related Work q Defend acoustic attacks based on audio information v Voice authentication models Ø Gaussian Mixture Models Only relying on speech audio features is vulnerable to hidden voice commands Ø i-vector models v Speech vocal features (e.g., ) q Speaker liveness detection v Articulary Gesture Restricted application scenarios by either requiring the microphone to be held v Proximity detection leveraging a second microphone (e.g., on a close to mouth or additional dedicated hardware wearable) A multi-modality authentication framework is highly desirable to provide enhanced security: Audio sending modality + vibration sensing modality 4 10/28/20
Basic Idea q Many VCS devices (e.g., smartphones and voice Basic Idea: utilizing the vibration signatures of the voice command to detect assistant systems) are already equipped with motion hidden voice commands sensors q Unique audio-induced surface vibrations captured by the motion sensor are hard to forge q Two modes for capturing noticeable speech impact on motion sensors based on playback Replay Device in Cloud Mobile Device HomePod Service Motion Sensor Speaker Front-end playback Back-end playback 5 10/28/20
Capturing Voice Using Motion Sensors q Shared surface between loudspeaker and microphone q Low sampling rate motion sensors (e.g., < 200Hz) q Nonlinear vibration responses Down-sampled mic data q Distinct vibration domain Played Audio Vibration Responses “show facebook.com” Accelerometer data Lead to aliased vibration signals 10 10/28/20
Why Vibration? q Existing speech/voice recognition methods based on audio domain voice vocal features q Hidden voice commands designed to duplicate these audio domain features by iteratively modify a voice command q Audio-induced surface vibrations v An additional sensing domain, distinct to audio v Hard to be forged from audio signals in software The vibration domain approach can work in conjunction with the v Similar audio features result in distinct vibration features audio domain approach to more effectively detect the hidden voice commands. v Resulting vibration responses are device-dependent (device physical vibrations, motion sensors) 7 10/28/20
System Overview Mobile Device or Replay Device in Vibration Feature Derivation Cloud Service HomePod Time/Frequency Domain Acoustic Features Statistical Features (MFCC, Chroma Vector) Vibration Feature Selection Motion Sensor Speaker Statistical Analysis based Frontend Playback Backend Playback Feature Normalization Selection Accelerometer Data Calibration Hidden Voice Command Detection Readings Vibration Noise Supervised Learning-based Unsupervised Removal Classifier Learning-based Classifier Random Tree Simple Logistic Voice Command Segmentation K-means K-medoid Random Forest SMO 8 10/28/20
Vibration Feature Derivation q Unique and hard to forge v Statistical features in time and frequency domains v Deriving Acoustic Features from Motion Sensor Data Ø MFCC “ Show facebook.com ” Ø Chrome vectors Vibration Domain human q Nonlinear relationship between audio features and vibration features Audio Domain Vibration Domain hvc 13 10/28/20
Vibration Feature Derivation q Unique and hard to forge vibration features v Statistical features in time and frequency domains v Deriving Acoustic Features from Motion Sensor Data Ø MFCC “ Show facebook.com ” Ø Chrome vectors q Nonlinear relationship between audio features and vibration features q Feature Selection Based on Statistical Analysis 14 10/28/20
Feature Selection Based on Statistical Analysis 15 10/28/20
Hidden Voice Command Detection q Supervised Learning-based method v Simple Logistic v Support Vector Machine v Random Forest v Random Tree q Unsupervised learning-based method v k-means/k-medoids based methods v Calculating the Euclidean distance of the voice command samples to the cluster centroid v Not require much training 16 10/28/20
Experimental Setup Front-end playback q Front-end playback setup setup v 4 different smartphones Placed on table v On table Held by hand v Held by hand v Placed on sofa Placed on sofa q Backend playback setup Back-end playback setup v Imitated cloud service device v Prototype on Raspberry Pi On- Raspberry q 10 voice commands, 5 speakers board Pi Speaker q 13,000 vibration data traces Logitech S120 Loudspeaker On-board v 6500 benign commands Motion v 6500 hidden voice commands Sensors 17 10/28/20
Performance Evaluation Unsupervised-learning Back-end playback setup Front-end playback setup Up to 99% accuracy for both frontend and backend setups to differentiate normal commands from hidden voice commands 18 10/28/20
Performance Evaluation q Partial playback to reduce delay Front-end playback setup Back-end playback setup q Various mobile device usage scenarios of frontend playback setup 19 10/28/20
Take-aways q Demonstrate that hidden voice commands can be detected by their speech features in the vibration domain q Derive the unique vibration features (statistical features in the time and frequency domains and speech features to distinguish hidden voice commands from normal commands q Develop both supervised and unsupervised learning-based systems to detect hidden voice commands q Implemented the proposed system in two modes: frontend playback and backend playback 20 10/28/20
DAISY D ata A nalysis and I nformation S ecurit Y Lab Practical Adversarial Attacks Against Speaker Recognition Systems 21
What’s Speaker Recognition? q Speaker Recognition (SR) Enrolled Score Result Speakers 95 q Applications Who is this? 40 v Smartphone v Telephone Banking v Access Control 60 22
Recommend
More recommend