Le Lear arnin ing-based P based Prac actic ical S al Smar - - PowerPoint PPT Presentation

le lear arnin ing based p based prac actic ical s al smar
SMART_READER_LITE
LIVE PREVIEW

Le Lear arnin ing-based P based Prac actic ical S al Smar - - PowerPoint PPT Presentation

Le Lear arnin ing-based P based Prac actic ical S al Smar martpho phone ne Eavesdrop oppin ing wi with Bu Built-in A in Acceler elerome meter er Authors: Zhongjie Ba, Tianhang Zheng, Xinyu Zhang, Zhan Qin, Baochun Li, Xue Liu


slide-1
SLIDE 1

Le Lear arnin ing-based P based Prac actic ical S al Smar martpho phone ne Eavesdrop

  • ppin

ing wi with Bu Built-in A in Acceler elerome meter er

Authors: Zhongjie Ba, Tianhang Zheng, Xinyu Zhang, Zhan Qin, Baochun Li, Xue Liu and Kui Ren Presenter: Shiqing Luo

slide-2
SLIDE 2

Smartphone Sensors

Motion Sensor Gyroscope Accelerometer Magnetic Sensor Magnetometer Voice Sensor Microphone Image Sensor Camera

Permission required No Permission needed

Accelerometer

slide-3
SLIDE 3

Motion Sensor Threat to Speech Privacy

  • A smartphone gyroscope can pick up surface vibrations incurred by an independent

loudspeaker placed on the same table (Michalevsky et al., Usenix 2014).

  • Gyroscopes are (lousy but still) microphones.
  • Very low signal to noise ratio
  • Low sampling frequency

Speaker Speaker Identification Digits Recognition Mixed Female/Male 50% 17% Female speakers 45% 26% Male speakers 65% 23%

slide-4
SLIDE 4

Motion Sensor Threat to Speech Privacy

  • Only loudspeaker-rendered speech signals traveling through a solid surface can

create noticeable impacts on motion sensors (Anand et al., S&P 2018).

Gyroscope Accelerometer

Through a shared surface Through air The threat does not go beyond the Loudspeaker-Same-Surface setup studied by Michalevsky et al.

slide-5
SLIDE 5

Commonly Believed Limitations

  • Can only pick up a narrow band of speech signals
  • Android has a sampling ceiling of 200 Hz
  • iOS has a sampling ceiling of 100 Hz
  • Does not go beyond the Loudspeaker-Same-Surface setup
  • Very low SNR (Signal-to-Noise Ratio)
  • Sensitive to sound angle of arrival

Fundamental frequency range of human speech 85-180 Hz 165-255 Hz

slide-6
SLIDE 6

Our Observations: Sampling Frequency

Delay Options Delay Sampling Rate DELAY NORMAL 200 ms 5 Hz DELAY UI 20 ms 50 Hz DELAY GAME 60 ms 16.7 Hz DELAY FASTEST 0 ms AFAP

<latexit sha1_base64="yYOIwdWODfxZEK63X1Gki8GTDu8=">AC+HicdZLbtNAFIbHLpcSoKSwZDMiomIV2YW2LOvSlCK1NDRJWxRH0fHkJBl1PLZmxkipmydhwKE2PIo7HgbxonFpQ3HGun3Od8/lzMTpYJr43k/HXfpxs1bt5fvVO7eu7/yoLr68EQnmWLYlI1FkEGgWX2DHcCDxLFUIcCTyNzl8V9dMPqDRPZNtMUuzFMJ8yBkYm+qvOithCMucwNRJkBN80txyew3rYTjYtIK3UBE3qUFgZN18r/NdqC2O5QjugxGKRhWKGlIw+NobuNg+B92H97dHwYHEwtvu5NC78G3T/wtIL4M6bOWg5i3n/5V4Hh42C3Cxn9DfrWxami+m9oNVutNqFoeSDvaD5hw1RDn4fv1+teXVvFvS68EtRI2U0+9Uf4SBhWYzSMAFad30vNb0clOFMoO1ipjEFdg4j7FopIUbdy2cXN6VPbWZAh4myQxo6y/7tyCHWehJHlozBjPXVWpFcVOtmZviyl3OZgYlmy80zAQ1CS1eAR1whcyIiRXAFLd7pWwMCpixb6Vim+BfPfJ1cbJe95/XN969qG3vlO1YJo/JE/KM+GSLbJN90iQdwpzM+eh8dr64F+4n96v7bY6Tul5RP4J9/svAufiTA=</latexit>

Model Year Sampling Rate Moto G4 2016 100 Hz Samsung J3 2016 100 Hz LG G5 2016 200 Hz Huawei Mate 9 2016 250 Hz Samsung S8 2017 420 Hz Google Pixel 3 2018 410 Hz Huawei P20 Pro 2018 500 Hz Huawei Mate 20 2018 500 Hz

<latexit sha1_base64="FsdzMpQszuC+0z1Y2gmbgrn9ioY=">ADinicfZLdbtMwFMfdBNhaBuvgkhuLiobqiRt103cTIDUCjGpMLoNVXlOG5rzbEjxwFK1nfhmbjbThpw0TbsRNF+euc3/mwc4JY8MQ4zu+SZd+7/2Bnt1x5uPfo8X714Ml5olJN2YAqofRlQBImuGQDw41gl7FmJAoEuwiu3ubxi69MJ1zJz2Yes1FEpJPOCUGXOD0s+yH7Apl5khQSqIXmTX4prCs6iUy/4srwviVIVM4Dr+woiGzxmJYDg5xZ+IYdj31GjcLdVx57jHuK6zi492ODgfwkhfT3zTuxD13cbeMCwXvNqaXkm+M49N8kO/bN1r39H17GjFdaBmy8vBda6r1FQw3Of4czNFQspLXcbLbr3oUpfqxu07fwXQ4K+Dq6PqnPZHjzP8bVmtNwloa3hVuIGiqsP67+8kNF04hJQwVJkqHrxGaUEW04FWxR8dOExYRekSkbgpQkYskoW67SAr8AT4gnSsMrDV56/83I4A6TeRQAGREzSzZjufO2DA1k6NRxmWcGibpqtEkFRiWJd9LHLNqBFzEIRqDrNiOiOaUAPbW4FLcDePvC3OvYbLQ/tmonb4r2EXP0HP0Ermog05QD/XRAFrx3plHVode8/27GP79Qq1SkXOU7Rm9rs/jM74yw=</latexit>
  • The actual sampling rates of motion sensors are determined by the performance
  • f the smartphone.
  • Accelerometers on recent smartphones can cover almost the entire fundamental

frequency band (85-255Hz) of adult speech.

The 200 Hz sampling ceiling no longer exists

[1] “Sensor Overview,” https://developer.android.com/guide/topics/sensors/sensors_overview.

Sampling frequencies supported by Android [1]

slide-7
SLIDE 7

Our Observations: New Setup

  • Employs a smartphone’s accelerometer to eavesdrop on the speaker in the same

smartphone.

  • Much Higher SNR
  • Sound always arrives from the same direction

0.4 0.03

x-axis z-axis

slide-8
SLIDE 8

Our Observations: New Setup

  • Employs a smartphone’s accelerometer to eavesdrop on the speaker in the same

smartphone.

  • Much Higher SNR
  • Sound always arrives from the same direction
  • A smartphone speaker is more likely to reveal sensitive

information than an independent loudspeaker.

slide-9
SLIDE 9

Threat Model

Table setting Handhold setting

slide-10
SLIDE 10

Accelerometer-based Smartphone Eavesdropping

  • Preprocessing: convert acceleration signals into spectrograms.
  • Speech Recognition: convert spectrograms to text.
  • Speech Reconstruction: reconstructs voice signals from spectrograms
slide-11
SLIDE 11

Preprocessing

  • Problems in Raw Acceleration Signals
  • Raw accelerometer measurements are not sampled at fixed interval.
  • Raw accelerometer measurements can be distorted by human movement.
  • Raw accelerometer measurements have captured multiple digits and needs to be segmented.

Time (ms) x-axis ("/$%) y-axis ("/$%) z-axis ("/$%) 1

  • 0.2130
  • 0.1410

10.0020 2

  • 0.1870
  • 0.1440

9.9970 3

  • 0.2110
  • 0.1510

9.9970 5

  • 0.2110
  • 0.1410

10.0070 8

  • 0.2080
  • 0.1340

10.0120 10

  • 0.2150
  • 0.1320

10.0070

slide-12
SLIDE 12

Time (ms) x-axis ("/$%) y-axis ("/$%) z-axis ("/$%) 1

  • 0.2130
  • 0.1410

10.0020 2

  • 0.1870
  • 0.1440

9.9970 3

  • 0.2110
  • 0.1510

9.9970 5

  • 0.2110
  • 0.1410

10.0070 8

  • 0.2080
  • 0.1340

10.0120 10

  • 0.2150
  • 0.1320

10.0070

Step 1: Generate Sanitized Single-word Signals

  • Interpolation
  • Upsample accelerometer signals to 1000 Hz

using linear interpolation.

slide-13
SLIDE 13

Step 1: Generate Sanitized Single-word Signals

Time (ms) x-axis ("/$%) y-axis ("/$%) z-axis ("/$%) 1

  • 0.2130
  • 0.1410

10.0020 2

  • 0.1870
  • 0.1440

9.9970 3

  • 0.2110
  • 0.1510

9.9970 4

  • 0.2110
  • 0.1460

10.0020 5

  • 0.2110
  • 0.1410

10.0070 6

  • 0.2100
  • 0.1387

10.0087 7

  • 0.2090
  • 0.1363

10.0103 8

  • 0.2080
  • 0.1340

10.0120 9

  • 0.2115
  • 0.1330

10.0095 10

  • 0.2150
  • 0.1320

10.0070

  • Interpolation
  • Upsample accelerometer signals to 1000 Hz

using linear interpolation.

slide-14
SLIDE 14

Step 1: Generate Sanitized Single-word Signals

Fundamental frequency range of human speech 85-180 Hz 165-255 Hz

  • Interpolation
  • Upsample accelerometer signals to 1000 Hz

using linear interpolation.

  • High-pass filter
  • Convert the acceleration signal along each

axis to the frequency domain and eliminate frequency components below 80 Hz.

slide-15
SLIDE 15

Step 1: Generate Sanitized Single-word Signals

Table setting Handhold setting

  • Interpolation
  • Upsample accelerometer signals to 1000 Hz

using linear interpolation.

  • High-pass filter
  • Convert the acceleration signal along each

axis to the frequency domain and eliminate frequency components below 80 Hz.

slide-16
SLIDE 16

Step 1: Generate Sanitized Single-word Signals

  • Interpolation
  • Upsample accelerometer signals to 1000 Hz

using linear interpolation.

  • High-pass filter
  • Convert the acceleration signal along each

axis to the frequency domain and eliminate frequency components below 80 Hz.

  • Segmentation
  • Calculate the magnitude of the acceleration

signal and smooth the obtained magnitude sequence with moving average.

  • Locate all regions with magnitudes higher

than a threshold.

Table setting Handhold setting

slide-17
SLIDE 17

Step 2: Generate Spectrogram Images

  • Signal-to-spectrogram conversion
  • Divide the signal into multiple short segments

with a fixed overlap.

  • Window each segment with a Hamming window

and calculate its spectrum through STFT (Short- Time Fourier Transform).

  • Three spectrograms can be obtained for each

single-word signal. Table setting Handhold setting

slide-18
SLIDE 18

Step 2: Generate Spectrogram Images

  • Signal-to-spectrogram conversion
  • Divide the signal into multiple short segments

with a fixed overlap.

  • Window each segment with a Hamming window

and calculate its spectrum through STFT (Short- Time Fourier Transform).

  • Three spectrograms can be obtained for each

single-word signal.

  • Generate Spectrogram-Images
  • Fit the three m x n spectrograms into one m x n

x 3 tensor.

  • Take the square root of all the elements in the

tensor and map the obtained values to integers between 0 and 255.

  • Export the m x n x 3 tensor as an image in PNG

format Table setting Handhold setting Table setting Handhold setting

slide-19
SLIDE 19

Speech Recognition

  • DenseNet:
  • Direct connections between each layer
  • Fewer nodes and parameters
  • Comparable performance with VGG & ResNet

xl = Hl([x0, x1, ..., xl−1])

<latexit sha1_base64="gr8gdQU1LkMOStjg+tgd8EtEyJ8=">ACRHicbVDLSgMxFM3UVx1fVZdugkWoUIcZH+hGKLrpsoJ9QDuUTJpQzMPkoxYhvk4N36AO7/AjQtF3IqZaYU+vBy7jnkpvjhIwKaZqvWm5peWV1Lb+ub2xube8UdvcaIog4JnUcsIC3HCQIoz6pSyoZaYWcIM9hpOkMb1O9+UC4oIF/L0chsT3U96lLMZK6hbaHQ/JgePGj0mXwWv412Y3RiyuJkoTdvMpzqrLJhGDNMzE6sxD7W9W6haBpmVnARWBNQBJOqdQsvnV6AI4/4EjMkRNsyQ2nHiEuKGUn0TiRIiPAQ9UlbQR95RNhxFkICjxTg27A1fElzNjpiRh5Qow8RznTXcW8lpL/ae1Iuld2TP0wksTH4fciEZwDR2KOcYMlGCiDMqdoV4gHiCEuVexqCNf/lRdA4Nawz4+LuvFi5mcSRBwfgEJSABS5BVRBDdQBk/gDXyAT+1Ze9e+tO+xNadNZvbBTGk/v83dsTI=</latexit>

Huang, Gao, et al. "Densely connected convolutional networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

slide-20
SLIDE 20

Recognition Results

  • Dataset (80% training data and 20% testing data) :
  • Digits: 10k single-digit signals from 20 speakers
  • Digits + Letter: 36*260 single-word signals from 10 speakers.
  • Recognizing Digits & Letters (common elements in password)
  • Recognizing 20 Speakers (connect multiple attack results)

Previous SOTA results: 26% on recognizing digits Previous SOTA results: 50% on recognizing 10 speakers Traditional ML + gyroscope+ Loudspeaker-Same-Surface

AccDataRec Audio Player

Tasks Top1 Acc Top3 Acc SOTA Digits 78% 96% 26% D + L 55% 78%

  • Top1 Acc

Top3 Acc SOTA 70% 88% 50% (10)

slide-21
SLIDE 21

Hot Word Search

  • Locate and identify pre-trained hot (sensitive) words from sentences.

Here is my social security number

Insensitive words Hot words

Hot word TPR FPR Password 94% 0.4% Username 97% 0.4% Social 100% 0.3% Security 91% 0.0% Number 88% 0.1% Email 88% 1.4% Credit 88% 0.3% Card 97% 1.4%

<latexit sha1_base64="742alscu+TK3Kxe6vbew5OdTJo=">ADYnicdZJRb9MwEMfdhMEaYGvZIzxYVEM8Vck61PI2UYH2hMpYt0lNVTnOtbXm2JHtAFXWL8kbT7zwQXDaiLXNuCjxX3c/X3zni1LOtPH9XzXHfbT3+Ml+3Xv67PnBYaP54krLTFEYUsmluomIBs4EDA0zHG5SBSJOFxHt/0ifv0NlGZSXJpFCuOEzASbMkqMdU2atR9ePYxgxkRuSJRxopb5Hb+j9l69Xo4LxJbcS4N/i5VjN/gy8GF/X4aXIThJjEgWpfE+9Pw2C5+267b0FCDEiSBAureQ3ib+iopI9wGA98voU4VApopZhZFqCk/Ar1OUsiUDbY65VMUGE+JoTxeyR4ER9BTEzm2k6u6X1ybr27v+ShCDif032vEmjZQ+8MlwVQSlaqLTBpPEzjCXNEhCGctvsUeCnZpwTZRjlsPTCTENK6C2ZwcjKos16nK9GZImPrSfGU6nsKwxeTd35CTRepFElkyImevdWOF8KDbKzLQ3zplIMwOCrn80zTg2EhfzhmOmgBq+sIJQe1+MYjonilBjp7JoQrBbclVcnbSDTvdl9PW2YeyHfvoJXqN3qIAdEZOkcDNES09tvZcw6cQ+eP67lN92iNOrVyzxHaMvfVX5H97L0=</latexit>
  • Speakers
  • Two males and two females
  • Training data
  • 128*8 hot words
  • 2176 insensitive words (negative samples)
  • Testing data
  • 200 short sentences, each of which contains several

insensitive words and one to three hot words.

slide-22
SLIDE 22

Case Study: End-to-End Attack

  • Attack scenario:
  • The victim makes a phone call to a remote caller and requests a password during the conversation.
  • The password is eight digits in length and is preceded by the hot word “password (is)”.
  • Training data
  • 200 “password”s (Hotword search)
  • 2200 other word (Hotword search)
  • 280*10 digits (Digits recognition)
  • Testing data
  • 80 conversations for each setting.
  • Attach process:
  • 1) Hotword search: Locate password.
  • 2) Digits recognition: Recognize eight-

digit password.

slide-23
SLIDE 23

Speech Reconstruction

  • Reconstruction Network (Refer to StyleTransfer):
  • Encoder: encode spectrograms into features
  • Residue Blocks: refine encoded features by residual

mappings (inspired by ResNet)

  • Decoder: decode the features into audio spectrograms
  • GL algorithm:
  • Recover the phase from spectrogram
  • Recover audio signals

H(x) = F(x, Wi) + x

<latexit sha1_base64="P1uq49SUCts+KaEJE2YMrUYRLvg=">ACS3icbVDLSgMxFM1Uq7W+qi7dBItQUcqMD3QjFAXpsoJ9QDsMmTRTQzOZIcmIZj/c+PGnT/hxoUiLsz0gb1QsjJOedyb4bMiqVab4amYXF7NJybiW/ura+sVnY2m7IBKY1HAtFykSMclJXVDHSCgVBvstI0+1fp3rzgQhJA36nBiGxfdTj1KMYKU05BbfjI3XvevHwxojF1SQpTcjH5ABewjnLzZTlCMaTRzNxqG45hL9qPu8UimbZHBacB9YFMG4ak7hpdMNcOQTrjBDUrYtM1R2jISimJEk34kCRHuox5pa8iRT6QdD7NI4L5mutALhD5cwSH7tyNGvpQD39XOdEc5q6Xkf1o7Ut6FHVMeRopwPBrkRQyqAKbBwi4VBCs20ABhQfWuEN8jgbDS8achWLNfngeN47J1Uj67PS1WrsZx5MAu2AMlYIFzUAFVUAN1gMETeAMf4N4Nt6NL+N7ZM0Y454dMFWZ7A8PQ7R+</latexit>

Johnson, Justin, et al. "Perceptual losses for real-time style transfer and super-resolution." European conference on computer vision. Springer, Cham, 2016.

slide-24
SLIDE 24

Reconstruction Results

  • Listen to some reconstructed examples

Password is one two three Angry bird is my username Here is my social security number

slide-25
SLIDE 25

Defense

  • Limit the sampling rate of the accelerometer.
  • According to Android Developer, the recommended sampling rates for the user interface

and mobile games are 16.7 Hz and 50 Hz respectively.

  • Applications requiring sampling rates above 50 Hz should request a permission through

<user-permission >

  • Notify the user when some applications are collecting accelerometer readings

in the background.

Recognition accuracy on the digits dataset

slide-26
SLIDE 26

Conclusion

  • Sound signals emitted by smartphone speakers can significantly affect the

accelerometer on the same smartphone.

  • Accelerometers on recent smartphones almost cover the entire fundamental

frequency band of speech voice.

  • Using deep learning techniques, it is possible to recognize and reconstruct the

speech signals from the accelerometer measurements.

Thank you!

Zhongjie Ba, Email: zhongjie.ba@mcgill.ca