[PPT] - Le Lear arnin ing-based P based Prac actic ical S al Smar PowerPoint Presentation

SLIDE 1

Le Lear arnin ing-based P based Prac actic ical S al Smar martpho phone ne Eavesdrop

ppin

ing wi with Bu Built-in A in Acceler elerome meter er

Authors: Zhongjie Ba, Tianhang Zheng, Xinyu Zhang, Zhan Qin, Baochun Li, Xue Liu and Kui Ren Presenter: Shiqing Luo

SLIDE 2

Smartphone Sensors

Motion Sensor Gyroscope Accelerometer Magnetic Sensor Magnetometer Voice Sensor Microphone Image Sensor Camera

Permission required No Permission needed

Accelerometer

SLIDE 3

Motion Sensor Threat to Speech Privacy

A smartphone gyroscope can pick up surface vibrations incurred by an independent

loudspeaker placed on the same table (Michalevsky et al., Usenix 2014).

Gyroscopes are (lousy but still) microphones.
Very low signal to noise ratio
Low sampling frequency

Speaker Speaker Identification Digits Recognition Mixed Female/Male 50% 17% Female speakers 45% 26% Male speakers 65% 23%

SLIDE 4

Motion Sensor Threat to Speech Privacy

Only loudspeaker-rendered speech signals traveling through a solid surface can

create noticeable impacts on motion sensors (Anand et al., S&P 2018).

Gyroscope Accelerometer

Through a shared surface Through air The threat does not go beyond the Loudspeaker-Same-Surface setup studied by Michalevsky et al.

SLIDE 5

Commonly Believed Limitations

Can only pick up a narrow band of speech signals
Android has a sampling ceiling of 200 Hz
iOS has a sampling ceiling of 100 Hz
Does not go beyond the Loudspeaker-Same-Surface setup
Very low SNR (Signal-to-Noise Ratio)
Sensitive to sound angle of arrival

Fundamental frequency range of human speech 85-180 Hz 165-255 Hz

SLIDE 6

Our Observations: Sampling Frequency

Delay Options Delay Sampling Rate DELAY NORMAL 200 ms 5 Hz DELAY UI 20 ms 50 Hz DELAY GAME 60 ms 16.7 Hz DELAY FASTEST 0 ms AFAP

<latexit sha1_base64="yYOIwdWODfxZEK63X1Gki8GTDu8=">AC+HicdZLbtNAFIbHLpcSoKSwZDMiomIV2YW2LOvSlCK1NDRJWxRH0fHkJBl1PLZmxkipmydhwKE2PIo7HgbxonFpQ3HGun3Od8/lzMTpYJr43k/HXfpxs1bt5fvVO7eu7/yoLr68EQnmWLYlI1FkEGgWX2DHcCDxLFUIcCTyNzl8V9dMPqDRPZNtMUuzFMJ8yBkYm+qvOithCMucwNRJkBN80txyew3rYTjYtIK3UBE3qUFgZN18r/NdqC2O5QjugxGKRhWKGlIw+NobuNg+B92H97dHwYHEwtvu5NC78G3T/wtIL4M6bOWg5i3n/5V4Hh42C3Cxn9DfrWxami+m9oNVutNqFoeSDvaD5hw1RDn4fv1+teXVvFvS68EtRI2U0+9Uf4SBhWYzSMAFad30vNb0clOFMoO1ipjEFdg4j7FopIUbdy2cXN6VPbWZAh4myQxo6y/7tyCHWehJHlozBjPXVWpFcVOtmZviyl3OZgYlmy80zAQ1CS1eAR1whcyIiRXAFLd7pWwMCpixb6Vim+BfPfJ1cbJe95/XN969qG3vlO1YJo/JE/KM+GSLbJN90iQdwpzM+eh8dr64F+4n96v7bY6Tul5RP4J9/svAufiTA=</latexit>

Model Year Sampling Rate Moto G4 2016 100 Hz Samsung J3 2016 100 Hz LG G5 2016 200 Hz Huawei Mate 9 2016 250 Hz Samsung S8 2017 420 Hz Google Pixel 3 2018 410 Hz Huawei P20 Pro 2018 500 Hz Huawei Mate 20 2018 500 Hz

<latexit sha1_base64="FsdzMpQszuC+0z1Y2gmbgrn9ioY=">ADinicfZLdbtMwFMfdBNhaBuvgkhuLiobqiRt103cTIDUCjGpMLoNVXlOG5rzbEjxwFK1nfhmbjbThpw0TbsRNF+euc3/mwc4JY8MQ4zu+SZd+7/2Bnt1x5uPfo8X714Ml5olJN2YAqofRlQBImuGQDw41gl7FmJAoEuwiu3ubxi69MJ1zJz2Yes1FEpJPOCUGXOD0s+yH7Apl5khQSqIXmTX4prCs6iUy/4srwviVIVM4Dr+woiGzxmJYDg5xZ+IYdj31GjcLdVx57jHuK6zi492ODgfwkhfT3zTuxD13cbeMCwXvNqaXkm+M49N8kO/bN1r39H17GjFdaBmy8vBda6r1FQw3Of4czNFQspLXcbLbr3oUpfqxu07fwXQ4K+Dq6PqnPZHjzP8bVmtNwloa3hVuIGiqsP67+8kNF04hJQwVJkqHrxGaUEW04FWxR8dOExYRekSkbgpQkYskoW67SAr8AT4gnSsMrDV56/83I4A6TeRQAGREzSzZjufO2DA1k6NRxmWcGibpqtEkFRiWJd9LHLNqBFzEIRqDrNiOiOaUAPbW4FLcDePvC3OvYbLQ/tmonb4r2EXP0HP0Ermog05QD/XRAFrx3plHVode8/27GP79Qq1SkXOU7Rm9rs/jM74yw=</latexit>

The actual sampling rates of motion sensors are determined by the performance
f the smartphone.
Accelerometers on recent smartphones can cover almost the entire fundamental

frequency band (85-255Hz) of adult speech.

The 200 Hz sampling ceiling no longer exists

[1] “Sensor Overview,” https://developer.android.com/guide/topics/sensors/sensors_overview.

Sampling frequencies supported by Android [1]

SLIDE 7

Our Observations: New Setup

Employs a smartphone’s accelerometer to eavesdrop on the speaker in the same

smartphone.

Much Higher SNR
Sound always arrives from the same direction

0.4 0.03

x-axis z-axis

SLIDE 8

Our Observations: New Setup

Employs a smartphone’s accelerometer to eavesdrop on the speaker in the same

smartphone.

Much Higher SNR
Sound always arrives from the same direction
A smartphone speaker is more likely to reveal sensitive

information than an independent loudspeaker.

SLIDE 9

Threat Model

Table setting Handhold setting

SLIDE 10

Accelerometer-based Smartphone Eavesdropping

Preprocessing: convert acceleration signals into spectrograms.
Speech Recognition: convert spectrograms to text.
Speech Reconstruction: reconstructs voice signals from spectrograms

SLIDE 11

Preprocessing

Problems in Raw Acceleration Signals
Raw accelerometer measurements are not sampled at fixed interval.
Raw accelerometer measurements can be distorted by human movement.
Raw accelerometer measurements have captured multiple digits and needs to be segmented.

Time (ms) x-axis ("/$%) y-axis ("/$%) z-axis ("/$%) 1

0.2130
0.1410

10.0020 2

0.1870
0.1440

9.9970 3

0.2110
0.1510

9.9970 5

0.2110
0.1410

10.0070 8

0.2080
0.1340

10.0120 10

0.2150
0.1320

10.0070

SLIDE 12

Time (ms) x-axis ("/$%) y-axis ("/$%) z-axis ("/$%) 1

0.2130
0.1410

10.0020 2

0.1870
0.1440

9.9970 3

0.2110
0.1510

9.9970 5

0.2110
0.1410

10.0070 8

0.2080
0.1340

10.0120 10

0.2150
0.1320

10.0070

Step 1: Generate Sanitized Single-word Signals

Interpolation
Upsample accelerometer signals to 1000 Hz

using linear interpolation.

SLIDE 13

Step 1: Generate Sanitized Single-word Signals

Time (ms) x-axis ("/$%) y-axis ("/$%) z-axis ("/$%) 1

0.2130
0.1410

10.0020 2

0.1870
0.1440

9.9970 3

0.2110
0.1510

9.9970 4

0.2110
0.1460

10.0020 5

0.2110
0.1410

10.0070 6

0.2100
0.1387

10.0087 7

0.2090
0.1363

10.0103 8

0.2080
0.1340

10.0120 9

0.2115
0.1330

10.0095 10

0.2150
0.1320

10.0070

Interpolation
Upsample accelerometer signals to 1000 Hz

using linear interpolation.

SLIDE 14

Step 1: Generate Sanitized Single-word Signals

Fundamental frequency range of human speech 85-180 Hz 165-255 Hz

Interpolation
Upsample accelerometer signals to 1000 Hz

using linear interpolation.

High-pass filter
Convert the acceleration signal along each

axis to the frequency domain and eliminate frequency components below 80 Hz.

SLIDE 15

Step 1: Generate Sanitized Single-word Signals

Table setting Handhold setting

Interpolation
Upsample accelerometer signals to 1000 Hz

using linear interpolation.

High-pass filter
Convert the acceleration signal along each

axis to the frequency domain and eliminate frequency components below 80 Hz.

SLIDE 16

Step 1: Generate Sanitized Single-word Signals

Interpolation
Upsample accelerometer signals to 1000 Hz

using linear interpolation.

High-pass filter
Convert the acceleration signal along each

axis to the frequency domain and eliminate frequency components below 80 Hz.

Segmentation
Calculate the magnitude of the acceleration

signal and smooth the obtained magnitude sequence with moving average.

Locate all regions with magnitudes higher

than a threshold.

Table setting Handhold setting

SLIDE 17

Step 2: Generate Spectrogram Images

Signal-to-spectrogram conversion
Divide the signal into multiple short segments

with a fixed overlap.

Window each segment with a Hamming window

and calculate its spectrum through STFT (Short- Time Fourier Transform).

Three spectrograms can be obtained for each

single-word signal. Table setting Handhold setting

SLIDE 18

Step 2: Generate Spectrogram Images

Signal-to-spectrogram conversion
Divide the signal into multiple short segments

with a fixed overlap.

Window each segment with a Hamming window

and calculate its spectrum through STFT (Short- Time Fourier Transform).

Three spectrograms can be obtained for each

single-word signal.

Generate Spectrogram-Images
Fit the three m x n spectrograms into one m x n

x 3 tensor.

Take the square root of all the elements in the

tensor and map the obtained values to integers between 0 and 255.

Export the m x n x 3 tensor as an image in PNG

format Table setting Handhold setting Table setting Handhold setting

SLIDE 19

Speech Recognition

DenseNet:
Direct connections between each layer
Fewer nodes and parameters
Comparable performance with VGG & ResNet

xl = Hl([x0, x1, ..., xl−1])

<latexit sha1_base64="gr8gdQU1LkMOStjg+tgd8EtEyJ8=">ACRHicbVDLSgMxFM3UVx1fVZdugkWoUIcZH+hGKLrpsoJ9QDuUTJpQzMPkoxYhvk4N36AO7/AjQtF3IqZaYU+vBy7jnkpvjhIwKaZqvWm5peWV1Lb+ub2xube8UdvcaIog4JnUcsIC3HCQIoz6pSyoZaYWcIM9hpOkMb1O9+UC4oIF/L0chsT3U96lLMZK6hbaHQ/JgePGj0mXwWv412Y3RiyuJkoTdvMpzqrLJhGDNMzE6sxD7W9W6haBpmVnARWBNQBJOqdQsvnV6AI4/4EjMkRNsyQ2nHiEuKGUn0TiRIiPAQ9UlbQR95RNhxFkICjxTg27A1fElzNjpiRh5Qow8RznTXcW8lpL/ae1Iuld2TP0wksTH4fciEZwDR2KOcYMlGCiDMqdoV4gHiCEuVexqCNf/lRdA4Nawz4+LuvFi5mcSRBwfgEJSABS5BVRBDdQBk/gDXyAT+1Ze9e+tO+xNadNZvbBTGk/v83dsTI=</latexit>

Huang, Gao, et al. "Densely connected convolutional networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

SLIDE 20

Recognition Results

Dataset (80% training data and 20% testing data) :
Digits: 10k single-digit signals from 20 speakers
Digits + Letter: 36*260 single-word signals from 10 speakers.
Recognizing Digits & Letters (common elements in password)
Recognizing 20 Speakers (connect multiple attack results)

Previous SOTA results: 26% on recognizing digits Previous SOTA results: 50% on recognizing 10 speakers Traditional ML + gyroscope+ Loudspeaker-Same-Surface

AccDataRec Audio Player

Tasks Top1 Acc Top3 Acc SOTA Digits 78% 96% 26% D + L 55% 78%

Top1 Acc

Top3 Acc SOTA 70% 88% 50% (10)

SLIDE 21

Hot Word Search

Locate and identify pre-trained hot (sensitive) words from sentences.

Here is my social security number

Insensitive words Hot words

Hot word TPR FPR Password 94% 0.4% Username 97% 0.4% Social 100% 0.3% Security 91% 0.0% Number 88% 0.1% Email 88% 1.4% Credit 88% 0.3% Card 97% 1.4%

<latexit sha1_base64="742alscu+TK3Kxe6vbew5OdTJo=">ADYnicdZJRb9MwEMfdhMEaYGvZIzxYVEM8Vck61PI2UYH2hMpYt0lNVTnOtbXm2JHtAFXWL8kbT7zwQXDaiLXNuCjxX3c/X3zni1LOtPH9XzXHfbT3+Ml+3Xv67PnBYaP54krLTFEYUsmluomIBs4EDA0zHG5SBSJOFxHt/0ifv0NlGZSXJpFCuOEzASbMkqMdU2atR9ePYxgxkRuSJRxopb5Hb+j9l69Xo4LxJbcS4N/i5VjN/gy8GF/X4aXIThJjEgWpfE+9Pw2C5+267b0FCDEiSBAureQ3ib+iopI9wGA98voU4VApopZhZFqCk/Ar1OUsiUDbY65VMUGE+JoTxeyR4ER9BTEzm2k6u6X1ybr27v+ShCDif032vEmjZQ+8MlwVQSlaqLTBpPEzjCXNEhCGctvsUeCnZpwTZRjlsPTCTENK6C2ZwcjKos16nK9GZImPrSfGU6nsKwxeTd35CTRepFElkyImevdWOF8KDbKzLQ3zplIMwOCrn80zTg2EhfzhmOmgBq+sIJQe1+MYjonilBjp7JoQrBbclVcnbSDTvdl9PW2YeyHfvoJXqN3qIAdEZOkcDNES09tvZcw6cQ+eP67lN92iNOrVyzxHaMvfVX5H97L0=</latexit>

Speakers
Two males and two females
Training data
128*8 hot words
2176 insensitive words (negative samples)
Testing data
200 short sentences, each of which contains several

insensitive words and one to three hot words.

SLIDE 22

Case Study: End-to-End Attack

Attack scenario:
The victim makes a phone call to a remote caller and requests a password during the conversation.
The password is eight digits in length and is preceded by the hot word “password (is)”.
Training data
200 “password”s (Hotword search)
2200 other word (Hotword search)
280*10 digits (Digits recognition)
Testing data
80 conversations for each setting.
Attach process:
1) Hotword search: Locate password.
2) Digits recognition: Recognize eight-

digit password.

SLIDE 23

Speech Reconstruction

Reconstruction Network (Refer to StyleTransfer):
Encoder: encode spectrograms into features
Residue Blocks: refine encoded features by residual

mappings (inspired by ResNet)

Decoder: decode the features into audio spectrograms
GL algorithm:
Recover the phase from spectrogram
Recover audio signals

H(x) = F(x, Wi) + x

<latexit sha1_base64="P1uq49SUCts+KaEJE2YMrUYRLvg=">ACS3icbVDLSgMxFM1Uq7W+qi7dBItQUcqMD3QjFAXpsoJ9QDsMmTRTQzOZIcmIZj/c+PGnT/hxoUiLsz0gb1QsjJOedyb4bMiqVab4amYXF7NJybiW/ura+sVnY2m7IBKY1HAtFykSMclJXVDHSCgVBvstI0+1fp3rzgQhJA36nBiGxfdTj1KMYKU05BbfjI3XvevHwxojF1SQpTcjH5ABewjnLzZTlCMaTRzNxqG45hL9qPu8UimbZHBacB9YFMG4ak7hpdMNcOQTrjBDUrYtM1R2jISimJEk34kCRHuox5pa8iRT6QdD7NI4L5mutALhD5cwSH7tyNGvpQD39XOdEc5q6Xkf1o7Ut6FHVMeRopwPBrkRQyqAKbBwi4VBCs20ABhQfWuEN8jgbDS8achWLNfngeN47J1Uj67PS1WrsZx5MAu2AMlYIFzUAFVUAN1gMETeAMf4N4Nt6NL+N7ZM0Y454dMFWZ7A8PQ7R+</latexit>

Johnson, Justin, et al. "Perceptual losses for real-time style transfer and super-resolution." European conference on computer vision. Springer, Cham, 2016.

SLIDE 24

Reconstruction Results

Listen to some reconstructed examples

Password is one two three Angry bird is my username Here is my social security number

SLIDE 25

Defense

Limit the sampling rate of the accelerometer.
According to Android Developer, the recommended sampling rates for the user interface

and mobile games are 16.7 Hz and 50 Hz respectively.

Applications requiring sampling rates above 50 Hz should request a permission through

<user-permission >

Notify the user when some applications are collecting accelerometer readings

in the background.

Recognition accuracy on the digits dataset

SLIDE 26

Conclusion

Sound signals emitted by smartphone speakers can significantly affect the

accelerometer on the same smartphone.

Accelerometers on recent smartphones almost cover the entire fundamental

frequency band of speech voice.

Using deep learning techniques, it is possible to recognize and reconstruct the

Le Lear arnin ing-based P based Prac actic ical S al Smar martpho phone ne Eavesdrop

ing wi with Bu Built-in A in Acceler elerome meter er

Authors: Zhongjie Ba, Tianhang Zheng, Xinyu Zhang, Zhan Qin, Baochun Li, Xue Liu and Kui Ren Presenter: Shiqing Luo

Smartphone Sensors

Permission required No Permission needed

Motion Sensor Threat to Speech Privacy

loudspeaker placed on the same table (Michalevsky et al., Usenix 2014).

Motion Sensor Threat to Speech Privacy

create noticeable impacts on motion sensors (Anand et al., S&P 2018).

Commonly Believed Limitations

Our Observations: Sampling Frequency

frequency band (85-255Hz) of adult speech.

The 200 Hz sampling ceiling no longer exists

Our Observations: New Setup

smartphone.

x-axis z-axis

Our Observations: New Setup

smartphone.

information than an independent loudspeaker.

Threat Model

Accelerometer-based Smartphone Eavesdropping

Preprocessing

Step 1: Generate Sanitized Single-word Signals

Step 1: Generate Sanitized Single-word Signals

Step 1: Generate Sanitized Single-word Signals

Step 1: Generate Sanitized Single-word Signals

Step 1: Generate Sanitized Single-word Signals

Step 2: Generate Spectrogram Images

Step 2: Generate Spectrogram Images

Speech Recognition

xl = Hl([x0, x1, ..., xl−1])

Recognition Results

Tasks Top1 Acc Top3 Acc SOTA Digits 78% 96% 26% D + L 55% 78%

Top3 Acc SOTA 70% 88% 50% (10)

Hot Word Search

Here is my social security number

Case Study: End-to-End Attack

Speech Reconstruction

Reconstruction Results

Password is one two three Angry bird is my username Here is my social security number

Defense

in the background.

Conclusion

accelerometer on the same smartphone.

frequency band of speech voice.

speech signals from the accelerometer measurements.

Thank you!

Zhongjie Ba, Email: zhongjie.ba@mcgill.ca