spot me if you can uncovering spoken phrases in encrypted
play

Spot me if you can: Uncovering spoken phrases in encrypted VoIP - PowerPoint PPT Presentation

Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations C. Wright, L. Ballard, S. Coull, F. Monrose, G. Masson Talk held by Goran Doychev Selected Topics in Information Security and Cryptography Seminar 1 / 30 Overview 1


  1. Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations C. Wright, L. Ballard, S. Coull, F. Monrose, G. Masson Talk held by Goran Doychev Selected Topics in Information Security and Cryptography Seminar 1 / 30

  2. Overview 1 How does VoIP work? 2 Recognizing previously seen phrases 3 Recognizing phrases without example utterances 4 Evaluation 2 / 30

  3. 1 How does VoIP work? 2 Recognizing previously seen phrases 3 Recognizing phrases without example utterances 4 Evaluation 3 / 30

  4. How does VoIP work? • Control channel: SIP, XMPP, Skype • negotiate IP ports, supported codecs etc. • Voice data: RTP over UDP • Speech codec: GSM, G.728, iSAC, Speex 4 / 30

  5. Operation of a Codec → → audio stream sampling at 8000 or n most recent sam- 16000 samples per ples compressed second (Hz) to packet (usually 20ms) Example • 16kHz audio source: n = 320 samples per packet • 8kHz audio source: n = 160 samples per packet 5 / 30

  6. Operation of a Codec (2) • brute-force search over entries in codebook of audio vectors • find one that most closely reproduces audio packet → 01001110 audio packet digital representation ↓ In Out 01001010 0110 → 0111 01001110 0111 output 01011001 1000 01011010 1001 01011110 1010 codebook 6 / 30

  7. Operation of a Codec (3) • Quality of sound depends on # entries in codebook • Classification of coders according to bit-rate: Category Bit-rate range High bit-rate > 15 kbps Medium bit-rate 5 to 15 kbps Low bit-rate 2 to 5 kbps Very low bit-rate < 2 kbps 7 / 30

  8. Variable Bit Rate • Variable bit rate (VBR): adaptively choose bit rate for each packet • Balance between audio quality and bandwidth • In a two-way conversation: speaker silent 63% of the time 8 / 30

  9. Variable Bit Rate (2) LEAKAGE: • Bit rate depends on encoded data • e.g., Speex encodes vowel sounds ( aa , aw ) at higher bit rate than fricative sounds ( f , s ) 9 / 30

  10. 1 How does VoIP work? 2 Recognizing previously seen phrases 3 Recognizing phrases without example utterances 4 Evaluation 10 / 30

  11. Problem Description Given: • utterances of n phrases phrase 1 phrase 2 phrase 3 • packet sizes of one of the phrases (5k,7k,3k,8k,12k,2k,1k) Goal: • recognize the phrase (5k,7k,3k,8k,12k,2k,1k) → “ the phrase ” 11 / 30

  12. Profile Hidden Markov Model (HMM) • Match states - expected distribution of packet sizes at each position in the sequence • Insert states - emit packets according to some distribution (uniform). Allows “insertion” of additional packets. • Delete states - silent states. Allows “omitting” packets. 12 / 30

  13. Building a Profile HMM Initially: • set Match state probabilities to uniform distribution • transition probabilities : make Match the most likely transition 13 / 30

  14. Building a Profile HMM Initially: • set Match state probabilities to uniform distribution • transition probabilities : make Match the most likely transition Train the HMM using example utterances 13 / 30

  15. Building a Profile HMM Initially: • set Match state probabilities to uniform distribution • transition probabilities : make Match the most likely transition Train the HMM using example utterances: • Apply Baum & Welch algorithm: iteratively improves the probability of the training sequences • Baum & Welch finds locally optimal set of parameters ⇒ apply Simulated annealing • Apply Viterbi training to further refine parameters. 13 / 30

  16. Problem Description Given: • utterances of n phrases phrase 1 phrase 2 phrase 3 • packet sizes of one of the phrases (5k,7k,3k,8k,12k,2k,1k) Goal: • recognize the phrase (5k,7k,3k,8k,12k,2k,1k) → “ the phrase ” 14 / 30

  17. Searching for a Phrase Changes: • Random - emit packets according to uniform distribution. Matches packets not part of phrase of interest • Profile Start/End - matches start/end of phrase • from PS: transition to the first M state is most likely 15 / 30

  18. Searching for a Phrase (2) • Apply the Viterbi algorithm - find most likely sequence of states to explain observed packet sizes • A “hit” : subsequence of states that belong to the profile part of the model 16 / 30

  19. Searching for a Phrase (2) • Apply the Viterbi algorithm - find most likely sequence of states to explain observed packet sizes • A “hit” : subsequence of states that belong to the profile part of the model • Evaluate the hit ’s goodness: l i , . . . , l j – packet lengths of the phrase of interest score i , j = log Pr [ l i , . . . , l j | Profile ] Pr [ l i , . . . , l j | Random ] • Discard hits below a threshold 16 / 30

  20. 1 How does VoIP work? 2 Recognizing previously seen phrases 3 Recognizing phrases without example utterances 4 Evaluation 17 / 30

  21. Phrase Models from Phonemes • Phonemes – sounds like b , ch , t , s , aa , aw (English - 40 to 60 phonemes) • Idea: words built up by concatenated phonemes ⇒ model phonemes instead 18 / 30

  22. Phrase Models from Phonemes • Phonemes – sounds like b , ch , t , s , aa , aw (English - 40 to 60 phonemes) • Idea: words built up by concatenated phonemes ⇒ model phonemes instead Advantages: • Flexibility • Cheaper 18 / 30

  23. Problem Description Given: • recordings of all phonemes aa, ae, ah, ao, aw, ay, b, ch, d, dh, eh, er, ey, f, g, hh, etc. • packet sizes of a phrase (5k,7k,3k,8k,12k,2k,1k) Goal: • recognize the phrase (5k,7k,3k,8k,12k,2k,1k) → “ the phrase ” 19 / 30

  24. Phrase Models from Phonemes (2) Straightforward method: 1 build HMMs for phonemes 2 concatenate them, build word HMM 3 concatenate word HMMs to phrase HMM 20 / 30

  25. Phrase Models from Phonemes (2) Straightforward method: 1 build HMMs for phonemes 2 concatenate them, build word HMM 3 concatenate word HMMs to phrase HMM American English: “the phrase” (5k,7k,1k,8k,12k,2k,1k) ↓ (dh,ah),(f,r,ey,z) ↓ (“ the ”),(“ phrase ”) ↓ “ the phrase ” 20 / 30

  26. Phrase Models from Phonemes (2) Straightforward method: 1 build HMMs for phonemes 2 concatenate them, build word HMM 3 concatenate word HMMs to phrase HMM Scottish English: “the phrase” (5k,7k,1k,8k,10k,2k,1k) ↓ (dh,ah),(f,r,eh,z) ↓ (“ the ”),(“ frese ”?) ↓ ? 20 / 30

  27. Problem Description Given: • recordings of all phonemes aa, ae, ah, ao, aw, ay, b, ch, d, dh, eh, er, ey, f, g, hh, etc. • packet sizes of a phrase (5k,7k,3k,8k,12k,2k,1k) Goal: • recognize the phrase (5k,7k,3k,8k,12k,2k,1k) → “ the phrase ” 21 / 30

  28. Problem Description Given: • recordings of all phonemes aa, ae, ah, ao, aw, ay, b, ch, d, dh, eh, er, ey, f, g, hh, etc. • packet sizes of a phrase (5k,7k,3k,8k,12k,2k,1k) • phonetic pronunciation dictionary Goal: • recognize the phrase (5k,7k,3k,8k,12k,2k,1k) → “ the phrase ” 21 / 30

  29. Phrase Models from Phonemes (3) Advanced method: • build initial profile HMM for phrase (as usual) • train it using synthetic training set • search for phrase (as usual) 22 / 30

  30. Phrase Models from Phonemes (3) Advanced method: • build initial profile HMM for phrase (as usual) • train it using synthetic training set • search for phrase (as usual) Synthetic training set: • phrase: “the phrase” • split into words: “the” “phrase” • create list of phonemes: “dh ah” “f r ey z” • replace with packet sizes: “9k 20k” “5k 8k 14k 3k” 22 / 30

  31. Phrase Models from Phonemes (3) Advanced method: • build initial profile HMM for phrase (as usual) • train it using synthetic training set • search for phrase (as usual) Synthetic training set: • phrase: “the phrase” • split into words: “the” “phrase” • create list of phonemes: “dh ah” “f r ey z” • replace with packet sizes: “9k 20k” “5k 8k 14k 3k” Improved Model: use diphones and triphones instead of words 22 / 30

  32. 1 How does VoIP work? 2 Recognizing previously seen phrases 3 Recognizing phrases without example utterances 4 Evaluation 23 / 30

  33. Experimental Setup • Use TIMIT continuous speech corporus • Concatenate sentences to “conversation” • Training of HMM: • TIMIT pronunciation dictionary (“proper” American English) • PRONLEX pronunciation dictionary (more colloquial English) 24 / 30

  34. Evaluation Metrics • recall : Probability that algorithm finds phrase • precision : Probability that reported match is correct 25 / 30

  35. Results of the Experiment recall precision 51% 50% 26 / 30

  36. Results of the Experiment recall precision 51% 50% • Some phrases were found with high accuracy: “Young children should avoid exposure to contagious diseases.” (recall = 0.99, precision = 1) 26 / 30

  37. Results of the Experiment recall precision 51% 50% • Some phrases were found with high accuracy: “Young children should avoid exposure to contagious diseases.” (recall = 0.99, precision = 1) • A high deviation of results for individual speakers 26 / 30

  38. Robustness to Noise Using pink noise : • energy logarithmically distributed across range of human hearing • harder for noise removal algorithms to filter it sound noise recall precision 100% - .51 .50 90% 10% .39 .40 75% 25% .23 .22 27 / 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend