Spot me if you can: Uncovering spoken phrases in encrypted VoIP - - PowerPoint PPT Presentation

spot me if you can uncovering spoken phrases in encrypted
SMART_READER_LITE
LIVE PREVIEW

Spot me if you can: Uncovering spoken phrases in encrypted VoIP - - PowerPoint PPT Presentation

Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations C. Wright, L. Ballard, S. Coull, F. Monrose, G. Masson Talk held by Goran Doychev Selected Topics in Information Security and Cryptography Seminar 1 / 30 Overview 1


slide-1
SLIDE 1

Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations

  • C. Wright, L. Ballard, S. Coull, F. Monrose, G. Masson

Talk held by Goran Doychev Selected Topics in Information Security and Cryptography Seminar

1 / 30

slide-2
SLIDE 2

Overview

1 How does VoIP work? 2 Recognizing previously seen phrases 3 Recognizing phrases without example utterances 4 Evaluation

2 / 30

slide-3
SLIDE 3

1 How does VoIP work? 2 Recognizing previously seen phrases 3 Recognizing phrases without example utterances 4 Evaluation

3 / 30

slide-4
SLIDE 4

How does VoIP work?

  • Control channel: SIP, XMPP, Skype
  • negotiate IP ports, supported codecs etc.
  • Voice data: RTP over UDP
  • Speech codec: GSM, G.728, iSAC, Speex

4 / 30

slide-5
SLIDE 5

Operation of a Codec

→ → audio stream sampling at 8000 or 16000 samples per second (Hz) n most recent sam- ples compressed to packet (usually 20ms)

Example

  • 16kHz audio source: n = 320 samples per packet
  • 8kHz audio source: n = 160 samples per packet

5 / 30

slide-6
SLIDE 6

Operation of a Codec (2)

  • brute-force search over entries in codebook of audio vectors
  • find one that most closely reproduces audio packet

→ 01001110 audio packet digital representation ↓

In Out 01001010 0110 01001110 0111 01011001 1000 01011010 1001 01011110 1010 codebook

→ 0111

  • utput

6 / 30

slide-7
SLIDE 7

Operation of a Codec (3)

  • Quality of sound depends on # entries in codebook
  • Classification of coders according to bit-rate:

Category Bit-rate range High bit-rate > 15 kbps Medium bit-rate 5 to 15 kbps Low bit-rate 2 to 5 kbps Very low bit-rate < 2 kbps

7 / 30

slide-8
SLIDE 8

Variable Bit Rate

  • Variable bit rate (VBR): adaptively choose bit rate for each

packet

  • Balance between audio quality and bandwidth
  • In a two-way conversation: speaker silent 63% of the time

8 / 30

slide-9
SLIDE 9

Variable Bit Rate (2)

LEAKAGE:

  • Bit rate depends on encoded data
  • e.g., Speex encodes vowel sounds (aa, aw) at higher bit rate

than fricative sounds (f, s)

9 / 30

slide-10
SLIDE 10

1 How does VoIP work? 2 Recognizing previously seen phrases 3 Recognizing phrases without example utterances 4 Evaluation

10 / 30

slide-11
SLIDE 11

Problem Description

Given:

  • utterances of n phrases

phrase 1 phrase 2 phrase 3

  • packet sizes of one of the phrases

(5k,7k,3k,8k,12k,2k,1k) Goal:

  • recognize the phrase

(5k,7k,3k,8k,12k,2k,1k) → “the phrase”

11 / 30

slide-12
SLIDE 12

Profile Hidden Markov Model (HMM)

  • Match states - expected distribution of packet sizes at

each position in the sequence

  • Insert states - emit packets according to some distribution

(uniform). Allows “insertion” of additional packets.

  • Delete states - silent states. Allows “omitting” packets.

12 / 30

slide-13
SLIDE 13

Building a Profile HMM

Initially:

  • set Match state probabilities to uniform distribution
  • transition probabilities: make Match the most likely

transition

13 / 30

slide-14
SLIDE 14

Building a Profile HMM

Initially:

  • set Match state probabilities to uniform distribution
  • transition probabilities: make Match the most likely

transition Train the HMM using example utterances

13 / 30

slide-15
SLIDE 15

Building a Profile HMM

Initially:

  • set Match state probabilities to uniform distribution
  • transition probabilities: make Match the most likely

transition Train the HMM using example utterances:

  • Apply Baum & Welch algorithm:

iteratively improves the probability of the training sequences

  • Baum & Welch finds locally optimal set of parameters

⇒ apply Simulated annealing

  • Apply Viterbi training to further refine parameters.

13 / 30

slide-16
SLIDE 16

Problem Description

Given:

  • utterances of n phrases

phrase 1 phrase 2 phrase 3

  • packet sizes of one of the phrases

(5k,7k,3k,8k,12k,2k,1k) Goal:

  • recognize the phrase

(5k,7k,3k,8k,12k,2k,1k) → “the phrase”

14 / 30

slide-17
SLIDE 17

Searching for a Phrase

Changes:

  • Random - emit packets according to uniform distribution.

Matches packets not part of phrase of interest

  • Profile Start/End - matches start/end of phrase
  • from PS: transition to the first M state is most likely

15 / 30

slide-18
SLIDE 18

Searching for a Phrase (2)

  • Apply the Viterbi algorithm - find most likely sequence of

states to explain observed packet sizes

  • A “hit”: subsequence of states that belong to the profile part
  • f the model

16 / 30

slide-19
SLIDE 19

Searching for a Phrase (2)

  • Apply the Viterbi algorithm - find most likely sequence of

states to explain observed packet sizes

  • A “hit”: subsequence of states that belong to the profile part
  • f the model
  • Evaluate the hit’s goodness:

li, . . . , lj – packet lengths of the phrase of interest scorei,j = log Pr[li, . . . , lj | Profile] Pr[li, . . . , lj | Random]

  • Discard hits below a threshold

16 / 30

slide-20
SLIDE 20

1 How does VoIP work? 2 Recognizing previously seen phrases 3 Recognizing phrases without example utterances 4 Evaluation

17 / 30

slide-21
SLIDE 21

Phrase Models from Phonemes

  • Phonemes – sounds like b, ch, t, s, aa, aw (English - 40 to

60 phonemes)

  • Idea: words built up by concatenated phonemes

⇒ model phonemes instead

18 / 30

slide-22
SLIDE 22

Phrase Models from Phonemes

  • Phonemes – sounds like b, ch, t, s, aa, aw (English - 40 to

60 phonemes)

  • Idea: words built up by concatenated phonemes

⇒ model phonemes instead Advantages:

  • Flexibility
  • Cheaper

18 / 30

slide-23
SLIDE 23

Problem Description

Given:

  • recordings of all phonemes

aa, ae, ah, ao, aw, ay, b, ch, d, dh, eh, er, ey, f, g, hh, etc.

  • packet sizes of a phrase

(5k,7k,3k,8k,12k,2k,1k) Goal:

  • recognize the phrase

(5k,7k,3k,8k,12k,2k,1k) → “the phrase”

19 / 30

slide-24
SLIDE 24

Phrase Models from Phonemes (2)

Straightforward method:

1 build HMMs for phonemes 2 concatenate them, build word HMM 3 concatenate word HMMs to phrase HMM

20 / 30

slide-25
SLIDE 25

Phrase Models from Phonemes (2)

Straightforward method:

1 build HMMs for phonemes 2 concatenate them, build word HMM 3 concatenate word HMMs to phrase HMM

American English: “the phrase” (5k,7k,1k,8k,12k,2k,1k) ↓ (dh,ah),(f,r,ey,z) ↓ (“the”),(“phrase”) ↓ “the phrase”

20 / 30

slide-26
SLIDE 26

Phrase Models from Phonemes (2)

Straightforward method:

1 build HMMs for phonemes 2 concatenate them, build word HMM 3 concatenate word HMMs to phrase HMM

Scottish English: “the phrase” (5k,7k,1k,8k,10k,2k,1k) ↓ (dh,ah),(f,r,eh,z) ↓ (“the”),(“frese”?) ↓ ?

20 / 30

slide-27
SLIDE 27

Problem Description

Given:

  • recordings of all phonemes

aa, ae, ah, ao, aw, ay, b, ch, d, dh, eh, er, ey, f, g, hh, etc.

  • packet sizes of a phrase

(5k,7k,3k,8k,12k,2k,1k) Goal:

  • recognize the phrase

(5k,7k,3k,8k,12k,2k,1k) → “the phrase”

21 / 30

slide-28
SLIDE 28

Problem Description

Given:

  • recordings of all phonemes

aa, ae, ah, ao, aw, ay, b, ch, d, dh, eh, er, ey, f, g, hh, etc.

  • packet sizes of a phrase

(5k,7k,3k,8k,12k,2k,1k)

  • phonetic pronunciation dictionary

Goal:

  • recognize the phrase

(5k,7k,3k,8k,12k,2k,1k) → “the phrase”

21 / 30

slide-29
SLIDE 29

Phrase Models from Phonemes (3)

Advanced method:

  • build initial profile HMM for phrase (as usual)
  • train it using synthetic training set
  • search for phrase (as usual)

22 / 30

slide-30
SLIDE 30

Phrase Models from Phonemes (3)

Advanced method:

  • build initial profile HMM for phrase (as usual)
  • train it using synthetic training set
  • search for phrase (as usual)

Synthetic training set:

  • phrase: “the phrase”
  • split into words: “the” “phrase”
  • create list of phonemes: “dh ah” “f r ey z”
  • replace with packet sizes: “9k 20k” “5k 8k 14k 3k”

22 / 30

slide-31
SLIDE 31

Phrase Models from Phonemes (3)

Advanced method:

  • build initial profile HMM for phrase (as usual)
  • train it using synthetic training set
  • search for phrase (as usual)

Synthetic training set:

  • phrase: “the phrase”
  • split into words: “the” “phrase”
  • create list of phonemes: “dh ah” “f r ey z”
  • replace with packet sizes: “9k 20k” “5k 8k 14k 3k”

Improved Model: use diphones and triphones instead of words

22 / 30

slide-32
SLIDE 32

1 How does VoIP work? 2 Recognizing previously seen phrases 3 Recognizing phrases without example utterances 4 Evaluation

23 / 30

slide-33
SLIDE 33

Experimental Setup

  • Use TIMIT continuous speech corporus
  • Concatenate sentences to “conversation”
  • Training of HMM:
  • TIMIT pronunciation dictionary (“proper” American English)
  • PRONLEX pronunciation dictionary (more colloquial English)

24 / 30

slide-34
SLIDE 34

Evaluation Metrics

  • recall: Probability that algorithm finds phrase
  • precision: Probability that reported match is correct

25 / 30

slide-35
SLIDE 35

Results of the Experiment

recall precision 51% 50%

26 / 30

slide-36
SLIDE 36

Results of the Experiment

recall precision 51% 50%

  • Some phrases were found with high accuracy:

“Young children should avoid exposure to contagious diseases.” (recall = 0.99, precision = 1)

26 / 30

slide-37
SLIDE 37

Results of the Experiment

recall precision 51% 50%

  • Some phrases were found with high accuracy:

“Young children should avoid exposure to contagious diseases.” (recall = 0.99, precision = 1)

  • A high deviation of results for individual speakers

26 / 30

slide-38
SLIDE 38

Robustness to Noise

Using pink noise:

  • energy logarithmically distributed across range of human

hearing

  • harder for noise removal algorithms to filter it

sound noise recall precision 100%

  • .51

.50 90% 10% .39 .40 75% 25% .23 .22

27 / 30

slide-39
SLIDE 39

Robustness to Noise

Using pink noise:

  • energy logarithmically distributed across range of human

hearing

  • harder for noise removal algorithms to filter it

sound noise recall precision 100%

  • .51

.50 90% 10% .39 .40 75% 25% .23 .22 ⇒ attacker can identify an alarming number of the phrases

27 / 30

slide-40
SLIDE 40

Mitigation Techniques

Padding packets to a coarser granularity: granulity recall precision

  • verhead

multiples of 128bit 0.15 0.16 8.81% multiples of 256bit 0.04 0.04 16,5%

  • In these tests: continuous speech
  • In practice: 63% idle time in conversations

⇒ greater overhead

28 / 30

slide-41
SLIDE 41

Overview

1 How does VoIP work? 2 Recognizing previously seen phrases 3 Recognizing phrases without example utterances 4 Evaluation

29 / 30

slide-42
SLIDE 42

References

Charles V. Wright, Lucas Ballard, Scott E. Coull, Fabian Monrose, and Gerald M. Masson. Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations. In SP ’08: Proceedings of the 2008 IEEE Symposium on Security and Privacy (sp 2008), pages 35–49, Washington, DC, USA, 2008. IEEE Computer Society. Charles V. Wright, Lucas Ballard, Fabian Monrose, and Gerald M. Masson. Language identification of encrypted voip traffic: Alejandra y Roberto or Alice and Bob? In SS’07: Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium, pages 1–12, Berkeley, CA, USA, 2007. USENIX Association. Wai C. Chu. Speech Coding Algorithms: Foundation and Evolution of Standardized Coders. John Wiley & Sons, Inc., New York, NY, USA, 2003. Lawrence R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. In Proceedings of the IEEE, pages 257–286, 1989.

  • S. R. Eddy.

Profile hidden markov models (review). Bioinformatics, 14(9):755–763, 1998.

30 / 30