Inferring the Source of Encrypted HTTP Connections Michael Lin CSE - - PowerPoint PPT Presentation

inferring the source of encrypted http connections
SMART_READER_LITE
LIVE PREVIEW

Inferring the Source of Encrypted HTTP Connections Michael Lin CSE - - PowerPoint PPT Presentation

Inferring the Source of Encrypted HTTP Connections Michael Lin CSE 544 Hiding your identity You can wear a mask, but some distinguishing characteristics are visible: Height Weight Hair Clothing Even if everyone


slide-1
SLIDE 1

Inferring the Source of Encrypted HTTP Connections

Michael Lin CSE 544

slide-2
SLIDE 2

Hiding your identity

  • You can wear a mask, but some distinguishing characteristics are visible:
  • Height
  • Weight
  • Hair
  • Clothing
  • Even if everyone looked the same, we can determine some things about

people based on their habits

  • People who go to school everyday are probably students or teachers
  • If you follow a strict schedule everyday (school, coffee shop, gym), you

can be identified to some degree of accuracy

  • “There are 10 people who follow this exact schedule everyday.”
slide-3
SLIDE 3

Profiling

  • How would you identify someone in a world of clones?
  • Determine their schedule
  • Determine their habits
  • Profiling allows us to identify something without knowing what it is
slide-4
SLIDE 4

Hiding your online identity

  • Encryption will save us from prying eyes. Or will it?
  • We can hide the header and contents of a packet behind encryption
  • But can we still say something about the packet itself?
  • Packet size
  • Packet direction
  • What about traffic patterns?
  • Packet arrival rate/distribution
slide-5
SLIDE 5

HTTP traffic profiling

  • Using only packet size and direction, create profiles of traces of HTTP traffic

for certain websites

  • Instance - <Packet size, direction>
  • Class - URL
  • Create sets of instances for each class and use these sets to identify other

traces to unknown sites

  • These sets are surprisingly unique
slide-6
SLIDE 6

Comparing HTTP traces

  • Two relatively simple methods to get a rating of the similarity of two sets
  • Jaccard’s coefficient
  • Intersection of two sets divided by union of two sets
  • Think about this and it makes sense
  • Naive Bayes classifier (Idiot’s Bayes)
  • “Naive” because it assumes every event is independent
  • A surprisingly good indicator of similarity
  • Important: you need something to compare against!
slide-7
SLIDE 7

Collecting HTTP traces

  • Gathered 100,000 URLs from DNS server logs
  • Used Firefox to access top 2000 pages over an SSH tunnel 4 times a day
  • ver 2 months
  • Used tcpdump to collect header information from these connections
  • Analyzed the logs to get packet length and direction for connections to each

site

  • Create a library of profiles for sites
slide-8
SLIDE 8

This is where the magic happens

  • Now we have two methods for comparing sets and a big library of site

profiles

  • Say we intercepted some encrypted HTTP traffic and want to guess where it’s

going...

  • Compare with all sites in the library to find the best match or two or ten
slide-9
SLIDE 9

How well does it work?

  • Surprisingly well
  • Lots of variables to play with:
  • Size of “training set” - the data used to create the library profile
  • Size of test set
  • Time between collection of training and test set
  • Desired accuracy (top 1 most likely site or top k, k = 2, 3, 5, 10...)
  • Number of sites in library
  • Jaccard’s coefficient is generally better than naive Bayes
  • Bottom line: for a training set of 4 samples and a test set of 4 samples, they

got ~75% accuracy

slide-10
SLIDE 10

Effect of variables

  • Increasing size of training set up to 4 greatly improves accuracy, after 4 they

get diminishing returns

  • Increasing k increases accuracy (duh)
  • Time between training set and test set matters, but the difference is less than

10%, even after 4 weeks

  • It doesn’t matter if the training set comes from before or after the test set
  • The fewer total sites there are in the library, the better the accuracy, but the

drop in accuracy is relatively slow from 200-2000 (will this hold true to 40 million?)

slide-11
SLIDE 11
  • This is a philosophical question
  • Given the relatively small amount of data collected for each site, I think this is

good enough to be interesting

  • This kind of accuracy requires a training and test set of size 4+
  • How likely are you to get a test set of that size?
  • Even with perfect data, a maximum of ~75% accuracy is limiting

Is this good enough?

slide-12
SLIDE 12

How can we make it worse?

  • This analysis is based entirely on packet size
  • Change the packet size, change the results
  • 4 simple packet size padding methods:
  • Linear
  • Exponential
  • Mice & elephants
  • MTU
  • All increase packet sizes in a deterministic manner
slide-13
SLIDE 13

The effectiveness of padding

  • Linear padding cuts accuracy

in half

  • Exponential makes it useless
  • Total data transmitted remains

small for linear and exponential

  • Results for 10-accuracy are

much better

Padding Accuracy Size none 0.721 1 linear 0.477 1.034 exp 0.056 1.089 m & e 0.003 1.478 MTU 0.001 2.453

slide-14
SLIDE 14

The not so great...

  • For this to be useful, you need a library of every website
  • Collecting this much data isn’t easy
  • How accurate will this be? With 38 million websites there’s going to be a

lot of sites that look the same

  • They show that trivial packet padding makes this useless
  • No results for test sets of size < 4
slide-15
SLIDE 15

Future work

  • Current analysis is weak to packet padding, they are looking to use packet

arrival times to overcome this

  • Even for non-padded packets, packet timing can be important (but also hard

to use)

  • Padding packets non-deterministically may be even stronger against profiling
  • How reasonable is building a huge library of profiles for the entire Internet?
  • In the end, is 75% accuracy good enough?
slide-16
SLIDE 16

Take away

You can say a lot about a book by its cover.