Website Fingerprinting at Internet Scale Andriy Panchenko 1 , Fabian - - PowerPoint PPT Presentation

website fingerprinting at internet scale
SMART_READER_LITE
LIVE PREVIEW

Website Fingerprinting at Internet Scale Andriy Panchenko 1 , Fabian - - PowerPoint PPT Presentation

Website Fingerprinting at Internet Scale Andriy Panchenko 1 , Fabian Lanze 1 , Andreas Zinnen 2 , Martin Henze 3 , Jan Pennekamp 1 , Klaus Wehrle 3 , Thomas Engel 1 1 Interdisciplinary Centre for Security, Reliability and Trust (SnT), Luxembourg 2


slide-1
SLIDE 1

Website Fingerprinting at Internet Scale

Andriy Panchenko1, Fabian Lanze1, Andreas Zinnen2, Martin Henze3, Jan Pennekamp1, Klaus Wehrle3, Thomas Engel1

1Interdisciplinary Centre for Security, Reliability and Trust (SnT), Luxembourg 2RheinMain University of Applied Sciences, Germany 3RWTH Aachen University, Germany

slide-2
SLIDE 2

Background

Why people use Tor... Privacy has become a general concern Access to the Internet is censored in many countries

slide-3
SLIDE 3

Website Fingerprinting

Client OR OR OR OR OR OR OR Server

Tor: The Onion Router

Most popular low-latency anonymization network Many users rely on Tor to access unfiltered information

slide-4
SLIDE 4

Website Fingerprinting

Client OR OR OR OR OR OR OR Server Entry Middle Exit

Tor: The Onion Router

Most popular low-latency anonymization network Many users rely on Tor to access unfiltered information

slide-5
SLIDE 5

Website Fingerprinting

Client OR OR OR OR OR OR OR Server Entry Middle Exit

Tor: The Onion Router

Most popular low-latency anonymization network Many users rely on Tor to access unfiltered information

slide-6
SLIDE 6

Website Fingerprinting

Client OR OR OR OR OR OR OR Server Entry Middle Exit

?

What is website fingerprinting?

Identify website accessed without breaking cryptography Attacker is a passive observer Features based on packet size, direction, ordering, timing

slide-7
SLIDE 7

Website Fingerprinting - state of the art

Widely discussed and hot topic in anonymity research

State-of-the-art approach: Wang et al. (Usenix Sec’14)

k-Nearest Neighbor approach manually selected features (e.g., bursts, unique lengths) about 4,000 features recognition rates > 90%

2 scenarios for evaluation

Closed world: user visits only a fixed number of websites Open world: monitor set of sites (user may visit unknown sites)

slide-8
SLIDE 8

Our method

Idea

Don’t try to guess which characteristics may be relevant Use a representation that implicitly covers all characteristics Our feature set: (Nin,Nout,Sin,Sout

  • basic properties

, C1, · · · , Cn

  • cumulative features

)

2 4 6 8 10 12 14 16 18 Packet Number −1000 1000 2000 3000 4000 5000 6000 7000 Cumulative Sum of Packet Sizes

C(T1) Ci sampled for T1 C(T2) Ci sampled for T2

slide-9
SLIDE 9

Example

20 40 60 80 100 Feature Index 50 100 150 200 Feature Value [kByte]

about.com google.de

Fixed number of distinctive characteristics from traces with varying lengths Fingerprints can be visualized Used as input for a Support Vector Machine

slide-10
SLIDE 10

Layers of data representation

TLS records TCP packets

Record 1 * Packet 2

Tor cells

Packet 3 Packet 1 Cell 3 Cell 2 Cell 1 Record 2 Cell 5 Cell 4

Information src for feature extraction: Cell vs. TLS vs. TCP Practically nigligible effect on the classification accuracy

slide-11
SLIDE 11

Comparison with state of the art – classification

Closed world

Accuracy [%] for 100 most popular websites 90 instances 40 instances k-NN (3736 features) 90.84 89.19 Our method (104 features) 91.38 92.03

Open world

Foreground: 100 blocked websites, background: 9,000 popular websites TPR FPR k-NN 90.59 2.24 Our method 96.92 1.98

slide-12
SLIDE 12

Comparison of computational performance

10000 20000 30000 40000 50000 Background Set Size 10−4 10−3 10−2 10−1 100 101 102 103 Average Processing Time [h]

k-NN CUMUL CUMUL (parallelized)

Computation time for 100 random monitored pages in open world

slide-13
SLIDE 13

Website fingerprinting in reality

Critique

Data sets used are not representative!

too small, only popular websites / index pages

Simplified assumptions, wrong metrics for evaluation

RND-WWW: How do people access the world wide web?

Twitter                  > 120,000 web pages Alexa-one-click Googling the trends Googling at random Censored in China

Tor-Exit: Which pages do users actually access over Tor?

Monitor a Tor Exit node ⇒ 211,148 web pages

slide-14
SLIDE 14

Webpage fingerprinting at Internet scale

Question: Does the attack scale under realistic assumptions?

Which metric to evaluate?

Accuracy: fraction of true results True Positive rate / Recall: fraction of monitored pages detected False Positive Rate: fraction of false alarms

Problem: misleading interpretation ⇒ base rate fallacy

Precision: probability that the classifier is correct given it has detected a monitored page

Focus of evaluation

Precision and recall for increasing background set sizes Random subset as foreground

slide-15
SLIDE 15

Webpage fingerprinting at Internet scale

Question: Does the attack scale under realistic assumptions? Results for RND-WWW

0.2 0.4 0.6 0.8 1 Recall 20 40 60 80 100 Fraction of Foreground Pages [%] b = 1000 b = 5000 b = 9000 b = 20000 b = 50000 b = 111884 0.2 0.4 0.6 0.8 1 Precision 20 40 60 80 100 Fraction of Foreground Pages [%] b = 1000 b = 5000 b = 9000 b = 20000 b = 50000 b = 111884

slide-16
SLIDE 16

Webpage fingerprinting at Internet scale

Question: Does the attack scale under realistic assumptions? Results for Tor-Exit

0.2 0.4 0.6 0.8 1 Recall 20 40 60 80 100 Fraction of Foreground Pages [%]

b = 1000 b = 5000 b = 9000 b = 20000 b = 50000 b = 111884 b = 211148

0.2 0.4 0.6 0.8 1 Precision 20 40 60 80 100 Fraction of Foreground Pages [%]

b = 1000 b = 5000 b = 9000 b = 20000 b = 50000 b = 111884 b = 211148

slide-17
SLIDE 17

Webpage fingerprinting at Internet scale

Question: Does the attack scale under realistic assumptions? Results for Tor-Exit

0.2 0.4 0.6 0.8 1 Recall 20 40 60 80 100 Fraction of Foreground Pages [%]

b = 1000 b = 5000 b = 9000 b = 20000 b = 50000 b = 111884 b = 211148

0.2 0.4 0.6 0.8 1 Precision 20 40 60 80 100 Fraction of Foreground Pages [%]

b = 1000 b = 5000 b = 9000 b = 20000 b = 50000 b = 111884 b = 211148

Answer: No.

slide-18
SLIDE 18

Webpage fingerprinting at Internet scale

Question: Is it at least possible for certain pages?

slide-19
SLIDE 19

Webpage fingerprinting at Internet scale

Question: Is it at least possible for certain pages? Minimum number of mistakenly confused pages

50 100 150 200 250 300 350 400 20 40 60 80 100 Number of Webpage Confusions Fraction of Foreground Pages [%] b=20 000 b=50 000 b=100 000

No single page without a confusingly similar page in a realistic universe.

slide-20
SLIDE 20

How about fingerprinting websites? (1/2)

A website is a collection of web pages served under the same domain Is it possible to fingerprint a website when only a subset of its pages are available for training?

Experiment: 20 websites

ALJAZEERA AMAZON BBC CNN EBAY FACEBOOK IMDB KICKASS LOVESHACK RAKUTEN REDDIT RT SPIEGEL STACKOVERFLOW TMZ TORPROJECT TWITTER WIKIPEDIA XHAMSTER XNXX ALJAZEERA AMAZON BBC CNN EBAY FACEBOOK IMDB KICKASS LOVESHACK RAKUTEN REDDIT RT SPIEGEL STACKOVERFLOW TMZ TORPROJECT TWITTER WIKIPEDIA XHAMSTER XNXX 51 51 50 1 51 51 50 1 51 51 49 1 1 51 51 51 1 1 48 1 51 1 50 51 50 1 51 1 50 51 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ALJAZEERA AMAZON BBC CNN EBAY FACEBOOK IMDB KICKASS LOVESHACK RAKUTEN REDDIT RT SPIEGEL STACKOVERFLOW TMZ TORPROJECT TWITTER WIKIPEDIA XHAMSTER XNXX ALJAZEERA AMAZON BBC CNN EBAY FACEBOOK IMDB KICKASS LOVESHACK RAKUTEN REDDIT RT SPIEGEL STACKOVERFLOW TMZ TORPROJECT TWITTER WIKIPEDIA XHAMSTER XNXX 47 1 2 1 28 5 1 1 4 3 1 1 3 3 1 43 1 1 4 2 2 45 1 3 2 1 32 3 1 2 2 1 2 2 2 1 41 2 1 1 1 2 3 49 2 1 49 1 1 45 2 2 1 1 2 2 44 1 1 3 48 4 1 44 1 1 1 2 1 47 1 3 2 1 2 3 31 1 1 2 2 2 1 2 1 46 1 1 1 3 7 31 1 7 4 2 1 1 1 5 1 1 1 1 33 1 3 1 1 5 3 37 3 1 47 1 50

(a) only index pages (b) different pages

slide-21
SLIDE 21

How about fingerprinting websites? (2/2)

Transition of results from closed-world to the realistic open-world setting is typically not trivial Website fingerprinting scales better than webpage fingerprinting

20000 40000 60000 80000 100000 120000 Background Set Size 0.0 0.2 0.4 0.6 0.8 1.0

Precision Recall

20000 40000 60000 80000 100000 120000 Background Set Size 0.0 0.2 0.4 0.6 0.8 1.0

Precision Recall

slide-22
SLIDE 22

Summary

Our classifier with 104 features outperforms state of the art Alarming results under simplified assumptions can’t be generalized Webpage fingerprinting does not scale for appropriate universe sizes for any webpage Website fingerprinting is not only more realistic and also significantly more effective Conclusions drawn need to be reconsidered Scripts and RND-WWW dataset: http://lorre.uni.lu/~andriy/zwiebelfreunde/

slide-23
SLIDE 23

We are hiring!

Our lab within the Interdisciplinary Centre for Security, Reliability and Trust (Uni Luxembourg) is looking for PhD candidates and PostDocs in the area

  • f anonymity and privacy

More information: http://secan-lab.uni.lu/jobs