SLIDE 1 Website Fingerprinting at Internet Scale
Andriy Panchenko1, Fabian Lanze1, Andreas Zinnen2, Martin Henze3, Jan Pennekamp1, Klaus Wehrle3, Thomas Engel1
1Interdisciplinary Centre for Security, Reliability and Trust (SnT), Luxembourg 2RheinMain University of Applied Sciences, Germany 3RWTH Aachen University, Germany
SLIDE 2
Background
Why people use Tor... Privacy has become a general concern Access to the Internet is censored in many countries
SLIDE 3
Website Fingerprinting
Client OR OR OR OR OR OR OR Server
Tor: The Onion Router
Most popular low-latency anonymization network Many users rely on Tor to access unfiltered information
SLIDE 4
Website Fingerprinting
Client OR OR OR OR OR OR OR Server Entry Middle Exit
Tor: The Onion Router
Most popular low-latency anonymization network Many users rely on Tor to access unfiltered information
SLIDE 5
Website Fingerprinting
Client OR OR OR OR OR OR OR Server Entry Middle Exit
Tor: The Onion Router
Most popular low-latency anonymization network Many users rely on Tor to access unfiltered information
SLIDE 6
Website Fingerprinting
Client OR OR OR OR OR OR OR Server Entry Middle Exit
?
What is website fingerprinting?
Identify website accessed without breaking cryptography Attacker is a passive observer Features based on packet size, direction, ordering, timing
SLIDE 7
Website Fingerprinting - state of the art
Widely discussed and hot topic in anonymity research
State-of-the-art approach: Wang et al. (Usenix Sec’14)
k-Nearest Neighbor approach manually selected features (e.g., bursts, unique lengths) about 4,000 features recognition rates > 90%
2 scenarios for evaluation
Closed world: user visits only a fixed number of websites Open world: monitor set of sites (user may visit unknown sites)
SLIDE 8 Our method
Idea
Don’t try to guess which characteristics may be relevant Use a representation that implicitly covers all characteristics Our feature set: (Nin,Nout,Sin,Sout
, C1, · · · , Cn
)
2 4 6 8 10 12 14 16 18 Packet Number −1000 1000 2000 3000 4000 5000 6000 7000 Cumulative Sum of Packet Sizes
C(T1) Ci sampled for T1 C(T2) Ci sampled for T2
SLIDE 9 Example
20 40 60 80 100 Feature Index 50 100 150 200 Feature Value [kByte]
about.com google.de
Fixed number of distinctive characteristics from traces with varying lengths Fingerprints can be visualized Used as input for a Support Vector Machine
SLIDE 10
Layers of data representation
TLS records TCP packets
Record 1 * Packet 2
Tor cells
Packet 3 Packet 1 Cell 3 Cell 2 Cell 1 Record 2 Cell 5 Cell 4
Information src for feature extraction: Cell vs. TLS vs. TCP Practically nigligible effect on the classification accuracy
SLIDE 11
Comparison with state of the art – classification
Closed world
Accuracy [%] for 100 most popular websites 90 instances 40 instances k-NN (3736 features) 90.84 89.19 Our method (104 features) 91.38 92.03
Open world
Foreground: 100 blocked websites, background: 9,000 popular websites TPR FPR k-NN 90.59 2.24 Our method 96.92 1.98
SLIDE 12 Comparison of computational performance
10000 20000 30000 40000 50000 Background Set Size 10−4 10−3 10−2 10−1 100 101 102 103 Average Processing Time [h]
k-NN CUMUL CUMUL (parallelized)
Computation time for 100 random monitored pages in open world
SLIDE 13
Website fingerprinting in reality
Critique
Data sets used are not representative!
too small, only popular websites / index pages
Simplified assumptions, wrong metrics for evaluation
RND-WWW: How do people access the world wide web?
Twitter > 120,000 web pages Alexa-one-click Googling the trends Googling at random Censored in China
Tor-Exit: Which pages do users actually access over Tor?
Monitor a Tor Exit node ⇒ 211,148 web pages
SLIDE 14
Webpage fingerprinting at Internet scale
Question: Does the attack scale under realistic assumptions?
Which metric to evaluate?
Accuracy: fraction of true results True Positive rate / Recall: fraction of monitored pages detected False Positive Rate: fraction of false alarms
Problem: misleading interpretation ⇒ base rate fallacy
Precision: probability that the classifier is correct given it has detected a monitored page
Focus of evaluation
Precision and recall for increasing background set sizes Random subset as foreground
SLIDE 15 Webpage fingerprinting at Internet scale
Question: Does the attack scale under realistic assumptions? Results for RND-WWW
0.2 0.4 0.6 0.8 1 Recall 20 40 60 80 100 Fraction of Foreground Pages [%] b = 1000 b = 5000 b = 9000 b = 20000 b = 50000 b = 111884 0.2 0.4 0.6 0.8 1 Precision 20 40 60 80 100 Fraction of Foreground Pages [%] b = 1000 b = 5000 b = 9000 b = 20000 b = 50000 b = 111884
SLIDE 16 Webpage fingerprinting at Internet scale
Question: Does the attack scale under realistic assumptions? Results for Tor-Exit
0.2 0.4 0.6 0.8 1 Recall 20 40 60 80 100 Fraction of Foreground Pages [%]
b = 1000 b = 5000 b = 9000 b = 20000 b = 50000 b = 111884 b = 211148
0.2 0.4 0.6 0.8 1 Precision 20 40 60 80 100 Fraction of Foreground Pages [%]
b = 1000 b = 5000 b = 9000 b = 20000 b = 50000 b = 111884 b = 211148
SLIDE 17 Webpage fingerprinting at Internet scale
Question: Does the attack scale under realistic assumptions? Results for Tor-Exit
0.2 0.4 0.6 0.8 1 Recall 20 40 60 80 100 Fraction of Foreground Pages [%]
b = 1000 b = 5000 b = 9000 b = 20000 b = 50000 b = 111884 b = 211148
0.2 0.4 0.6 0.8 1 Precision 20 40 60 80 100 Fraction of Foreground Pages [%]
b = 1000 b = 5000 b = 9000 b = 20000 b = 50000 b = 111884 b = 211148
Answer: No.
SLIDE 18
Webpage fingerprinting at Internet scale
Question: Is it at least possible for certain pages?
SLIDE 19 Webpage fingerprinting at Internet scale
Question: Is it at least possible for certain pages? Minimum number of mistakenly confused pages
50 100 150 200 250 300 350 400 20 40 60 80 100 Number of Webpage Confusions Fraction of Foreground Pages [%] b=20 000 b=50 000 b=100 000
No single page without a confusingly similar page in a realistic universe.
SLIDE 20 How about fingerprinting websites? (1/2)
A website is a collection of web pages served under the same domain Is it possible to fingerprint a website when only a subset of its pages are available for training?
Experiment: 20 websites
ALJAZEERA AMAZON BBC CNN EBAY FACEBOOK IMDB KICKASS LOVESHACK RAKUTEN REDDIT RT SPIEGEL STACKOVERFLOW TMZ TORPROJECT TWITTER WIKIPEDIA XHAMSTER XNXX ALJAZEERA AMAZON BBC CNN EBAY FACEBOOK IMDB KICKASS LOVESHACK RAKUTEN REDDIT RT SPIEGEL STACKOVERFLOW TMZ TORPROJECT TWITTER WIKIPEDIA XHAMSTER XNXX 51 51 50 1 51 51 50 1 51 51 49 1 1 51 51 51 1 1 48 1 51 1 50 51 50 1 51 1 50 51 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ALJAZEERA AMAZON BBC CNN EBAY FACEBOOK IMDB KICKASS LOVESHACK RAKUTEN REDDIT RT SPIEGEL STACKOVERFLOW TMZ TORPROJECT TWITTER WIKIPEDIA XHAMSTER XNXX ALJAZEERA AMAZON BBC CNN EBAY FACEBOOK IMDB KICKASS LOVESHACK RAKUTEN REDDIT RT SPIEGEL STACKOVERFLOW TMZ TORPROJECT TWITTER WIKIPEDIA XHAMSTER XNXX 47 1 2 1 28 5 1 1 4 3 1 1 3 3 1 43 1 1 4 2 2 45 1 3 2 1 32 3 1 2 2 1 2 2 2 1 41 2 1 1 1 2 3 49 2 1 49 1 1 45 2 2 1 1 2 2 44 1 1 3 48 4 1 44 1 1 1 2 1 47 1 3 2 1 2 3 31 1 1 2 2 2 1 2 1 46 1 1 1 3 7 31 1 7 4 2 1 1 1 5 1 1 1 1 33 1 3 1 1 5 3 37 3 1 47 1 50
(a) only index pages (b) different pages
SLIDE 21 How about fingerprinting websites? (2/2)
Transition of results from closed-world to the realistic open-world setting is typically not trivial Website fingerprinting scales better than webpage fingerprinting
20000 40000 60000 80000 100000 120000 Background Set Size 0.0 0.2 0.4 0.6 0.8 1.0
Precision Recall
20000 40000 60000 80000 100000 120000 Background Set Size 0.0 0.2 0.4 0.6 0.8 1.0
Precision Recall
SLIDE 22
Summary
Our classifier with 104 features outperforms state of the art Alarming results under simplified assumptions can’t be generalized Webpage fingerprinting does not scale for appropriate universe sizes for any webpage Website fingerprinting is not only more realistic and also significantly more effective Conclusions drawn need to be reconsidered Scripts and RND-WWW dataset: http://lorre.uni.lu/~andriy/zwiebelfreunde/
SLIDE 23 We are hiring!
Our lab within the Interdisciplinary Centre for Security, Reliability and Trust (Uni Luxembourg) is looking for PhD candidates and PostDocs in the area
More information: http://secan-lab.uni.lu/jobs