An Analysis of Domain Classification Services Pelayo Vallina, - PowerPoint PPT Presentation

Mis-shapes, Mistakes, Misfits : An Analysis of Domain Classification Services Pelayo Vallina, Victor Le Pochat, Álvaro Feal, Marius Paraschiv, Julien Gamba, Tim Burke, Oliver Hohlfeld, Juan Tapiador, Narseo Vallina-Rodriguez

web directories classification engines (manually edited) (automated) 2 [Yan04, Qi09, Bru20]

Why does the quality of these services matter? › End users : incorrect categories affect reliability over/underblocking in content filtering › Academia : domain sample or results depend on them 2019 top conferences: 24 papers lack of trust → resort to manual classification 3 [Res04, Ric02, Sch18, LeP19, Ahm20, Zeb20]

Services are opaque on how they operate Validation? Training set? Comprehensiveness? 4

Discussion Deep dive: Empirical Methodology & human labeling validation Conclusion & case studies 5

Inputs Outputs Purpose Updates Access 7

Inputs Outputs Purpose Updates Access Aggregator 8

Inputs Outputs Purpose Updates Access Content filtering Threat assessment Marketing Discovery 10

Inputs Outputs Purpose Updates Access (Mostly) automated Manual 11

Label gathering 4.4M Aggregate 4.4M using Top 1M Tranco domains ↓ 10k Sept 1-30 4.4M direct (rate-limited) 2019 domains direct 279k cats (904k domains) 15

Service choice affects which domains are labeled › Coverage Updates ranges from <1% to 94% is better for automated classification services 100 100 4.4M 10k 80 80 60 60 40 40 20 20 0 0 b t e o a S t c e d r r t e a o b S t d e r c n e e c n x e e e s r N e r x s r e N r e e e d i k n c e t a d i W f n e c a o W f D o r n r D i n A l n u i A e l n u i M i e A p M A p . d n a d i c . n s e g c r s e g r r e m M e D e M b f h i D e b f i a d o c t d e c t p e e p s r e x r y r n r r d n o D W O o o b c W O e S o e e t F F e i F l t F i M r A r B i W T B T 16 r T

Service choice affects which domains are labeled › Coverage Updates ranges from <1% to 94% is better for automated classification services Updates › Popular domains have better coverage › Subdomain coverage ranges from <1% to 99% Inputs › Inconsistent when directly sourced Access or through VirusTotal 17

Service choice affects the taxonomy granularity › Security/content filtering: fewer categories As low as 12 Easier setup Purpose › Marketing: more categories Up to 7.5k Fine-grained targeting 18

Service choice affects label interpretation › Inconsistencies between documented Access and observed labels › Multiple labels are uncommon Outputs › Subdomains inherit labels from parent Inputs › 3 out of 9 services updated labels Updates Mostly for maliciousness 19

Service choice affects label distribution › Disagreement Purpose on distribution of labels over domains Updates As measured through mutual information › Uneven distribution of labels over domains Purpose As measured through label frequency 20

Dynamics of human labeling may trigger biases Participation concentrated › at beginning of project outdated labels? › with few users lack of peer review? › on unlabeled domains stale labels? 24

Disagreement in human labeling may trigger biases › Label assignment is not completely objective 25

Disagreement in human labeling may trigger biases › Label assignment is not completely objective › Empirically : Clusters of correlated labels 26

Disagreement in human labeling may trigger biases › Label assignment is not completely objective › Empirically : Clusters of correlated labels › Experimentally : 35.5% disagreement among authors, 71% matches community label 27

We analyze services on specialized use cases › Intended usage → requirements → data source selection › Service selection → characteristics → coverage/accuracy › Estimate suitability for three case studies Obtain a manually curated list as “ground truth” Analyze coverage across domains Analyze appropriateness of labels 28

Behavior differs widely for specialized use cases › Advertising and tracking Curated list : EasyList/EasyPrivacy Finding: few services label the domains at all, let alone as tracker › Adult content Curated list: [Val19] and gambling regulators Finding: 5 services label correctly, 3 others hardly label any › CDNs/hosting providers Curated list: signatures from WebPageTest Finding: confusion between service function and content 29

Recommendations › We avoid recommending a specific service “Best” service depends on use case and requirements We cannot measure semantic agreement nor correctness › Our recommendations address best practices 32 [Seb16, Lee13, Wei19]

Recommendations › Coverage and accuracy may be insufficient Very service - and use case -dependent Consider impact of errors › Purpose and updates may introduce biases Consult documentation for taxonomy and label sources ... but verify (and report) manually, as inconsistencies exist › Taxonomies differ in size, scope and semantics Sound aggregation is not obvious 33

Mis-shapes, Mistakes, Misfits : An Analysis of Domain Classification Services Pelayo Vallina, Victor Le Pochat, Álvaro Feal, Marius Paraschiv, Julien Gamba, Tim Burke, Oliver Hohlfeld, Juan Tapiador, Narseo Vallina-Rodriguez

An Analysis of Domain Classification Services Pelayo Vallina, - PowerPoint PPT Presentation

Mis-shapes, Mistakes, Misfits : An Analysis of Domain Classification Services Pelayo Vallina, Victor Le Pochat, lvaro Feal, Marius Paraschiv, Julien Gamba, Tim Burke, Oliver Hohlfeld, Juan Tapiador, Narseo Vallina-Rodriguez web directories

Protein Sequence Analysis Protein Sequence Analysis Domain review Domain review What is a

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Web Hosting and Domain Names Introduction to Web Design Web Hosting and Domain Names

Focusing the Core Domain Model A Domain-Driven Design Case Study, Eric Evans, Domain Language

Image Processing A case study for a domain decomposed MPI code Domain Decomposition 1

Domain analysis Domain analysis Goal: build an object-oriented model of the real- world

(a) Quantitative classification (b) Qualitative classification (c) Area classification (d) Simple

Classification Image Classification Set of predefined categories [eg: table, apple, dog, giraffe]

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

Library of Congress Classification: Module 1.3 1 Library of Congress Classification: Module 1.3

Classification K-nearest neighbor classification D istance functions Choice of k Choice of k

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Management of Classification Lookup Files The basics of classification The basics of

Kicking Down the Cross Domain Door Techniques for Cross Domain Exploitation Billy K Rios (BK) and

Chapter 24 Chapter 24 Chapter 24 The Domain Name System The Domain Name System The Domain Name

I ntroduction / Standard digital control loop Actuator Physical System Sensor

Different Approaches to Morse-Bott Homology David Hurtubise with Augustin Banyaga Penn State

AFRICACRYPT 2011 Call for Papers AFRICACRYPT 2011 Call for Papers Africacrypt 2011 You are

Draft 1 Delay Predictors in Multi-skill Call Centers: An Empirical Comparison with Real Data

Core Mechanisms of Exponence Jochen Trommer jtrommer@uni-leipzig.de Universitt Leipzig

'All these emotions, all these yearnings, all these data': Platform openness, data sharing and

Concept and life of a National Focal Point Libra Librarian Work orkshop / / Libr Libraria

jlint Group 5: David Bangerter Matt Laroche Melissa Ludowise Ben McCann Overview Two

Sambuz

Useful Links

Newsletter

Mail Us