Similarity encoding for learning on dirty categorical variables Ga - PowerPoint PPT Presentation

Similarity encoding for learning on dirty categorical variables Ga¨ el Varoquaux ⋆ ⋆ Scikit-learn project lead Agenda today Bring to light a problem Show that statistical-learning can solve it

Machine learning Let X ∈ R n × p G Varoquaux 2

Machine learning Let X ∈ R n × p The data Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 11/19/1989 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M 03/02/2008 Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M 11/22/2010 Library Assistant I G Varoquaux 2

Machine learning Let X ∈ R n × p The data Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 11/19/1989 Social Worker IV M 07/16/2007 Police Officer III A data cleaning problem? F 02/05/2007 Police Aide M 01/13/2014 Electrician I A feature engineering problem? M 04/28/2002 Bus Operator M 03/02/2008 Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M 11/22/2010 Library Assistant I G Varoquaux 2

The problem of “dirty categories” Non-curated categorical entries Overlapping categories “Master Police Officer”, Employee Position Title “Police Officer III”, Master Police Officer Social Worker IV “Police Officer II”... Police Officer III Police Aide High cardinality Electrician I 400 unique entries Bus Operator in 10 000 rows Bus Operator Social Worker III Rare categories Library Assistant I Only 1 “Architect III” Library Assistant I New categories in test set G Varoquaux 3

Dirty categories in the wild Employee Salaries : salary information for employees of Montgomery County, Maryland. Employee Position Title Master Police Officer Social Worker IV ... G Varoquaux 4

Dirty categories in the wild Employee Salaries : salary information for employees of Montgomery County, Maryland. Open Payments : payments by health care companies to medical doctors or hospitals. Company name Frequency Pfizer Inc. 79,073 Pfizer Pharmaceuticals LLC 486 Pfizer International LLC 425 Pfizer Limited 13 Pfizer Corporation Hong Kong Limited 4 Pfizer Pharmaceuticals Korea Limited 3 ... G Varoquaux 4

Dirty categories in the wild Employee Salaries : salary information for employees of Montgomery County, Maryland. Open Payments : payments by health care companies to medical doctors or hospitals. Medical charges : patient discharges: utilization, payment, and hospital-specific charges across 3 000 US hospitals. ... Nothing on UCI machine-learning data repository G Varoquaux 4

Dirty categories in the wild 100 √ n beer reviews road safety traffic violations Number of categories midwest survey 10 000 open payments employee salaries medical charges 1 000 100 5 log 2 ( n ) 100 1k 10k 100k 1M Number of rows G Varoquaux 5

Mechanisms creating dirty categories Typos Open-ended entries Merging different data sources G Varoquaux 6

Our goal : a statistical view of supervised learning on dirty categories Pfizer Corporation Hong Kong The statistical question = ? should inform curation Pfizer Pharmaceuticals Korea Rest of the talk : 1 Related approaches 2 Similarity encoding 3 Empirical study G Varoquaux 7

1 Related approaches Database cleaning Natural language processing Machine learning G Varoquaux 8

1 A database cleaning point of view Recognizing / merging entities Record linkage : matching across different (clean) tables Deduplication/fuzzy matching : matching in one dirty table Techniques [Fellegi and Sunter 1969] Supervised learning (known matches) Clustering Expectation Maximization to learn a metric Outputs a “clean” database G Varoquaux 9

1 A natural language processing point of view Stemming / normalization Set of (handcrafted) rules Need to be adapted to new language / new domains G Varoquaux 10

1 A natural language processing point of view Stemming / normalization Set of (handcrafted) rules Need to be adapted to new language / new domains Semantics Relate different discreet objects Formal semantics (entity resolution in knowlege bases) Distributional semantics: “a word is characterized by the company it keeps” G Varoquaux 10

1 A natural language processing point of view Stemming / normalization Set of (handcrafted) rules Need to be adapted to new language / new domains Semantics Relate different discreet objects Formal semantics (entity resolution in knowlege bases) Distributional semantics: “a word is characterized by the company it keeps” Character-level NLP For entity resolution [Klein... 2003] For semantics [Bojanowski... 2017] “London” & “Londres” may carry different information G Varoquaux 10

1 A machine-learning point of view High-cardinality categorical data Encoding each category blows up the dimension Target encoding [Micci-Barreca 2001] Represent each category by a simple statistical link to the target y eg E [ y | X i = C k ] 1D real-number embedding for a categorical column Bring close categories with same link to y Great for tree-based machine-learning [Dorogush...] G Varoquaux 11

1 A machine-learning point of view High-cardinality categorical data Encoding each category blows up the dimension Target encoding [Micci-Barreca 2001] Represent each category by a simple statistical link to the target y eg E [ y | X i = C k ] 1D real-number embedding for a categorical column Bring close categories with same link to y Great for tree-based machine-learning [Dorogush...] But fails on unseen categories G Varoquaux 11

2 Similarity encoding [P. Cerda, G. Varoquaux, & B. Kegl, Machine Learning 2018] G Varoquaux 12

2 Similarity encoding [P. Cerda, G. Varoquaux, & B. Kegl, Machine Learning 2018] 1 . One-hot encoding maps categories to vector spaces 2 . String similarities capture information G Varoquaux 12

2 Adding similarities to one-hot encoding One-hot encoding London Londres Paris Londres 0 1 0 X ∈ R n × p London 1 0 0 Paris 0 0 1 p grows fast new categories? link categories? G Varoquaux 13

2 Adding similarities to one-hot encoding One-hot encoding London Londres Paris Londres 0 1 0 X ∈ R n × p London 1 0 0 Paris 0 0 1 p grows fast new categories? link categories? Similarity encoding London Londres Paris Londres 0.3 1.0 0.0 London 1.0 0.3 0.0 Paris 0.0 0.0 1.0 string distance (Londres, London) G Varoquaux 13

2 Some string similarities Levenshtein Number of edit operations on one string to match the other Jaro-Winkler m 3 | s 2 | + m − t m d jaro ( s 1 , s 2 ) = 3 | s 1 | + 3 m m : number of matching characters t : number of character transpositions n-gram similarity n-gram: group of n consecutive characters similarity = #n-gram in comon #n-gram in total G Varoquaux 14

3 Empirical study G Varoquaux 15

3 Datasets with dirty categories Dataset # of # of cat- Less frequent Prediction rows egories category type medical charges 160k 100 613 regression employee salaries 9.2k 385 1 regression open payments 100k 973 1 binary clf midwest survey 2.8k 1009 1 multiclass clf traffic violations 100k 3043 1 multiclass clf road safety 10k 4617 1 binary clf beer reviews 10k 4634 1 multiclass clf 7 datasets! All open G Varoquaux 16

3 Experiments Cross-validation & measure prediction Focus on prediction rather than in-sample statistics Easier non-parametric evaluation Amenable to high dimension G Varoquaux 17

3 Results: gradient boosted trees Average ranking across datasets Hash encoding 5.9 Onehot encoding 4.6 Target encoding 3.7 Jarowinkler 2.9 Similarity encoding Levenshtein 2.4 ratio 3gram 1.6 0.8 0.9 0.6 0.8 0.6 0.8 0.5 0.7 0.6 0.8 0.4 0.5 0.25 0.75 medical employee open midwest traffic road beer charges salaries payments survey violations safety reviews G Varoquaux 18

Similarity encoding for learning on dirty categorical variables Ga - PowerPoint PPT Presentation

Similarity encoding for learning on dirty categorical variables Ga el Varoquaux Scikit-learn project lead Agenda today Bring to light a problem Show that statistical-learning can solve it Machine learning Let X R n p G

Dirty COW Race Condition Attack Outline Dirty COW vulnerability Memory Mapping using

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

61A Extra Lecture 4 Announcements Encoding Strings Representing Strings: UTF-8 Encoding 4

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Dirty Sock Syndrome Dirty Sock Syndrome Why it happens! Why it happens! How to resolve it! How

Dirty Jo Dirty Jobs a s at UMD: t UMD: Wa Waste Audit Coordinator Ma# Silverman What is a

Dirty COW Attack Instructor: Fengwei Zhang 1 SUSTech CS 315 Computer Security Outline

The Statistics of Dirty Data Sanjay Krishnan coax treasure out of messy, unstructured data 204

Quick & Dirty Python Professor Marie Roch 1 Quick and dirty Python 3.x About the

Deep Encode: Machine Learning for Per-Title Encoding Daniel Silhavy| IBC20| Per-Title Encoding

Categorical Professional Development In-Service August 6, 2019 Welcome Back Categorical Team

STAT 113 Describing Categorical Data Colin Reimer Dawson Oberlin College September 7, 2017 1 /

Categorical Probability and Statistics Peter McCullagh Department of Statistics University of

Case study introduction Emily Robinson Data Scientist DataCamp Categorical Data in the

Categorical quantum mechanics Chris Heunen 1 / 76 Categorical Quantum Mechanics? Study of

t ttst t

Statistical Model Checking for Markov Decision Processes David Henriques Joint work with Jo

Descriptive statistics P RACTICIN G S TATIS TICS IN TERVIEW QUES TION S IN R Zuzanna

Visual Encodings of Temporal Times are often imprecise The

SFB 1102: Information Density and Linguistic Encoding The Empirical Basis of Slavic The Empirical

More Efficient Cryptographic Multilinear Maps from Ideal Lattices Ron Steinfeld Clayton School

Deterministic Hashing to Elliptic and Hyperelliptic Curves Mehdi Tibouchi LORIA, 2010-11-08

The Theory of Statistical Comparison in Quantum Information and Foundations Francesco Buscemi *

Similarity encoding for learning on dirty categorical variables Ga - PowerPoint PPT Presentation

Similarity encoding for learning on dirty categorical variables Ga el Varoquaux Scikit-learn project lead Agenda today Bring to light a problem Show that statistical-learning can solve it Machine learning Let X R n p G

Dirty COW Race Condition Attack Outline Dirty COW vulnerability Memory Mapping using

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

61A Extra Lecture 4 Announcements Encoding Strings Representing Strings: UTF-8 Encoding 4

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Dirty Sock Syndrome Dirty Sock Syndrome Why it happens! Why it happens! How to resolve it! How

Dirty Jo Dirty Jobs a s at UMD: t UMD: Wa Waste Audit Coordinator Ma# Silverman What is a

Dirty COW Attack Instructor: Fengwei Zhang 1 SUSTech CS 315 Computer Security Outline

The Statistics of Dirty Data Sanjay Krishnan coax treasure out of messy, unstructured data 204

Quick &amp; Dirty Python Professor Marie Roch 1 Quick and dirty Python 3.x About the

Deep Encode: Machine Learning for Per-Title Encoding Daniel Silhavy| IBC20| Per-Title Encoding

Categorical Professional Development In-Service August 6, 2019 Welcome Back Categorical Team

STAT 113 Describing Categorical Data Colin Reimer Dawson Oberlin College September 7, 2017 1 /

Categorical Probability and Statistics Peter McCullagh Department of Statistics University of

Case study introduction Emily Robinson Data Scientist DataCamp Categorical Data in the

Categorical quantum mechanics Chris Heunen 1 / 76 Categorical Quantum Mechanics? Study of

t ttst t

Statistical Model Checking for Markov Decision Processes David Henriques Joint work with Jo

Descriptive statistics P RACTICIN G S TATIS TICS IN TERVIEW QUES TION S IN R Zuzanna

Visual Encodings of Temporal Times are often imprecise The

SFB 1102: Information Density and Linguistic Encoding The Empirical Basis of Slavic The Empirical

More Efficient Cryptographic Multilinear Maps from Ideal Lattices Ron Steinfeld Clayton School

Deterministic Hashing to Elliptic and Hyperelliptic Curves Mehdi Tibouchi LORIA, 2010-11-08

The Theory of Statistical Comparison in Quantum Information and Foundations Francesco Buscemi *

Quick & Dirty Python Professor Marie Roch 1 Quick and dirty Python 3.x About the