SLIDE 1 Similarity encoding for learning
- n dirty categorical variables
Ga¨ el Varoquaux⋆
⋆ Scikit-learn project lead Agenda today Bring to light a problem Show that statistical-learning can solve it
SLIDE 2
Machine learning Let X ∈ Rn×p
G Varoquaux 2
SLIDE 3
Machine learning Let X ∈ Rn×p The data
Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 11/19/1989 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M 03/02/2008 Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M 11/22/2010 Library Assistant I
G Varoquaux 2
SLIDE 4
Machine learning Let X ∈ Rn×p The data
Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 11/19/1989 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M 03/02/2008 Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M 11/22/2010 Library Assistant I
A data cleaning problem? A feature engineering problem?
G Varoquaux 2
SLIDE 5
The problem of “dirty categories” Non-curated categorical entries
Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I
Overlapping categories
“Master Police Officer”, “Police Officer III”, “Police Officer II”...
High cardinality 400 unique entries in 10 000 rows Rare categories Only 1 “Architect III” New categories in test set
G Varoquaux 3
SLIDE 6 Dirty categories in the wild Employee Salaries: salary information for employees
- f Montgomery County, Maryland.
Employee Position Title Master Police Officer Social Worker IV ...
G Varoquaux 4
SLIDE 7 Dirty categories in the wild Employee Salaries: salary information for employees
- f Montgomery County, Maryland.
Open Payments: payments by health care companies to medical doctors or hospitals.
Company name Frequency Pfizer Inc. 79,073 Pfizer Pharmaceuticals LLC 486 Pfizer International LLC 425 Pfizer Limited 13 Pfizer Corporation Hong Kong Limited 4 Pfizer Pharmaceuticals Korea Limited 3 ...
G Varoquaux 4
SLIDE 8 Dirty categories in the wild Employee Salaries: salary information for employees
- f Montgomery County, Maryland.
Open Payments: payments by health care companies to medical doctors or hospitals. Medical charges: patient discharges: utilization, payment, and hospital-specific charges across 3 000 US hospitals. ... Nothing on UCI machine-learning data repository
G Varoquaux 4
SLIDE 9 Dirty categories in the wild
100 1k 10k 100k 1M Number of rows 100 1 000 10 000 Number of categories
beer reviews road safety traffic violations midwest survey
employee salaries medical charges
100√n 5 log2(n) G Varoquaux 5
SLIDE 10
Mechanisms creating dirty categories Typos Open-ended entries Merging different data sources
G Varoquaux 6
SLIDE 11
Our goal: a statistical view of supervised learning on dirty categories The statistical question should inform curation
Pfizer Corporation Hong Kong
= ?
Pfizer Pharmaceuticals Korea
Rest of the talk: 1 Related approaches 2 Similarity encoding 3 Empirical study
G Varoquaux 7
SLIDE 12
1 Related approaches
Database cleaning Natural language processing Machine learning
G Varoquaux 8
SLIDE 13
1 A database cleaning point of view Recognizing / merging entities Record linkage: matching across different (clean) tables Deduplication/fuzzy matching: matching in one dirty table Techniques
[Fellegi and Sunter 1969]
Supervised learning (known matches) Clustering Expectation Maximization to learn a metric Outputs a “clean” database
G Varoquaux 9
SLIDE 14
1 A natural language processing point of view Stemming / normalization Set of (handcrafted) rules Need to be adapted to new language / new domains
G Varoquaux 10
SLIDE 15
1 A natural language processing point of view Stemming / normalization Set of (handcrafted) rules Need to be adapted to new language / new domains Semantics Relate different discreet objects Formal semantics (entity resolution in knowlege bases) Distributional semantics: “a word is characterized by the company it keeps”
G Varoquaux 10
SLIDE 16
1 A natural language processing point of view Stemming / normalization Set of (handcrafted) rules Need to be adapted to new language / new domains Semantics Relate different discreet objects Formal semantics (entity resolution in knowlege bases) Distributional semantics: “a word is characterized by the company it keeps” Character-level NLP For entity resolution
[Klein... 2003]
For semantics
[Bojanowski... 2017]
“London” & “Londres” may carry different information
G Varoquaux 10
SLIDE 17
1 A machine-learning point of view High-cardinality categorical data Encoding each category blows up the dimension Target encoding
[Micci-Barreca 2001]
Represent each category by a simple statistical link to the target y eg E[y|Xi = Ck] 1D real-number embedding for a categorical column Bring close categories with same link to y Great for tree-based machine-learning [Dorogush...]
G Varoquaux 11
SLIDE 18
1 A machine-learning point of view High-cardinality categorical data Encoding each category blows up the dimension Target encoding
[Micci-Barreca 2001]
Represent each category by a simple statistical link to the target y eg E[y|Xi = Ck] 1D real-number embedding for a categorical column Bring close categories with same link to y Great for tree-based machine-learning [Dorogush...] But fails on unseen categories
G Varoquaux 11
SLIDE 19
2 Similarity encoding
[P. Cerda, G. Varoquaux, & B. Kegl, Machine Learning 2018]
G Varoquaux 12
SLIDE 20 2 Similarity encoding
[P. Cerda, G. Varoquaux, & B. Kegl, Machine Learning 2018]
- 1. One-hot encoding maps categories to vector spaces
- 2. String similarities capture information
G Varoquaux 12
SLIDE 21
2 Adding similarities to one-hot encoding One-hot encoding London Londres Paris Londres 1 London 1 Paris 1 X ∈ Rn×p p grows fast new categories? link categories?
G Varoquaux 13
SLIDE 22
2 Adding similarities to one-hot encoding One-hot encoding London Londres Paris Londres 1 London 1 Paris 1 X ∈ Rn×p p grows fast new categories? link categories? Similarity encoding London Londres Paris Londres 0.3 1.0 0.0 London 1.0 0.3 0.0 Paris 0.0 0.0 1.0 string distance(Londres, London)
G Varoquaux 13
SLIDE 23 2 Some string similarities Levenshtein Number of edit operations on one string to match the other Jaro-Winkler djaro(s1, s2) =
m 3|s1| + m 3|s2| + m−t 3m
m: number of matching characters t: number of character transpositions n-gram similarity n-gram: group of n consecutive characters similarity = #n-gram in comon #n-gram in total
G Varoquaux 14
SLIDE 24
3 Empirical study
G Varoquaux 15
SLIDE 25 3 Datasets with dirty categories
Dataset # of rows # of cat- egories Less frequent category Prediction type medical charges 160k 100 613 regression employee salaries 9.2k 385 1 regression
100k 973 1 binary clf midwest survey 2.8k 1009 1 multiclass clf traffic violations 100k 3043 1 multiclass clf road safety 10k 4617 1 binary clf beer reviews 10k 4634 1 multiclass clf
7 datasets! All open
G Varoquaux 16
SLIDE 26
3 Experiments Cross-validation & measure prediction Focus on prediction rather than in-sample statistics Easier non-parametric evaluation Amenable to high dimension
G Varoquaux 17
SLIDE 27 3 Results: gradient boosted trees
0.8 0.9
medical charges 3gram Levenshtein ratio Jarowinkler Target encoding Onehot encoding Hash encoding Similarity encoding
0.6 0.8
employee salaries
0.6 0.8
payments
0.5 0.7
midwest survey
0.6 0.8
traffic violations
0.4 0.5
road safety
0.25 0.75
beer reviews
1.6 2.4 2.9 3.7 4.6 5.9
Average ranking across datasets
G Varoquaux 18
SLIDE 28 3 Results: gradient boosted trees
0.8 0.9
medical charges 3gram Levenshtein ratio Jarowinkler Target encoding Onehot encoding Hash encoding Similarity encoding
0.6 0.8
employee salaries
0.6 0.8
payments
0.5 0.7
midwest survey
0.6 0.8
traffic violations
0.4 0.5
road safety
0.25 0.75
beer reviews
1.6 2.4 2.9 3.7 4.6 5.9
Average ranking across datasets
G Varoquaux 18
SLIDE 29 3 Results: gradient boosted trees
0.8 0.9
medical charges 3gram Levenshtein ratio Jarowinkler Target encoding Onehot encoding Hash encoding Similarity encoding
0.6 0.8
employee salaries
0.6 0.8
payments
0.5 0.7
midwest survey
0.6 0.8
traffic violations
0.4 0.5
road safety
0.25 0.75
beer reviews
1.6 2.4 2.9 3.7 4.6 5.9
Average ranking across datasets
G Varoquaux 18
SLIDE 30 3 Results: gradient boosted trees
0.8 0.9
medical charges 3gram Levenshtein ratio Jarowinkler Target encoding Onehot encoding Hash encoding Similarity encoding
0.6 0.8
employee salaries
0.6 0.8
payments
0.5 0.7
midwest survey
0.6 0.8
traffic violations
0.4 0.5
road safety
0.25 0.75
beer reviews
1.6 2.4 2.9 3.7 4.6 5.9
Average ranking across datasets
G Varoquaux 18
SLIDE 31 3 Results: gradient boosted trees
0.8 0.9
medical charges 3gram Levenshtein ratio Jarowinkler Target encoding Onehot encoding Hash encoding Similarity encoding
0.6 0.8
employee salaries
0.6 0.8
payments
0.5 0.7
midwest survey
0.6 0.8
traffic violations
0.4 0.5
road safety
0.25 0.75
beer reviews
1.6 2.4 2.9 3.7 4.6 5.9
Average ranking across datasets
G Varoquaux 18
SLIDE 32 3 Results: ridge
0.7 0.9
medical charges 3gram Levenshtein ratio Jarowinkler Target encoding Onehot encoding Hash encoding Similarity encoding
0.6 0.8
employee salaries
0.25 0.50
payments
0.5 0.7
midwest survey
0.6 0.8
traffic violations
0.45 0.50
road safety
0.25 0.75
beer reviews
1.0 2.9 3.1 4.4 3.6 6.0
Average ranking across datasets
Similarity encoding, with 3-gram similarity
G Varoquaux 19
SLIDE 33 3 Results: different learner
0.85 0.90
medical charges Random Forest Gradient Boosting Ridge CV Logistic CV
0.7 0.9
employee salaries
0.50 0.75
payments
0.5 0.7
midwest survey
0.750 0.775
traffic violations
0.45 0.55
road safety
0.50 0.75
beer reviews
3gram similarity encoding
2.7 2.4 2.3 2.0
- Avg. ranking across datasets
G Varoquaux 20
SLIDE 34 3 This is just a string similarity? What similarity is defined by our encoding?
(kernel)
si, sjsim =
k
- l=1sim(si, s(l)) sim(sj, s(l))
Sum over the categories Reference categories The categories in the train set shape the similarity
G Varoquaux 21
SLIDE 35 3 This is just a string similarity? What similarity is defined by our encoding?
(kernel)
si, sjsim =
k
- l=1sim(si, s(l)) sim(sj, s(l))
Sum over the categories Reference categories The categories in the train set shape the similarity
0 83 0 88
3-gram Levenshtein- ratio Jaro-winkler Bag of 3-grams Similarity encoding
0 75 0 85 0 3 0 5 0 6 0 7 0 72 0 78 0 44 0 52 0 3 0 8
1.1 3.1 3.4 4.1
Similarity encoding > > > a feature map capturing string similarities
G Varoquaux 21
SLIDE 36
3 Too high dimensions X ∈ Rn×p but p is large Statistical problems Computational problems Interpretation problems
G Varoquaux 22
SLIDE 37
3 Too high dimensions X ∈ Rn×p but p is large Statistical problems Computational problems Interpretation problems Reducing the dimension Random projections: “cheap PCA” Only most-frequent categories as prototypes Kmeans no strings to select prototypes Similar to deduplication without hard assignment
G Varoquaux 22
SLIDE 38 3 Reducing the dimension
0.7 0.8 0.9
employee salaries
d=30 d=100 d=300 d=30 d=100 d=300 d=30 d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full (k=355) Cardinality of categorical variable
Onehot encoding 3gram similarity encoding
Random projections Most frequent categories Kmeans Deduplication with Kmeans Random projections
0.7 0.8
payments
(k=910) 0.6 0.7
midwest survey
(k=644) 0.750 0.775
traffic violations
(k=2588) 0.45 0.50 0.55
road safety
(k=3988) 0.25 0.50 0.75
beer reviews
(k=4015)
7.2 5.0 3.7 10.5 7.2 4.6 10.5 6.3 3.3 2.0 16.3 14.5 14.3 12.3 10.8 9.9 14.5
Average ranking across datasets
Factorizing one-hot: Related to Multiple Correspondance Analysis
G Varoquaux 23
SLIDE 39 3 Reducing the dimension
0.7 0.8 0.9
employee salaries
d=30 d=100 d=300 d=30 d=100 d=300 d=30 d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full (k=355) Cardinality of categorical variable
Onehot encoding 3gram similarity encoding
Random projections Most frequent categories Kmeans Deduplication with Kmeans Random projections
0.7 0.8
payments
(k=910) 0.6 0.7
midwest survey
(k=644) 0.750 0.775
traffic violations
(k=2588) 0.45 0.50 0.55
road safety
(k=3988) 0.25 0.50 0.75
beer reviews
(k=4015)
7.2 5.0 3.7 10.5 7.2 4.6 10.5 6.3 3.3 2.0 16.3 14.5 14.3 12.3 10.8 9.9 14.5
Average ranking across datasets
“Hard deduplication” Difficult problem, lengthy literature
G Varoquaux 23
SLIDE 40 3 Reducing the dimension
0.7 0.8 0.9
employee salaries
d=30 d=100 d=300 d=30 d=100 d=300 d=30 d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full (k=355) Cardinality of categorical variable
Onehot encoding 3gram similarity encoding
Random projections Most frequent categories Kmeans Deduplication with Kmeans Random projections
0.7 0.8
payments
(k=910) 0.6 0.7
midwest survey
(k=644) 0.750 0.775
traffic violations
(k=2588) 0.45 0.50 0.55
road safety
(k=3988) 0.25 0.50 0.75
beer reviews
(k=4015) 7.2 5.0 3.7 10.5 7.2 4.6 10.5 6.3 3.3 2.0 16.3 14.5 14.3 12.3 10.8 9.9 14.5
Average ranking across datasets G Varoquaux 23
SLIDE 41 3 Reducing the dimension
0.7 0.8 0.9
employee salaries
d=30 d=100 d=300 d=30 d=100 d=300 d=30 d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full (k=355) Cardinality of categorical variable
Onehot encoding 3gram similarity encoding
Random projections Most frequent categories Kmeans Deduplication with Kmeans Random projections
0.7 0.8
payments
(k=910) 0.6 0.7
midwest survey
(k=644) 0.750 0.775
traffic violations
(k=2588) 0.45 0.50 0.55
road safety
(k=3988) 0.25 0.50 0.75
beer reviews
(k=4015) 7.2 5.0 3.7 10.5 7.2 4.6 10.5 6.3 3.3 2.0 16.3 14.5 14.3 12.3 10.8 9.9 14.5
Average ranking across datasets
G Varoquaux 23
SLIDE 42 3 Reducing the dimension
0.7 0.8 0.9
employee salaries
d=30 d=100 d=300 d=30 d=100 d=300 d=30 d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full (k=355) Cardinality of categorical variable
Onehot encoding 3gram similarity encoding
Random projections Most frequent categories Kmeans Deduplication with Kmeans Random projections
0.7 0.8
payments
(k=910) 0.6 0.7
midwest survey
(k=644) 0.750 0.775
traffic violations
(k=2588) 0.45 0.50 0.55
road safety
(k=3988) 0.25 0.50 0.75
beer reviews
(k=4015) 7.2 5.0 3.7 10.5 7.2 4.6 10.5 6.3 3.3 2.0 16.3 14.5 14.3 12.3 10.8 9.9 14.5
Average ranking across datasets
Hashing n-grams (for speed and collisions)
G Varoquaux 23
SLIDE 43
@GaelVaroquaux
Learning on dirty categories Dirty categories Statistical models of non-curated categorical data Give us your dirty data Machine learning can help Similarity encoding Robust solution (dominates one-hot) Enables statistical models More to come Dirty category software: http://dirty-cat.github.io
SLIDE 44 4 References I
- P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching
word vectors with subword information. Transactions of the Association of Computational Linguistics, 5(1):135–146, 2017.
- P. Cerda, G. Varoquaux, and B. K´
- egl. Similarity encoding for
learning with dirty categorical variables. Machine Learning, pages 1–18, 2018.
- A. V. Dorogush, V. Ershov, and A. Gulin. Catboost: gradient
boosting with categorical features support.
- I. P. Fellegi and A. B. Sunter. A theory for record linkage.
Journal of the American Statistical Association, 64:1183, 1969.
SLIDE 45 4 References II
- D. Klein, J. Smarr, H. Nguyen, and C. D. Manning. Named
entity recognition with character-level models. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 180–183. Association for Computational Linguistics, 2003.
- D. Micci-Barreca. A preprocessing scheme for high-cardinality
categorical attributes in classification and prediction
- problems. ACM SIGKDD Explorations Newsletter, 3(1):
27–32, 2001.