Similarity encoding for learning on dirty categorical variables Ga - - PowerPoint PPT Presentation

similarity encoding for learning on dirty categorical
SMART_READER_LITE
LIVE PREVIEW

Similarity encoding for learning on dirty categorical variables Ga - - PowerPoint PPT Presentation

Similarity encoding for learning on dirty categorical variables Ga el Varoquaux Scikit-learn project lead Agenda today Bring to light a problem Show that statistical-learning can solve it Machine learning Let X R n p G


slide-1
SLIDE 1

Similarity encoding for learning

  • n dirty categorical variables

Ga¨ el Varoquaux⋆

⋆ Scikit-learn project lead Agenda today Bring to light a problem Show that statistical-learning can solve it

slide-2
SLIDE 2

Machine learning Let X ∈ Rn×p

G Varoquaux 2

slide-3
SLIDE 3

Machine learning Let X ∈ Rn×p The data

Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 11/19/1989 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M 03/02/2008 Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M 11/22/2010 Library Assistant I

G Varoquaux 2

slide-4
SLIDE 4

Machine learning Let X ∈ Rn×p The data

Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 11/19/1989 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M 03/02/2008 Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M 11/22/2010 Library Assistant I

A data cleaning problem? A feature engineering problem?

G Varoquaux 2

slide-5
SLIDE 5

The problem of “dirty categories” Non-curated categorical entries

Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I

Overlapping categories

“Master Police Officer”, “Police Officer III”, “Police Officer II”...

High cardinality 400 unique entries in 10 000 rows Rare categories Only 1 “Architect III” New categories in test set

G Varoquaux 3

slide-6
SLIDE 6

Dirty categories in the wild Employee Salaries: salary information for employees

  • f Montgomery County, Maryland.

Employee Position Title Master Police Officer Social Worker IV ...

G Varoquaux 4

slide-7
SLIDE 7

Dirty categories in the wild Employee Salaries: salary information for employees

  • f Montgomery County, Maryland.

Open Payments: payments by health care companies to medical doctors or hospitals.

Company name Frequency Pfizer Inc. 79,073 Pfizer Pharmaceuticals LLC 486 Pfizer International LLC 425 Pfizer Limited 13 Pfizer Corporation Hong Kong Limited 4 Pfizer Pharmaceuticals Korea Limited 3 ...

G Varoquaux 4

slide-8
SLIDE 8

Dirty categories in the wild Employee Salaries: salary information for employees

  • f Montgomery County, Maryland.

Open Payments: payments by health care companies to medical doctors or hospitals. Medical charges: patient discharges: utilization, payment, and hospital-specific charges across 3 000 US hospitals. ... Nothing on UCI machine-learning data repository

G Varoquaux 4

slide-9
SLIDE 9

Dirty categories in the wild

100 1k 10k 100k 1M Number of rows 100 1 000 10 000 Number of categories

beer reviews road safety traffic violations midwest survey

  • pen payments

employee salaries medical charges

100√n 5 log2(n) G Varoquaux 5

slide-10
SLIDE 10

Mechanisms creating dirty categories Typos Open-ended entries Merging different data sources

G Varoquaux 6

slide-11
SLIDE 11

Our goal: a statistical view of supervised learning on dirty categories The statistical question should inform curation

Pfizer Corporation Hong Kong

= ?

Pfizer Pharmaceuticals Korea

Rest of the talk: 1 Related approaches 2 Similarity encoding 3 Empirical study

G Varoquaux 7

slide-12
SLIDE 12

1 Related approaches

Database cleaning Natural language processing Machine learning

G Varoquaux 8

slide-13
SLIDE 13

1 A database cleaning point of view Recognizing / merging entities Record linkage: matching across different (clean) tables Deduplication/fuzzy matching: matching in one dirty table Techniques

[Fellegi and Sunter 1969]

Supervised learning (known matches) Clustering Expectation Maximization to learn a metric Outputs a “clean” database

G Varoquaux 9

slide-14
SLIDE 14

1 A natural language processing point of view Stemming / normalization Set of (handcrafted) rules Need to be adapted to new language / new domains

G Varoquaux 10

slide-15
SLIDE 15

1 A natural language processing point of view Stemming / normalization Set of (handcrafted) rules Need to be adapted to new language / new domains Semantics Relate different discreet objects Formal semantics (entity resolution in knowlege bases) Distributional semantics: “a word is characterized by the company it keeps”

G Varoquaux 10

slide-16
SLIDE 16

1 A natural language processing point of view Stemming / normalization Set of (handcrafted) rules Need to be adapted to new language / new domains Semantics Relate different discreet objects Formal semantics (entity resolution in knowlege bases) Distributional semantics: “a word is characterized by the company it keeps” Character-level NLP For entity resolution

[Klein... 2003]

For semantics

[Bojanowski... 2017]

“London” & “Londres” may carry different information

G Varoquaux 10

slide-17
SLIDE 17

1 A machine-learning point of view High-cardinality categorical data Encoding each category blows up the dimension Target encoding

[Micci-Barreca 2001]

Represent each category by a simple statistical link to the target y eg E[y|Xi = Ck] 1D real-number embedding for a categorical column Bring close categories with same link to y Great for tree-based machine-learning [Dorogush...]

G Varoquaux 11

slide-18
SLIDE 18

1 A machine-learning point of view High-cardinality categorical data Encoding each category blows up the dimension Target encoding

[Micci-Barreca 2001]

Represent each category by a simple statistical link to the target y eg E[y|Xi = Ck] 1D real-number embedding for a categorical column Bring close categories with same link to y Great for tree-based machine-learning [Dorogush...] But fails on unseen categories

G Varoquaux 11

slide-19
SLIDE 19

2 Similarity encoding

[P. Cerda, G. Varoquaux, & B. Kegl, Machine Learning 2018]

G Varoquaux 12

slide-20
SLIDE 20

2 Similarity encoding

[P. Cerda, G. Varoquaux, & B. Kegl, Machine Learning 2018]

  • 1. One-hot encoding maps categories to vector spaces
  • 2. String similarities capture information

G Varoquaux 12

slide-21
SLIDE 21

2 Adding similarities to one-hot encoding One-hot encoding London Londres Paris Londres 1 London 1 Paris 1 X ∈ Rn×p p grows fast new categories? link categories?

G Varoquaux 13

slide-22
SLIDE 22

2 Adding similarities to one-hot encoding One-hot encoding London Londres Paris Londres 1 London 1 Paris 1 X ∈ Rn×p p grows fast new categories? link categories? Similarity encoding London Londres Paris Londres 0.3 1.0 0.0 London 1.0 0.3 0.0 Paris 0.0 0.0 1.0 string distance(Londres, London)

G Varoquaux 13

slide-23
SLIDE 23

2 Some string similarities Levenshtein Number of edit operations on one string to match the other Jaro-Winkler djaro(s1, s2) =

m 3|s1| + m 3|s2| + m−t 3m

m: number of matching characters t: number of character transpositions n-gram similarity n-gram: group of n consecutive characters similarity = #n-gram in comon #n-gram in total

G Varoquaux 14

slide-24
SLIDE 24

3 Empirical study

G Varoquaux 15

slide-25
SLIDE 25

3 Datasets with dirty categories

Dataset # of rows # of cat- egories Less frequent category Prediction type medical charges 160k 100 613 regression employee salaries 9.2k 385 1 regression

  • pen payments

100k 973 1 binary clf midwest survey 2.8k 1009 1 multiclass clf traffic violations 100k 3043 1 multiclass clf road safety 10k 4617 1 binary clf beer reviews 10k 4634 1 multiclass clf

7 datasets! All open

G Varoquaux 16

slide-26
SLIDE 26

3 Experiments Cross-validation & measure prediction Focus on prediction rather than in-sample statistics Easier non-parametric evaluation Amenable to high dimension

G Varoquaux 17

slide-27
SLIDE 27

3 Results: gradient boosted trees

0.8 0.9

medical charges 3­gram Levenshtein ratio Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding

0.6 0.8

employee salaries

0.6 0.8

  • pen

payments

0.5 0.7

midwest survey

0.6 0.8

traffic violations

0.4 0.5

road safety

0.25 0.75

beer reviews

1.6 2.4 2.9 3.7 4.6 5.9

Average ranking across datasets

G Varoquaux 18

slide-28
SLIDE 28

3 Results: gradient boosted trees

0.8 0.9

medical charges 3­gram Levenshtein ratio Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding

0.6 0.8

employee salaries

0.6 0.8

  • pen

payments

0.5 0.7

midwest survey

0.6 0.8

traffic violations

0.4 0.5

road safety

0.25 0.75

beer reviews

1.6 2.4 2.9 3.7 4.6 5.9

Average ranking across datasets

G Varoquaux 18

slide-29
SLIDE 29

3 Results: gradient boosted trees

0.8 0.9

medical charges 3­gram Levenshtein ratio Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding

0.6 0.8

employee salaries

0.6 0.8

  • pen

payments

0.5 0.7

midwest survey

0.6 0.8

traffic violations

0.4 0.5

road safety

0.25 0.75

beer reviews

1.6 2.4 2.9 3.7 4.6 5.9

Average ranking across datasets

G Varoquaux 18

slide-30
SLIDE 30

3 Results: gradient boosted trees

0.8 0.9

medical charges 3­gram Levenshtein ratio Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding

0.6 0.8

employee salaries

0.6 0.8

  • pen

payments

0.5 0.7

midwest survey

0.6 0.8

traffic violations

0.4 0.5

road safety

0.25 0.75

beer reviews

1.6 2.4 2.9 3.7 4.6 5.9

Average ranking across datasets

G Varoquaux 18

slide-31
SLIDE 31

3 Results: gradient boosted trees

0.8 0.9

medical charges 3­gram Levenshtein ratio Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding

0.6 0.8

employee salaries

0.6 0.8

  • pen

payments

0.5 0.7

midwest survey

0.6 0.8

traffic violations

0.4 0.5

road safety

0.25 0.75

beer reviews

1.6 2.4 2.9 3.7 4.6 5.9

Average ranking across datasets

G Varoquaux 18

slide-32
SLIDE 32

3 Results: ridge

0.7 0.9

medical charges 3­gram Levenshtein ratio Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding

0.6 0.8

employee salaries

0.25 0.50

  • pen

payments

0.5 0.7

midwest survey

0.6 0.8

traffic violations

0.45 0.50

road safety

0.25 0.75

beer reviews

1.0 2.9 3.1 4.4 3.6 6.0

Average ranking across datasets

Similarity encoding, with 3-gram similarity

G Varoquaux 19

slide-33
SLIDE 33

3 Results: different learner

0.85 0.90

medical charges Random Forest Gradient Boosting Ridge CV Logistic CV

0.7 0.9

employee salaries

0.50 0.75

  • pen

payments

0.5 0.7

midwest survey

0.750 0.775

traffic violations

0.45 0.55

road safety

0.50 0.75

beer reviews

  • ne­hot encoding

3­gram similarity encoding

2.7 2.4 2.3 2.0

  • Avg. ranking across datasets

G Varoquaux 20

slide-34
SLIDE 34

3 This is just a string similarity? What similarity is defined by our encoding?

(kernel)

si, sjsim =

k

  • l=1sim(si, s(l)) sim(sj, s(l))

Sum over the categories Reference categories The categories in the train set shape the similarity

G Varoquaux 21

slide-35
SLIDE 35

3 This is just a string similarity? What similarity is defined by our encoding?

(kernel)

si, sjsim =

k

  • l=1sim(si, s(l)) sim(sj, s(l))

Sum over the categories Reference categories The categories in the train set shape the similarity

0 83 0 88

3-gram Levenshtein- ratio Jaro-winkler Bag of 3-grams Similarity encoding

0 75 0 85 0 3 0 5 0 6 0 7 0 72 0 78 0 44 0 52 0 3 0 8

1.1 3.1 3.4 4.1

Similarity encoding > > > a feature map capturing string similarities

G Varoquaux 21

slide-36
SLIDE 36

3 Too high dimensions X ∈ Rn×p but p is large Statistical problems Computational problems Interpretation problems

G Varoquaux 22

slide-37
SLIDE 37

3 Too high dimensions X ∈ Rn×p but p is large Statistical problems Computational problems Interpretation problems Reducing the dimension Random projections: “cheap PCA” Only most-frequent categories as prototypes Kmeans no strings to select prototypes Similar to deduplication without hard assignment

G Varoquaux 22

slide-38
SLIDE 38

3 Reducing the dimension

0.7 0.8 0.9

employee salaries

d=30 d=100 d=300 d=30 d=100 d=300 d=30 d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full (k=355) Cardinality of categorical variable

One­hot encoding 3­gram similarity encoding

Random projections Most frequent categories K­means Deduplication with K­means Random projections

0.7 0.8

  • pen

payments

(k=910) 0.6 0.7

midwest survey

(k=644) 0.750 0.775

traffic violations

(k=2588) 0.45 0.50 0.55

road safety

(k=3988) 0.25 0.50 0.75

beer reviews

(k=4015)

7.2 5.0 3.7 10.5 7.2 4.6 10.5 6.3 3.3 2.0 16.3 14.5 14.3 12.3 10.8 9.9 14.5

Average ranking across datasets

Factorizing one-hot: Related to Multiple Correspondance Analysis

G Varoquaux 23

slide-39
SLIDE 39

3 Reducing the dimension

0.7 0.8 0.9

employee salaries

d=30 d=100 d=300 d=30 d=100 d=300 d=30 d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full (k=355) Cardinality of categorical variable

One­hot encoding 3­gram similarity encoding

Random projections Most frequent categories K­means Deduplication with K­means Random projections

0.7 0.8

  • pen

payments

(k=910) 0.6 0.7

midwest survey

(k=644) 0.750 0.775

traffic violations

(k=2588) 0.45 0.50 0.55

road safety

(k=3988) 0.25 0.50 0.75

beer reviews

(k=4015)

7.2 5.0 3.7 10.5 7.2 4.6 10.5 6.3 3.3 2.0 16.3 14.5 14.3 12.3 10.8 9.9 14.5

Average ranking across datasets

“Hard deduplication” Difficult problem, lengthy literature

G Varoquaux 23

slide-40
SLIDE 40

3 Reducing the dimension

0.7 0.8 0.9

employee salaries

d=30 d=100 d=300 d=30 d=100 d=300 d=30 d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full (k=355) Cardinality of categorical variable

One­hot encoding 3­gram similarity encoding

Random projections Most frequent categories K­means Deduplication with K­means Random projections

0.7 0.8

  • pen

payments

(k=910) 0.6 0.7

midwest survey

(k=644) 0.750 0.775

traffic violations

(k=2588) 0.45 0.50 0.55

road safety

(k=3988) 0.25 0.50 0.75

beer reviews

(k=4015) 7.2 5.0 3.7 10.5 7.2 4.6 10.5 6.3 3.3 2.0 16.3 14.5 14.3 12.3 10.8 9.9 14.5

Average ranking across datasets G Varoquaux 23

slide-41
SLIDE 41

3 Reducing the dimension

0.7 0.8 0.9

employee salaries

d=30 d=100 d=300 d=30 d=100 d=300 d=30 d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full (k=355) Cardinality of categorical variable

One­hot encoding 3­gram similarity encoding

Random projections Most frequent categories K­means Deduplication with K­means Random projections

0.7 0.8

  • pen

payments

(k=910) 0.6 0.7

midwest survey

(k=644) 0.750 0.775

traffic violations

(k=2588) 0.45 0.50 0.55

road safety

(k=3988) 0.25 0.50 0.75

beer reviews

(k=4015) 7.2 5.0 3.7 10.5 7.2 4.6 10.5 6.3 3.3 2.0 16.3 14.5 14.3 12.3 10.8 9.9 14.5

Average ranking across datasets

G Varoquaux 23

slide-42
SLIDE 42

3 Reducing the dimension

0.7 0.8 0.9

employee salaries

d=30 d=100 d=300 d=30 d=100 d=300 d=30 d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full (k=355) Cardinality of categorical variable

One­hot encoding 3­gram similarity encoding

Random projections Most frequent categories K­means Deduplication with K­means Random projections

0.7 0.8

  • pen

payments

(k=910) 0.6 0.7

midwest survey

(k=644) 0.750 0.775

traffic violations

(k=2588) 0.45 0.50 0.55

road safety

(k=3988) 0.25 0.50 0.75

beer reviews

(k=4015) 7.2 5.0 3.7 10.5 7.2 4.6 10.5 6.3 3.3 2.0 16.3 14.5 14.3 12.3 10.8 9.9 14.5

Average ranking across datasets

Hashing n-grams (for speed and collisions)

G Varoquaux 23

slide-43
SLIDE 43

@GaelVaroquaux

Learning on dirty categories Dirty categories Statistical models of non-curated categorical data Give us your dirty data Machine learning can help Similarity encoding Robust solution (dominates one-hot) Enables statistical models More to come Dirty category software: http://dirty-cat.github.io

slide-44
SLIDE 44

4 References I

  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching

word vectors with subword information. Transactions of the Association of Computational Linguistics, 5(1):135–146, 2017.

  • P. Cerda, G. Varoquaux, and B. K´
  • egl. Similarity encoding for

learning with dirty categorical variables. Machine Learning, pages 1–18, 2018.

  • A. V. Dorogush, V. Ershov, and A. Gulin. Catboost: gradient

boosting with categorical features support.

  • I. P. Fellegi and A. B. Sunter. A theory for record linkage.

Journal of the American Statistical Association, 64:1183, 1969.

slide-45
SLIDE 45

4 References II

  • D. Klein, J. Smarr, H. Nguyen, and C. D. Manning. Named

entity recognition with character-level models. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 180–183. Association for Computational Linguistics, 2003.

  • D. Micci-Barreca. A preprocessing scheme for high-cardinality

categorical attributes in classification and prediction

  • problems. ACM SIGKDD Explorations Newsletter, 3(1):

27–32, 2001.