Fuzzy Matching In PostgreSQL A Story From The Trenches Charles - PowerPoint PPT Presentation

Introduction Simple Matching Fuzzy Matching Use Case Conclusion Fuzzy Matching In PostgreSQL A Story From The Trenches Charles Clavadetscher Swiss PostgreSQL Users Group Swiss PGDay 2016, Rapperswil, 24.06.2016 Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 1/38

Introduction Simple Matching Fuzzy Matching Use Case Conclusion Outline Introduction 1 Simple Matching 2 Fuzzy Matching 3 Use Case 4 5 Conclusion Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 2/38

Introduction Simple Matching Fuzzy Matching Use Case Conclusion Charles Clavadetscher Senior DB Engineer at KOF ETH Zurich KOF is the Center of Economic Research of the ETHZ the Swiss Institute of Technology in Zurich, Switzerland Independent economic research on business cycle tendencies for almost all sectors Maintenance of all databases at KOF: PostgreSQL, Oracle, MySQL and MSSQL Server. Focus on migrating to PostgreSQL Support in business process re-engineering Co-founder and treasurer of the SwissPUG, the Swiss PostgreSQL Users Group Member of the OK of the Swiss PGDay Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 3/38

Introduction Simple Matching Fuzzy Matching Use Case Conclusion Principles of searches 1/2 Searching in a collection of texts written in a natural language is a huge challenge. Semantics and lexicology Meaning (FTS 1 ) Synonyms (FTS) Lexemes: Naked word base meanings Morphology Phonetic (soundex, metaphone, double metaphone) Morphems: Naked word base forms (FTS) Language independent Words (=, LIKE, ILIKE, regexp) Letters (levenshtein) N-gram subsets (trigrams) 1. FTS: Full Text Search Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 4/38

Introduction Simple Matching Fuzzy Matching Use Case Conclusion Principles of searches 2/2 Methods can be classified based on many criteria. Here is short summary that includes the most important differentiators for our use case: Language dependency : Switzerland is a country with 4 official languages. Exact/Fuzzy : Use exact when possible and fuzzy otherwise. Indexability : Tradeoff between matching probability and execution speed. -----------+------------------------------------+------------------------------------- Language | Exact | Fuzzy dependent | not indexable | indexable | not indexable | indexable ===========|================|===================|==================|================== NO | LIKE | = (equality) | Levenshtein | Trigrams | ILIKE | LIKE (Trigrams) | | | Regexp | ILIKE (Trigrams) | | | | Regexp (Trigrams) | | -----------+----------------+-------------------+------------------+------------------ YES | | | Soundex | Full Text Search | | | Metaphone | | | | Double Metaphone | -----------+----------------+-------------------+------------------+------------------ Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 5/38

Introduction Simple Matching Fuzzy Matching Use Case Conclusion Outline Introduction 1 Simple Matching 2 Fuzzy Matching 3 Use Case 4 Conclusion 5 Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 6/38

Introduction Simple Matching Fuzzy Matching Use Case Conclusion Classics = : Exact match. This can easily be indexed using mostly btree. db=# SELECT * FROM people WHERE people_id = 1765; LIKE : Partially exact match. The part that is given must match exactly, all the rest is irrelevant. In combination with trigrams (and some other operator classes if the initial part is fixed) you can use indexes. SIMILAR TO : Equivalent to LIKE. ILIKE : The same as LIKE but case insensitive. db=# SELECT * FROM people WHERE firstname ILIKE ’Johann%’; db=# SELECT * FROM people WHERE firstname ILIKE ’%johann%’; Regexp : With regular expressions you can search for a specific pattern. PostgreSQL has operators for regexp matching case sensitive and insensitive. As with LIKE/ILIKE you can build an index on the base field using trigrams. db=# SELECT * FROM people WHERE lastname ~ ’^[CB][a-z]*(ts|tsch)er$’; db=# SELECT * FROM companies WHERE name ~ ’.*[0-9]+.*’; Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 7/38

Introduction Simple Matching Fuzzy Matching Use Case Conclusion Outline Introduction 1 Simple Matching 2 Fuzzy Matching 3 Use Case 4 Conclusion 5 Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 8/38

Introduction Simple Matching Fuzzy Matching Use Case Conclusion Definitions and Extensions Definitions Fuzzy Search : A process that locates entries that are likely to be relevant to a one or more search parameters even if the argument does not exactly correspond to the specific information available. Distance/Difference : A metric indicating how far are two strings apart. Similarity : A metric indicating how similar are two strings. Normalization : A technique to compute the relative strength of a metric (usually a number between 0 and 1) Extensions fuzzystrmatch : Soundex, Metaphone, Double Metaphone, Levenshtein. pg_trgm : Trigrams. Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 9/38

Introduction Simple Matching Fuzzy Matching Use Case Conclusion Phonetic Methods Soundex db=# select soundex(’halloween’), soundex(’halouin’), db-# soundex(’les mots’), soundex(’lemó’); soundex | soundex | soundex | soundex ---------+---------+---------+--------- H450 | H450 | L253 | L50 Metaphone db=# select metaphone(’halloween’,5), metaphone(’halouin’,5), db-# metaphone(’les mots’,5), metaphone(’lemó’,5); metaphone | metaphone | metaphone | metaphone -----------+-----------+-----------+----------- HLWN | HLN | LSMTS | LM Double metaphone db=# select dmetaphone(’halloween’), dmetaphone(’halouin’), db-# dmetaphone(’les mots’), dmetaphone(’lemó’); dmetaphone | dmetaphone | dmetaphone | dmetaphone ------------+------------+------------+------------ HLN | HLN | LSMT | LM Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 10/38

Introduction Simple Matching Fuzzy Matching Use Case Conclusion Levenshtein Distance The minumum number of operations (delete, insert, substitute) required to transform a string into another. Case sensitive. db=# select levenshtein(’Zürich’,’zuerich’); levenshtein ------------- 3 Normalized distance: d ( a , b ) d norm ( a , b ) = 1 − max ( length ( a ) , length ( b )) db=# SELECT 1 - (levenshtein(’Zürich’,’zuerich’)::REAL / db-# greatest(length(’Zürich’),length(’zuerich’))::REAL) AS levenshtein_norm; levenshtein_norm ------------------- 0.571428567171097 Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 11/38

Introduction Simple Matching Fuzzy Matching Use Case Conclusion Full Text search - 1/8 Full Text Search is a huge chapter in PostgreSQL and would deserve a dedicated session. In this context we can only give an overview of the main characteristics of the technique. Lexeme : A basic lexical unit of a language consisting of one word or several words, the elements of which do not separately convey the meaning of the whole. tsvector : The PostgreSQL equivalent of a collection of lexeme. to_tsvector : Function to convert a text into a collection of lexemes. to_tsquery : A collection of lexemes combined with boolean operators ( &, |, ! ). db=> SELECT to_tsvector(’english’,’There are many nice stories about creating lexemes’); to_tsvector ------------------------------------------------- ’creat’:7 ’lexem’:8 ’mani’:3 ’nice’:4 ’stori’:5 db=> SELECT to_tsquery(’english’,’People & talk & stories & lexemes’); to_tsquery -------------------------------------- ’peopl’ & ’talk’ & ’stori’ & ’lexem’ Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 12/38

Introduction Simple Matching Fuzzy Matching Use Case Conclusion Full Text search - 2/8 The match operator @@ db=> SELECT to_tsquery(’english’,’People & talk & stories & lexemes’) @@ db-> to_tsvector(’english’,’There are many nice stories about creating lexemes’); ?column? ---------- f db=> SELECT to_tsquery(’english’,’(People & talk) | stories | lexemes’) @@ db-> to_tsvector(’english’,’There are many nice stories about creating lexemes’); ?column? ---------- t Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 13/38

Introduction Simple Matching Fuzzy Matching Use Case Conclusion Full Text search - 3/8 From the PostgreSQL documentation: Text search parsers break documents into tokens and classify each token (for example, as words or numbers). Text search dictionaries convert tokens to normalized form and reject stop words. Text search templates provide the functions underlying dictionaries. (A dictionary simply specifies a template and a set of parameters for the template.) Text search configurations select a parser and a set of dictionaries to use to normalize the tokens produced by the parser. Charles Clavadetscher Swiss PostgreSQL Users Group Fuzzy Matching In PostgreSQL 14/38

Fuzzy Matching In PostgreSQL A Story From The Trenches Charles - PowerPoint PPT Presentation

Introduction Simple Matching Fuzzy Matching Use Case Conclusion Fuzzy Matching In PostgreSQL A Story From The Trenches Charles Clavadetscher Swiss PostgreSQL Users Group Swiss PGDay 2016, Rapperswil, 24.06.2016 Charles Clavadetscher Swiss

On Fuzzy Soft Rings Banu Pazar Varol and Halis Ayg un Department of Mathematics, Kocaeli

Applications Three sample applications Fuzzy inferno Nostalgic cow Twilight Eden Fuzzy inferno

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

7 Transformations of Fuzzy Sets Fuzzy Systems Engineering Toward Human-Centric Computing

Semi-Heuristic Target-Based Fuzzy Target . . . Fuzzy Target . . . Fuzzy Decision Procedures:

M odels for Inexact Reasoning Fuzzy Logic Lesson 8 Fuzzy Controllers M aster in

11 Fuzzy Rule-Based Models Fuzzy Systems Engineering Toward Human-Centric Computing Contents

PostgreSQL Who, What, When, Where, Why, How? 1 QUIS? Who's involved with PostgreSQL? Core

Hacking PostgreSQL Stephen Frost Crunchy Data stephen@crunchydata.com FOSDEM 2019 February 3,

Hacking PostgreSQL Stephen Frost Crunchy Data stephen@crunchydata.com PGConf.EU 2018 October

PostgreSQL SQL-MED Ibrar Ahmed Senior Software Engineer @ Percona PostgreSQL Consultant What?

Breaking PostgreSQL at Scale. Christophe Pettus PostgreSQL Experts pgDay Paris 2019

PostgreSQL Provider The PostgreSQL provider gives the ability to deploy and congure resources

Fuzzy Reasoning Outline Introduction Bivalent & Multivalent Logics Fundamental

A fuzzy clustering method using Genetic Algorithm and Fuzzy Subtractive Clustering Thanh Le, Tom

M odels for Inexact Reasoning Fuzzy Logic Lesson 1 Crisp and Fuzzy Sets M aster in

Desktop components Jens Bache-Wiig Agenda Intro Project status Coding QA

Repairing without Retraining: Avoiding Disparate Impact with Counterfactual Distributions Hao

Computing Closed-Form Solutions of Integrable Connections Thomas Cluzeau thomas.cluzeau@xlim.fr

A Systematic Study of Neural Discourse Models for Implicit Discourse Relation Attapol T.

Combining Solr and Elasticsearch to Improve Autosuggestion on Mobile Local Search Toan Vinh Luu,

Lessons Learned from Building a Large Multilingual, Multi-region Website in Drupal 8 Stella

Meas rDroid An Android Measurement Framework Johann Schlamp Georg Carle May 2, 2013 The Meas

Captioning Images with Diverse Objects Lisa Anne Subhashini Marcus Raymond Kate Trevor

Fuzzy Matching In PostgreSQL A Story From The Trenches Charles - PowerPoint PPT Presentation

Introduction Simple Matching Fuzzy Matching Use Case Conclusion Fuzzy Matching In PostgreSQL A Story From The Trenches Charles Clavadetscher Swiss PostgreSQL Users Group Swiss PGDay 2016, Rapperswil, 24.06.2016 Charles Clavadetscher Swiss

On Fuzzy Soft Rings Banu Pazar Varol and Halis Ayg un Department of Mathematics, Kocaeli

Applications Three sample applications Fuzzy inferno Nostalgic cow Twilight Eden Fuzzy inferno

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

7 Transformations of Fuzzy Sets Fuzzy Systems Engineering Toward Human-Centric Computing

Semi-Heuristic Target-Based Fuzzy Target . . . Fuzzy Target . . . Fuzzy Decision Procedures:

M odels for Inexact Reasoning Fuzzy Logic Lesson 8 Fuzzy Controllers M aster in

11 Fuzzy Rule-Based Models Fuzzy Systems Engineering Toward Human-Centric Computing Contents

PostgreSQL Who, What, When, Where, Why, How? 1 QUIS? Who's involved with PostgreSQL? Core

Hacking PostgreSQL Stephen Frost Crunchy Data stephen@crunchydata.com FOSDEM 2019 February 3,

Hacking PostgreSQL Stephen Frost Crunchy Data stephen@crunchydata.com PGConf.EU 2018 October

PostgreSQL SQL-MED Ibrar Ahmed Senior Software Engineer @ Percona PostgreSQL Consultant What?

Breaking PostgreSQL at Scale. Christophe Pettus PostgreSQL Experts pgDay Paris 2019

PostgreSQL Provider The PostgreSQL provider gives the ability to deploy and congure resources

Fuzzy Reasoning Outline Introduction Bivalent &amp; Multivalent Logics Fundamental

A fuzzy clustering method using Genetic Algorithm and Fuzzy Subtractive Clustering Thanh Le, Tom

M odels for Inexact Reasoning Fuzzy Logic Lesson 1 Crisp and Fuzzy Sets M aster in

Desktop components Jens Bache-Wiig Agenda Intro Project status Coding QA

Repairing without Retraining: Avoiding Disparate Impact with Counterfactual Distributions Hao

Computing Closed-Form Solutions of Integrable Connections Thomas Cluzeau thomas.cluzeau@xlim.fr

A Systematic Study of Neural Discourse Models for Implicit Discourse Relation Attapol T.

Combining Solr and Elasticsearch to Improve Autosuggestion on Mobile Local Search Toan Vinh Luu,

Lessons Learned from Building a Large Multilingual, Multi-region Website in Drupal 8 Stella

Meas rDroid An Android Measurement Framework Johann Schlamp Georg Carle May 2, 2013 The Meas

Captioning Images with Diverse Objects Lisa Anne Subhashini Marcus Raymond Kate Trevor

Fuzzy Reasoning Outline Introduction Bivalent & Multivalent Logics Fundamental