Full-Text Search in Django with PostgreSQL | EuroPython 2017 - - - PowerPoint PPT Presentation

full text search in django with postgresql
SMART_READER_LITE
LIVE PREVIEW

Full-Text Search in Django with PostgreSQL | EuroPython 2017 - - - PowerPoint PPT Presentation

Full-Text Search in Django with PostgreSQL | EuroPython 2017 - Rimini, 2017-07-12 Paolo Melchiorre - @pauloxnet Paolo olo Melchior lchiorre re | Computer Science Engineer Backend Python Developer (>10yrs) Django


slide-1
SLIDE 1

Full-Text Search

in Django with

PostgreSQL

EuroPython 2017 - Rimini, 2017-07-12

 Paolo Melchiorre - @pauloxnet

 |

slide-2
SLIDE 2

 Paolo

  • lo Melchior

lchiorre re |

▪Computer Science Engineer ▪Backend Python Developer (>10yrs) ▪Django Developer (~5yrs) ▪Senior Software Engineer @ 20Tab ▪Happy Remote Worker ▪PostgreSQL user, not a DBA

2

slide-3
SLIDE 3

 Goal

  • al

|

“To show how we have used Django Full-Text Search and PostgreSQL in a Real Project”

3

slide-4
SLIDE 4

 Motivation Motivation |

“To implement Full-Text Search using only Django and PostgreSQL functionalities, without resorting to external tools.”

4

slide-5
SLIDE 5

 Agend genda |

▪Full-Text Search ▪Existing Solutions ▪PostgreSQL Full-Text Search ▪Django Full-Text Search Support ▪www.concertiaroma.com project ▪What’s next ▪Conclusions ▪Questions

5

slide-6
SLIDE 6

 Full- ull-Tex ext Se Search arch |

“… Full-Text Search* refers to techniques for Searching a single computer-stored Document or a Collection in a Full-Text Database …”

  • - Wikipedia

* FTS = Full-Text Search

6

slide-7
SLIDE 7

 Fea eatur tures of

  • f a FTS

FTS |

▪Stemming ▪Ranking ▪Stop-words ▪Multiple languages support ▪Accent support ▪Indexing ▪Phrase search

7

slide-8
SLIDE 8

 Tes este ted Solutions Solutions |

8

slide-9
SLIDE 9

 Ela lastics ticsea earch ch |

Project: Snap Market (~500k mobile users) Issues: ▪Management problems ▪Patching a Java plug-in

@@ -52,7 +52,8 @@ public class DecompoundTokenFilter … {

  • posIncAtt.setPositionIncrement(0);

+ if (!subwordsonly) + posIncAtt.setPositionIncrement(0); return true; }

9

slide-10
SLIDE 10

 Ap Apach che Solr Solr |

Project: GoalScout (~25k videos) Issues: ▪Synchronization problems ▪All writes to PostgreSQL and reads from Solr

10

slide-11
SLIDE 11

 Existing xisting Solutions Solutions |

PROS  ▪Full featured solutions ▪Resources (documentations, articles, …) CONS  ▪Synchronization ▪Mandatory use of driver (haystack, bungiesearch…) ▪Ops Oriented: focus on system integrations

11

slide-12
SLIDE 12

 FTS TS in Pos PostgreSQL reSQL |

▪FTS Support since version 8.3 (~2008) ▪TSVECTOR to represent text data ▪TSQUERY to represent search predicates ▪Special Indexes (GIN, GIST) ▪Phrase Search since version 9.6 (~2016)

12

slide-13
SLIDE 13

 Wh What are Docu Documents ments |

“… a Document is the Unit of searching in a Full-Text Search system; for example, a magazine Article or email Message …”

  • - PostgreSQL documentation

13

slide-14
SLIDE 14

 Django Support |

▪Module: django.contrib.postgres ▪FTS Support since version 1.10 (2016) ▪BRIN and GIN indexes since version 1.11 (2017) ▪Dev Oriented: focus on programming

14

slide-15
SLIDE 15

 Making ing quer queries ies |

class Blog(models.Model): name = models.CharField(max_length=100) tagline = models.TextField() class Author(models.Model): name = models.CharField(max_length=200) email = models.EmailField() class Entry(models.Model): blog = models.ForeignKey(Blog) headline = models.CharField(max_length=255) body_text = models.TextField() pub_date = models.DateField() authors = models.ManyToManyField(Author)

15

slide-16
SLIDE 16

 St Stand andard rd queries ueries |

>>> Author.objects.filter(name__contains='Terry') [<Author: Terry Gilliam>, <Author: Terry Jones>] >>> Author.objects.filter(name__icontains='Erry') [<Author: Terry Gilliam>, <Author: Terry Jones>, <Author: Jerry Lewis>]

16

slide-17
SLIDE 17

 Unaccented Unaccented query uery |

>>> from django.contrib.postgres.operations import UnaccentExtension >>> UnaccentExtension() >>> Author.objects.filter(name__unaccent__icontains='Hélène') [<Author: Helen Mirren>, <Author: Helena Bonham Carter>, <Author: Hélène Joy>]

17

slide-18
SLIDE 18

 Trigr rigram simila imilar |

>>> from django.contrib.postgres.operations import TrigramExtension >>> TrigramExtension() >>> Author.objects.filter(name__unaccent__trigram_similar='Hélèn') [<Author: Helen Mirren>, <Author: Helena Bonham Carter>, <Author: Hélène Joy>]

18

slide-19
SLIDE 19

 The The search search look lookup up |

>>> Entry.objects.filter(body_text__search='Cheese') [<Entry: Cheese on Toast recipes>, <Entry: Pizza Recipes>]

19

slide-20
SLIDE 20

 Sea SearchV rchVector ector |

>>> from django.contrib.postgres.search import SearchVector >>> Entry.objects.annotate( ... search=SearchVector('body_text', 'blog__tagline'), ... ).filter(search='Cheese') [<Entry: Cheese on Toast recipes>, <Entry: Pizza Recipes>]

20

slide-21
SLIDE 21

 Sea SearchQuery rchQuery |

>>> from django.contrib.postgres.search import SearchQuery >>> SearchQuery('potato') & SearchQuery('ireland') # potato AND ireland >>> SearchQuery('potato') | SearchQuery('penguin') # potato OR penguin >>> ~SearchQuery('sausage') # NOT sausage

21

slide-22
SLIDE 22

 SearchRank SearchRank |

>>> from django.contrib.postgres.search import ( ... SearchQuery, SearchRank, SearchVector ... ) >>> vector = SearchVector('body_text') >>> query = SearchQuery('cheese') >>> Entry.objects.annotate( ... rank=SearchRank(vector, query) ... ).order_by('-rank') [<Entry: Cheese on Toast recipes>, <Entry: Pizza recipes>]

22

slide-23
SLIDE 23

 Se Search arch confg confgurat uration ion |

>>> from django.contrib.postgres.search import ( ... SearchQuery, SearchVector ... ) >>> Entry.objects.annotate( ... search=SearchVector('body_text', config='french'), ... ).filter(search=SearchQuery('œuf', config='french')) [<Entry: Pain perdu>] >>> from django.db.models import F >>> Entry.objects.annotate( ... search=SearchVector('body_text', config=F('blog__lang')), ... ).filter(search=SearchQuery('œuf', config=F('blog__lang'))) [<Entry: Pain perdu>]

23

slide-24
SLIDE 24

 Weighting ighting queries ueries |

>>> from django.contrib.postgres.search import ( ... SearchQuery, SearchRank, SearchVector ... ) >>> vector = SearchVector('body_text', weight='A') + ... SearchVector('blog__tagline', weight='B') >>> query = SearchQuery('cheese') >>> Entry.objects.annotate( ... rank=SearchRank(vector, query) ... ).filter(rank__gte=0.3).order_by('rank')

24

slide-25
SLIDE 25

 Sea SearchV chVectorFie ectorField ld |

>>> Entry.objects.update( ... search_vector=SearchVector('body_text') ... ) >>> Entry.objects.filter(search_vector='cheese') [<Entry: Cheese on Toast recipes>, <Entry: Pizza recipes>]

25

slide-26
SLIDE 26

 www.concertiaroma.com|

“… today's shows in the Capital” * The numbers of the project: ~ 1k venues > 12k bands > 15k shows ~ 200 festivals ~ 30k user/month * since ~2014

26

slide-27
SLIDE 27

 Ver ersion sion 2.0 |

Python 2.7 - Django 1.7 - PostgreSQL 9.1 - SQL LIKE

27

slide-28
SLIDE 28

 Version 3.0 |

Python 3.6 - Django 1.11 - PostgreSQL 9.6 - PG FTS

28

slide-29
SLIDE 29

 Band Band Manager |

LANG = 'english' class BandManager(models.Manager): def search(self, text): vector = ( SearchVector('nickname', weight='A', config=LANG) + SearchVector('genres__name', weight='B', config=LANG)+ SearchVector('description', weight='D', config=LANG) ) query = SearchQuery(text, config=LANG) rate = SearchRank(vector, query) return self.get_queryset().annotate(rate=rate).filter( search=query).annotate(search=vector).distinct( 'id', 'rate').order_by('-rate', 'id')

29

slide-30
SLIDE 30

 Band Band T est Setup |

class BandTest(TestCase): def setUp(self): metal, _ = Genre.objects.get_or_create(name='Metal') doom, _ = Genre.objects.get_or_create(name='Doom') doomraiser, _ = Contact.objects.get_or_create( nickname='Doom raiser', description='Lorem…') doomraiser.genres.add(doom) forgotten_tomb, _ = Contact.objects.get_or_create( nickname='Forgotten Tomb', description='Lorem…') forgotten_tomb.genres.add(doom) ....

30

slide-31
SLIDE 31

 Band Band T est Method |

class BandTest(TestCase): def setUp(self): ... def test_band_search(self): band_queryset = Band.objects.search( 'doom').values_list('nickname', 'rate') band_list = [ ('Doom raiser', 0.675475), ('The Foreshadowin', 0.258369), ('Forgotten Tomb', 0.243171)] self.assertSequenceEqual( list(OrderedDict(band_queryset).items()), band_list)

31

slide-32
SLIDE 32

 What’s next |

▪Misspelling support ▪Multiple language configuration ▪Search suggestions ▪SearchVectorField with triggers ▪JSON/JSONB Full-Text Search ▪RUM indexing

32

slide-33
SLIDE 33

 Conclusions |

Conditions to implement this solution: ▪No extra dependencies ▪Not too complex searches ▪Easy management ▪No need to synchronize data ▪PostgreSQL already in your stack ▪Python-only environment

33

slide-34
SLIDE 34

 Resources |

▪postgresql.org/docs/9.6/static/textsearch.html ▪github.com/damoti/django-tsvector-field ▪en.wikipedia.org/wiki/Full-text_search ▪docs.djangoproject.com/en/1.11/ref/contrib/postgres ▪PostgreSQL & Django source codes ▪Stack Overflow ▪Google ;-)

34

slide-35
SLIDE 35

 Acknowledgements |

Marc Tamlyn for all the Support for django.contrib.postgres

35

slide-36
SLIDE 36

 Thank you |

 BY -  SA (Attribution-ShareAlike) creativecommons.org/licenses/by-sa

Slides speakerdeck.com/pauloxnet

36

slide-37
SLIDE 37

 Questions ? |

After the talk, Please!

 *

* Speak Slowly I'm not a native English speaker

37

slide-38
SLIDE 38

 Contacts |

www.paulox.net

twitter.com/pauloxnet

linkedin.com/in/paolomelchiorre

github.com/pauloxnet

38