Language support and linguistics in Lucene, Solr and ElasticSearch - - PowerPoint PPT Presentation

language support and linguistics
SMART_READER_LITE
LIVE PREVIEW

Language support and linguistics in Lucene, Solr and ElasticSearch - - PowerPoint PPT Presentation

Language support and linguistics in Lucene, Solr and ElasticSearch and the eco-system June 3rd, 2013 Christian Moen cm@atilika.com About me MSc. in computer science, University of Oslo, Norway Worked with search at FAST (now Microsoft)


slide-1
SLIDE 1

Language support and linguistics

in Lucene, Solr and ElasticSearch and the eco-system

Christian Moen cm@atilika.com June 3rd, 2013

slide-2
SLIDE 2

About me

  • MSc. in computer science, University of Oslo, Norway
  • Worked with search at FAST (now Microsoft) for 10 years
  • 5 years in R&D building FAST Enterprise Search Platform in Oslo, Norway
  • 5 years in Services doing solution delivery, technical sales, etc. in Tokyo, Japan
  • Founded アティリカ株式会社 in October, 2009
  • We help companies innovate using new technologies and good ideas
  • We do information retrieval, natural language processing and big data
  • We are based in Tokyo, but we have clients everywhere
  • We are a small company, but our customers are typically very big companies
  • Newbie Lucene & Solr Committer
  • Mostly been working on Japanese language support (Kuromoji) so far
  • Working on Korean support from a code donation (LUCENE-4956)
  • Please write me on cm@atilika.com or cm@apache.org
slide-3
SLIDE 3

About this talk

  • Basic searching and matching
  • Challenges with natural language
  • Basic measurements for search quality
  • Linguistics in Apache Lucene
  • Linguistics in ElasticSearch (quick intro)
  • Linguistics in Apache Solr
  • Linguistics in the NLP eco-system
  • Summary and practical advice
slide-4
SLIDE 4

Hands-on 1: Working with Apache Lucene analyzers Hands-on 4: Other text processing using OpenNLP Hands-on 3: Multi-lingual search with Apace Solr Hands-on 2: Multi-lingual search using ElasticSearch

Hands-on demos

slide-5
SLIDE 5

What is a search engine?

slide-6
SLIDE 6

Documents

1 Sushi is very tasty in Japan 2 Visiting the Tsukiji fish market is very fun

Two documents (1 & 2) with English text

1

slide-7
SLIDE 7

Text segmentation

1 Sushi is very tasty in Japan 2 Visiting the Tsukiji fish market is very fun 1 Sushi is very tasty in Japan 2 Visiting the Tsukiji fish market is very fun

Documents are turned into searchable terms (tokenization) Two documents (1 & 2) with English text

1 2

slide-8
SLIDE 8

Text segmentation

1 Sushi is very tasty in Japan 2 Visiting the Tsukiji fish market is very fun 1 Sushi is very tasty in Japan 2 Visiting the Tsukiji fish market is very fun 1 sushi is very tasty in japan 2 visiting the tsukiji fish market is very fun

Documents are turned into searchable terms (tokenization) Two documents (1 & 2) with English text Terms/tokens are converted to lowercase form (normalization)

1 2 3

slide-9
SLIDE 9

Document indexing

1 sushi is very tasty in japan 2 visiting the tsukiji fish market is very fun

Tokenized documents with normalized tokens

slide-10
SLIDE 10

Document indexing

sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2 1 sushi is very tasty in japan 2 visiting the tsukiji fish market is very fun

Tokenized documents with normalized tokens Inverted index - tokens are mapped to the document ids that contain them

slide-11
SLIDE 11

Searching

sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2

slide-12
SLIDE 12

Searching

sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2

query

very tasty sushi

slide-13
SLIDE 13

Searching

sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2

AND

very tasty sushi

parsed query

slide-14
SLIDE 14

Searching

sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2

AND

very tasty sushi

parsed query

slide-15
SLIDE 15

Searching

sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2

AND

very tasty sushi

parsed query

slide-16
SLIDE 16

Searching

sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2

AND

very tasty sushi

parsed query

slide-17
SLIDE 17

Searching

sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2 1

hits

AND

very tasty sushi

parsed query

slide-18
SLIDE 18

Searching

sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2

slide-19
SLIDE 19

Searching

sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2

query

visit fun market

slide-20
SLIDE 20

Searching

sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2

AND

visit fun market

parsed query

slide-21
SLIDE 21

Searching

sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2

AND

visit fun market

parsed query

slide-22
SLIDE 22

Searching

sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2

AND

visit fun market

parsed query

visit ≠ visiting

slide-23
SLIDE 23

Searching

sushi 1 is 1 2 very 1 2 tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2

AND

visit fun market

parsed query no hits

(all terms need to match)

slide-24
SLIDE 24

What’s the problem?

Search engines are not magical answering machines They match terms in queries against terms in documents, and order matches by rank

! !

slide-25
SLIDE 25

Key takeaways

Text processing affects search quality in big way because it affects matching The “magic” of a search engine is often provided by high quality text processing Garbage in ⇒ Garbage out

! !

slide-26
SLIDE 26

Natural language and search

slide-27
SLIDE 27

日本語 English Deutsch Françaisىةيبرعلا

slide-28
SLIDE 28

English

Pale ale is a beer made through warm fermentation using pale malt and is one of the world's major beer styles.

slide-29
SLIDE 29

English

How do we want to index world's? ?

Pale ale is a beer made through warm fermentation using pale malt and is one of the world's major beer styles.

slide-30
SLIDE 30

English

How do we want to index world's? ? Should a search for style match styles? And should ferment match fermentation? ?

Pale ale is a beer made through warm fermentation using pale malt and is one of the world's major beer styles.

slide-31
SLIDE 31

German

Das Oktoberfest ist das größte Volksfest der Welt und es findet in der bayerischen Landeshauptstadt München.

slide-32
SLIDE 32

German

Das Oktoberfest ist das größte Volksfest der Welt und es findet in der bayerischen Landeshauptstadt München.

The Oktoberfest is the world’s largest festival and it takes place in the Bavarian capital Munich.

slide-33
SLIDE 33

German

Das Oktoberfest ist das größte Volksfest der Welt und es findet in der bayerischen Landeshauptstadt München.

The Oktoberfest is the world’s largest festival and it takes place in the Bavarian capital Munich.

How do we want to search ü, ö and ß? ?

slide-34
SLIDE 34

German

Das Oktoberfest ist das größte Volksfest der Welt und es findet in der bayerischen Landeshauptstadt München.

The Oktoberfest is the world’s largest festival and it takes place in the Bavarian capital Munich.

How do we want to search ü, ö and ß? ? Do we want a search for hauptstadt to match Landeshauptstadt? ?

slide-35
SLIDE 35

French

Le champagne est un vin pétillant français protégé par une appellation d'origine contrôlée.

slide-36
SLIDE 36

French

Le champagne est un vin pétillant français protégé par une appellation d'origine contrôlée.

Champagne is a French sparkling wine with a protected designation of origin.

slide-37
SLIDE 37

French

Le champagne est un vin pétillant français protégé par une appellation d'origine contrôlée.

How do we want to search é, ç and ô? ?

Champagne is a French sparkling wine with a protected designation of origin.

slide-38
SLIDE 38

French

Le champagne est un vin pétillant français protégé par une appellation d'origine contrôlée.

How do we want to search é, ç and ô? ? Do we want a search for aoc to match appellation d'origine contrôlée? ?

Champagne is a French sparkling wine with a protected designation of origin.

slide-39
SLIDE 39

Arabic

دْنـِع مَرَكلا زْوُمُر نِم اًزْمَر ةَلْيـــــــِصلا هَيِبَرَعلا ةَوْهَقلا رَبـَتـْعُت .يِبَرَعلا َملاَعلا ىِف ْبَرَعلا

slide-40
SLIDE 40

Arabic

دْنـِع مَرَكلا زْوُمُر نِم اًزْمَر ةَلْيـــــــِصلا هَيِبَرَعلا ةَوْهَقلا رَبـَتـْعُت .يِبَرَعلا َملاَعلا ىِف ْبَرَعلا

Reads from right to left

slide-41
SLIDE 41

Arabic

دْنـِع مَرَكلا زْوُمُر نِم اًزْمَر ةَلْيـــــــِصلا هَيِبَرَعلا ةَوْهَقلا رَبـَتـْعُت .يِبَرَعلا َملاَعلا ىِف ْبَرَعلا

Original Arabian coffee is considered a symbol of generosity among the Arabs in the Arab world.

slide-42
SLIDE 42

Arabic

دْنـِع مَرَكلا زْوُمُر نِم اًزْمَر ةَلْيـــــــِصلا هَيِبَرَعلا ةَوْهَقلا رَبـَتـْعُت .يِبَرَعلا َملاَعلا ىِف ْبَرَعلا

Original Arabian coffee is considered a symbol of generosity among the Arabs in the Arab world.

How do we want to search ةَلْيـــــــِصلا??

slide-43
SLIDE 43

Arabic

دْنـِع مَرَكلا زْوُمُر نِم اًزْمَر ةَلْيـــــــِصلا هَيِبَرَعلا ةَوْهَقلا رَبـَتـْعُت .يِبَرَعلا َملاَعلا ىِف ْبَرَعلا

Original Arabian coffee is considered a symbol of generosity among the Arabs in the Arab world.

How do we want to search ةَلْيـــــــِصلا?? Do we want to normalize diacritics? ?

slide-44
SLIDE 44

Arabic

دنع مركلا زومر نم ازمر ةليـــــــصلا ةيبرعلا ةوهقلا ربتعت .يبرعلا ملاعلا ىف برعلا

Original Arabian coffee is considered a symbol of generosity among the Arabs in the Arab world.

How do we want to search ةَلْيـــــــِصلا?? Do we want to normalize diacritics? ?

Diacritics normalized (removed)

slide-45
SLIDE 45

Arabic

دْنـِع مَرَكلا زْوُمُر نِم اًزْمَر ةَلْيـــــــِصلا هَيِبَرَعلا ةَوْهَقلا رَبـَتـْعُت .يِبَرَعلا َملاَعلا ىِف ْبَرَعلا

Original Arabian coffee is considered a symbol of generosity among the Arabs in the Arab world.

How do we want to search ةَلْيـــــــِصلا?? Do we want to correct the common spelling mistake for ىِف and ه? ? Do we want to normalize diacritics? ?

slide-46
SLIDE 46

Japanese

JR 新宿 駅 の 近くに ビールを飲みに行こう か?

slide-47
SLIDE 47

Japanese

Shall we go for a beer near JR Shinjuku station?

JR 新宿 駅 の 近くに ビールを飲みに行こう か?

slide-48
SLIDE 48

Japanese

Shall we go for a beer near JR Shinjuku station?

What are the words in this sentence? ? What are the words in this sentence? Which tokens do we index?

JR 新宿 駅 の 近くに ビールを飲みに行こう か?

slide-49
SLIDE 49

Japanese

Shall we go for a beer near JR Shinjuku station?

What are the words in this sentence? ? Words are implicit in Japanese - there is no white space that separates them ! What are the words in this sentence? Which tokens do we index?

JR 新宿 駅 の 近くに ビールを飲みに行こう か?

slide-50
SLIDE 50

Japanese

JR 新宿 駅 の 近くに ビールを飲みに行こう か?

What are the words in this sentence? ? Words are implicit in Japanese - there is no white space that separates them ! What are the words in this sentence? Which tokens do we index?

Shall we go for a beer near JR Shinjuku station?

But how do we find the tokens? ?

JR 新宿 駅 の 近くに ビールを飲みに行こう か?

slide-51
SLIDE 51

Japanese

JR 新宿 駅 の 近くに ビールを飲みに行こう か?

What are the words in this sentence? ? Words are implicit in Japanese - there is no white space that separates them ! What are the words in this sentence? Which tokens do we index?

Shall we go for a beer near JR Shinjuku station?

But how do we find the tokens? ?

slide-52
SLIDE 52

Japanese

Do we want 飲む (to drink) to match 飲み? ?

Shall we go for a beer near JR Shinjuku station?

JR 新宿 駅 の 近くに ビールを飲みに行こう か?

slide-53
SLIDE 53

Japanese

Do we want 飲む (to drink) to match 飲み? ? Do we want ビール to match ビール? ?

Shall we go for a beer near JR Shinjuku station?

Does half-width match full-width?

JR 新宿 駅 の 近くに ビールを飲みに行こう か?

slide-54
SLIDE 54

Japanese

Do we want 飲む (to drink) to match 飲み? ? Do we want ビール to match ビール? ? Do we want (emoji) to match? ?

Shall we go for a beer near JR Shinjuku station?

Does half-width match full-width?

JR 新宿 駅 の 近くに ビールを飲みに行こう か?

slide-55
SLIDE 55

Common traits

  • Segmenting source text into tokens
  • Dealing with non-space separated languages
  • Handling punctuation in space separated languages
  • Segmenting compounds into their parts
  • Apply relevant linguistic normalizations
  • Character normalization
  • Morphological (or grammatical) normalizations
  • Spelling variations
  • Synonyms and stopwords
slide-56
SLIDE 56

Key take-aways

  • Natural language is very complex
  • Each language is different with its own set of complexities
  • We have had a high level look at languages
  • But there is also...
  • Search needs per-language processing
  • Many considerations to be made (often application-specific)

Greek Hebrew Chinese Korean Russian Thai Spanish and many more ... Japanese English German French Arabic

slide-57
SLIDE 57

Basic search quality measurements

slide-58
SLIDE 58

Precision

Fraction of retrieved documents that are relevant

precision = | { relevant docs } ∩ { retrieved docs } | | { retrieved docs } |

slide-59
SLIDE 59

Recall

| { relevant docs } ∩ { retrieved docs } | | { relevant docs } | recall =

Fraction of relevant documents that are retrieved

slide-60
SLIDE 60

Precision vs. Recall

Should I optimize for precision or recall? ?

slide-61
SLIDE 61

Precision vs. Recall

Should I optimize for precision or recall? ? That depends on your application !

slide-62
SLIDE 62

Precision vs. Recall

Should I optimize for precision or recall? ? That depends on your application ! A lot of tuning work is in practice often about improving recall without hurting precision !

slide-63
SLIDE 63

Linguistics in Lucene

slide-64
SLIDE 64

Simplified architecture

Index

document

  • r query
slide-65
SLIDE 65

Index

document

  • r query

Lucene analysis chain / Analyzer

  • 1. Analyzes queries or documents in a pipelined fashion

before indexing or searching

  • 2. Analysis itself is done by an analyzer on a per field basis
  • 3. Key plug-in point for linguistics in Lucene

Simplified architecture

slide-66
SLIDE 66

What does an Analyzer do? ?

Analyzers

slide-67
SLIDE 67

What does an Analyzer do? ? ! Analyzers take text as its input and turns it into a stream of tokens

Analyzers

slide-68
SLIDE 68

What does an Analyzer do? ? ! Analyzers take text as its input and turns it into a stream of tokens Tokens are produced by a Tokenizer !

Analyzers

slide-69
SLIDE 69

What does an Analyzer do? ? ! Analyzers take text as its input and turns it into a stream of tokens Tokens are produced by a Tokenizer Tokens can be processed further by a chain of TokenFilters downstream ! !

Analyzers

slide-70
SLIDE 70

Analyzer high-level concepts

Tokenizer Reader TokenFilter TokenFilter TokenFilter

Reader

  • Stream to be analyzed is provided by a Reader (from java.io)
  • Can have chain of associated CharFilters (not discussed)

Tokenizer

  • Segments text provider by reader into tokens
  • Most interesting things happen in incrementToken() method

TokenFilter

  • Updates, mutates or enriches tokens
  • Most interesting things happen in incrementToken() method

TokenFilter

...

TokenFilter

...

slide-71
SLIDE 71

Lucene processing example

Le champagne est protégé par une appellation d'origine contrôlée.

slide-72
SLIDE 72

Le champagne est protégé par une appellation d'origine contrôlée.

FrenchAnalyzer

slide-73
SLIDE 73

StandardTokenizer

Le champagne est protégé par une appellation d'origine contrôlée.

FrenchAnalyzer

slide-74
SLIDE 74

StandardTokenizer

Le champagne est protégé par une appellation d'origine contrôlée. Le champagne est protégé par une appellation d'origine contrôlée

FrenchAnalyzer

slide-75
SLIDE 75

StandardTokenizer

Le champagne est protégé par une appellation d'origine contrôlée. Le champagne est protégé par une appellation d'origine contrôlée

ElisionFilter

FrenchAnalyzer

slide-76
SLIDE 76

StandardTokenizer

Le champagne est protégé par une appellation d'origine contrôlée. Le champagne est protégé par une appellation d'origine contrôlée

ElisionFilter

Le champagne est protégé par une appellation

  • rigine

contrôlée

FrenchAnalyzer

slide-77
SLIDE 77

StandardTokenizer

Le champagne est protégé par une appellation d'origine contrôlée. Le champagne est protégé par une appellation d'origine contrôlée

ElisionFilter

Le champagne est protégé par une appellation

  • rigine

contrôlée

FrenchAnalyzer

LowerCaseFilter

slide-78
SLIDE 78

LowerCaseFilter

le champagne est protégé par une appellation

  • rigine

contrôlée

slide-79
SLIDE 79

LowerCaseFilter

le champagne est protégé par une appellation

  • rigine

contrôlée

StopFilter

slide-80
SLIDE 80

LowerCaseFilter

le champagne est protégé par une appellation

  • rigine

contrôlée

StopFilter

champagne protégé appellation

  • rigine

contrôlée

slide-81
SLIDE 81

LowerCaseFilter

le champagne est protégé par une appellation

  • rigine

contrôlée

StopFilter

champagne protégé appellation

  • rigine

contrôlée

FrenchLightStemFilter

slide-82
SLIDE 82

LowerCaseFilter

le champagne est protégé par une appellation

  • rigine

contrôlée

StopFilter

champagne protégé appellation

  • rigine

contrôlée champagn proteg apel

  • rigin

control

FrenchLightStemFilter

slide-83
SLIDE 83

FrenchAnalyzer

champagn proteg apel

  • rigin

control Le champagne est protégé par une appellation d'origine contrôlée.

FrenchLightStemFilter StandardTokenizer ElisionFilter LowerCaseFilter StopFilter

slide-84
SLIDE 84

Analyzer processing model

  • Analyzers provide a TokenStream
  • Retrieve it by calling tokenStream(field, reader)
  • tokenStream() bundles together tokenizers and

any additional filters necessary for analysis

  • Input is advanced by incrementToken()
  • Information about the token itself is provided by

so-called TokenAttributes attached to the stream

  • Attribute for term text, offset, token type, etc.
  • TokenAttributes are updated on incrementToken()
slide-85
SLIDE 85

Hands-on: Working with analyzers in code

slide-86
SLIDE 86

Synonyms

slide-87
SLIDE 87

Synonyms

  • Synonyms are flexible and easy-to-use
  • Very powerful tools for improving recall
  • Two types of synonyms
  • One way/mapping “sparkling wine => champagne”
  • Two way/equivalence “aoc, appellation d'origine contrôlée”
  • Can be applied index-time or query-time
  • Apply synonyms on one side - not both
  • Best practice is to apply synonyms query-side
  • Allows for updating synonyms without reindexing
  • Allows for turning synonyms on and off easily
slide-88
SLIDE 88

Hands-on: French analysis with synonyms

slide-89
SLIDE 89

Linguistics in ElasticSearch (quick intro)

slide-90
SLIDE 90

ElasticSearch linguistics highlights

  • Uses Lucene analyzers, tokenizers & filters
  • Analyzers are made available through a

provider interface

  • Some analyzers available through plugins,

i.e. kuromoji, smartcn, icu, etc.

  • Analyzers can be set up in your mapping
  • Analyzers can also be chosen based on a

field in your document, i.e. a lang field

slide-91
SLIDE 91

Hands-on: Simple multi-language example

slide-92
SLIDE 92

Linguistics in Solr

slide-93
SLIDE 93

Linguistics in Solr

  • Uses Lucene analyzers, tokenizers & filters
  • Linguistic processing is defined by field types

in schema.xml

  • Different processing can be applied on

indexing and querying side if desired

  • A rich set of pre-defined and ready-to-use per-

language field types are available

  • Defaults can be used as starting points for

further configuration or as they are

slide-94
SLIDE 94

French in schema.xml

<!-- French --> <fieldType name="text_fr" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <!-- removes l', etc --> <filter class="solr.ElisionFilterFactory" ignoreCase="true" articles="lang/contractions_fr.txt"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fr.txt" format="snowball" enablePositionIncrements="true"/> <filter class="solr.FrenchLightStemFilterFactory"/> <!-- less aggressive: <filter class="solr.FrenchMinimalStemFilterFactory"/> --> <!-- more aggressive: <filter class="solr.SnowballPorterFilterFactory" language="French"/> --> </analyzer> </fieldType> <!-- French --> <field name="title" type="text_fr" indexed="true" stored="true"/> <field name="body" type="text_fr" indexed="true" stored="true"/> <dynamicField name="*_fr" type="text_fr" indexed="true" stored="true"/>

slide-95
SLIDE 95

Arabic in schema.xml

<!-- Arabic --> <fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <!-- for any non-arabic --> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ar.txt" enablePositionIncrements="true"/> <!-- normalizes alef maksura to yeh, etc --> <filter class="solr.ArabicNormalizationFilterFactory"/> <filter class="solr.ArabicStemFilterFactory"/> </analyzer> </fieldType> <!-- Arabic --> <field name="title" type="text_ar" indexed="true" stored="true"/> <field name="body" type="text_ar" indexed="true" stored="true"/> <dynamicField name="*_ar" type="text_ar" indexed="true" stored="true"/>

slide-96
SLIDE 96

Field types in schema.xml

  • text_ar Arabic
  • text_bg Bulgarian
  • text_ca Catalan
  • text_cjk CJK
  • text_cz Czech
  • text_da Danish
  • text_de German
  • text_el Greek
  • text_es Spanish
  • text_eu Basque
  • text_fa Farsi
  • text_fi Finnish
  • text_fr French
  • text_ga Irish
  • text_gl Galician
  • text_hi Hindi
  • text_hu Hungarian
  • text_hy Armenian
  • text_id Indonedian
  • text_it Italian
  • text_lv Latvian
  • text_nl Dutch
  • text_no Norwegian
  • text_pt Portuguese
  • text_ro Romanian
  • text_ru Russian
  • text_sv Swedish
  • text_th Thai
  • text_fr Turkish
slide-97
SLIDE 97

Field types in schema.xml

Coming soon!

LUCENE-4956

  • text_ar Arabic
  • text_bg Bulgarian
  • text_ca Catalan
  • text_cjk CJK
  • text_cz Czech
  • text_da Danish
  • text_de German
  • text_el Greek
  • text_es Spanish
  • text_eu Basque
  • text_fa Farsi
  • text_fi Finnish
  • text_fr French
  • text_ga Irish
  • text_gl Galician
  • text_hi Hindi
  • text_hu Hungarian
  • text_hy Armenian
  • text_id Indonedian
  • text_it Italian
  • text_lv Latvian
  • text_nl Dutch
  • text_no Norwegian
  • text_pt Portuguese
  • text_ro Romanian
  • text_ru Russian
  • text_sv Swedish
  • text_th Thai
  • text_fr Turkish
  • text_ko Korean
slide-98
SLIDE 98

Solr processing

slide-99
SLIDE 99

Adding document details

Index

<add> <doc> <field> ∙∙∙ </field> </doc> </add>

slide-100
SLIDE 100

Index

<add> <doc> <field> ∙∙∙ </field> </doc> </add>

Adding document details

slide-101
SLIDE 101

Index

id ... title ... body ...

<add> <doc> <field> ∙∙∙ </field> </doc> </add>

UpdateRequestHandler handles request

  • 1. Receives a document via HTTP in XML (or JSON, CSV, ...)
  • 2. Converts document to a SolrInputDocument
  • 3. Activates the update chain

Adding document details

slide-102
SLIDE 102

Index

id ... title ... body ...

UpdateRequestHandler handles request

  • 1. Receives a document via HTTP in XML (or JSON, CSV, ...)
  • 2. Converts document to a SolrInputDocument
  • 3. Activates the update chain

<add> <doc> <field> ∙∙∙ </field> </doc> </add>

Adding document details

slide-103
SLIDE 103

Index

id ... title ... body ...

Update chain of UpdateRequestProcessors

  • 1. Processes a document at a time with operation (add)
  • 2. Plugin logic can mutate SolrInputDocument, i.e. add fields
  • r do other processing as desired

Adding document details

slide-104
SLIDE 104

Index

id ... title ... body ...

Update chain of UpdateRequestProcessors

  • 1. Processes a document at a time with operation (add)
  • 2. Plugin logic can mutate SolrInputDocument, i.e. add fields
  • r do other processing as desired

Adding document details

slide-105
SLIDE 105

Index

id ... title ... body ...

Update chain of UpdateRequestProcessors

  • 1. Processes a document at a time with operation (add)
  • 2. Plugin logic can mutate SolrInputDocument, i.e. add fields
  • r do other processing as desired

Adding document details

slide-106
SLIDE 106

Index

id ... title ... body ...

Update chain of UpdateRequestProcessors

  • 1. Processes a document at a time with operation (add)
  • 2. Plugin logic can mutate SolrInputDocument, i.e. add fields
  • r do other processing as desired

Adding document details

slide-107
SLIDE 107

Index

id ... title ... body ... lang ...

Update chain of UpdateRequestProcessors

  • 1. Update processor added a lang field by analyzing body
  • 2. Finish by calling RunUpdateProcessor (usually)

Adding document details

slide-108
SLIDE 108

Index

id ... title ... body ... lang ...

Update chain of UpdateRequestProcessors

  • 1. Update processor added a lang field by analyzing body
  • 2. Finish by calling RunUpdateProcessor (usually)

Adding document details

slide-109
SLIDE 109

Index

id ... title ... body ... lang ...

Update chain of UpdateRequestProcessors

  • 1. Update processor added a lang field by analyzing body
  • 2. Finish by calling RunUpdateProcessor (usually)

id ... title ... body ... lang ...

Adding document details

slide-110
SLIDE 110

Index

id ... title ... body ... lang ... id ... title ... body ... lang ...

Lucene analyzer chain

  • 1. Fields are analyzed individually

Adding document details

slide-111
SLIDE 111

Index

id ... title ... body ... lang ... id ... title ... body ... lang ...

Lucene analyzer chain

  • 1. No analysis on id

Adding document details

slide-112
SLIDE 112

Index

id ... title ... body ... lang ... title ... body ... lang ...

Lucene analyzer chain

  • 1. Field title being processed

id ...

Adding document details

slide-113
SLIDE 113

Index

id ... title ... body ... lang ... title ... body ... lang ...

Lucene analyzer chain

  • 1. Field title being processed

id ...

Adding document details

slide-114
SLIDE 114

Index

id ... title ... body ... lang ... title ... body ... lang ...

Lucene analyzer chain

  • 1. Field title being processed

id ...

Adding document details

slide-115
SLIDE 115

Index

id ... title ... body ... lang ... title ... body ... lang ...

Lucene analyzer chain

  • 1. Field title being processed

id ...

Adding document details

slide-116
SLIDE 116

Index

id ... title ... body ... lang ... title ... body ... lang ...

Lucene analyzer chain

  • 1. Field body being processed

id ...

Adding document details

slide-117
SLIDE 117

Index

id ... title ... body ... lang ... title ... lang ...

Lucene analyzer chain

  • 1. Field body being processed

id ... body ...

Adding document details

slide-118
SLIDE 118

Index

id ... title ... body ... lang ... title ... lang ...

Lucene analyzer chain

  • 1. Field body being processed

id ... body ...

Adding document details

slide-119
SLIDE 119

Index

id ... title ... body ... lang ... title ... lang ...

Lucene analyzer chain

  • 1. Field body being processed

id ... body ...

Adding document details

slide-120
SLIDE 120

Index

id ... title ... body ... lang ... title ... lang ...

Lucene analyzer chain

  • 1. Field lang being processed
  • 2. User a different analyzer chain

id ... body ...

Adding document details

slide-121
SLIDE 121

Index

id ... title ... body ... lang ... title ...

Lucene analyzer chain

  • 1. Field lang being processed
  • 2. User a different analyzer chain

id ... body ... lang ...

Adding document details

slide-122
SLIDE 122

id ... title ... body ... lang ...

Index

Lucene analyzer chain

  • 1. All fields analyzed

Adding document details

slide-123
SLIDE 123

Index

id ... title ... body ... genre ...

Adding document details

slide-124
SLIDE 124

Index

query

Search details

slide-125
SLIDE 125

SearchHandler

Index

query

Search details

slide-126
SLIDE 126

Index

query

Search components

Search details

slide-127
SLIDE 127

Analysis chain

Index

query

Search details

slide-128
SLIDE 128

Index

query

Search details

slide-129
SLIDE 129

Index

query

Search components

Search details

slide-130
SLIDE 130

Index

result

SearchHandler

Search details

slide-131
SLIDE 131

Hands-on: Multi-lingual search with Solr

slide-132
SLIDE 132

Multi-language challenges

  • How do we detect language accurately?
  • Indexing side is feasible (accuracy > 99.1%),

but query side is hard because of ambiguity

  • How to deal with language query side?
  • Supply language to use in the application (best if possible)
  • Search all relevant language variants (OR query)
  • Search a fallback field using n-gramming
  • Boost important language or content

Not knowing query term language will most likely impact negatively on overall rank

slide-133
SLIDE 133

NLP eco-system

slide-134
SLIDE 134

Basis Technology

  • High-end provider of text analytics software
  • Rosette Linguistics Platform (RLP) highlights
  • Language and encoding identification

(55 languages and 45 encodings)

  • Segmentation for Chinese, Japanese and Korean
  • De-compounding for German, Dutch, Korean, etc.
  • Lemmatization for a range of languages
  • Part-of-speech tagging for a range of language
  • Sentence boundary detection
  • Named entity extraction
  • Name indexing, transliteration and matching
  • Integrates well with Lucene/Solr
slide-135
SLIDE 135

Apache OpenNLP

  • Machine learning toolkit for NLP
  • Implements a range of common and best-practice algorithms
  • Very easy-to-use tools and APIs targeted towards NLP
  • Features and applications
  • Tokenization
  • Sentence segmentation
  • Part-of-speech tagging
  • Named entity recognition
  • Chunking
  • Licensing terms
  • Code itself has an Apache License 2.0
  • Some models are available, but licensing terms and F-scores are unclear...
  • See LUCENE-2899 for OpenNLP a Lucene Analyzer (work-in-progress)
slide-136
SLIDE 136

Hands-on: Basic text processing with OpenNLP

slide-137
SLIDE 137

Other eco-system options

slide-138
SLIDE 138

Summary

slide-139
SLIDE 139

Summary

  • Getting languages right is a hard problem
  • Linguistics helps improve search quality
  • Linguistics in Lucene, ElasticSearch and Solr
  • A wide range of languages are supported out-of-the-box
  • Considerations to be made on indexing and query side
  • Lucene Analyzers work on a per-field level
  • Solr UpdateRequestProcessors work on the document level
  • Solr has functionality for automatically detecting language

(available in ElasticSearch as a plugin)

  • Linguistics options also available in the eco-system
slide-140
SLIDE 140

Practical advice

slide-141
SLIDE 141

Practical advice

  • Understand your content and your users’ needs
  • Understand your language and its issues
  • Understand what users want from search
  • Do you have issues with recall?
  • Consider synonyms, stemming
  • Consider compound-segmentation for European languages
  • Consider WordDelimiterFilter, phonetic matching
  • Do you have issues with precision?
  • Consider using ANDs instead of ORs for terms
  • Consider improving content quality? Search fewer fields?
  • Is some content more important than other?
  • Consider boosting content with a boost query
slide-142
SLIDE 142

Thanks you

Jan Høydahl www.cominvent.com Thanks for some slide material Bushra Zawaydeh Thanks for fun Arabic language lessons Gaute Lambertsen Thanks for helping talk preparations

slide-143
SLIDE 143

Example code

  • Example code will be available on Github
  • https://github.com/atilika/berlin-buzzwords-2013
  • Get started using
  • git clone git://github.com/atilika/berlin-buzzwords-2013.git
  • less berlin-buzzwords-2013/README.md
  • Contact us if you have questions
  • hello@atilika.com
slide-144
SLIDE 144

ありがとうございました

Thank you very much

ليزج اركش

Vielen Dank Merci beaucoup