}w !"#$%&'()+,-./012345<yA| Illustraons by Ji Franek. - - PowerPoint PPT Presentation

w 012345 ya
SMART_READER_LITE
LIVE PREVIEW

}w !"#$%&'()+,-./012345<yA| Illustraons by Ji Franek. - - PowerPoint PPT Presentation

Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines Michal Rika, Vt Novotn, Petr Sojka; Jan Pomiklek, Radim ehek Masaryk University, Faculty of Informacs, Brno, Czech Republic mruzicka@mail.muni.cz ,


slide-1
SLIDE 1

Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines

Michal Růžička, Vít Novotný, Petr Sojka; Jan Pomikálek, Radim Řehůřek

Masaryk University, Faculty of Informacs, Brno, Czech Republic mruzicka@mail.muni.cz, witiko@mail.muni.cz, sojka@fi.muni.cz; RaRe Technologies honza@rare-technologies.com, radim@rare-technologies.com

https://mir.fi.muni.cz/ https://rare-technologies.com/

}w !"#$%&'()+,-./012345<yA|

Illustraons by Jiří Franek.

slide-2
SLIDE 2 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

Outline

1 Semanc Indexing and Searching 2 String Encoding of Semanc Vectors 3 Results

Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-3
SLIDE 3 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

Outline

1 Semanc Indexing and Searching 2 String Encoding of Semanc Vectors 3 Results

Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-4
SLIDE 4 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

Semanc Indexing

Input Document DataReader

(e.g. pdf2text)

Tokenizer

(e.g. tokenizer)

Segment2Vec SemancModeler

(e.g. TI, LSI, deep learning, doc2vec)

Segmenter

(e.g. paragraph / logical part [table, formula] segmenter)

Index of Vectors

document as a file (e-mail, , …), , … document as plain text document as a token list segments in all documents document as a segment list document as a list of points represenng segments Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-5
SLIDE 5 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

Semanc Searching with Nuggets

Query Document Indexing Pipeline doc doc doc Document Nuggets nugget nugget nugget Query Nuggets Similarity Search Candidate Nuggets 1 3 2 Results as Sorted Nuggets Ranker 1 2 3 Results as Sorted Documents query document as a file query as semanc vectors ⋅ semanc vectors , Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-6
SLIDE 6 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

Re-Ranking Techniques

1 Fast: find candidate nuggets via Elascsearch. 2 Slow but precise: re-rank candidate nuggets with exact similarity

metric.

  • Cosine similarity.
  • Euclidean similarity.
Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-7
SLIDE 7 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

Re-Ranking Techniques

1 Fast: find candidate nuggets via Elascsearch. 2 Slow but precise: re-rank candidate nuggets with exact similarity

metric.

  • Cosine similarity.
  • Euclidean similarity.
Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-8
SLIDE 8 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

Outline

1 Semanc Indexing and Searching 2 String Encoding of Semanc Vectors 3 Results

Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-9
SLIDE 9 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

String Encoding of Semanc Vectors

  • Encoding of semanc vectors to strings (feature tokens):
  • Semanc vector of three dimensions:

⃗ [., ., .]

  • Rounding to two decimal places, string encoded:
  • Feature tokens:
  • 0P2i0d12
  • 1P2ineg0d13
  • 2P2i0d07
Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-10
SLIDE 10 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

String Encoding of Semanc Vectors

  • Encoding of semanc vectors to strings (feature tokens):
  • Semanc vector of three dimensions:

⃗ [., ., .]

  • Rounding to two decimal places, string encoded:

⃗ [., ., .]

  • Feature tokens:
  • 0P2i0d12
  • 1P2ineg0d13
  • 2P2i0d07
Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-11
SLIDE 11 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

String Encoding of Semanc Vectors

  • Encoding of semanc vectors to strings (feature tokens):
  • Semanc vector of three dimensions:

⃗ [., ., .]

  • Rounding to two decimal places, string encoded:

⃗ [’0’ ., ., .]

  • Feature tokens:
  • 0P2i0d12
  • 1P2ineg0d13
  • 2P2i0d07
Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-12
SLIDE 12 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

String Encoding of Semanc Vectors

  • Encoding of semanc vectors to strings (feature tokens):
  • Semanc vector of three dimensions:

⃗ [., ., .]

  • Rounding to two decimal places, string encoded:

⃗ [’0’ ., ’1’ ., .]

  • Feature tokens:
  • 0P2i0d12
  • 1P2ineg0d13
  • 2P2i0d07
Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-13
SLIDE 13 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

String Encoding of Semanc Vectors

  • Encoding of semanc vectors to strings (feature tokens):
  • Semanc vector of three dimensions:

⃗ [., ., .]

  • Rounding to two decimal places, string encoded:

⃗ [’0’ ., ’1’ ., ’2’ .]

  • Feature tokens:
  • 0P2i0d12
  • 1P2ineg0d13
  • 2P2i0d07
Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-14
SLIDE 14 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

String Encoding of Semanc Vectors

  • Encoding of semanc vectors to strings (feature tokens):
  • Semanc vector of three dimensions:

⃗ [., ., .]

  • Rounding to two decimal places, string encoded:

⃗ [’0P2’ ., ’1’ ., ’2’ .]

  • Feature tokens:
  • 0P2i0d12
  • 1P2ineg0d13
  • 2P2i0d07
Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-15
SLIDE 15 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

String Encoding of Semanc Vectors

  • Encoding of semanc vectors to strings (feature tokens):
  • Semanc vector of three dimensions:

⃗ [., ., .]

  • Rounding to two decimal places, string encoded:

⃗ [’0P2’ ., ’1P2’ ., ’2’ .]

  • Feature tokens:
  • 0P2i0d12
  • 1P2ineg0d13
  • 2P2i0d07
Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-16
SLIDE 16 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

String Encoding of Semanc Vectors

  • Encoding of semanc vectors to strings (feature tokens):
  • Semanc vector of three dimensions:

⃗ [., ., .]

  • Rounding to two decimal places, string encoded:

⃗ [’0P2’ ., ’1P2’ ., ’2P2’ .]

  • Feature tokens:
  • 0P2i0d12
  • 1P2ineg0d13
  • 2P2i0d07
Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-17
SLIDE 17 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

String Encoding of Semanc Vectors

  • Encoding of semanc vectors to strings (feature tokens):
  • Semanc vector of three dimensions:

⃗ [., ., .]

  • Rounding to two decimal places, string encoded:

⃗ [’0P2i0d12’, ’1P2’ ., ’2P2’ .]

  • Feature tokens:
  • 0P2i0d12
  • 1P2ineg0d13
  • 2P2i0d07
Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-18
SLIDE 18 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

String Encoding of Semanc Vectors

  • Encoding of semanc vectors to strings (feature tokens):
  • Semanc vector of three dimensions:

⃗ [., ., .]

  • Rounding to two decimal places, string encoded:

⃗ [’0P2i0d12’, ’1P2ineg0d13’, ’2P2’ .]

  • Feature tokens:
  • 0P2i0d12
  • 1P2ineg0d13
  • 2P2i0d07
Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-19
SLIDE 19 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

String Encoding of Semanc Vectors

  • Encoding of semanc vectors to strings (feature tokens):
  • Semanc vector of three dimensions:

⃗ [., ., .]

  • Rounding to two decimal places, string encoded:

⃗ [’0P2i0d12’, ’1P2ineg0d13’, ’2P2i0d07’]

  • Feature tokens:
  • 0P2i0d12
  • 1P2ineg0d13
  • 2P2i0d07
Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-20
SLIDE 20 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

String Encoding of Semanc Vectors

  • Encoding of semanc vectors to strings (feature tokens):
  • Semanc vector of three dimensions:

⃗ [., ., .]

  • Rounding to two decimal places, string encoded:

⃗ [’0P2i0d12’, ’1P2ineg0d13’, ’2P2i0d07’]

  • Feature tokens:
  • 0P2i0d12
  • 1P2ineg0d13
  • 2P2i0d07
Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-21
SLIDE 21 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

High-Pass Filtering – Speed Opmizaon

  • High-pass filtering:

⃗ 𝑥 = [0.12, −0.13, 0.065] trim Fixed threshold, for example 0.1: Keep only 0.12, −0.13 from ⃗ 𝑥, as |0.065| < 0.1. best Fixed number of the best values is used, for example

  • nly the best one:

Keep only −0.13 from ⃗ 𝑥, as | − 0.13| is the highest absolute value in ⃗ 𝑥. Speed opmizaon of the search for candidate nuggets without significant impact on the quality.

Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-22
SLIDE 22 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

High-Pass Filtering – Speed Opmizaon

  • High-pass filtering:

⃗ 𝑥 = [0.12, −0.13, 0.065] trim Fixed threshold, for example 0.1: Keep only 0.12, −0.13 from ⃗ 𝑥, as |0.065| < 0.1. best Fixed number of the best values is used, for example

  • nly the best one:

Keep only −0.13 from ⃗ 𝑥, as | − 0.13| is the highest absolute value in ⃗ 𝑥. Speed opmizaon of the search for candidate nuggets without significant impact on the quality.

Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-23
SLIDE 23 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

Outline

1 Semanc Indexing and Searching 2 String Encoding of Semanc Vectors 3 Results

Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-24
SLIDE 24 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

Datasets

en-wiki The English Wikipedia dataset.

  • LSA with 400 dimensions
  • doc2vec with 400 dimensions.

wiki-2014+gigaword-5 Pre-trained word vectors from Wikipedia and English Gigaword Fih Edion.

  • GloVe with 50, 100, 200, and 300 dimensions.

common-crawl Pre-trained word vectors from the Common Crawl project.

  • GloVe with 300 dimensions.

twier Pre-trained word vectors from the Twier social network.

  • GloVe with 25, 50, 100, and 200 dimensions.

texmex Image descriptors provided by the TEXMEX project.

  • SIFT descriptors of images with 128 dimensions.
Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-25
SLIDE 25 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

Comparison of Results

100 200 300 400 500 600 Page size 0.2 0.4 0.6 0.8 Precision@10 # of best features used all 90 40 17 6

English Wikipedia Cosine Similarity

100 200 300 400 500 600 Page size 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Precision@10 # of best features used all 90 40 17 6

TEXMEX SIFT Descriptors Cosine Similarity

100 200 300 400 500 600 Page size 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Precision@10 # of best features used all 90 40 17 6

TEXMEX SIFT Descriptors Euclidean Similarity

Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-26
SLIDE 26 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

Comparison of Results

100 200 300 400 500 600 Page size 0.2 0.4 0.6 0.8 Precision@10 # of best features used all 90 40 17 6

English Wikipedia Cosine Similarity

Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-27
SLIDE 27 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

Comparison of Results

100 200 300 400 500 600 Page size 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Precision@10 # of best features used all 90 40 17 6

TEXMEX SIFT Descriptors Cosine Similarity

Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-28
SLIDE 28 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

Comparison of Results

100 200 300 400 500 600 Page size 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Precision@10 # of best features used all 90 40 17 6

TEXMEX SIFT Descriptors Euclidean Similarity

Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-29
SLIDE 29 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

Comparison of Results

100 200 300 400 500 600 Page size 0.2 0.4 0.6 0.8 Precision@10 # of best features used all 90 40 17 6

English Wikipedia Cosine Similarity

100 200 300 400 500 600 Page size 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Precision@10 # of best features used all 90 40 17 6

TEXMEX SIFT Descriptors Cosine Similarity

100 200 300 400 500 600 Page size 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Precision@10 # of best features used all 90 40 17 6

TEXMEX SIFT Descriptors Euclidean Similarity

Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-30
SLIDE 30 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

Summary

Flexible Different input data formats / tokenizers / segmenters / semanc models / re-ranking methods / fulltext search engines / … Similarity Search Cosine / euclidean / … similarity.

  • f Semanc Vectors LSI / deep learning / doc2vec / …

using Fulltext Search Engines Sphinx, Lucene, Elascsearch, Solr, …

Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-31
SLIDE 31 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

Quesons?

Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017
slide-32
SLIDE 32 Semanc Indexing and Searching String Encoding of Semanc Vectors Results

Illustraons by Jiří Franek. RŮŽIČKA, Michal, Vít NOVOTNÝ, Petr SOJKA, Jan POMIKÁLEK and Radim ŘEHŮŘEK. Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines. In CEUR Workshop Proceedings, Vol. 1923. Vienna, Austria: Neuveden, 2017. p. 1–12, 12 pp. ISSN 1613-0073. https://usc-isi-i2.github.io/ISWC17workshop/accepted-papers/HSSUES_2017_ paper_2.pdf RYGL, Jan, Jan POMIKÁLEK, Radim ŘEHŮŘEK, Michal RŮŽIČKA, Vít NOVOTNÝ and Petr SOJKA. Semanc Vector Encoding and Similarity Search Using Fulltext Search Engines. In Proceedings of the 2nd Workshop on Representaon Learning for NLP. Vancouver, Canada: Associaon for Computaonal Linguiscs, 2017. p. 81–90, 179 pp. ISBN 978-1-945626-62-3. DOI: https://doi.org/10.18653/v1/W17-2611 RYGL, Jan, Petr SOJKA, Michal RŮŽIČKA and Radim ŘEHŮŘEK. ScaleText: The Design of a Scalable, Adaptable and User-Friendly Document System for Similarity Searches : Digging for Nuggets of Wisdom in Text. In Aleš Horák, Pavel Rychlý, Adam Rambousek. Proceedings of the Tenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2016. Brno: Tribun EU, 2016. p. 79–87, 9 pp. ISBN 978-80-263-1095-2. https://nlp.fi.muni.cz/raslan/2016/paper08-Rygl_Sojka_etal.pdf

Flexible Similarity Search of Semanc Vectors Using Fulltext Search Engines ISWC 2017 workshop HSSUES, Vienna, Austria, October 21, 2017