IR in Practice
(a.k.a. Elastic 4 IR)
https://github.com/ielab/afirm2019 University of Queensland g.zuccon@uq.edu.au
- Dr. Guido Zuccon
www.ielab.io
IR in Practice (a.k.a. Elastic 4 IR) - - PowerPoint PPT Presentation
IR in Practice (a.k.a. Elastic 4 IR) https://github.com/ielab/afirm2019 Dr. Guido Zuccon University of Queensland g.zuccon@uq.edu.au www.ielab.io Plan of the Session The architecture of a typical IR system and that of Elasticsearch
https://github.com/ielab/afirm2019 University of Queensland g.zuccon@uq.edu.au
www.ielab.io
activities
instructions and exercises. Most activities come with code
activity relies on Java code.
Basic building blocks:
* Figures from Croft, Metzler, Strohman, “Search Engines: Information Retrieval in Practice”
Free download at:
Crawler, Feeds, Conversion, Document Data Store Document Statistics, Weighting, Inversion, Index Distribution Parser (tokenizer), Stopping, Stemming, Link Analysis, Information Extraction, Classifiers
Query input, Query transformation, Results output Scoring, Performance Optimisation Distribution Logging, Ranking analysis (effectiveness), Performance analysis (efficiency)
(in the early days) from: S Brin, L Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Computer networks and ISDN systems, 1998
(a non-exhaustive list)
lucene.apache.org/solr/, https://www.elastic.co/, http://anserini.io/
Pyindri)
Research Industry Industry Industry Research Research Research Research Research
Distributed indexing/querying Horizontal scalable Multi-tenancy enabled RESTful API Transaction logs Large, powerful query syntax
when searching on a specific type: this is used to include in an index data that is related, but of different nature (this has been phased out recently)
_index=“library_catalog” _type=“books” _type=“movies” q=“Shreck” q=“Lord of the Rings”
hardware limits of each node, by splitting a data index across multiple nodes.
shards
parallalisation of operations (also within the same node): multiple machines (or cores in one machine) can work on the same query at the same time.
index creation (default is 5).
Node 1 Node 2
Shard 1 Shard 3 Shard 2 Shard 4
nodes to create replica shards
reliability and increased performance for search queries,
the replicas in parallel
indexing (default 1)
Node 1 Node 2
Shard 1 Shard 3 Shard 2 Shard 4
Replica 2 Replica 4 Replica 1 Replica 3
Tool Console (from Elastic developers)
<REST verb> /<indexname>/<API>
http://localhost:9200/example/doc/AV3OdFAZ7fqYee2bfgSQ
Port where ES is listening Hostname of the ES node Index name Type name Document ID
afirm2019/tree/master/hands-on
ClueWeb12)
activity-1/
model
activity-2/
the default retrieval model
activity-3/
This can be used e.g. to extend retrieval models
activity-4/
run searches with it via Elasticsearch
5.x.x (uses Java) and one for 6.x.x (uses Python/ script similarity). We shall see the one for Elasticsearch 6.x.x
activity-5/
index
document priors
activity-6/
that you can use in a search engine GUI (e.g. for a user-based experiment)
activity-7/
stack for ingesting, search analyse and visualise data from any source, in any format, and in real time.
Glasgow
including for streams/tweet, diversity, learning to rank
theoretical definition of model
the Language Technologies Institute at Carnegie Mellon University.
Indri
purposes, on top of Lucene
to set standard state-of-the-art benchmarking
If you have questions or follow ups from this practical session, you can contact me at g.zuccon@uq.edu.au
www.ielab.io Thanks to the ielab team for developing parts of the activities we have seen today; in particular Harrisen Scells, Jimmy, Anton van der Vegt, Daniel Locke