PISA Performant Indexes and Search for Academia Antonio Mallia - PowerPoint PPT Presentation

PISA Performant Indexes and Search for Academia Antonio Mallia Michal Siedlaczek Joel Mackenzie Torsten Suel New York University New York University RMIT University New York University The Open-Source IR Replicability Challenge (OSIRRC 2019) July 25, 2019

PISA: Performant Indexes and Search for Academia � github.com/pisa-engine/pisa/ � 2

Design Overview PISA is designed to be e ff icient, extensible, and easy to use. Modern C++17 implementation Low level optimizations: CPU intrinsics, branch prediction hinting, and SIMD instructions Extensible: pluggable parsers, stemmers, compression codecs, and query processign algorithms Indexing, parsing and sharding capabilities Implementation of document reordering Free and open-source with permissive license � 3

Index building pipeline compress Compressed Index . reorder documents . . compress Compressed Index Collection Forward Index Canonical Inverted Index extract parse invert Index Metadata export e . x t . r a Term lexicon . c t Document lexicon External System Index Metadata Parsing Collection Indexing Several archive parsers, HTML To produce an inverted index in the content parser, tokenizer, and an uncompressed and universally stemming algorithm. readable format from a forward index Document Reordering Index Compression To reassign the document identifiers Variable Byte encoders, word-aligned within the inverted index: Random, encoders, monotonic encoders, and URL, MinHash and BP. frame-of-reference encoders. � 4

Supported Collections • Robust04 consists of newswire articles from a variety of sources from the late 1980’s through to the mid 1990’s.Core17 - the New York Times corpus. • Core17 corresponds to the New York Times news collection, which contains news articles between 1987 and 2007. • Core18 is the TREC Washington Post Corpus, which consists of news articles and blog posts from January 2012 through August 2017. • Gov2 is the TREC 2004 Terabyte Track test collection consisting of around 25 million .gov sites crawled in early 2004; the documents are truncated to 256 kB. • ClueWeb09 is the ClueWeb 2009 Category B collection consisting of around 50 million English web pages crawled between January and February, 2009. • ClueWeb12 is the ClueWeb 2012 Category B-13 collection, which contains around 52 million English web pages crawled between February and May, 2012. � 5

Feature Overview • Scoring : BM25, Language Models, DPH, PL2 • Search : Boolean and scored conjunctions or disjunctions • Traversal strategy : Document-at-a-Time or Term-at-a-Time • Dynamic pruning algorithms : MaxScore and WAND, and their Block-Max counterparts, Block-Max MaxScore (BMM) and Block-Max WAND (BMW) • Variable-sized blocks can be built (in lieu of fixed-sized blocks) to support the variable-block family of BlockMax algorithms, such as Variable Block-Max WAND (VBMW) � 6

System E ff ectiveness We process rank-safe, disjunctive, top-k queries to depth k = 1,000 Topics MAP P@30 NDCG@20 Robust04 All 0.2534 0.3120 0.4221 Core17 All 0.2078 0.4260 0.3898 Core18 All 0.2384 0.3500 0.3927 701-750 0.2638 0.4776 0.4070 Gov2 751-800 0.3305 0.5487 0.5073 801-850 0.2950 0.4680 0.4925 51-100 0.1009 0.2521 0.1509 101-150 0.1093 0.2507 0.2177 ClueWeb09 151-200 0.1054 0.2100 0.1311 201-250 0.0449 0.1940 0.1529 ClueWeb12 251-300 0.0217 0.1240 0.1484 � 7

Future Plans PISA is still a relatively young project, aspiring to become a more widely used tool for IR experimentation. Many relevant features can be still developed to further enrich the framework: • Precomputed quantized partial scores • Qv ery expansion • Score-at-a-Time • Boilerplate removal • Learning-To-Rank (LTR) • Distributed indexes � 8

Lesson Learned • Docker is good for reproducibility • Architecture-optimized binaries are not portable using Docker • The collection format can still cause some issues • Performance does not seem to be a ff ected by the use of Docker � 9

Thank you for your a tu ention! � Any questions?

PISA Performant Indexes and Search for Academia Antonio Mallia - PowerPoint PPT Presentation

PISA Performant Indexes and Search for Academia Antonio Mallia Michal Siedlaczek Joel Mackenzie Torsten Suel New York University New York University RMIT University New York University The Open-Source IR Replicability Challenge (OSIRRC

Module 7: Creating and Maintaining Indexes Overview Creating Indexes Creating Index

Modern OLTP Indexes (Part 2) 1 / 43 Modern OLTP Indexes (Part 2) Recap Recap 2 / 43 Modern OLTP

Performant Multiplatform Kotlin Serialization Eric Cochran KotlinConf October 5, 2018 Performant

Compressed Indexes for Fast Search of Semantic Data Ra ff aele Perego Giulio Ermanno Pibiri

An Example of Index An Example of Index pattern of structure in indicators pattern of structure

Module 6: Planning Indexes Overview Introduction to Indexes Index Architecture How

PISA and its importance for Wales North Wales education Conference What is PISA? PISA

MSP Commercial Repayment Center Contractor Transition Performant Recovery, Inc. NGHP ORM Town

MSP Commercial Repayment Center Contractor Transition Performant Recovery, Inc. GHP Town Hall

Dow Jones Sustainability Indexes A cooperation of Dow Jones Indexes and SAM Content Key

RECIPE : Converting Concurrent DRAM Indexes to Persistent-Memory Indexes Se Kwon Lee, Jayashree

Indexes 1 Demo 2 Indexes Index = data structure

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

X-ray Polarimetry with Gas Pixel detectors R. Bellazzini INFN - sez. Pisa, Pisa, Italy December

experience Alberto Del Guerra Department of Physics, University of Pisa and INFN, Sezione di Pisa

Effective Web Graph Representations Giulio Ermanno Pibiri University of Pisa and ISTI-CNR Pisa,

Extending Hindley-Milner Type Inference with Coercive Structural Subtyping Dmitriy Traytel

Fast Reconstruction Algorithms for Deterministic Sensing Matrices and Applications Robert

On bent and hyper-bent functions via Dillon-like exponents Sihem Mesnager 1 and Jean-Pierre Flori

Asymptotic enumeration of correlation-immune functions E. Rodney Canfield Jason Gao Catherine

CAVE2 Unity Tutorial CAVE2 unity tutorial on github Omicron Cave example unity scene Cave2

Meyer and Wand property for Damas and Milners polymorphic type assignment system Presentation

scamper Matthew Luckie mjl@wand.net.nz http://www.wand.net.nz/scamper/ 1 What is scamper?

Rasterization May 1, 2006 Triangles Only We will discuss the rasterization of triangles