Updateable fields in Lucene and other Codec applications Andrzej Bia - PowerPoint PPT Presentation

Updateable fields in Lucene and other Codec applications Andrzej Bia ł ecki

Agenda § Codec API primer § Some interesting Codec applications • TeeCodec and TeeDirectory • FilteringCodec • Single-pass IndexSplitter § Field-level updates in Lucene • Current document-level update design • Proposed “stacked” design • Implementation details and status • Limitations 3

About the speaker § Lucene user since 2003 (1.2-dev … ) § Created Luke – the Lucene Index Toolbox § Apache Nutch, Hadoop, Solr committer, Lucene PMC member, ASF member § LucidWorks developer 4

Codec API 5

Data encoding and file formats § Lucene 3.x and before • Tuned to pre-defined data types • Combinations of delta encoding and variable- length byte encodings • Hardcoded choices – impossible to customize • Dependencies on specific file-system behaviors (e.g. seek back & overwrite) • Data coding happened in many places § Lucene 4 and onwards • All data writing and reading abstracted from data encoding (file formats) • Highly customizable, easy to use API 6

Codec API § Codec implementations provide “formats” • SegmentInfoFormat, PostingsFormat, StoredFieldsFormat, TermVectorFormat, DocValuesFormat § Formats provide consumers (to write to) and producers (to read from) • FieldsConsumer, TermsConsumer, PostingsConsumer, StoredFieldsWriter / StoredFieldsReader … § Consumers and producers offer item-level API (e.g. to read terms, postings, stored fields, etc) 7

Codec Coding Craziness! § Many new data encoding schemas have been implemented • Lucene40, Pulsing, Appending § Still many more on the way! • PForDelta, intblock Simple 9/16, VSEncoding, Bloom-Filter-ed, etc … § Lucene became an excellent platform for IR research and experimentation • Easy to implement your own index format 8

Some interesting Codec applications 9

TeeCodec § Use cases: • Copy of index in real-time, with different data encoding / compression § TeeCodec writes the same index data to many locations simultaneously • Map<Directory,Codec> outputs • The same fields / terms / postings written to multiple outputs, using possibly different Codec-s § TeeDirectory replicates the stuff not covered in Codec API (e.g. segments.gen) 10

FilteringCodec § Use case: • Discard on-the-fly some less useful index data § Simple boolean decisions to pass / skip: • Stored Fields (add / skip / modify fields content) • Indexed Fields (all data related to a field, i.e. terms + postings) • Terms (all postings for a term) • Postings (some postings for a terms) • Payloads (add / skip / modify payloads for term's postings) § Output: Directory + Codec 11

Example: index pruning § On-the-fly pruning, i.e. no post-processing IndexWriter ? TeeCodec FilteringCodec Lucene40Codec IndexReader IndexReader AppendingCodec SSD HDFS 12

Example: Single-pass IndexSplitter § Each FilteringCodec selects a subset of data • Not necessarily disjoint! IndexWriter TeeCodec FilteringCodec 1 FilteringCodec 2 FilteringCodec 3 Lucene40Codec Lucene40Codec Lucene40Codec Directory 1 Directory 2 Directory 3

Field-level index updates 14

Current index update design § Document-level “update” is really a “delete + add” • Old document ID* is hidden via “liveDocs” bitset • Term and collections statistics wrong for a time • Only a segment merge actually removes deleted document’s data (stored fields, postings, etc) § And fixes term / collection statistics • New document is added to a new segment, with a different ID* * Internal document ID (segment scope) – ephemeral int, not preserved in segment merges 15

Problems with the current design § Document-level § Users have to store all fields § All indexed fields have to be analyzed again § Costly operation for large documents with small frequent updates § Some workarounds exist: • ParallelReader with large static index + small dynamic index – tricky to sync internals IDs! • ExternalFileField – simple float values, sorted in memory to match doc ID-s • Application-level join between indexes or index + db 16

Let’s change it 17

“Stacked” field-level updates § Per-field updates, both stored and inverted data § Updated field data is “stacked” on top of old data § Old data is “covered” by the updates § Paper by Ercegovac, Josifovski, Li et al • “Supporting Sub-Document Updates and Queries in an Inverted Index” CIKM ‘08 xy � yz � ab � bc � cd � de � ef � ab � xy � cd � yz � ef � 18

Proposed “stacked” field updates § Field updates represented as new documents • Contain only updated field values § Additional stored field keeps the original doc ID? OR § Change & sort the ID-s to match the main segment? § Updates are written as separate segments § On reading, data from the main and the “stacked” segments is somehow merged on the fly • Internal ID-s have to be matched for the join § Original ID from the main index § Re-mapped, or identical ID from the stacked segment? • Older data replaced with the new data from the “stacked” segments § Re-use existing APIs when possible 19

NOTE: work in progress § This is a work in progress § Very early stage § DO NOT expect this to work today – it doesn’t! • It’s a car frame + a pile of loose parts 20

Writing “stacked” updates 21

Writing “stacked” updates § Updates are regular Lucene Document-s • With the added “original ID” (oid) stored field • OR re-sort to match internal IDs of the main segment? § Initial design • Additional IndexWriter-s / DocumentWriter-s – UpdateWriter-s • Create regular Lucene segments § E.g. using different namespace ( u_0f5 for updates of _0f5 ) � • Flush needs to be synced with the main IndexWriter • SegmentInfos modified to record references to the update segments • Segment merging in main index closes UpdateWriter-s § Convenience methods in IndexWriter • IW.updateDocument(int n, Document newFields) § End result: additional segment(s) containing updates 22

… to be continued … § Interactions between the UpdateW and the main IW § Support multiple stacked segments § Evaluate strategies • Map ID-s on reading, OR • Change & sort ID-s on write § Support NRT 23

Reading “stacked” updates 24

Combining updates with originals § Updates may contain single or multiple fields • Need to keep track what updated field is where § Multiple updates of the same document • Last update should win § ID-s in the updates != ID-s in the main segment! • Need a mapping structure between internal ID-s • OR: Sort updates so that ID-s match § ID mapping – costs to retrieve § ID sorting – costs to create * Initial simplification: max. 1 update segment for 1 main segment 25

Unsorted “stacked” updates Runtime ID re-mapping 26

Unsorted updates – ID mismatch § Resolve ID-s at runtime: • Use stored original ID-s (newID à à oldID) • Invert the relation and sort (oldID à à newID) § Use a (sparse!) per-field map of oldID à newID for lookup and translation E.g. when iterating over docs: • Foreach ID in old ID-s: § Check if oldID exists in updates § if exists, translate to newID and return the newID’s data 27

Stacked stored fields Original segment id f1 f2 • Any non-inverted fields 10 abba � c-b � • Stored fields, norms or docValues 11 b-ad � -b-c � 12 ca--d � c-c � 13 da-da � b--b � Funny looking field values? This is just to later illustrate the tokenization – one character becomes one token, and then it becomes one index term.

Stacked stored fields Original segment id f1 f2 10 abba � c-b � 11 b-ad � -b-c � ? 12 ca--d � c-c � 13 da-da � b--b � “Updates” segment id oid f1 f2 f3 • Several versions of a field 0 12 ba-a � • Fields spread over several 1 10 ac � --cb � updates (documents) 2 13 -ee � • Internal IDs don’t match! 3 13 dab � • Store the original ID (oid) 4 10 ad-c �

Stacked stored fields Original segment • Build a map from original id id f1 f2 IDs to the IDs of updates 10 abba � c-b � • sort by oid 11 b-ad � -b-c � • One sparse map per field 12 ca--d � c-c � • Latest field value wins 13 da-da � b--b � • Fast lookup needed in memory? “Updates” segment ID per-field mapping id oid f1 f2 f3 f1 f2 f3 0 12 ba-a � 10 4 1 1 10 ac � --cb � 11 2 13 -ee � last update wins! 12 0 3 13 dab � 13 3 2 4 10 ad-c � 30

Stacked stored fields Original segment “Stacked” segment id f1 f2 id f1 f2 f3 10 abba � c-b � 10 ad-c � --cb � 11 b-ad � -b-c � 11 b-ad � -b-c � 12 ca--d � c-c � 12 ba-a � c-c � 13 da-da � b--b � 13 dab � b--b � -ee � “Updates” segment ID per-field mapping id oid f1 f2 f3 f1 f2 f3 0 12 ba-a � 10 4 1 1 10 ac � --cb � 11 2 13 -ee � last update wins! à discard 1:f1 12 0 3 13 dab � 13 3 2 4 10 ad-c � 31

Updateable fields in Lucene and other Codec applications Andrzej Bia - PowerPoint PPT Presentation

Updateable fields in Lucene and other Codec applications Andrzej Bia ecki Agenda Codec API primer Some interesting Codec applications TeeCodec and TeeDirectory FilteringCodec Single-pass IndexSplitter Field-level

Apache Lucene - a library retrieving data for millions of users Simon Willnauer Apache Lucene

RGL Codec (G.711 Lossless Codec) http://www.winlab.rutgers.edu/~ramalho/rgl_codec_p19.txt

Visualization Visualization Height Fields and Contours Height Fields and Contours Scalar Fields

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC & Apache Software Foundation

Joining in Lucene Martijn van Groningen martijn.vangroningen@searchworkings.com Lucene Committer

Codec 2 open source speech codec low bit rate (2400 bit/s and below) applications

A Full Bandwidth Audio Codec with Low A Full Bandwidth Audio Codec with Low Complexity and Very

Martin Adams Codec CEO & Co-founder martin@codec.ai AI for Content Marketing Monthly

Apache Lucene 5 New Features and Improvements for Apache Solr and Elasticsearch Uwe Schindler

Nutch as a Web mining platform Nutch Berlin Buzzwords '10 the present and the future Andrzej

Lucene And Solr Document Classification Alessandro Benedetti, Software Engineer, Sease Ltd. Who

Bugs, Bugs, Bugs Uwe Schindler Apache Lucene Committer & PMC Member uschindler@apache.org

Query Suggestions with Lucene simonw & rmuir Who we are... who: Simon Willnauer / Robert

Advanced Document Similarity With Apache Lucene Alessandro Benedetti, Software Engineer, Sease

Realtime Search with Lucene Michael Busch @michibusch michael@twitter.com buschmi@apache.org

Efficient Scoring in Lucene Stefan Pohl Nokia Berlin stefan.pohl@nokia.com Agenda

Exercise 8: Preparation 1 Download test video sequences from course web-site

Codecs and RTP payload formats in SDPng Anders Klemets <anderskl@microsoft.com> Jim Alkove

FLOW CYTOMETRY DATA COMPRESSION A.E. Bras PhD Student Erasmus University, Rotterdam, the

Tiny functions for lots of things Keith Winstein joint work with: Francis Y. Yan , Sadjad Fouladi

Jonathan Rosenberg dynamicsoft IETF 52 History RFC2543 had appendix B, which specified SDP

Linguistics & Corpora Monday, February 2, 2015 Plan for Today: Character Encodings

Software Security CSM27 Computer Security Dr Hans Georg Schaathun University of Surrey Autumn

Compact Course Python Michaela Regneri & Andreas Eisele Lecture 4 Overview More on

Sambuz

Useful Links

Newsletter

Mail Us

Updateable fields in Lucene and other Codec applications Andrzej Bia - PowerPoint PPT Presentation

Updateable fields in Lucene and other Codec applications Andrzej Bia ecki Agenda Codec API primer Some interesting Codec applications TeeCodec and TeeDirectory FilteringCodec Single-pass IndexSplitter Field-level

Apache Lucene - a library retrieving data for millions of users Simon Willnauer Apache Lucene

RGL Codec (G.711 Lossless Codec) http://www.winlab.rutgers.edu/~ramalho/rgl_codec_p19.txt

Visualization Visualization Height Fields and Contours Height Fields and Contours Scalar Fields

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC &amp; Apache Software Foundation

Joining in Lucene Martijn van Groningen martijn.vangroningen@searchworkings.com Lucene Committer

Codec 2 open source speech codec low bit rate (2400 bit/s and below) applications

A Full Bandwidth Audio Codec with Low A Full Bandwidth Audio Codec with Low Complexity and Very

Martin Adams Codec CEO &amp; Co-founder martin@codec.ai AI for Content Marketing Monthly

Apache Lucene 5 New Features and Improvements for Apache Solr and Elasticsearch Uwe Schindler

Nutch as a Web mining platform Nutch Berlin Buzzwords '10 the present and the future Andrzej

Lucene And Solr Document Classification Alessandro Benedetti, Software Engineer, Sease Ltd. Who

Bugs, Bugs, Bugs Uwe Schindler Apache Lucene Committer &amp; PMC Member uschindler@apache.org

Query Suggestions with Lucene simonw &amp; rmuir Who we are... who: Simon Willnauer / Robert

Advanced Document Similarity With Apache Lucene Alessandro Benedetti, Software Engineer, Sease

Realtime Search with Lucene Michael Busch @michibusch michael@twitter.com buschmi@apache.org

Efficient Scoring in Lucene Stefan Pohl Nokia Berlin stefan.pohl@nokia.com Agenda

Exercise 8: Preparation 1 Download test video sequences from course web-site

Codecs and RTP payload formats in SDPng Anders Klemets &lt;anderskl@microsoft.com&gt; Jim Alkove

FLOW CYTOMETRY DATA COMPRESSION A.E. Bras PhD Student Erasmus University, Rotterdam, the

Tiny functions for lots of things Keith Winstein joint work with: Francis Y. Yan , Sadjad Fouladi

Jonathan Rosenberg dynamicsoft IETF 52 History RFC2543 had appendix B, which specified SDP

Linguistics &amp; Corpora Monday, February 2, 2015 Plan for Today: Character Encodings

Software Security CSM27 Computer Security Dr Hans Georg Schaathun University of Surrey Autumn

Compact Course Python Michaela Regneri &amp; Andreas Eisele Lecture 4 Overview More on

Sambuz

Useful Links

Newsletter

Mail Us

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC & Apache Software Foundation

Martin Adams Codec CEO & Co-founder martin@codec.ai AI for Content Marketing Monthly

Bugs, Bugs, Bugs Uwe Schindler Apache Lucene Committer & PMC Member uschindler@apache.org

Query Suggestions with Lucene simonw & rmuir Who we are... who: Simon Willnauer / Robert

Codecs and RTP payload formats in SDPng Anders Klemets <anderskl@microsoft.com> Jim Alkove

Linguistics & Corpora Monday, February 2, 2015 Plan for Today: Character Encodings

Compact Course Python Michaela Regneri & Andreas Eisele Lecture 4 Overview More on