a time machine for text search
play

A Time Machine for Text Search Klaus Berberich, Srikanta Bedathur, - PowerPoint PPT Presentation

A Time Machine for Text Search Klaus Berberich, Srikanta Bedathur, Thomas Neumann, Gerhard Weikum Max-Planck Institute for Informatics, Saarbrcken, Germany Motivation Historical information needs, e.g., Contemporary (~2001) articles


  1. A Time Machine for Text Search Klaus Berberich, Srikanta Bedathur, Thomas Neumann, Gerhard Weikum Max-Planck Institute for Informatics, Saarbrücken, Germany

  2. Motivation � Historical information needs, e.g., Contemporary (~2001) articles about the movie “Harry Potter and the Sorcerer’s Stone” � Relevant pages have disappeared but are preserved by Web archives (e.g., archive.org) � Search over Web archives is limited and ignores the time-axis Klaus Berberich – A Time Machine for Text Search

  3. Motivation � Historical information needs, e.g., Contemporary (~2001) articles about the movie “Harry Potter and the Sorcerer’s Stone” � Relevant pages have disappeared but are preserved by Web archives (e.g., archive.org) � Search over Web archives is limited and ignores the time-axis Klaus Berberich – A Time Machine for Text Search

  4. Motivation � Historical information needs, e.g., Contemporary (~2001) articles about the movie “Harry Potter and the Sorcerer’s Stone” � Relevant pages have disappeared but are preserved by Web archives (e.g., archive.org) � Search over Web archives is limited and ignores the time-axis Klaus Berberich – A Time Machine for Text Search

  5. Motivation � Historical information needs, e.g., Contemporary (~2001) articles about the movie “Harry Potter and the Sorcerer’s Stone” � Relevant pages have disappeared but are preserved by Web archives (e.g., archive.org) � Search over Web archives is limited and ignores the time-axis Klaus Berberich – A Time Machine for Text Search

  6. Motivation � Historical information needs, e.g., Contemporary (~2001) articles about the movie “Harry Potter and the Sorcerer’s Stone” � Relevant pages have disappeared but are preserved by Web archives (e.g., archive.org) � Search over Web archives is limited and ignores the time-axis Klaus Berberich – A Time Machine for Text Search

  7. Motivation � Time-Travel Text Search extends keyword querying by a time-point of interest t “harry potter” @ 2001/11/14 � Other temporally versioned text collections � Wikis � Repositories (e.g., controlled by CVS, Subversion) � Your Desktop Klaus Berberich – A Time Machine for Text Search

  8. Outline � Motivation � Collection, Query, and Relevance Model � Time-Travel Inverted File Index � Reducing Index Size � Tuning Index Performance � Experimental Evaluation � Conclusions Klaus Berberich – A Time Machine for Text Search

  9. Collection Model � Document d is a sequence of time-stamped versions � Version is a vector of searchable terms � Document deletion results in tombstone version ⊥ � Discrete time, timestamps are non-negative � State of document collection as of time t Klaus Berberich – A Time Machine for Text Search

  10. Query Model � Time-travel query q t consists of � keyword part q (i.e., a set of query terms) � time-point of interest t � Time-travel query q t is evaluated over D t so that only versions that existed at time t are considered Klaus Berberich – A Time Machine for Text Search

  11. Relevance Model � We adapt Okapi BM25 as a relevance model � Term-frequency score (TF) � Inverse document-frequency score (IDF) Klaus Berberich – A Time Machine for Text Search

  12. Relevance Model � We adapt Okapi BM25 as a relevance model � Term-frequency score (TF) � Inverse document-frequency score (IDF) Klaus Berberich – A Time Machine for Text Search

  13. Relevance Model � We adapt Okapi BM25 as a relevance model � Term-frequency score (TF) � Inverse document-frequency score (IDF) Klaus Berberich – A Time Machine for Text Search

  14. Relevance Model � We adapt Okapi BM25 as a relevance model � Term-frequency score (TF) � Inverse document-frequency score (IDF) Klaus Berberich – A Time Machine for Text Search

  15. Relevance Model � We adapt Okapi BM25 as a relevance model � Term-frequency score (TF) � Inverse document-frequency score (IDF) Klaus Berberich – A Time Machine for Text Search

  16. Outline � Motivation � Collection, Query, and Relevance Model � Time-Travel Inverted File Index � Reducing Index Size � Tuning Index Performance � Experimental Evaluation � Conclusions Klaus Berberich – A Time Machine for Text Search

  17. Time-Travel Inverted File Index � Idea: Transparently extend “IR’s workhorse” so that the existing wealth of extensions remains applicable � We extend postings by a validity time-interval IDF TF B + -Tree B + -Tree “harry” “potter” ( d1, 11.2, [t1, t2) ) “harry” ( d8, 10.9, [t7, t9) ) ( d1, 10.6, [t2, t5) ) Klaus Berberich – A Time Machine for Text Search

  18. Time-Travel Inverted File Index � Time-travel query q t can be processed by scanning index lists while ignoring non-relevant postings � Example : “harry”@t8 IDF TF B + -Tree B + -Tree “harry” “potter” ( d1, 11.2, [t1, t2) ) “harry” ( d8, 10.9, [t7, t9) ) ( d1, 10.6, [t2, t5) ) Klaus Berberich – A Time Machine for Text Search

  19. Time-Travel Inverted File Index � Time-travel query q t can be processed by scanning index lists while ignoring non-relevant postings � Example : “harry”@t8 IDF TF B + -Tree B + -Tree “harry” “potter” ( d1, 11.2, [t1, t2) ) “harry” ( d8, 10.9, [t7, t9) ) ( d1, 10.6, [t2, t5) ) w idf (“harry”, t8) = 3.08 Klaus Berberich – A Time Machine for Text Search

  20. Time-Travel Inverted File Index � Time-travel query q t can be processed by scanning index lists while ignoring non-relevant postings � Example : “harry”@t8 IDF TF B + -Tree B + -Tree “harry” “potter” ( d1, 11.2, [t1, t2) ) “harry” scan ( d8, 10.9, [t7, t9) ) ( d1, 10.6, [t2, t5) ) w idf (“harry”, t8) = 3.08 Klaus Berberich – A Time Machine for Text Search

  21. Time-Travel Inverted File Index � Time-travel query q t can be processed by scanning index lists while ignoring non-relevant postings � Example : “harry”@t8 IDF TF B + -Tree B + -Tree “harry” “potter” ( d1, 11.2, [t1, t2) ) “harry” scan ( d8, 10.9, [t7, t9) ) ( d1, 10.6, [t2, t5) ) w idf (“harry”, t8) = 3.08 Klaus Berberich – A Time Machine for Text Search

  22. Time-Travel Inverted File Index � Time-travel query q t can be processed by scanning index lists while ignoring non-relevant postings � Example : “harry”@t8 IDF TF B + -Tree B + -Tree “harry” “potter” ( d1, 11.2, [t1, t2) ) “harry” scan ( d8, 10.9, [t7, t9) ) ( d1, 10.6, [t2, t5) ) w idf (“harry”, t8) = 3.08 Klaus Berberich – A Time Machine for Text Search

  23. Time-Travel Inverted File Index � Time-travel query q t can be processed by scanning index lists while ignoring non-relevant postings � Example : “harry”@t8 IDF TF B + -Tree B + -Tree “harry” “potter” ( d1, 11.2, [t1, t2) ) “harry” scan ( d8, 10.9, [t7, t9) ) ( d1, 10.6, [t2, t5) ) w idf (“harry”, t8) = 3.08 Klaus Berberich – A Time Machine for Text Search

  24. Time-Travel Inverted File Index � Time-travel query q t can be processed by scanning index lists while ignoring non-relevant postings � Example : “harry”@t8 IDF TF B + -Tree B + -Tree “harry” “potter” ( d1, 11.2, [t1, t2) ) “harry” scan ( d8, 10.9, [t7, t9) ) ( d1, 10.6, [t2, t5) ) w idf (“harry”, t8) = 3.08 Klaus Berberich – A Time Machine for Text Search

  25. Time-Travel Inverted File Index � Time-travel query q t can be processed by scanning index lists while ignoring non-relevant postings � Example : “harry”@t8 IDF TF B + -Tree B + -Tree “harry” “potter” ( d1, 11.2, [t1, t2) ) “harry” scan ( d8, 10.9, [t7, t9) ) ( d1, 10.6, [t2, t5) ) w idf (“harry”, t8) = 3.08 Klaus Berberich – A Time Machine for Text Search

  26. Time-Travel Inverted File Index � Time-travel query q t can be processed by scanning index lists while ignoring non-relevant postings � Example : “harry”@t8 IDF TF B + -Tree B + -Tree “harry” “potter” ( d1, 11.2, [t1, t2) ) “harry” scan ( d8, 10.9, [t7, t9) ) ( d1, 10.6, [t2, t5) ) w idf (“harry”, t8) = 3.08 Klaus Berberich – A Time Machine for Text Search

  27. Time-Travel Inverted File Index � Time-travel query q t can be processed by scanning index lists while ignoring non-relevant postings � Example : “harry”@t8 IDF TF B + -Tree B + -Tree “harry” “potter” ( d1, 11.2, [t1, t2) ) “harry” scan ( d8, 10.9, [t7, t9) ) ( d1, 10.6, [t2, t5) ) w idf (“harry”, t8) = 3.08 Klaus Berberich – A Time Machine for Text Search

  28. Time-Travel Inverted File Index � Time-travel query q t can be processed by scanning index lists while ignoring non-relevant postings � Example : “harry”@t8 IDF TF B + -Tree B + -Tree “harry” “potter” ( d1, 11.2, [t1, t2) ) “harry” scan ( d8, 10.9, [t7, t9) ) ( d1, 10.6, [t2, t5) ) w idf (“harry”, t8) = 3.08 Klaus Berberich – A Time Machine for Text Search

  29. Outline � Motivation � Collection, Query, and Relevance Model � Time-Travel Inverted File Index � Reducing Index Size � Tuning Index Performance � Experimental Evaluation � Conclusions Klaus Berberich – A Time Machine for Text Search

  30. Reducing Index Size � Shortcoming: Since we create one posting per version per term, the resulting index is very large TF B + -Tree “harry” ( d1, 11.2, [t1, t2) ) HUGE!!! ( d8, 10.9, [t7, t9) ) ( d1, 10.6, [t2, t5) ) (Wikipedia Revision History ~8.6B postings) Klaus Berberich – A Time Machine for Text Search

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend