accelerating document
play

Accelerating Document Retrieval and Ranking for Cognitive - PowerPoint PPT Presentation

Accelerating Document Retrieval and Ranking for Cognitive Applications Presenters: Tim Kaldewey Performance Architect David Wendt Performance Engineer Disclaimer The author's views expressed in this presentation do not necessarily


  1. Accelerating Document Retrieval and Ranking for Cognitive Applications Presenters: Tim Kaldewey – Performance Architect David Wendt – Performance Engineer

  2. Disclaimer The author's views expressed in this presentation do not necessarily reflect the views of IBM.

  3. Watson evolution *http://www-03.ibm.com/software/businesscasesudies/us/en/corp?synkey=Y362451T34615G34

  4. Watson evolution 40x* *http://www-03.ibm.com/software/businesscasestudies/us/en/corp?synkey=Y362451T34615G34

  5. A “brainwave” for answering a question Time [ms]

  6. Background • Querying unstructured data (text) to identify relevant documents is a prerequisite for many cognitive data processing tasks (NLP) • The large number of queries and the volume of unstructured data require a highly performant mechanism Example: - Lucene index of Wikipedia (5 million docs) is 105GB - Average search comprises 7 terms (keywords) - On average 115 thousand documents scored per search • Scoring of candidate documents and passages is highly parallelizable. ➔ Acceleration can can be leveraged to improve response time and/or enable more complex queries to improve accuracy

  7. Document Search Index is This provincial government of Canada is officially organized in known as the government of Newfoundland and term-document what region? format • Retrieve the documents that are most likely to have the answer(s) to the question • Search for documents that contain the words from the question • Rank the documents based on – How frequent the words and word combinations appear in each document – The distance between these words in those documents

  8. Anatomy of Lucene Query Turn text into a Lucene query to retrieve relative documents. This provincial government of Canada +canada +newfoundland +provinci +govern +offici +known^0.5 +region is officially known as the government "provinci govern"~2 "govern canada"~2 "offici known"~2^0.9 of Newfoundland and what region? "known govern"~2 "govern newfoundland"~2 "offici region"~3 • Words are stemmed and some stop words (the, of, as, …) are removed. • Key words become term clauses: canada newfoundland provinci govern offici … – Scores are computed based on term frequency. • Word pairs (phrases) become span clauses: "provinci govern"~2 … – Scores are computed based on frequency of phase and word distance between words • Complex queries (e.g. nested span clauses) can improve accuracy by scoring higher more relevant documents.

  9. Scoring term clauses • Lucene is very efficient making only one- pass to match and score • Index format is optimized for speed in matching terms to documents • For each document, score each term clause and then sum the scores • Scorer takes three values: – Term frequency – Document length – Term probability

  10. Scoring span clauses "provinci govern"~2 "govern canada"~2 "offici known"~2 "known govern"~2 "govern newfoundland"~2 "offici region"~3 Scoring here uses a ‘sloppy’ frequency value calculated based on how often the term pair appears and how close together the terms are to each other. Clause form: span(term1,term2,slop,order) Example: span(provinci,govern,2,false)

  11. Scoring span clauses – continued span(provinci,govern,2,false) • Position vectors vary in length per term per document.

  12. Analysis • Scoring for each document is independent from other documents • At the end, scores are sorted to provide the document rank order

  13. Perfect for GPU • Floating point operations for thousands of items (documents) that can occur in parallel • Each query clause is implemented as a set of kernels and the scores accumulate in a float array where each element is the score for a unique document • The top N ranked document ids are returned to the host application

  14. Scoring on the GPU • We used the thrust libraries for sorting and intersecting to more easily include a CPU-only alternative • All term clauses are scored first and can be calculated in a single kernel (loop) • Spans are computed to maximize caching of term position values • Once scored, the results are sorted and the top N document ids are returned along with their scores Only 5 custom kernels were required.

  15. Results

  16. Making it Real • Accessing the index data: ids, frequencies, positions • Managing GPU access • Recursion for nested clauses • Scoring special cases • Coverage of query types

  17. Shared index data • First approach was to create a custom index with only the values we needed for scoring. • Sharing the index with the rest of Lucene would be ideal but how much would this cost us?

  18. Shared index data - results

  19. Managing GPU access • Need to handle simultaneous queries from many host threads • A dedicated set of streams – one per host thread – to handle each query • Limited the number of streams based on the available GPU memory and index size • Once the GPU is fully utilized, additional host threads can be blocked or can fallback to calling Lucene directly

  20. Recursion for nested spans • Although CUDA supports recursion, having an unknown stack-size becomes an issue. • Implemented the recursions as loops and managed a fake stack in global memory

  21. Query Types vs Coverage • Query types are unique combinations of search clauses: terms, spanNear, spanOr, nested spans, etc. • Coverage progression is from most common clause type to least common. .

  22. Scoring span clauses has special cases • There are some special cases like when phrases overlap.

  23. Conclusion • Speed up by half an order of magnitude • Many challenges: shared index, query types, recursion, … • GPU performance is even higher for complex queries – Words resulting in many documents requiring more threads – Complex span clauses with many position values • Speeding up query allows building more complex queries and scoring documents better which may help improve accuracy

  24. Questions?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend