Accelerating Document Retrieval and Ranking for Cognitive - - PowerPoint PPT Presentation

accelerating document
SMART_READER_LITE
LIVE PREVIEW

Accelerating Document Retrieval and Ranking for Cognitive - - PowerPoint PPT Presentation

Accelerating Document Retrieval and Ranking for Cognitive Applications Presenters: Tim Kaldewey Performance Architect David Wendt Performance Engineer Disclaimer The author's views expressed in this presentation do not necessarily


slide-1
SLIDE 1

Accelerating Document Retrieval and Ranking for Cognitive Applications

Presenters:

Tim Kaldewey – Performance Architect David Wendt – Performance Engineer

slide-2
SLIDE 2

Disclaimer

The author's views expressed in this presentation do not necessarily reflect the views of IBM.

slide-3
SLIDE 3

Watson evolution

*http://www-03.ibm.com/software/businesscasesudies/us/en/corp?synkey=Y362451T34615G34

slide-4
SLIDE 4

Watson evolution

40x*

*http://www-03.ibm.com/software/businesscasestudies/us/en/corp?synkey=Y362451T34615G34

slide-5
SLIDE 5

A “brainwave” for answering a question

Time [ms]

slide-6
SLIDE 6

Background

  • Querying unstructured data (text) to identify relevant documents is a

prerequisite for many cognitive data processing tasks (NLP)

  • The large number of queries and the volume of unstructured data require

a highly performant mechanism

Example: - Lucene index of Wikipedia (5 million docs) is 105GB

  • Average search comprises 7 terms (keywords)
  • On average 115 thousand documents scored per search
  • Scoring of candidate documents and passages is highly parallelizable.

➔ Acceleration can can be leveraged to improve response time and/or enable more complex queries to improve accuracy

slide-7
SLIDE 7

Document Search

  • Retrieve the documents that are most likely to have the answer(s) to the question
  • Search for documents that contain the words from the question
  • Rank the documents based on

– How frequent the words and word combinations appear in each document – The distance between these words in those documents

This provincial government of Canada is officially known as the government of Newfoundland and what region? Index is

  • rganized in

term-document format

slide-8
SLIDE 8

Anatomy of Lucene Query

  • Words are stemmed and some stop words (the, of, as, …) are removed.
  • Keywords become term clauses: canada newfoundland provinci govern offici …

– Scores are computed based on term frequency.

  • Word pairs (phrases) become span clauses: "provinci govern"~2 …

– Scores are computed based on frequency of phase and word distance between words

  • Complex queries (e.g. nested span clauses) can improve accuracy by scoring higher more

relevant documents.

This provincial government of Canada is officially known as the government

  • f Newfoundland and what region?

+canada +newfoundland +provinci +govern +offici +known^0.5 +region "provinci govern"~2 "govern canada"~2 "offici known"~2^0.9 "known govern"~2 "govern newfoundland"~2 "offici region"~3

Turn text into a Lucene query to retrieve relative documents.

slide-9
SLIDE 9

Scoring term clauses

  • Lucene is very efficient making only one-

pass to match and score

  • Index format is optimized for speed in

matching terms to documents

  • For each document, score each term

clause and then sum the scores

  • Scorer takes three values:

– Term frequency – Document length – Term probability

slide-10
SLIDE 10

Scoring span clauses

Scoring here uses a ‘sloppy’ frequency value calculated based on how often the term pair appears and how close together the terms are to each other. Clause form: span(term1,term2,slop,order) Example: span(provinci,govern,2,false)

"provinci govern"~2 "govern canada"~2 "offici known"~2 "known govern"~2 "govern newfoundland"~2 "offici region"~3

slide-11
SLIDE 11

Scoring span clauses – continued

  • Position vectors vary in length per term per document.

span(provinci,govern,2,false)

slide-12
SLIDE 12

Analysis

  • Scoring for each document is independent from other documents
  • At the end, scores are sorted to provide the document rank order
slide-13
SLIDE 13

Perfect for GPU

  • Floating point operations for thousands of

items (documents) that can occur in parallel

  • Each query clause is implemented as a set of

kernels and the scores accumulate in a float array where each element is the score for a unique document

  • The top N ranked document ids are returned

to the host application

slide-14
SLIDE 14

Scoring on the GPU

  • We used the thrust libraries for sorting and

intersecting to more easily include a CPU-only alternative

  • All term clauses are scored first and can be

calculated in a single kernel (loop)

  • Spans are computed to maximize caching of

term position values

  • Once scored, the results are sorted and the top

N document ids are returned along with their scores

Only 5 custom kernels were required.

slide-15
SLIDE 15

Results

slide-16
SLIDE 16

Making it Real

  • Accessing the index data: ids, frequencies, positions
  • Managing GPU access
  • Recursion for nested clauses
  • Scoring special cases
  • Coverage of query types
slide-17
SLIDE 17

Shared index data

  • First approach was to create a custom index with only the values we needed for scoring.
  • Sharing the index with the rest of Lucene would be ideal but how much would this cost us?
slide-18
SLIDE 18

Shared index data - results

slide-19
SLIDE 19

Managing GPU access

  • Need to handle simultaneous queries

from many host threads

  • A dedicated set of streams – one per

host thread – to handle each query

  • Limited the number of streams based on

the available GPU memory and index size

  • Once the GPU is fully utilized, additional

host threads can be blocked or can fallback to calling Lucene directly

slide-20
SLIDE 20

Recursion for nested spans

  • Although CUDA supports recursion, having an unknown stack-size becomes an issue.
  • Implemented the recursions as loops and managed a fake stack in global memory
slide-21
SLIDE 21

Query Types vs Coverage

  • Query types are unique

combinations of search clauses: terms, spanNear, spanOr, nested spans, etc.

  • Coverage progression is from

most common clause type to least common. .

slide-22
SLIDE 22

Scoring span clauses has special cases

  • There are some special cases like when phrases overlap.
slide-23
SLIDE 23

Conclusion

  • Speed up by half an order of magnitude
  • Many challenges: shared index, query types, recursion, …
  • GPU performance is even higher for complex queries

– Words resulting in many documents requiring more threads – Complex span clauses with many position values

  • Speeding up query allows building more complex queries and

scoring documents better which may help improve accuracy

slide-24
SLIDE 24

Questions?