vector space scoring
play

Vector Space Scoring Introduction to Information Retrieval INF 141 - PowerPoint PPT Presentation

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schtze http://www.informationretrieval.org Querying Corpus-wide statistics Querying Corpus-wide statistics Collection


  1. Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org

  2. Querying Corpus-wide statistics

  3. Querying Corpus-wide statistics • Collection Frequency, cf • Define: The total number of occurences of the term in the entire corpus

  4. Querying Corpus-wide statistics • Collection Frequency, cf • Define: The total number of occurences of the term in the entire corpus • Document Frequency, df • Define: The total number of documents which contain the term in the corpus

  5. Querying Corpus-wide statistics Word Collection Frequency Document Frequency insurance 10440 3997 try 10422 8760

  6. Querying Corpus-wide statistics Word Collection Frequency Document Frequency insurance 10440 3997 try 10422 8760 • This suggests that df is better at discriminating between documents

  7. Querying Corpus-wide statistics Word Collection Frequency Document Frequency insurance 10440 3997 try 10422 8760 • This suggests that df is better at discriminating between documents • How do we use df?

  8. Querying Corpus-wide statistics

  9. Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights

  10. Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf”

  11. Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency

  12. Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency • some measure of term density in a document

  13. Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency • some measure of term density in a document • idf = inverse document frequency

  14. Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency • some measure of term density in a document • idf = inverse document frequency • a measure of the informativeness of a term

  15. Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency • some measure of term density in a document • idf = inverse document frequency • a measure of the informativeness of a term • it’s rarity across the corpus

  16. Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency • some measure of term density in a document • idf = inverse document frequency • a measure of the informativeness of a term • it’s rarity across the corpus • could be just a count of documents with the term

  17. Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency • some measure of term density in a document • idf = inverse document frequency • a measure of the informativeness of a term • it’s rarity across the corpus • could be just a count of documents with the term • more commonly it is:

  18. Querying Corpus-wide statistics • Term-Frequency, Inverse Document Frequency Weights • “tf-idf” • tf = term frequency • some measure of term density in a document • idf = inverse document frequency • a measure of the informativeness of a term • it’s rarity across the corpus • could be just a count of documents with the term � | corpus | � • more commonly it is: id f t = log d f t

  19. Querying TF-IDF Examples � | corpus | � � 1 , 000 , 000 � id f t = log id f t = log 10 d f t d f t term d f t id f t 6 calpurnia 1 4 animal 10 3 sunday 1000 2 fly 10 , 000 1 under 100 , 000 0 the 1 , 000 , 000

  20. Querying TF-IDF Summary • Assign tf-idf weight for each term t in a document d: � | corpus | � f ( t, d ) = (1 + log ( tf t,d )) ∗ log tfid d f t,d • Increases with number of occurrences of term in a doc. • Increases with rarity of term across entire corpus • Three different metrics • term frequency • document frequency • collection/corpus frequency

  21. Querying Now, real-valued term-document matrices • Bag of words model • Each element of matrix is tf-idf value Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13 . 1 11 . 4 0 . 0 0 . 0 0 . 0 0 . 0 Brutus 3 . 0 8 . 3 0 . 0 1 . 0 0 . 0 0 . 0 Caesar 2 . 3 2 . 3 0 . 0 0 . 5 0 . 3 0 . 3 Calpurnia 0 . 0 11 . 2 0 . 0 0 . 0 0 . 0 0 . 0 Cleopatra 17 . 7 0 . 0 0 . 0 0 . 0 0 . 0 0 . 0 mercy 0 . 5 0 . 0 0 . 7 0 . 9 0 . 9 0 . 3 worser 1 . 2 0 . 0 0 . 6 0 . 6 0 . 6 0 . 0

  22. Querying Vector Space Scoring • That is a nice matrix, but • How does it relate to scoring? • Next, vector space scoring

  23. Vector Space Scoring Vector Space Model • Define: Vector Space Model • Representing a set of documents as vectors in a common vector space. • It is fundamental to many operations • (query,document) pair scoring • document classification • document clustering • Queries are represented as a document • A short one, but mathematically equivalent

  24. Vector Space Scoring Vector Space Model • Define: Vector Space Model � • A document, d, is defined as a vector: V ( d ) • One component for each term in the dictionary • Assume the term is the tf-idf score � | corpus | � � V ( d ) t = (1 + log ( tf t,d )) ∗ log d f t,d • A corpus is many vectors together. • A document can be thought of as a point in a multi- dimensional space, with axes related to terms.

  25. Vector Space Scoring Vector Space Model • Recall our Shakespeare Example: Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13 . 1 11 . 4 0 . 0 0 . 0 0 . 0 0 . 0 Brutus 3 . 0 8 . 3 0 . 0 1 . 0 0 . 0 0 . 0 Caesar 2 . 3 2 . 3 0 . 0 0 . 5 0 . 3 0 . 3 Calpurnia 0 . 0 11 . 2 0 . 0 0 . 0 0 . 0 0 . 0 Cleopatra 17 . 7 0 . 0 0 . 0 0 . 0 0 . 0 0 . 0 mercy 0 . 5 0 . 0 0 . 7 0 . 9 0 . 9 0 . 3 worser 1 . 2 0 . 0 0 . 6 0 . 6 0 . 6 0 . 0

  26. Vector Space Scoring Vector Space Model • Recall our Shakespeare Example: � V ( d 1 ) Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13 . 1 11 . 4 0 . 0 0 . 0 0 . 0 0 . 0 Brutus 3 . 0 8 . 3 0 . 0 1 . 0 0 . 0 0 . 0 Caesar 2 . 3 2 . 3 0 . 0 0 . 5 0 . 3 0 . 3 Calpurnia 0 . 0 11 . 2 0 . 0 0 . 0 0 . 0 0 . 0 Cleopatra 17 . 7 0 . 0 0 . 0 0 . 0 0 . 0 0 . 0 mercy 0 . 5 0 . 0 0 . 7 0 . 9 0 . 9 0 . 3 worser 1 . 2 0 . 0 0 . 6 0 . 6 0 . 6 0 . 0

  27. Vector Space Scoring Vector Space Model • Recall our Shakespeare Example: � � � V ( d 1 ) V ( d 2 ) V ( d 6 ) Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13 . 1 11 . 4 0 . 0 0 . 0 0 . 0 0 . 0 Brutus 3 . 0 8 . 3 0 . 0 1 . 0 0 . 0 0 . 0 Caesar 2 . 3 2 . 3 0 . 0 0 . 5 0 . 3 0 . 3 Calpurnia 0 . 0 11 . 2 0 . 0 0 . 0 0 . 0 0 . 0 Cleopatra 17 . 7 0 . 0 0 . 0 0 . 0 0 . 0 0 . 0 mercy 0 . 5 0 . 0 0 . 7 0 . 9 0 . 9 0 . 3 worser 1 . 2 0 . 0 0 . 6 0 . 6 0 . 6 0 . 0

  28. Vector Space Scoring Vector Space Model • Recall our Shakespeare Example: � � � V ( d 1 ) V ( d 2 ) V ( d 6 ) Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13 . 1 11 . 4 0 . 0 0 . 0 0 . 0 0 . 0 Brutus 3 . 0 8 . 3 0 . 0 1 . 0 0 . 0 0 . 0 Caesar 2 . 3 2 . 3 0 . 0 0 . 5 0 . 3 0 . 3 Calpurnia 0 . 0 11 . 2 0 . 0 0 . 0 0 . 0 0 . 0 Cleopatra 17 . 7 0 . 0 0 . 0 0 . 0 0 . 0 0 . 0 mercy 0 . 5 0 . 0 0 . 7 0 . 9 0 . 9 0 . 3 worser 1 . 2 0 . 0 0 . 6 0 . 6 0 . 6 0 . 0 � V ( d 6 ) 7

  29. Vector Space Scoring Vector Space Model • Recall our Shakespeare Example: � � � V ( d 1 ) V ( d 2 ) V ( d 6 ) Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13 . 1 11 . 4 0 . 0 0 . 0 0 . 0 0 . 0 Brutus 3 . 0 8 . 3 0 . 0 1 . 0 0 . 0 0 . 0 Caesar 2 . 3 2 . 3 0 . 0 0 . 5 0 . 3 0 . 3 Calpurnia 0 . 0 11 . 2 0 . 0 0 . 0 0 . 0 0 . 0 Cleopatra 17 . 7 0 . 0 0 . 0 0 . 0 0 . 0 0 . 0 mercy 0 . 5 0 . 0 0 . 7 0 . 9 0 . 9 0 . 3 worser 1 . 2 0 . 0 0 . 6 0 . 6 0 . 6 0 . 0

  30. Vector Space Scoring Vector Space Model • Recall our Shakespeare Example: Brutus Julius Caesar Antony and Cleopatra Hamlet Tempest Othello MacBeth Antony

  31. Vector Space Scoring Vector Space Model • Recall our Shakespeare Example: � � � V ( d 1 ) V ( d 2 ) V ( d 6 ) Antony and Julius The Tempest Hamlet Othello Macbeth Cleopatra Caesar Antony 13 . 1 11 . 4 0 . 0 0 . 0 0 . 0 0 . 0 Brutus 3 . 0 8 . 3 0 . 0 1 . 0 0 . 0 0 . 0 Caesar 2 . 3 2 . 3 0 . 0 0 . 5 0 . 3 0 . 3 Calpurnia 0 . 0 11 . 2 0 . 0 0 . 0 0 . 0 0 . 0 Cleopatra 17 . 7 0 . 0 0 . 0 0 . 0 0 . 0 0 . 0 mercy 0 . 5 0 . 0 0 . 7 0 . 9 0 . 9 0 . 3 worser 1 . 2 0 . 0 0 . 6 0 . 6 0 . 6 0 . 0

  32. Vector Space Scoring Vector Space Model • Recall our Shakespeare Example: worser Antony and Cleopatra Tempest Hamlet Othello MacBeth Julius Caesar mercy

  33. Vector Space Scoring Query as a vector • So a query can also be plotted in the same space • “worser mercy” • To score, we ask: worser • How similar are two points? Antony and Cleopatra • How to answer? query Tempest Hamlet Othello MacBeth Julius Caesar mercy

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend