subtopic ranking based on hierarchical headings
play

Subtopic Ranking Based on Hierarchical Headings Tomohiro Manabe and - PowerPoint PPT Presentation

Subtopic Ranking Based on Hierarchical Headings Tomohiro Manabe and Keishi Tajima Graduate School of Informatics, Kyoto Univ. {manabe@dl.kuis, tajima@i}.kyoto-u.ac.jp What are subtopics? We focus on a topic given as a keyword query A


  1. Subtopic Ranking Based on Hierarchical Headings Tomohiro Manabe and Keishi Tajima Graduate School of Informatics, Kyoto Univ. {manabe@dl.kuis, tajima@i}.kyoto-u.ac.jp

  2. What are subtopics? • We focus on a topic given as a keyword query • A subtopic of a given keyword query is: Another keyword query that specializes and/or disambiguates the search intent of the given query harry potter Search office Search ✔ harry potter movie ✔ office workplace ✘ harry potter hp ✘ office office Sakai, T., Dou, Z., Yamamoto, T., Liu, Y., Zhang, M., and Song, R. (2013). Overview of the 2 NTCIR-10 INTENT-2 task. In NTCIR.

  3. Why are subtopics important? Subtopics are useful for • Query suggestion/completion • Search result diversification • By including a few pages for each subtopic in the search result 3

  4. Our Problem: Subtopic Ranking • Query suggestion/completion • Which subtopic should be suggested? • Search result diversification • Which subtopic should be included in the search results? Subtopic Ranking Problem Sorting subtopics by their intent probabilities (the probability that the user intends that subtopic) 4

  5. Our Idea: Hierarchical Headings are useful We use hierarchical heading structure in documents It consists of: • Nested logical blocks • Each block has its own heading • A heading describes its own and descendant blocks Assumption 1: Hierarchical headings represent hierarchical topics 5

  6. Programming Example Document All about computer programming skills. Schools Programming Top schools for computer … • Programming schools Courses Specifically, the most famous … • Programming school courses • Programming school degrees Degrees • Programming jobs Some schools award degrees … Jobs Programming skills are required … 6

  7. E.g. Schools block contains Programming more letters and descendant All about computer programming skills. blocks than Jobs block Schools • Authors must have assumed Top schools for computer … the readers need more Courses information on “Schools” Specifically, the most famous … • It suggests that “Schools” Degrees have higher intent Some schools award degrees … probability Assumption 2: Jobs Subtopics with more contents Programming skills are required … are more important 7

  8. Overview of our Assumptions and Methods Our assumptions are: • Hierarchical headings represent hierarchical topics • Topics with more contents is more important Our subtopic ranking method: 1. Score blocks based on their content quantity 2. Score subtopics by integrating the scores of blocks matching the subtopics 3. Rank the subtopics based on their scores 8

  9. Matching between Subtopics and Blocks A subtopic matches a block iff: All words in the subtopic appear either in the headings of the block or of its ancestor blocks Before comparing, we perform basic preprocessing • Tokenization • Stop word filtering • Stemming 9

  10. Programming Example of Matching All about computer programming skills. Schools Subtopic “programming schools” Top schools for computer … matches block “schools” in this Courses document. Specifically, the most famous … Degrees NOTE: if a topic matches a block, Some schools award degrees … its descendant blocks also match it, but we only consider top-most Jobs matching blocks Programming skills are required … 10

  11. Overview of our Methods 1. Score blocks based on their content quantity We compare 4 block-scoring methods 2. Score subtopics by integrating scores of blocks matching the subtopics We compare 4 integration methods 3. Rank the subtopics based on their scores We compare 2 ranking methods total: 4x4x2=32 methods 11

  12. Overview of our Methods Our subtopic ranking methods: 1. Score blocks based on their content quantity We compare 4 block-scoring methods 2. Score subtopics by integrating scores of blocks matching the subtopics We compare 4 integration methods 3. Rank the subtopics based on their scores We compare 2 ranking methods 12

  13. 1. Scoring Blocks Based on Content Quantity We compare four block-scoring methods: 1-A. Length scoring 1-B. Log-scale scoring 1-C. Bottom-up scoring 1-D. Top-down scoring 13

  14. Programming 3,000 letters 1-A. Length Scoring All about computer programming skills. Schools 2,500 letters Idea: Block with more text Top schools for computer … is more important Courses 1,600 letters Specifically, the most famous … Score a block by Degrees 400 letters the number of letters in it Some schools award degrees … • Including those in Jobs 440 letters descendant blocks Programming skills are required … 14

  15. Programming log(3k) ≈ 3.5 1-B. Log-Scale Scoring All about computer programming skills. Schools log(2,500) ≈ 3.4 Idea: Importance of block Top schools for computer … is not linearly proportional Courses log(1,600) ≈ 3.2 Specifically, the most famous … to its content quantity Degrees log(400) ≈ 2.6 Some schools award degrees … Score a block by logarithm of the numbers of letters Jobs log(440) ≈ 2.6 Programming skills are required … in it 15

  16. Programming 1+3+1=5 1-C. Bottom-up Scoring All about computer programming skills. Schools 1+1+1=3 Idea: Importance of some Top schools for computer … topics are independent Courses 1 from text length Specifically, the most famous … • e.g. telephone number Degrees 1 Some schools award degrees … Score a block by the Jobs 1 number of blocks in it Programming skills are required … (including itself) 16

  17. Programming 1 1-D. Top-down Scoring All about computer programming skills. Schools 1 / (2 + 1) = 1/3 Idea: Authors often divide Top schools for computer … a block into child blocks Courses (1/3) / (2 + 1) = 1/9 that have the equal Specifically, the most famous … importance Degrees (1/3) / (2 + 1) = 1/9 Some schools award degrees … score = parent’s score Jobs 1 / (2 + 1) = 1/3 |sibling | + 1 Programming skills are required … 17

  18. Overview of our Methods Our subtopic ranking methods: 1. Score blocks based on their content quantity We compare 4 block-scoring methods 2. Score subtopics by integrating scores of blocks matching the subtopics We compare 4 integration methods 3. Rank the subtopics based on their scores We compare 2 ranking methods 18

  19. 2. Score Subtopics by Integrating Scores of Matching Blocks 2-1. Integrate the block scores into document scores 2-2. Integrate the document scores into the final score Score: 300 Score: ??? Score: ??? Score: 200 Score: ??? Score: 500 19

  20. 2-1. Integrate Block Scores into Document Score • Simply sum up the scores of all matching blocks in each document Score: 300 Score: 300 Score: ??? Score: 200 Score: 700 = 200 + 500 Score: 500 20

  21. 2-2. Integrate Document Scores into the Final Score We compare four integration methods: 2-2-a. Simple Summation 2-2-b. Per-Document Normalization 2-2-c. Per-Domain Normalization 2-2-d. Hybrid Normalization 21

  22. 2-2-a. Simple Summation Simply sum up scores of multiple documents • The score of a subtopic is content quantity in whole corpus Score: 400 Score: 0 Score: 500 Score: 100 22

  23. 2-2-b. Per-Document Normalization • In summation method, documents with more contents have bigger influence on scores • However, each document may be equally important Divide scores by the scores of the root block of document Score: Score: 0 / 900 400 / 500 Score: 1.8 Score: 100 / 100 23

  24. 2-2-c. Per-Domain Normalization • We can also consider per-domain normalization Divide total score of matching blocks in a domain by the total score of root blocks in the domain http://abc.com/ http://def.com/ Score: 400 / 500 Score: (100+0) / (900 + 100) Score: Score: 0 / 900 400 /500 Score: 0.9 Score: 100 / 100 24

  25. 2-2-d. Hybrid Normalization Apply both page-based and domain-based normalization http://abc.com/ http://def.com/ Score: 0.8 / 1 Score: (0 + 1) / 2 Score: 0 / 900 Score: Score: 1.3 400 / 500 Score: 100 / 100 25

  26. Overview of our Methods Our subtopic ranking methods: 1. Score blocks based on their content quantity We compare 4 block-scoring methods 2. Score subtopics by integrating scores of blocks matching the subtopics We compare 4 integration methods 3. Rank the subtopics based on their scores We compare 2 ranking methods 26

  27. 3. Rank The Subtopics based on Their Scores We compare 2 ranking methods: 3-A. Simple Ranking Method 3-B. Diversified Ranking Method 27

  28. 3-A. Simple Ranking Programming 3,000 letters All about computer programming skills. Method Schools 2,500 letters • Simply sort subtopics by Top schools for computer … their scores Courses 1,600 letters Specifically, the most famous … Example Subtopics Score Degrees 400 letters Programming Schools 2,500 Some schools award degrees … Programming School 1,600 Jobs 440 letters Courses Programming skills are required … Programming Jobs 440 28

  29. 3-B. Diversified Ranking Method • As search result diversification is an important application, we also want diversified ranking of subtopics • Basic idea is: • If a block matches an already-ranked subtopic, the topic of the block is already included in the ranking • So even if the block also matches some lower-ranked subtopics, the block should not contribute to their scores 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend