Subtopic Ranking Based on Hierarchical Headings Tomohiro Manabe and - PowerPoint PPT Presentation

Subtopic Ranking Based on Hierarchical Headings Tomohiro Manabe and Keishi Tajima Graduate School of Informatics, Kyoto Univ. {manabe@dl.kuis, tajima@i}.kyoto-u.ac.jp

What are subtopics? • We focus on a topic given as a keyword query • A subtopic of a given keyword query is: Another keyword query that specializes and/or disambiguates the search intent of the given query harry potter Search office Search ✔ harry potter movie ✔ office workplace ✘ harry potter hp ✘ office office Sakai, T., Dou, Z., Yamamoto, T., Liu, Y., Zhang, M., and Song, R. (2013). Overview of the 2 NTCIR-10 INTENT-2 task. In NTCIR.

Why are subtopics important? Subtopics are useful for • Query suggestion/completion • Search result diversification • By including a few pages for each subtopic in the search result 3

Our Problem: Subtopic Ranking • Query suggestion/completion • Which subtopic should be suggested? • Search result diversification • Which subtopic should be included in the search results? Subtopic Ranking Problem Sorting subtopics by their intent probabilities (the probability that the user intends that subtopic) 4

Our Idea: Hierarchical Headings are useful We use hierarchical heading structure in documents It consists of: • Nested logical blocks • Each block has its own heading • A heading describes its own and descendant blocks Assumption 1: Hierarchical headings represent hierarchical topics 5

Programming Example Document All about computer programming skills. Schools Programming Top schools for computer … • Programming schools Courses Specifically, the most famous … • Programming school courses • Programming school degrees Degrees • Programming jobs Some schools award degrees … Jobs Programming skills are required … 6

E.g. Schools block contains Programming more letters and descendant All about computer programming skills. blocks than Jobs block Schools • Authors must have assumed Top schools for computer … the readers need more Courses information on “Schools” Specifically, the most famous … • It suggests that “Schools” Degrees have higher intent Some schools award degrees … probability Assumption 2: Jobs Subtopics with more contents Programming skills are required … are more important 7

Overview of our Assumptions and Methods Our assumptions are: • Hierarchical headings represent hierarchical topics • Topics with more contents is more important Our subtopic ranking method: 1. Score blocks based on their content quantity 2. Score subtopics by integrating the scores of blocks matching the subtopics 3. Rank the subtopics based on their scores 8

Matching between Subtopics and Blocks A subtopic matches a block iff: All words in the subtopic appear either in the headings of the block or of its ancestor blocks Before comparing, we perform basic preprocessing • Tokenization • Stop word filtering • Stemming 9

Programming Example of Matching All about computer programming skills. Schools Subtopic “programming schools” Top schools for computer … matches block “schools” in this Courses document. Specifically, the most famous … Degrees NOTE: if a topic matches a block, Some schools award degrees … its descendant blocks also match it, but we only consider top-most Jobs matching blocks Programming skills are required … 10

Overview of our Methods 1. Score blocks based on their content quantity We compare 4 block-scoring methods 2. Score subtopics by integrating scores of blocks matching the subtopics We compare 4 integration methods 3. Rank the subtopics based on their scores We compare 2 ranking methods total: 4x4x2=32 methods 11

Overview of our Methods Our subtopic ranking methods: 1. Score blocks based on their content quantity We compare 4 block-scoring methods 2. Score subtopics by integrating scores of blocks matching the subtopics We compare 4 integration methods 3. Rank the subtopics based on their scores We compare 2 ranking methods 12

1. Scoring Blocks Based on Content Quantity We compare four block-scoring methods: 1-A. Length scoring 1-B. Log-scale scoring 1-C. Bottom-up scoring 1-D. Top-down scoring 13

Programming 3,000 letters 1-A. Length Scoring All about computer programming skills. Schools 2,500 letters Idea: Block with more text Top schools for computer … is more important Courses 1,600 letters Specifically, the most famous … Score a block by Degrees 400 letters the number of letters in it Some schools award degrees … • Including those in Jobs 440 letters descendant blocks Programming skills are required … 14

Programming log(3k) ≈ 3.5 1-B. Log-Scale Scoring All about computer programming skills. Schools log(2,500) ≈ 3.4 Idea: Importance of block Top schools for computer … is not linearly proportional Courses log(1,600) ≈ 3.2 Specifically, the most famous … to its content quantity Degrees log(400) ≈ 2.6 Some schools award degrees … Score a block by logarithm of the numbers of letters Jobs log(440) ≈ 2.6 Programming skills are required … in it 15

Programming 1+3+1=5 1-C. Bottom-up Scoring All about computer programming skills. Schools 1+1+1=3 Idea: Importance of some Top schools for computer … topics are independent Courses 1 from text length Specifically, the most famous … • e.g. telephone number Degrees 1 Some schools award degrees … Score a block by the Jobs 1 number of blocks in it Programming skills are required … (including itself) 16

Programming 1 1-D. Top-down Scoring All about computer programming skills. Schools 1 / (2 + 1) = 1/3 Idea: Authors often divide Top schools for computer … a block into child blocks Courses (1/3) / (2 + 1) = 1/9 that have the equal Specifically, the most famous … importance Degrees (1/3) / (2 + 1) = 1/9 Some schools award degrees … score = parent’s score Jobs 1 / (2 + 1) = 1/3 |sibling | + 1 Programming skills are required … 17

2. Score Subtopics by Integrating Scores of Matching Blocks 2-1. Integrate the block scores into document scores 2-2. Integrate the document scores into the final score Score: 300 Score: ??? Score: ??? Score: 200 Score: ??? Score: 500 19

2-1. Integrate Block Scores into Document Score • Simply sum up the scores of all matching blocks in each document Score: 300 Score: 300 Score: ??? Score: 200 Score: 700 = 200 + 500 Score: 500 20

2-2. Integrate Document Scores into the Final Score We compare four integration methods: 2-2-a. Simple Summation 2-2-b. Per-Document Normalization 2-2-c. Per-Domain Normalization 2-2-d. Hybrid Normalization 21

2-2-a. Simple Summation Simply sum up scores of multiple documents • The score of a subtopic is content quantity in whole corpus Score: 400 Score: 0 Score: 500 Score: 100 22

2-2-b. Per-Document Normalization • In summation method, documents with more contents have bigger influence on scores • However, each document may be equally important Divide scores by the scores of the root block of document Score: Score: 0 / 900 400 / 500 Score: 1.8 Score: 100 / 100 23

2-2-c. Per-Domain Normalization • We can also consider per-domain normalization Divide total score of matching blocks in a domain by the total score of root blocks in the domain http://abc.com/ http://def.com/ Score: 400 / 500 Score: (100+0) / (900 + 100) Score: Score: 0 / 900 400 /500 Score: 0.9 Score: 100 / 100 24

2-2-d. Hybrid Normalization Apply both page-based and domain-based normalization http://abc.com/ http://def.com/ Score: 0.8 / 1 Score: (0 + 1) / 2 Score: 0 / 900 Score: Score: 1.3 400 / 500 Score: 100 / 100 25

3. Rank The Subtopics based on Their Scores We compare 2 ranking methods: 3-A. Simple Ranking Method 3-B. Diversified Ranking Method 27

3-A. Simple Ranking Programming 3,000 letters All about computer programming skills. Method Schools 2,500 letters • Simply sort subtopics by Top schools for computer … their scores Courses 1,600 letters Specifically, the most famous … Example Subtopics Score Degrees 400 letters Programming Schools 2,500 Some schools award degrees … Programming School 1,600 Jobs 440 letters Courses Programming skills are required … Programming Jobs 440 28

3-B. Diversified Ranking Method • As search result diversification is an important application, we also want diversified ranking of subtopics • Basic idea is: • If a block matches an already-ranked subtopic, the topic of the block is already included in the ranking • So even if the block also matches some lower-ranked subtopics, the block should not contribute to their scores 29

Subtopic Ranking Based on Hierarchical Headings Tomohiro Manabe and - PowerPoint PPT Presentation

Subtopic Ranking Based on Hierarchical Headings Tomohiro Manabe and Keishi Tajima Graduate School of Informatics, Kyoto Univ. {manabe@dl.kuis, tajima@i}.kyoto-u.ac.jp What are subtopics? We focus on a topic given as a keyword query A

FLY QUIET 21 RNAV DEPARTURE CONCEPTS DEPARTURE PROCEDURES Vector Headings Vector Headings

Easy and Hard Outline Constraint Ranking in OT The Constraint Ranking problem Making fast

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Extracting Logical Hierarchical Structure of HTML Documents Based on Headings Tomohiro Manabe and

Hierarchical Bounding Volume October 11, 2005 () Hierarchical Bounding Volume October 11, 2005

What is a hierarchical model? Richard Erickson Quantitative Ecologist DataCamp Hierarchical

SUBTOPIC Charge units Electric field Electric force & Coulombs Law

Overview and Highlights FZJ Andreas Lehrach, FZJ and RWTH Aachen Subtopic coordinator ARD ST2

Guide to Make Google Docs & Google Slides ADA Compliant Google Docs Headings Google

Using FAST (Faceted Application of Subject Headings) in CONTENTdm Eric Childress Terry

Ranking candidate genes from Ranking candidate genes from perturbation experiments Niko

Online Submodular Set Cover, Ranking, and Repeated Active Learning Online Ranking: At each round,

TVM for Ads Ranking @ Facebook Hao Lu, Ansha Yu, Yinghai Lu, Andrew Tulloch Ads Ranking at

Bayesian hierarchical models in Stata Nikolay Balov StataCorp LP 2016 Stata Conference Nikolay

Unsupervised Learning and Clustering Owen Roberts, Zach Busser, Ganesh Sugunan Hierarchical

PRanking with Ranking Koby Crammer Technion Israel Institute of Technology Based on joint

Mot otiv ivat ation, ion, Logis Logistics ics, , and and Int ntroduct oduction ion 2

Multi- -Tool CAP Tool CAP- -Based Alert Based Alert Multi and Warning System and Warning

Final Exam Review Matt Gormley Lecture 29 Apr. 29, 2020 1 Reminders Homework 9: Learning

Description Logics Structural Description Logics Enrico Franconi franconi@cs.man.ac.uk

Tuesday, June 18, 2013 1 pm 2:30 pm PST 2 pm 3:30 pm MST 3 pm 4:30 pm CST 4 pm

WWW + Grid Computing = Next Generation Web

15-292 History of Computing Computer Memory and the Invention of the Transistor Evolution of

TDDE25 Data Abstractions Algorithms and Provide Context Programming Software Roadmap

Sambuz

Useful Links

Newsletter

Mail Us

Subtopic Ranking Based on Hierarchical Headings Tomohiro Manabe and - PowerPoint PPT Presentation

Subtopic Ranking Based on Hierarchical Headings Tomohiro Manabe and Keishi Tajima Graduate School of Informatics, Kyoto Univ. {manabe@dl.kuis, tajima@i}.kyoto-u.ac.jp What are subtopics? We focus on a topic given as a keyword query A

FLY QUIET 21 RNAV DEPARTURE CONCEPTS DEPARTURE PROCEDURES Vector Headings Vector Headings

Easy and Hard Outline Constraint Ranking in OT The Constraint Ranking problem Making fast

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Extracting Logical Hierarchical Structure of HTML Documents Based on Headings Tomohiro Manabe and

Hierarchical Bounding Volume October 11, 2005 () Hierarchical Bounding Volume October 11, 2005

What is a hierarchical model? Richard Erickson Quantitative Ecologist DataCamp Hierarchical

SUBTOPIC Charge units Electric field Electric force &amp; Coulombs Law

Overview and Highlights FZJ Andreas Lehrach, FZJ and RWTH Aachen Subtopic coordinator ARD ST2

Guide to Make Google Docs &amp; Google Slides ADA Compliant Google Docs Headings Google

Using FAST (Faceted Application of Subject Headings) in CONTENTdm Eric Childress Terry

Ranking candidate genes from Ranking candidate genes from perturbation experiments Niko

Online Submodular Set Cover, Ranking, and Repeated Active Learning Online Ranking: At each round,

TVM for Ads Ranking @ Facebook Hao Lu, Ansha Yu, Yinghai Lu, Andrew Tulloch Ads Ranking at

Bayesian hierarchical models in Stata Nikolay Balov StataCorp LP 2016 Stata Conference Nikolay

Unsupervised Learning and Clustering Owen Roberts, Zach Busser, Ganesh Sugunan Hierarchical

PRanking with Ranking Koby Crammer Technion Israel Institute of Technology Based on joint

Mot otiv ivat ation, ion, Logis Logistics ics, , and and Int ntroduct oduction ion 2

Multi- -Tool CAP Tool CAP- -Based Alert Based Alert Multi and Warning System and Warning

Final Exam Review Matt Gormley Lecture 29 Apr. 29, 2020 1 Reminders Homework 9: Learning

Description Logics Structural Description Logics Enrico Franconi franconi@cs.man.ac.uk

Tuesday, June 18, 2013 1 pm 2:30 pm PST 2 pm 3:30 pm MST 3 pm 4:30 pm CST 4 pm

WWW + Grid Computing = Next Generation Web

15-292 History of Computing Computer Memory and the Invention of the Transistor Evolution of

TDDE25 Data Abstractions Algorithms and Provide Context Programming Software Roadmap

Sambuz

Useful Links

Newsletter

Mail Us

SUBTOPIC Charge units Electric field Electric force & Coulombs Law

Guide to Make Google Docs & Google Slides ADA Compliant Google Docs Headings Google