outline
play

Outline Clickstream Example What topic is the user looking at on - PDF document

Categorizing Web View ership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and Brett Gordon Carnegie Mellon University Marketing Science Conference Marketing Science Conference University of Alberta,


  1. Categorizing Web View ership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and Brett Gordon Carnegie Mellon University Marketing Science Conference Marketing Science Conference University of Alberta, Edmonton University of Alberta, Edmonton 28 June 2002 28 June 2002 Outline • Clickstream Example – What topic is the user looking at on each page? • Information Sources – Dmoz.org classification – Text classification – User browsing model • Results • Conclusions 2 1

  2. Clickstream Example What topics is this user browsing on each of the following pages? True Class: Business 4 2

  3. True Class: Business 5 True Class: Business 6 3

  4. True Class: Sports 7 True Class: Sports 8 4

  5. True Class: Sports 9 True Class: News 10 5

  6. True Class: News 11 True Class: News 12 6

  7. True Class: News 13 {Business} {Business} {Business} {Sports} {Sports} {Sports} {News} {News} {News} {News} User Demographics Sex: Male Age: 22 Occupation:Student Income: < $30,000 State: Pennsylvania Country: U.S.A. 14 7

  8. Information Sources Data Clickstream Data Classification Information • Panel of representative web • Dmoz.org - Pages classified users collected by Jupiter by human experts Media Metrix • Page Content - Text • Sample of 30 randomly classification algorithms from selected users who browsed Comp. Sci./Inform. Retr. during April 2002 – 38k URLs viewings – 13k unique URLs visited – 1,550 domains • Average user – Views 1300 URLs – Active for 9 hours/month 16 8

  9. Dmoz.org Categories 1. Arts • Largest, most comprehensive human- 2. Business edited directory of the web 3. Computers • Constructed and maintained by 4. Games volunteers (open-source), and original 5. Health set donated by Netscape 6. Home • Used by Netscape, AOL, Google, 7. News Lycos, Hotbot, DirectHit, etc. 8. Recreation • Over 3m+ sites classified, 438k 9. Reference categories, 43k editors (Dec 2001) 10. Science 11. Shopping 12. Society 13. Sports 14. Adult 17 Problem • Web is very large and dynamic and only a fraction of pages can be classified – 147m hosts (Jan 2002, Internet Domain Survey, isc.org) – 1b (?) web pages+ • Only a fraction of the web pages in our panel are categorized – 1.3% of web pages are exactly categorized – 7.3% categorized within one level – 10% categorized within two levels – 74% of pages have no classification information 18 9

  10. Text Classification Background • Informational Retrieval – Overview ( Baeza-Yates and Ribeiro- Neto 2000, Chakrabarti 2000) – Naïve Bayes (Joachims 1997) – Support Vector Machines (Vapnik 1995 and Joachims 1998) – Feature Selection (Mladenic and Grobelnik 1998, Yang Pederson 1998) – Latent Semantic Indexing – Support Vector Machines – Language Models ( MacKey and Peto 1994) 20 10

  11. True Class: Sports 21 Page Contents = HTML Code + Regular Text 22 11

  12. Tokenization & Lexical Parsing • HTML code is removed • Punctuation is removed • All words are converted to lowercase • Stopwords are removed – Common, non-informative words such as ‘the’, ‘and’, ‘with’, ‘an’, etc… Determine the term frequency (TF) of each remaining unique word 23 Result: Document Vector home 2 game 8 hit 4 runs 6 threw 2 ejected 1 baseball 5 major 2 league 2 bat 2 24 12

  13. Classifying Document Vectors Test Document home 2 game 8 hit 4 runs 6 threw 2 ejected 1 baseball 5 major 2 league 2 bat 2 ? ? ? bush 58 game 97 sale 87 congress 92 football 32 customer 28 tax 48 hit 45 cart 24 cynic 16 goal 84 game 16 politician 23 umpire 23 microsoft 31 forest 9 won 12 buy 93 major 3 league 58 order 75 world 29 baseball 39 pants 21 summit 31 soccer 21 nike 8 federal 64 runs 26 tax 19 {News Class} {Sports Class} {Shopping Class} 25 Classifying Document Vectors Test Document home 2 game 8 hit 4 runs 6 threw 2 ejected 1 baseball 5 major 2 league 2 bat 2 bush 58 game 97 sale 87 congress 92 football 32 customer 28 tax 48 hit 45 cart 24 cynic 16 goal 84 game 16 politician 23 umpire 23 microsoft 31 forest 9 won 12 buy 93 major 3 league 58 order 75 world 29 baseball 39 pants 21 summit 31 soccer 21 nike 8 federal 64 runs 26 tax 19 {News Class} {Sports Class} {Shopping Class} 26 13

  14. Classifying Document Vectors Test Document home 2 game 8 hit 4 runs 6 threw 2 ejected 1 baseball 5 major 2 league 2 bat 2 P( {News} | Test Doc) = 0.02 P( {Sports} | Test Doc) = 0.91 P( {Shopping} | Test Doc) = 0.07 bush 58 game 97 sale 87 congress 92 football 32 customer 28 tax 48 hit 45 cart 24 cynic 16 goal 84 game 16 politician 23 umpire 23 microsoft 31 forest 9 won 12 buy 93 major 3 league 58 order 75 world 29 baseball 39 pants 21 summit 31 soccer 21 nike 8 federal 64 runs 26 tax 19 {News Class} {Sports Class} {Shopping Class} 27 Classifying Document Vectors Test Document home 2 game 8 hit 4 runs 6 threw 2 ejected 1 baseball 5 major 2 league 2 bat 2 P( {Sports} | Test Doc) = 0.91 game 97 football 32 hit 45 goal 84 umpire 23 won 12 league 58 baseball 39 soccer 21 runs 26 {Sports Class} 28 14

  15. Classification Model • A document is a vector of term frequency (TF) values, each category has its own term distribution • Words in a document are generated by a multinomial model of the term distribution in a given class: � ? c c c d ~ M { n , p ( p , p ,..., p )} c | v | 1 2 arg max { P ( c | d )} • Classification: c ? C | V | ? c n arg max { P ( c ) P ( w | c ) } i i ? c C ? i 1 |V| : vocabulary size n ic : # of times word i appears in class c 29 Results • 25% correct classification • Compare with random guessing of 7% • More advanced techniques perform slightly better: – Shrinkage of word term frequencies (McCallum et al 1998) – n-gram models – Support Vector Machines 30 15

  16. User Brow sing Model User Brow sing Model • Web browsing is “sticky” or persistent: users tend to view a series of pages within the same category and then switch to another topic • Example: {News} {News} {News} 32 16

  17. Markov Sw itching Model arts business computers games health home news recreation reference science shopping society sports adult arts 83% 4% 5% 2% 1% 2% 6% 3% 2% 6% 2% 3% 4% 1% business 3% 73% 5% 3% 2% 3% 6% 2% 3% 3% 3% 2% 3% 2% computers 5% 11% 79% 3% 3% 7% 5% 3% 4% 4% 5% 5% 2% 2% games 1% 3% 2% 90% 1% 1% 1% 1% 0% 1% 1% 1% 1% 0% health 0% 0% 0% 0% 84% 1% 1% 0% 0% 1% 0% 1% 0% 0% home 0% 1% 1% 0% 1% 80% 1% 1% 0% 1% 1% 1% 0% 0% news 1% 1% 1% 0% 1% 0% 69% 0% 0% 1% 0% 1% 1% 0% recreation 1% 1% 1% 0% 1% 1% 1% 86% 1% 1% 1% 1% 1% 0% reference 0% 1% 1% 0% 1% 0% 1% 0% 85% 2% 0% 1% 1% 0% science 1% 0% 0% 0% 1% 1% 1% 0% 1% 75% 0% 1% 0% 0% shopping 1% 3% 2% 1% 1% 2% 1% 1% 0% 1% 86% 1% 1% 0% society 1% 1% 2% 0% 2% 1% 3% 1% 2% 2% 0% 82% 1% 1% sports 2% 1% 1% 0% 0% 0% 3% 1% 1% 0% 0% 1% 85% 0% adult 1% 1% 1% 0% 0% 0% 1% 0% 0% 0% 0% 1% 0% 93% 16% 10% 19% 11% 2% 3% 2% 6% 3% 2% 7% 6% 5% 7% Pooled transition matrix, heterogeneity across users 33 Implications • Suppose we have the following sequence: ? {News} {News} • Using Bayes Rule can determine that there is a 97% probability of news, unconditional= 2% , conditional on last observation= 69% 34 17

  18. Results Methodology Bayesian setup to combine information from: • Known categories based on exact matches • Text classification • Markov Model of User Browsing – Introduce heterogeneity by assuming that conditional transition probability vectors drawn from Dirichlet distribution • Similarity of other pages in the same domain – Assume that category of each page within a domain follows a Dirichlet distribution, so if we are at a “news” site then pages more likely to be classified as “news” 36 18

  19. Findings Random guessing 7% Text Classification 25% + Domain Model 41% + Browsing Model 78% 37 Conclusions 19

  20. Summary • Each technique (text classification, browsing model, or domain model) performs only fairly well (~ 25% classification) • Combining these techniques together results in very good (~ 80%) classification rates • Future directions: larger datasets and newer text classification and user browsing models 39 Applications • Newsgroups – Gather information from newsgroups and determine whether consumers are responding positively or negatively • E-mail – Scan e-mail text for similarities to known problems/topics • Better Search engines – Instead of experts classifying pages we can mine the information collected by ISPs and classify it automatically • Adult filters – US Appeals Court struck down Children’s Internet Protection Act on the grounds that technology was inadequate 40 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend