Outline Clickstream Example What topic is the user looking at on - PDF document

Categorizing Web View ership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and Brett Gordon Carnegie Mellon University Marketing Science Conference Marketing Science Conference University of Alberta, Edmonton University of Alberta, Edmonton 28 June 2002 28 June 2002 Outline • Clickstream Example – What topic is the user looking at on each page? • Information Sources – Dmoz.org classification – Text classification – User browsing model • Results • Conclusions 2 1

Clickstream Example What topics is this user browsing on each of the following pages? True Class: Business 4 2

True Class: Business 5 True Class: Business 6 3

True Class: Sports 7 True Class: Sports 8 4

True Class: Sports 9 True Class: News 10 5

True Class: News 11 True Class: News 12 6

True Class: News 13 {Business} {Business} {Business} {Sports} {Sports} {Sports} {News} {News} {News} {News} User Demographics Sex: Male Age: 22 Occupation:Student Income: < $30,000 State: Pennsylvania Country: U.S.A. 14 7

Information Sources Data Clickstream Data Classification Information • Panel of representative web • Dmoz.org - Pages classified users collected by Jupiter by human experts Media Metrix • Page Content - Text • Sample of 30 randomly classification algorithms from selected users who browsed Comp. Sci./Inform. Retr. during April 2002 – 38k URLs viewings – 13k unique URLs visited – 1,550 domains • Average user – Views 1300 URLs – Active for 9 hours/month 16 8

Dmoz.org Categories 1. Arts • Largest, most comprehensive human- 2. Business edited directory of the web 3. Computers • Constructed and maintained by 4. Games volunteers (open-source), and original 5. Health set donated by Netscape 6. Home • Used by Netscape, AOL, Google, 7. News Lycos, Hotbot, DirectHit, etc. 8. Recreation • Over 3m+ sites classified, 438k 9. Reference categories, 43k editors (Dec 2001) 10. Science 11. Shopping 12. Society 13. Sports 14. Adult 17 Problem • Web is very large and dynamic and only a fraction of pages can be classified – 147m hosts (Jan 2002, Internet Domain Survey, isc.org) – 1b (?) web pages+ • Only a fraction of the web pages in our panel are categorized – 1.3% of web pages are exactly categorized – 7.3% categorized within one level – 10% categorized within two levels – 74% of pages have no classification information 18 9

Text Classification Background • Informational Retrieval – Overview ( Baeza-Yates and Ribeiro- Neto 2000, Chakrabarti 2000) – Naïve Bayes (Joachims 1997) – Support Vector Machines (Vapnik 1995 and Joachims 1998) – Feature Selection (Mladenic and Grobelnik 1998, Yang Pederson 1998) – Latent Semantic Indexing – Support Vector Machines – Language Models ( MacKey and Peto 1994) 20 10

True Class: Sports 21 Page Contents = HTML Code + Regular Text 22 11

Tokenization & Lexical Parsing • HTML code is removed • Punctuation is removed • All words are converted to lowercase • Stopwords are removed – Common, non-informative words such as ‘the’, ‘and’, ‘with’, ‘an’, etc… Determine the term frequency (TF) of each remaining unique word 23 Result: Document Vector home 2 game 8 hit 4 runs 6 threw 2 ejected 1 baseball 5 major 2 league 2 bat 2 24 12

Classifying Document Vectors Test Document home 2 game 8 hit 4 runs 6 threw 2 ejected 1 baseball 5 major 2 league 2 bat 2 ? ? ? bush 58 game 97 sale 87 congress 92 football 32 customer 28 tax 48 hit 45 cart 24 cynic 16 goal 84 game 16 politician 23 umpire 23 microsoft 31 forest 9 won 12 buy 93 major 3 league 58 order 75 world 29 baseball 39 pants 21 summit 31 soccer 21 nike 8 federal 64 runs 26 tax 19 {News Class} {Sports Class} {Shopping Class} 25 Classifying Document Vectors Test Document home 2 game 8 hit 4 runs 6 threw 2 ejected 1 baseball 5 major 2 league 2 bat 2 bush 58 game 97 sale 87 congress 92 football 32 customer 28 tax 48 hit 45 cart 24 cynic 16 goal 84 game 16 politician 23 umpire 23 microsoft 31 forest 9 won 12 buy 93 major 3 league 58 order 75 world 29 baseball 39 pants 21 summit 31 soccer 21 nike 8 federal 64 runs 26 tax 19 {News Class} {Sports Class} {Shopping Class} 26 13

Classifying Document Vectors Test Document home 2 game 8 hit 4 runs 6 threw 2 ejected 1 baseball 5 major 2 league 2 bat 2 P( {News} | Test Doc) = 0.02 P( {Sports} | Test Doc) = 0.91 P( {Shopping} | Test Doc) = 0.07 bush 58 game 97 sale 87 congress 92 football 32 customer 28 tax 48 hit 45 cart 24 cynic 16 goal 84 game 16 politician 23 umpire 23 microsoft 31 forest 9 won 12 buy 93 major 3 league 58 order 75 world 29 baseball 39 pants 21 summit 31 soccer 21 nike 8 federal 64 runs 26 tax 19 {News Class} {Sports Class} {Shopping Class} 27 Classifying Document Vectors Test Document home 2 game 8 hit 4 runs 6 threw 2 ejected 1 baseball 5 major 2 league 2 bat 2 P( {Sports} | Test Doc) = 0.91 game 97 football 32 hit 45 goal 84 umpire 23 won 12 league 58 baseball 39 soccer 21 runs 26 {Sports Class} 28 14

Classification Model • A document is a vector of term frequency (TF) values, each category has its own term distribution • Words in a document are generated by a multinomial model of the term distribution in a given class: � ? c c c d ~ M { n , p ( p , p ,..., p )} c | v | 1 2 arg max { P ( c | d )} • Classification: c ? C | V | ? c n arg max { P ( c ) P ( w | c ) } i i ? c C ? i 1 |V| : vocabulary size n ic : # of times word i appears in class c 29 Results • 25% correct classification • Compare with random guessing of 7% • More advanced techniques perform slightly better: – Shrinkage of word term frequencies (McCallum et al 1998) – n-gram models – Support Vector Machines 30 15

User Brow sing Model User Brow sing Model • Web browsing is “sticky” or persistent: users tend to view a series of pages within the same category and then switch to another topic • Example: {News} {News} {News} 32 16

Markov Sw itching Model arts business computers games health home news recreation reference science shopping society sports adult arts 83% 4% 5% 2% 1% 2% 6% 3% 2% 6% 2% 3% 4% 1% business 3% 73% 5% 3% 2% 3% 6% 2% 3% 3% 3% 2% 3% 2% computers 5% 11% 79% 3% 3% 7% 5% 3% 4% 4% 5% 5% 2% 2% games 1% 3% 2% 90% 1% 1% 1% 1% 0% 1% 1% 1% 1% 0% health 0% 0% 0% 0% 84% 1% 1% 0% 0% 1% 0% 1% 0% 0% home 0% 1% 1% 0% 1% 80% 1% 1% 0% 1% 1% 1% 0% 0% news 1% 1% 1% 0% 1% 0% 69% 0% 0% 1% 0% 1% 1% 0% recreation 1% 1% 1% 0% 1% 1% 1% 86% 1% 1% 1% 1% 1% 0% reference 0% 1% 1% 0% 1% 0% 1% 0% 85% 2% 0% 1% 1% 0% science 1% 0% 0% 0% 1% 1% 1% 0% 1% 75% 0% 1% 0% 0% shopping 1% 3% 2% 1% 1% 2% 1% 1% 0% 1% 86% 1% 1% 0% society 1% 1% 2% 0% 2% 1% 3% 1% 2% 2% 0% 82% 1% 1% sports 2% 1% 1% 0% 0% 0% 3% 1% 1% 0% 0% 1% 85% 0% adult 1% 1% 1% 0% 0% 0% 1% 0% 0% 0% 0% 1% 0% 93% 16% 10% 19% 11% 2% 3% 2% 6% 3% 2% 7% 6% 5% 7% Pooled transition matrix, heterogeneity across users 33 Implications • Suppose we have the following sequence: ? {News} {News} • Using Bayes Rule can determine that there is a 97% probability of news, unconditional= 2% , conditional on last observation= 69% 34 17

Results Methodology Bayesian setup to combine information from: • Known categories based on exact matches • Text classification • Markov Model of User Browsing – Introduce heterogeneity by assuming that conditional transition probability vectors drawn from Dirichlet distribution • Similarity of other pages in the same domain – Assume that category of each page within a domain follows a Dirichlet distribution, so if we are at a “news” site then pages more likely to be classified as “news” 36 18

Findings Random guessing 7% Text Classification 25% + Domain Model 41% + Browsing Model 78% 37 Conclusions 19

Summary • Each technique (text classification, browsing model, or domain model) performs only fairly well (~ 25% classification) • Combining these techniques together results in very good (~ 80%) classification rates • Future directions: larger datasets and newer text classification and user browsing models 39 Applications • Newsgroups – Gather information from newsgroups and determine whether consumers are responding positively or negatively • E-mail – Scan e-mail text for similarities to known problems/topics • Better Search engines – Instead of experts classifying pages we can mine the information collected by ISPs and classify it automatically • Adult filters – US Appeals Court struck down Children’s Internet Protection Act on the grounds that technology was inadequate 40 20

Outline Clickstream Example What topic is the user looking at on - PDF document

Categorizing Web View ership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and Brett Gordon Carnegie Mellon University Marketing Science Conference Marketing Science Conference University of Alberta,

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Xingqian Peng, Huaqiao University, China Presented by Zhen Wu Presented by Zhen Wu October 30,2011

1 Web Application Development 2 3 Web Application Development CSS Outline An outline is a

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Outline Outline Deaf and Hearing Impaired Deaf and Hearing Impaired Physical Structures of

Alpha Presentation Cognitive Enterprise Software Robots The Capstone Experience Team Volkswagen

Deep Learning For Retail And Marketing: Practical Use Cases on GPUs in the Cloud GTC, 3/28

Why wait? Lets start computing while data is still on the wire Shilpi Bhattacharyya,

FAST TRACK YOUR WEB OPTIMIZATION ROI Delivering Optimal Digital Customer Experience

MFI-TransSW+ : Efficiently Mining Frequent Itemsets in Clickstreams Franklin Anderson de Amorim

journey through data. Maarten Edelman We are a trusted partner of hotels across the globe since

MOOCs @ Illinois What were learning about whos taking

Learning Treed Generalized Linear Models Hugh Chipman, University of Waterloo Joint work with Ed

Sambuz

Useful Links

Newsletter

Mail Us