Outline Clickstream Example What topic is the user looking at on - - PDF document

outline
SMART_READER_LITE
LIVE PREVIEW

Outline Clickstream Example What topic is the user looking at on - - PDF document

Categorizing Web View ership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and Brett Gordon Carnegie Mellon University Marketing Science Conference Marketing Science Conference University of Alberta,


slide-1
SLIDE 1

1

Categorizing Web View ership Using Statistical Models

  • f Web Navigation and Text

Classification

Alan L. Montgomery and Brett Gordon

Carnegie Mellon University

Marketing Science Conference Marketing Science Conference University of Alberta, Edmonton University of Alberta, Edmonton 28 June 2002 28 June 2002

2

Outline

  • Clickstream Example

– What topic is the user looking at on each page?

  • Information Sources

– Dmoz.org classification – Text classification – User browsing model

  • Results
  • Conclusions
slide-2
SLIDE 2

2

Clickstream Example

What topics is this user browsing on each of the following pages?

4

True Class: Business

slide-3
SLIDE 3

3

5

True Class: Business

6

True Class: Business

slide-4
SLIDE 4

4

7

True Class: Sports

8

True Class: Sports

slide-5
SLIDE 5

5

9

True Class: Sports

10

True Class: News

slide-6
SLIDE 6

6

11

True Class: News

12

True Class: News

slide-7
SLIDE 7

7

13

True Class: News

14

User Demographics

Sex: Male Age: 22 Occupation:Student Income: < $30,000 State: Pennsylvania Country: U.S.A.

{Business} {Business} {Business} {Sports} {Sports} {Sports} {News} {News} {News} {News}

slide-8
SLIDE 8

8

Information Sources

16

Data

Clickstream Data

  • Panel of representative web

users collected by Jupiter Media Metrix

  • Sample of 30 randomly

selected users who browsed during April 2002

– 38k URLs viewings – 13k unique URLs visited – 1,550 domains

  • Average user

– Views 1300 URLs – Active for 9 hours/month

Classification Information

  • Dmoz.org - Pages classified

by human experts

  • Page Content - Text

classification algorithms from

  • Comp. Sci./Inform. Retr.
slide-9
SLIDE 9

9

17

Dmoz.org

  • Largest, most comprehensive human-

edited directory of the web

  • Constructed and maintained by

volunteers (open-source), and original set donated by Netscape

  • Used by Netscape, AOL, Google,

Lycos, Hotbot, DirectHit, etc.

  • Over 3m+ sites classified, 438k

categories, 43k editors (Dec 2001) Categories 1. Arts 2. Business 3. Computers 4. Games 5. Health 6. Home 7. News 8. Recreation 9. Reference 10. Science 11. Shopping 12. Society 13. Sports 14. Adult

18

Problem

  • Web is very large and dynamic and only a fraction of

pages can be classified

– 147m hosts (Jan 2002, Internet Domain Survey, isc.org) – 1b (?) web pages+

  • Only a fraction of the web pages in our panel are

categorized

– 1.3% of web pages are exactly categorized – 7.3% categorized within one level – 10% categorized within two levels – 74% of pages have no classification information

slide-10
SLIDE 10

10

Text Classification

20

Background

  • Informational Retrieval

– Overview ( Baeza-Yates and Ribeiro- Neto 2000, Chakrabarti 2000) – Naïve Bayes (Joachims 1997) – Support Vector Machines (Vapnik 1995 and Joachims 1998) – Feature Selection (Mladenic and Grobelnik 1998, Yang Pederson 1998) – Latent Semantic Indexing – Support Vector Machines – Language Models ( MacKey and Peto 1994)

slide-11
SLIDE 11

11

21

True Class: Sports

22

Page Contents = HTML Code + Regular Text

slide-12
SLIDE 12

12

23

Tokenization & Lexical Parsing

  • HTML code is removed
  • Punctuation is removed
  • All words are converted to lowercase
  • Stopwords are removed

– Common, non-informative words such as ‘the’, ‘and’, ‘with’, ‘an’, etc…

Determine the term frequency (TF) of each remaining unique word

24

Result: Document Vector

home 2 game 8 hit 4 runs 6 threw 2 ejected 1 baseball 5 major 2 league 2 bat 2

slide-13
SLIDE 13

13

25

Classifying Document Vectors

home 2 game 8 hit 4 runs 6 threw 2 ejected 1 baseball 5 major 2 league 2 bat 2 bush 58 congress 92 tax 48 cynic 16 politician 23 forest 9 major 3 world 29 summit 31 federal 64 sale 87 customer 28 cart 24 game 16 microsoft 31 buy 93

  • rder

75 pants 21 nike 8 tax 19 game 97 football 32 hit 45 goal 84 umpire 23 won 12 league 58 baseball 39 soccer 21 runs 26

{News Class} {Sports Class} {Shopping Class}

? ? ?

Test Document

26

Classifying Document Vectors

home 2 game 8 hit 4 runs 6 threw 2 ejected 1 baseball 5 major 2 league 2 bat 2 bush 58 congress 92 tax 48 cynic 16 politician 23 forest 9 major 3 world 29 summit 31 federal 64 sale 87 customer 28 cart 24 game 16 microsoft 31 buy 93

  • rder

75 pants 21 nike 8 tax 19 game 97 football 32 hit 45 goal 84 umpire 23 won 12 league 58 baseball 39 soccer 21 runs 26

{News Class} {Sports Class} {Shopping Class} Test Document

slide-14
SLIDE 14

14

27

home 2 game 8 hit 4 runs 6 threw 2 ejected 1 baseball 5 major 2 league 2 bat 2 bush 58 congress 92 tax 48 cynic 16 politician 23 forest 9 major 3 world 29 summit 31 federal 64 sale 87 customer 28 cart 24 game 16 microsoft 31 buy 93

  • rder

75 pants 21 nike 8 tax 19 game 97 football 32 hit 45 goal 84 umpire 23 won 12 league 58 baseball 39 soccer 21 runs 26

{News Class} {Sports Class} Test Document {Shopping Class}

P( {News} | Test Doc) = 0.02 P( {Sports} | Test Doc) = 0.91 P( {Shopping} | Test Doc) = 0.07

Classifying Document Vectors

28

home 2 game 8 hit 4 runs 6 threw 2 ejected 1 baseball 5 major 2 league 2 bat 2 game 97 football 32 hit 45 goal 84 umpire 23 won 12 league 58 baseball 39 soccer 21 runs 26

{Sports Class} Test Document

P( {Sports} | Test Doc) = 0.91

Classifying Document Vectors

slide-15
SLIDE 15

15

29

Classification Model

  • A document is a vector of term frequency (TF)

values, each category has its own term distribution

  • Words in a document are generated by a multinomial

model of the term distribution in a given class:

  • Classification:

)} d | c ( P { max arg

C c? |V| : vocabulary size nic : # of times word i appears in class c

} ) c | w ( P ) c ( P { max arg

| V | i n i C c

c i

?

? ? 1

)} p ,..., p , p ( p , n { M ~ d

c | v | c c c 2 1

?

  • 30

Results

  • 25% correct classification
  • Compare with random guessing of 7%
  • More advanced techniques perform slightly better:

– Shrinkage of word term frequencies (McCallum et al 1998) – n-gram models – Support Vector Machines

slide-16
SLIDE 16

16

User Brow sing Model

32

User Brow sing Model

  • Web browsing is “sticky” or persistent: users tend to

view a series of pages within the same category and then switch to another topic

  • Example:

{News} {News} {News}

slide-17
SLIDE 17

17

33

Markov Sw itching Model

arts business computers games health home news recreation reference science shopping society sports adult arts 83% 4% 5% 2% 1% 2% 6% 3% 2% 6% 2% 3% 4% 1% business 3% 73% 5% 3% 2% 3% 6% 2% 3% 3% 3% 2% 3% 2% computers 5% 11% 79% 3% 3% 7% 5% 3% 4% 4% 5% 5% 2% 2% games 1% 3% 2% 90% 1% 1% 1% 1% 0% 1% 1% 1% 1% 0% health 0% 0% 0% 0% 84% 1% 1% 0% 0% 1% 0% 1% 0% 0% home 0% 1% 1% 0% 1% 80% 1% 1% 0% 1% 1% 1% 0% 0% news 1% 1% 1% 0% 1% 0% 69% 0% 0% 1% 0% 1% 1% 0% recreation 1% 1% 1% 0% 1% 1% 1% 86% 1% 1% 1% 1% 1% 0% reference 0% 1% 1% 0% 1% 0% 1% 0% 85% 2% 0% 1% 1% 0% science 1% 0% 0% 0% 1% 1% 1% 0% 1% 75% 0% 1% 0% 0% shopping 1% 3% 2% 1% 1% 2% 1% 1% 0% 1% 86% 1% 1% 0% society 1% 1% 2% 0% 2% 1% 3% 1% 2% 2% 0% 82% 1% 1% sports 2% 1% 1% 0% 0% 0% 3% 1% 1% 0% 0% 1% 85% 0% adult 1% 1% 1% 0% 0% 0% 1% 0% 0% 0% 0% 1% 0% 93% 16% 10% 19% 11% 2% 3% 2% 6% 3% 2% 7% 6% 5% 7%

Pooled transition matrix, heterogeneity across users

34

Implications

  • Suppose we have the following sequence:
  • Using Bayes Rule can determine that there is a 97%

probability of news, unconditional= 2% , conditional

  • n last observation= 69%

{News}

?

{News}

slide-18
SLIDE 18

18

Results

36

Methodology

Bayesian setup to combine information from:

  • Known categories based on exact matches
  • Text classification
  • Markov Model of User Browsing

– Introduce heterogeneity by assuming that conditional transition probability vectors drawn from Dirichlet distribution

  • Similarity of other pages in the same domain

– Assume that category of each page within a domain follows a Dirichlet distribution, so if we are at a “news” site then pages more likely to be classified as “news”

slide-19
SLIDE 19

19

37

Findings

Random guessing Text Classification + Domain Model + Browsing Model 7% 25% 41% 78%

Conclusions

slide-20
SLIDE 20

20

39

Summary

  • Each technique (text classification, browsing model,
  • r domain model) performs only fairly well (~ 25%

classification)

  • Combining these techniques together results in very

good (~ 80%) classification rates

  • Future directions: larger datasets and newer text

classification and user browsing models

40

Applications

  • Newsgroups

– Gather information from newsgroups and determine whether consumers are responding positively or negatively

  • E-mail

– Scan e-mail text for similarities to known problems/topics

  • Better Search engines

– Instead of experts classifying pages we can mine the information collected by ISPs and classify it automatically

  • Adult filters

– US Appeals Court struck down Children’s Internet Protection Act on the grounds that technology was inadequate