Intersection Graphs for Text Analysis Elizabeth Leeds David - - PowerPoint PPT Presentation

intersection graphs for text analysis
SMART_READER_LITE
LIVE PREVIEW

Intersection Graphs for Text Analysis Elizabeth Leeds David - - PowerPoint PPT Presentation

Intersection Graphs for Text Analysis Elizabeth Leeds David Marchette leedsem@nswc.navy.mil marchettedj@nswc.navy.mil Naval Surface Warfare Center Code B10 < > - + Interface 2004 p.1/16 Overview bag-of-words approach to document


slide-1
SLIDE 1

< > - +

Intersection Graphs for Text Analysis

Elizabeth Leeds leedsem@nswc.navy.mil David Marchette marchettedj@nswc.navy.mil Naval Surface Warfare Center Code B10

Interface 2004 – p.1/16

slide-2
SLIDE 2

< > - +

Overview

bag-of-words approach to document encoding word weighting by mutual information

  • nly “important” words are kept

intersection graphs are used to analyze document relationships each document is a vertex an edge exists between two documents if they share important words

Interface 2004 – p.2/16

slide-3
SLIDE 3

< > - +

Mutual Information

Let

  • ✁✄✂

be the number of times that the word

has occurred in the document

and let

✞✠✟

be the total number of words (counting duplicates) in the corpus

. Let

☛ ✁✄✂ ☎ ☞
✂ ☎ ✌ ✞ ✟

. Then the mutual information between document

and word

is given by

✍ ✟ ✁ ✂ ☎ ☞ ✎✑✏ ✒ ☛ ✁ ✂ ☎ ✓ ✔ ✟ ☛ ✁✄✂ ✓ ✕ ☛ ✕ ✂ ☎

(1)

Let

✞ ☎

be the number of words (counting duplicates) in document

. Let

  • ✁✄✂

be the number of times that the word

appears in the corpus

.

✍ ✟ ✁✄✂ ☎ ☞ ✎✑✏ ✒ ✖✘✗✚✙ ✛ ✜ ✛ ✖ ✗✚✙ ✢ ✜ ✢

Interface 2004 – p.3/16

slide-4
SLIDE 4

< > - +

Mutual Information - Summary

  • ✁✄✂
  • the number of times that the word

appears in the document

.

  • ✁✄✂
  • the number of times that the word

appears in the corpus

.

✠ ☎
  • the number of words (counting duplicates) in

document

.

✠ ✞
  • the total number of words (counting duplicates) in

the corpus

.

✍ ✟ ✁✄✂ ☎ ☞ ✎✑✏ ✒ ✖ ✗✚✙ ✛ ✜ ✛ ✖ ✗✚✙ ✢ ✜ ✢

(2)

Interface 2004 – p.4/16

slide-5
SLIDE 5

< > - +

Intersection Graphs and the KSS Random Intersection Graph

  • is an intersection graph if a set
✁✄✂

can be assigned to each vertex

☎ ✆ ✝ ✞

so that

☎ ✆ ✆ ✠ ✞

exactly when

✁✡✂ ☛ ✁ ✁ ☞ ☞ ✌

. To define a random intersection graph, let

✍ ✆ ✎ ✏✒✑ ✓ ✔

and let

✕ ☞ ✖ ✓ ✑ ✗ ✑✘ ✘ ✘ ✑ ✍ ✙

. Define

random subsets

✁✜✛ ✑ ✢ ☞ ✓ ✑ ✘ ✘ ✘ ✑ ✚
  • f the

set

where each element of

is selected for the subset

✁ ✛

with probability

. Then

✚ ✑ ✍ ✑ ✍ ✟

is the intersection graph of the sets

✁ ✛

. Karonski, Scheinerman, Singer-Cohen, (1999) On Random Intersection Graphs: The Subgraph Problem. In Combinatorics, Probability and Computing, Vol 8, pp. 131-159.

Interface 2004 – p.5/16

slide-6
SLIDE 6

< > - +

Thresholding

For each document (vertex) we have a set of words with each word assigned a weight. Let

✁✁

be the set of words contained in document j Let

✖ ✍✄✂ ✑ ✍ ☎ ✑✘ ✘ ✘ ✑ ✍ ✆ ✝✟✞ ✆ ✙

be the ordered set containing the weights for each word in

  • .

Consider two types of thresholding:

✠ ✞ ✍ ✑ ✡ ✟ ☞ ☛ ☞ ✌ ✏ ✍ ☛ ✍ ✎ ✡ ✍ ✍ ☛ ✍ ✏ ✡ ✑ ✞ ✍ ✑ ✡ ✟ ☞ ☛ ☞ ✌ ✏ ✍ ☛ ✍ ✎ ✡ ✓ ✍ ☛ ✍ ✏ ✡

Interface 2004 – p.6/16

slide-7
SLIDE 7

< > - +

Defining Edges

Under the KSS model,

☎ ✕ ☎
✠ ✞

if

✁ ✕ ☛ ✁
☞ ✌

. Modify this by taking

☎ ✕ ☎
✠ ✞

if:

✕ ☛ ✁

for some

✢ ✆ ✁ ✂
✕ ☛ ✁✁
  • ✏☎✄
✆ ✆ ✂ ✝ ✕ ✝ ✝
✄ ✆ ✆ ✂

Interface 2004 – p.7/16

slide-8
SLIDE 8

< > - +

Procedure

a 0.03 about

  • 0.26

abstract 4.22 accent 5.83 ... ... word 1.52 would

  • 0.26

year 0.50 young 2.79 yowlumni 5.83

Graph Size = 500 Mutual Information Threshold = 1 141 edges between classes

ANTHRO ASTRO BEHAVIOR EARTH LIFE MATH&COMP MEDICINE PHYSICS

Interface 2004 – p.8/16

slide-9
SLIDE 9

< > - +

Intersection Graph

Graph Size = 500 Mutual Information Threshold = 1 141 edges between classes

ANTHRO ASTRO BEHAVIOR EARTH LIFE MATH&COMP MEDICINE PHYSICS

vertices are documents threshold determines which

words are important

edge between documents

that share important words

Interface 2004 – p.9/16

slide-10
SLIDE 10

< > - +

Mutual Information

The weight is based on the frequency of the word in the document compared to the frequency of the word in other documents in the corpus Words that are important have large weights Throw out words with small weights Reduces dimensionality Reduces the noise What does "important" mean in terms of the mutual information? Use graphs to select threshold value defining importance. This is different than the usual stopper list Document/corpus dependent stopper list Requires no knowledge of the language

Interface 2004 – p.10/16

slide-11
SLIDE 11

< > - +

Using Mutual Information to Threshold

−2 −1 1 2 3 0.2 0.3 0.4 0.5 0.6 MI Threshold fraction of edges out of class

graph size 300 graph size 400 graph size 500

Interface 2004 – p.11/16

slide-12
SLIDE 12

< > - +

Adding Documents to the Corpus

Add a new set of documents to the corpus. The weights on (importance of) the words in the original documents will change. What does the intersection graph tell us about this change? How can we use documents or sets of documents to force connections in the intersection graph? Mathematically, a new set of documents changes the weight on a word by the same amount across all original documents.

Interface 2004 – p.12/16

slide-13
SLIDE 13

< > - +

Adding Documents to the Corpus

Let the document

be in the corpus

. Suppose we add a new set of documents,

✡ ✂

, to

and measure the change of

✍ ✁✄✂ ☎

under this change in corpus. The change in the mutual information of word

in document

under the addition of the set of documents

✡ ✂

is

  • ✟✂✁☎✄
✗✚✙ ✛ ☞ ✍ ✟ ✆ ✟✝✁ ✁✄✂ ☎ ✞ ✍ ✟ ✁ ✂ ☎ ☞ ✎✑✏ ✒
  • ✁✄✂
☎ ✞ ☎ ✞ ✟ ✆ ✟✂✁
  • ✁✄✂
✟ ✆ ✟✝✁ ✞ ✎ ✏ ✒
✂ ☎ ✞ ☎ ✞ ✟
✂ ✟ ☞ ✎✑✏ ✒
  • ✁✄✂
✟ ✞ ✟ ✞ ✟ ✆ ✟✝✁
✂ ✟ ✆ ✟✂✁

(3)

The change in the mutual information for the word

does not depend

  • n the document

.

Interface 2004 – p.13/16

slide-14
SLIDE 14

< > - +

Adding Documents to the Corpus

a 0.03 about

  • 0.26

abstract 4.22 accent 5.83 ... ... word 1.52 would

  • 0.26

year 0.50 young 2.79 yowlumni 5.83

Graph Size = 300 Mutual Information Threshold = 0.5

31 edges between classes

ANTHRO ASTRO

Interface 2004 – p.14/16

slide-15
SLIDE 15

< > - +

Adding Documents to the Corpus

a 0.03 about

  • 0.26

abstract 4.22 accent 5.83 ... ... word 1.52 would

  • 0.26

year 0.50 young 2.79 yowlumni 5.83

Graph Size = 300 Mutual Information Threshold = 0.5

31 edges between classes

ANTHRO ASTRO

Interface 2004 – p.14/16

slide-16
SLIDE 16

< > - +

Adding Documents to the Corpus

a 0.03 about

  • 0.26

abstract 4.22 accent 5.83 ... ... word 1.52 would

  • 0.26

year 0.50 young 2.79 yowlumni 5.83

Graph Size = 300 Mutual Information Threshold = 0.5

31 edges between classes

ANTHRO ASTRO

0.02

  • 0.14

4.76 6.15 ... 4.23 0.03 2.67 4.12 6.24

Interface 2004 – p.14/16

slide-17
SLIDE 17

< > - +

Adding Documents to the Corpus

a 0.03 about

  • 0.26

abstract 4.22 accent 5.83 ... ... word 1.52 would

  • 0.26

year 0.50 young 2.79 yowlumni 5.83 0.02

  • 0.14

4.76 6.15 ... 4.23 0.03 2.67 4.12 6.24

Graph Size = 300 Mutual Information Threshold = 0.5

27 edges between classes

ANTHRO ASTRO MATH&COMP

Interface 2004 – p.14/16

slide-18
SLIDE 18

< > - +

Adding Documents to the Corpus

−2 −1 1 2 3 0.0 0.1 0.2 0.3 0.4 0.5 0.6 MI Threshold fraction of edges out of class

ANTHRO, ASTRO ANTHRO, ASTRO, MED ANTHRO, ASTRO, MED, EARTH ALL 8 CLASSES

Interface 2004 – p.15/16

slide-19
SLIDE 19

< > - +

Future Work

Optimal

based on the "size" of the corpus Unsupervised case Creating Random Documents Spectral Graph Analysis

Interface 2004 – p.16/16