Distilling Conceptual Connections from MeSH Co-Occurrences Padmini - - PowerPoint PPT Presentation

distilling conceptual connections from mesh co occurrences
SMART_READER_LITE
LIVE PREVIEW

Distilling Conceptual Connections from MeSH Co-Occurrences Padmini - - PowerPoint PPT Presentation

Distilling Conceptual Connections from MeSH Co-Occurrences Padmini Srinivasan, Dimitar Hristovski presented @ MEDINFO 2004 15 February 2005 INLS 279 Bioinformatics Seminar Patrick Herron SILS, UNC Chapel Hill Goals Analyze MeSH


slide-1
SLIDE 1

Distilling Conceptual Connections from MeSH Co-Occurrences

Padmini Srinivasan, Dimitar Hristovski presented @ MEDINFO 2004

15 February 2005 INLS 279 Bioinformatics Seminar Patrick Herron SILS, UNC Chapel Hill

slide-2
SLIDE 2

Goals

  • Analyze MeSH heading/subheading pair co-occurrences,

e.g.,

( diabetes / drug therapy ) with ( chemical / therapeutic use )

heading/concept subheading heading/concept subheading

  • Interestingness: Select semantically meaningful ones

(via chi-sq) that are relatively domain independent

  • Develop a “reasonable representation” of each pair: “a

weighted vector […] [emphasizing] verb based functional aspects of the underlying semantics”

  • The ultimate goal: such pairs may aid in generating

connections across disciplines across all of MEDLINE

slide-3
SLIDE 3

Reducing the problem space: using the SN

  • In order to reduce scale of the problem space…

20,742 concepts * 82 headings=1.7 mil heading/subheading pairs, 1.7mil2/2 (number of possible heading subheading pairs) = 1.5 trillion

  • and in order to be as domain-independent as

possible…

  • Represent concepts by their semantic types in

the SN

  • 20,742 -> 134 (semantic type/subheading)
  • Problem space down to

(134*82) 2/2 = 60 mil (two orders of magnitude smaller)

  • Only 1 mil of that 60 mil is meaningful
slide-4
SLIDE 4

Method

  • Usual approaches vs. Srinivasan et al
  • Extraction of [(st/sh)i, (st/sh)j] pairs
  • Selection of background dataset
  • Analysis of a single pair
slide-5
SLIDE 5

Usual approaches

  • First hand-pick relevant verbs & then

extract their arguments

  • Set of 1 mil co-occurrences too big for

manual approach

  • We want to extract interesting verbs based
  • n MeSH co-occurrence
slide-6
SLIDE 6

Srinivasan et al approach

  • Two step approach:
  • 1. automatically identify/extract key verbs

associated with a [(st/sh)i, (st/sh)j] pair

  • 2. Use these verbs to extract highly related Ns

and NPs

  • This paper establishes one method for

step 1; work on step 2 is for a later date

  • We want a weighted verb vector for MeSH

co-occurrences

slide-7
SLIDE 7

Extraction of [(st/sh)i, (st/sh)j] pairs

  • Corpus: MEDLINE to 2001 into rows with MEDLINE

record id (MRI), Head/subhead

  • Transformed into MRI, Sem Type/subhead
  • Then [(st/sh)i, (st/sh)j] pairings picked & frequencies

noted

  • 30% 1x, 97% < 500 over 11 million records
  • Pairings further culled by two more criteria: 1. freq > 500)

down to 31,000 pairs & 2. (observed co-occurrence >= 1.25*expected co-occurrence) lowered total to 22,000 pairs

  • 250 randomly selected; documents were reliably

retrieved for 228 of those

slide-8
SLIDE 8

Co-occurrence calculations

  • Actual co-occurrence of pair A,B

(# docs w/ A * # docs w/B) / (total number of documents)2

  • Expected co-occurrence of pair A,B

(# docs w/ A * # docs w/B) / total number of documents

  • (Expected – observed)/expected * 100 > 0.25
slide-9
SLIDE 9

Selection of background dataset

  • 100,000 MEDLINE records randomly selected
  • Title + abstract POS-tagged
  • Verbs extracted & used as vector to represent

doc; verbs transformed to infinitives

  • IDF for each verb as log2(100,000/df)
  • BV: background vector: set of (verb, IDF) pairs

for each record – our doc vector D???

slide-10
SLIDE 10

Analysis of a single pair

  • Identify all docs in which pair appears
  • 2/3rds of docs placed in training set
  • Other 1/3rd plus random 1/3rd from

MEDLINE in test set

  • Create verb profile for pair
  • Test verb profiles
slide-11
SLIDE 11

Creating verb profile

  • Docs POS-tagged & Vs extracted
  • Rules for inclusion: V must occur in at least 5

docs; V frequency in training set must be significantly different from its freq in BV – using Pearson’s test (null hypothesis: difference between expected and observed frequencies is random)

  • Formation of the profile vector, to which weights

are assigned just like with the BV, BUT..

slide-12
SLIDE 12

Formation of the profile vector

  • Four different profile vectors are formed:

– V,AugTF – V,AugTF*IDF – V,TF*IDF – V,IDF

  • Empirical question as to how each

performs

slide-13
SLIDE 13

Testing profiles

  • Test each of the 4 PVs against the doc vectors

D of the random set and that 1/3rd of documents in which pairs occur

  • Similarity of a background vector to a PV as dot

product of two vectors (D, PV)

  • Mean similarity (D, PV) of random set & mean

similarity of topic set calculated

  • Variance/information/interestingness: significant

difference in similarities for a pair indicates interestingness

slide-14
SLIDE 14

Results: Verb profiles

Withdraw, warrant, undertake, treat, tolerate [(disease or syndrome, drug) & (lipid, adverse effect)]

Verbs Pair

  • Authors claim these verb profiles will provide

useful constraints for extracting pair-associated nouns

  • Example of top five

verbs from the AugTFIDF PV

slide-15
SLIDE 15

Results: test set

slide-16
SLIDE 16

Results: evaluation

  • Each bar represents one of the 228 pairs

plus standard error

  • If a line touches the diagonal then the

similarity is possibly random

  • All pairs show significant similarity
  • “in the right direction”
slide-17
SLIDE 17

Conclusions

  • Verbs are important
  • Verb profiles from docs with MeSH co-
  • ccurrence pairs are different from docs

not covered by pair: verb profiles can be used to characterize other docs w/ co-

  • ccurrenc
slide-18
SLIDE 18

Questions

  • How did Srinivasan et al decide only 1 mil of the

60 mil possible [(st/sh), (st/sh)] pairings are meaningful? Chi sq test with null hypothesis that pairings are random?

  • What is the meaning of those similarity values?

How significant???

  • Once we have noun-verb representations of

MeSH co-occurrences, what does that get us?

  • What’s truly interesting about MeSH co-
  • ccurrence? Connecting otherwise disparate

pieces of the literature