Topic Modeling and Clustering NIH Grants Neural Molecular/ Cellular - - PowerPoint PPT Presentation

topic modeling and clustering nih grants
SMART_READER_LITE
LIVE PREVIEW

Topic Modeling and Clustering NIH Grants Neural Molecular/ Cellular - - PowerPoint PPT Presentation

Topic Modeling and Clustering NIH Grants Neural Molecular/ Cellular NIH Systems Biology Institutes Immunology & Infectious Diseases Health Services Red labels: global structure Black labels automatic, based on NIH review panels


slide-1
SLIDE 1

Health Services Molecular/ Cellular Biology Neural Systems

Topic Modeling and Clustering NIH Grants

Immunology & Infectious Diseases

NIH Institutes

Red labels: global structure Black labels – automatic, based on NIH review panels nihmaps.org

slide-2
SLIDE 2

Two Complementary Analytic Systems

Map-like Clusters

  • Based on overall textual similarity

between grants

  • Represent groupings of grants that

share a common thematic focus Automated Topics

  • Groupings of words determined by

statistical analysis

  • Represent shared categories – each

document consists of multiple topics

slide-3
SLIDE 3

Background – NINDS Effort

slide-4
SLIDE 4
  • Based on language in documents – not keywords

– Native categories that are latent in the text – captures shared discourses – Groups of words rather than individual concepts – Context sensitive – accommodates diverse word meanings

  • Documents assigned to multiple categories

– Documents treated as mixtures of topics

  • Quantitative information on textual content

Topic allocation (proportion of words assigned to a topic) serves as a proxy for document relevance. Topic allocations are used for:

– Topic-based queries – user sets document threshold – Ranking for lists of retrieved documents – Topic proportions for retrieved document sets – Measure similarity between documents

Methods 1: Topic Modeling using LDA

slide-5
SLIDE 5

Methods 2: Graph Based Clustering using DrL

  • Force-directed algorithm

– Nodes are attracted by similarity and repelled by areas of density – Highly scalable – Documents clustered based on lexical (topic and word) similarities

  • Surprising features

– Local: thematically coherent clusters with striking face-validity – Intermediate: clusters linked in a lattice - links are formed by documents with “between cluster” focus – Global: compelling organization based on language rather than NIH bureaucratic structure

  • Interactive visual framework

– Recognize patterns in the data that otherwise would be obscure

slide-6
SLIDE 6
slide-7
SLIDE 7

Uploaded NSF Cognitive/Neuro Grants

Document Upload for Analysis against NIH Awards

slide-8
SLIDE 8

NIH Study Sections

CVP - Central Visual Processing COG - Cognitive Neuroscience CP - Cognition and Perception LCOM - Language and Communication

NIH Program Directors

STEINMETZ, MICHAEL (NEI) BABCOCK, DEBRA (NINDS) VOGEL, MICHAEL (NIMH) ROSSI, ANDREW (NIMH)

NIH Top Similar to Uploaded Documents

Document Upload for Analysis against NIH Awards

slide-9
SLIDE 9

Acknowledgements

  • Original Inspiration and Development

Gully Burns (USC) Dave Newman (UC Irvine) Bruce Herr (Indiana) Katy Borner (Indiana)

  • Topic Modeling

Dave Newman (UC Irvine) Hanna Wallach (UMass) David Mimno (UMass) Andrew McCallum (UMass)

  • Map and User Interface Design

Bruce Herr (Chalklabs) Gavin LaRowe (Chalklabs) Nathan Skiba (Chalklabs)

  • Advice and Assistance

Many many people

slide-10
SLIDE 10

Machine Learned Topics vs. NIH Study Sections

slide-11
SLIDE 11

Machine Learned Topics vs. NIH Study Sections

slide-12
SLIDE 12

73 27 14 6 123 18 3 2

Somatosensory & Chemosensory Systems Study Section

Subset Inside the Bounded Region Subset Outside the Bounded Region

Topics and clusters reveal research categories within Study Sections that are highly relevant to NIH Institutes

slide-13
SLIDE 13

Machine Learned Topics vs. NIH RCDC Categories NIH RCDC Category: Sleep Research

  • Combined query with individual topics reveals finer grained

research categories

  • Once again, clearly relevant to NIH Institutes
slide-14
SLIDE 14

196 (29%) 183 (27%)

NIH RCDC Category: Sleep Research

Two prominent clusters account for ~56% of the awards

slide-15
SLIDE 15

Research Trend Analysis

  • Screened for topics that

changed over this time period

  • Biggest “hit”: microRNA
  • Co-occurring topics

2007 vs. 2009: Cellular/molecular biology vs. Complex physiology, Cancer biomarkers

slide-16
SLIDE 16

Topic Representation Within the Database

Co-occurring and similar topics give surrounding concept space Associated document metadata determined post-hoc by mutual information scoring Associated title words and document phrases give extra lexical information Full grant list ranked by topic allocation allows determination of threshold for tagging accuracy

slide-17
SLIDE 17

A: Good Topic

Topic Evaluation

B: Poor Topic

Word Co-Occurrence Score Topic Size (% total) Good (94.3%) Intermediate (3.4%) Poor (2.3%)

A B

slide-18
SLIDE 18

Title: Structure and Regulation of the GPCR- G Protein Interactions in the Visual System

Photo- receptor Cells Protein Structure

b: c:

a: b: c:

Initiating Grant: Topic Mix of Initiating Grant: GPCR Signaling

a:

100 Most Similar Grants:

Grant Similarity vs. Layout Proximity

Similar grants are not necessarily proximal

  • n the graph, instead they are clustered

Clusters correspond to the topics of the initiating grant

slide-19
SLIDE 19

Document Retrieval Performance

  • Retrieval Performance:

Graph Proximity vs. Document Similarity Values

  • For large document sets

(n = 100-300), graph performance is comparable to similarity scores, which are the inputs

  • This is # grants in all the

previous examples

  • Graph adds information:

documents are clustered Most Similar Closest on Graph

slide-20
SLIDE 20

Document 2 – “play” > Topic 82 > [literature] Document 1 – “play” > Topic 77 > [music] Document 3 – “play” > Topic 166 > [game]

Topics that include the word “play”

Topic word assignments are sensitive to the contexts in which the words occur

Each instance of a word is assigned to a specific topic, depending on assignments of other words in the document. Topics are mixtures of words, each with an associated probability, which are learned from word co-

  • ccurrence within documents.

Documents that contain the word “play” Topic 77 Topic 82 Topic 166