Modeling Science:
Discovering Themes in Large Collections of Documents
David M. Blei
Department of Computer Science Princeton University
May 14, 2007 Joint work with John Lafferty (CMU)
- D. Blei
Modeling Science 1 / 29
Modeling Science : Discovering Themes in Large Collections of - - PowerPoint PPT Presentation
Modeling Science : Discovering Themes in Large Collections of Documents David M. Blei Department of Computer Science Princeton University May 14, 2007 Joint work with John Lafferty (CMU) D. Blei Modeling Science 1 / 29 Modeling Science
Discovering Themes in Large Collections of Documents
Department of Computer Science Princeton University
Modeling Science 1 / 29
Modeling Science 2 / 29
Modeling Science 3 / 29
Modeling Science 4 / 29
Modeling Science 5 / 29
Modeling Science 6 / 29
Modeling Science 6 / 29
Modeling Science 7 / 29
N
Modeling Science 7 / 29
1 Draw each topic βi ∼ Dir(η), for i ∈ {1, . . . , K}. 2 For each document: 1 Draw topic proportions θd ∼ Dir(α). 2 For each word: 1 Draw Zd,n ∼ Mult(θd). 2 Draw Wd,n ∼ Mult(βzd,n).
Modeling Science 8 / 29
Modeling Science 8 / 29
Modeling Science 8 / 29
Modeling Science 9 / 29
1 8 16 26 36 46 56 66 76 86 96 Topics Probability 0.0 0.1 0.2 0.3 0.4
Modeling Science 10 / 29
human evolution disease computer genome evolutionary host models dna species bacteria information genetic
diseases data genes life resistance computers sequence
bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks mapping new parasites software project two united new sequences common tuberculosis simulations
Modeling Science 11 / 29
Modeling Science 12 / 29
Modeling Science 13 / 29
Modeling Science 14 / 29
"Infrared Reflectance in Leaf-Sitting Neotropical Frogs" (1977) "Instantaneous Photography" (1890)
Modeling Science 15 / 29
Modeling Science 16 / 29
D θd Zd,n Wd,n N K α D θd Zd,n Wd,n N α D θd Zd,n Wd,n N α
Modeling Science 17 / 29
Original article Topic proportions
Modeling Science 18 / 29
sequence genome genes sequences human gene dna sequencing chromosome regions analysis data genomic number devices device materials current high gate light silicon material technology electrical fiber power based data information network web computer language networks time software system words algorithm number internet Original article Most likely words from top topics
Modeling Science 18 / 29
1880 electric machine power engine steam two machines iron battery wire 1890 electric power company steam electrical machine two system motor engine 1900 apparatus steam power engine engineering water construction engineer room feet 1910 air water engineering apparatus room laboratory engineer made gas tube 1920 apparatus tube air pressure water glass gas made laboratory mercury 1930 tube apparatus glass air mercury laboratory pressure made gas small 1940 air tube apparatus glass laboratory rubber pressure small mercury gas 1950 tube apparatus glass air chamber instrument small laboratory pressure rubber 1960 tube system temperature air heat chamber power high instrument control 1970 air heat power system temperature chamber high flow tube design 1980 high power design heat system systems devices instruments control large 1990 materials high power current applications technology devices design device heat 2000 devices device materials current gate high light silicon material technology
Modeling Science 19 / 29
1880 1900 1920 1940 1960 1980 2000
1880 1900 1920 1940 1960 1980 2000
RELATIVITY LASER FORCE NERVE OXYGEN NEURON "Theoretical Physics" "Neuroscience"
Modeling Science 20 / 29
Modeling Science 21 / 29
Modeling Science 22 / 29
Modeling Science 23 / 29
Modeling Science 24 / 29
Modeling Science 25 / 29
j=1 exp{xj})}
Modeling Science 26 / 29
wild type mutant mutations mutants mutation
plants plant gene genes arabidopsis p53 cell cycle activity cyclin regulation amino acids cdna sequence isolated protein gene disease mutations families mutation rna dna rna polymerase cleavage site cells cell expression cell lines bone marrow
united states women universities students education
science scientists says research people research funding support nih program
surface tip image sample device
laser
light electrons quantum materials
polymer polymers molecules
volcanic deposits magma eruption volcanism
mantle crust upper mantle meteorites ratios earthquake earthquakes fault images data ancient found impact million years ago africa climate
ice changes climate change
cells proteins researchers protein found
patients disease treatment drugs clinical
genetic population populations differences variation
fossil record birds fossils dinosaurs fossil sequence sequences genome dna sequencing bacteria bacterial host resistance parasite development embryos drosophila genes expression species forest forests populations ecosystems
synapses ltp glutamate synaptic neurons
neurons stimulus motor visual cortical
atmospheric measurements stratosphere concentrations
sun solar wind earth planets planet co2 carbon carbon dioxide methane water
receptor receptors ligand ligands apoptosis
proteins protein binding domain domains activated tyrosine phosphorylation activation phosphorylation kinase magnetic magnetic field spin superconductivity superconducting physicists particles physics particle experiment surface liquid surfaces fluid model
reaction reactions molecule molecules transition state
enzyme enzymes iron active site reduction pressure high pressure pressures core inner core
brain memory subjects left task
computer problem information computers problems
stars astronomers universe galaxies galaxy
virus hiv aids infection viruses mice antigen t cells antigens immune response
Modeling Science 27 / 29
Modeling Science 28 / 29
Modeling Science 29 / 29