Latent ¡Dirichlet ¡ Allocation ¡
Alberto ¡Bie+ ¡
Latent Dirichlet Allocation Alberto Bie+ Trop - - PowerPoint PPT Presentation
Latent Dirichlet Allocation Alberto Bie+ Trop dinformation Topic modeling Dcouvrir la structure thma5que cache dans chaque document dune archive
Alberto ¡Bie+ ¡
document ¡d’une ¡archive ¡
human evolution disease computer genome evolutionary host models dna species bacteria information genetic
diseases data genes life resistance computers sequence
bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks mapping new parasites software project two united new sequences common tuberculosis simulations
SKY WATER TREE MOUNTAIN PEOPLE SCOTLAND WATER FLOWER HILLS TREE SKY WATER BUILDING PEOPLE WATER FISH WATER OCEAN TREE CORAL PEOPLE MARKET PATTERN TEXTILE DISPLAY BIRDS NEST TREE BRANCH LEAVES
w M N (a) unigram z w M N (b) mixture of unigrams
θd Zd,n Wd,n N D K
βk
α
η
Proportions parameter Per-document topic proportions Per-word topic assignment Observed word Topics Topic parameter
θd Zd,n Wd,n N D K
βk
α
η
Proportions parameter Per-document topic proportions Per-word topic assignment Observed word Topics Topic parameter
p(θ, z, w|α, β) = p(θ|α)
N
Ÿ
n=1
p(zn|θ)p(wn|βzn)
p(θ|α) = Γ(qk
i=1 αi)
rk
i=1 Γ(αi)
θα1−1
1
· · · θαk−1
k
Γ( ) la fonction Gamma. Cette distribution
que ∀i, θi ≥ 0 et qk
i=1 θi = 1 (
item value
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1
2
3
4
5
item value
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1
2
3
4
5
Topics Documents Topic proportions and assignments
θd Zd,n Wd,n N D K
βk
α
η
Proportions parameter Per-document topic proportions Per-word topic assignment Observed word Topics Topic parameter
les ¡données ¡
p(◊, z|w, –, —) = p(◊, z, w|–, —) p(w|–, —) resement très difficile à calculer, comme
1 8 16 26 36 46 56 66 76 86 96 Topics Probability 0.0 0.1 0.2 0.3 0.4
human evolution disease computer genome evolutionary host models dna species bacteria information genetic
diseases data genes life resistance computers sequence
bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks mapping new parasites software project two united new sequences common tuberculosis simulations
problem model selection species problems rate male forest mathematical constant males ecology number distribution females fish new time sex ecological mathematics number species conservation university size female diversity two values evolution population first value populations natural numbers average population ecosystems work rates sexual populations time data behavior endangered mathematicians density evolutionary tropical chaos measured genetic forests chaotic models reproductive ecosystem
dirichlet ¡alloca5on. ¡Journal ¡of ¡Machine ¡Learning ¡Research, ¡ 3:993–1022, ¡2003. ¡ ¡
Communica'ons ¡of ¡the ¡ACM, ¡à ¡paraître. ¡ ¡