IR&DM ’13/’14
III.4 Statistical Language Models
1. Basics of Statistical Language Models 2. Query-Likelihood Approaches 3. Smoothing Methods 4. Divergence Approaches 5. Extensions Based on MRS Chapter 12 and [Zhai 2008]
!78
III.4 Statistical Language Models 1. Basics of Statistical Language - - PowerPoint PPT Presentation
III.4 Statistical Language Models 1. Basics of Statistical Language Models 2. Query-Likelihood Approaches 3. Smoothing Methods 4. Divergence Approaches 5. Extensions Based on MRS Chapter 12 and
IR&DM ’13/’14
!78
IR&DM ’13/’14
!79
0.1 0.9 dog : 0.5 cat : 0.4 hog : 0.1 P(h hog i) = 0.1 ⇥ 0.1 P(h cat, dog i) = 0.4 ⇥ 0.9 ⇥ 0.5 ⇥ 0.1 P(h dog, dog, hog i) = 0.5 ⇥ 0.9 ⇥ 0.5 ⇥ 0.9 ⇥ 0.1 ⇥ 0.1
IR&DM ’13/’14
!80
IR&DM ’13/’14
!81
soccer : 0.20 goal : 0.15 tennis : 0.10 player : 0.05 :
party : 0.20 debate : 0.20 scandal : 0.15 election : 0.05 :
IR&DM ’13/’14
!82
IR&DM ’13/’14
!83
apple : 0.20 pie : 0.15 :
cake : 0.20 apple : 0.15 :
P(q|d1) P(q|d2)
IR&DM ’13/’14
!84
IR&DM ’13/’14
!85
IR&DM ’13/’14
!86
IR&DM ’13/’14
!87
IR&DM ’13/’14
!88
IR&DM ’13/’14
!89
t∈q
t∈q
|d|
|D|
t∈q
|d|
|D|
t∈q
λ 1−λ tf(t,d) |d| |D| tf(t,D)
IR&DM ’13/’14
!90
IR&DM ’13/’14
!91
|D|
IR&DM ’13/’14
!92
apple : 0.20 pie : 0.15 :
cake : 0.20 apple : 0.15 :
apple : 0.20 muffin : 0.15 :
D(θq||θd1) D(θq||θd2)
IR&DM ’13/’14
!93
apple : 0.50 muffin : 0.50
apple : 0.25 muffin : 0.25 recipe : 0.10 water : 0.10 sugar : 0.30
D(θqkθd) = P(apple|θq) log P (apple|θq)
P (apple|θd) + P(muffin|θq) log P (muffin|θq) P (muffin|θd)
= 0.50 log 0.50
0.25 + 0.50 log 0.50 0.25
= 1.00
IR&DM ’13/’14
!94
IR&DM ’13/’14
!95
IR&DM ’13/’14
!96
IR&DM ’13/’14
!97
IR&DM ’13/’14
!98
rayand@flickr
…munich’s flying dutchman… …one of bayern’s most valuable players… …winning soccer’s most prestigious champions league… …with the dutch national team…
IR&DM ’13/’14
!99
IR&DM ’13/’14
!100
IR&DM ’13/’14
!101
IR&DM ’13/’14
!102
IR&DM ’13/’14
!103
term
document
term topic topic topic topic document
IR&DM ’13/’14
!104
IR&DM ’13/’14
!105
m = 6 (terms) t1 : bak(e,ing) t2 : recipe(s) t3 : bread t4 : cake t5 : pastr(y,ies) t6 : pie n = 5 (documents) d1 : how to bake bread without recipes d2 : the classic art of viennese pastry d3 : numerical recipes: the art of scientific computing d4 : breads, pastries, pies and cakes: quantity baking recipes d5 : pastry: a book of best french recipes
IR&DM ’13/’14
!106
IR&DM ’13/’14
!107
IR&DM ’13/’14
!108
IR&DM ’13/’14
!109
IR&DM ’13/’14
!110
IR&DM ’13/’14
!111
IR&DM ’13/’14
!112
IR&DM ’13/’14
!113
term
document
term topic topic topic topic document
IR&DM ’13/’14
!114
IR&DM ’13/’14
!115
IR&DM ’13/’14
!116 Dirichlet(α) Multinomial(β, k) topic t word w N D Multinomial(θ, M) Latent (hidden) RV Observable RV (data) per document per word occurrence
IR&DM ’13/’14
!117 Dirichlet(α) Multinomial(β, k) topic t word w N D Multinomial(θ, M) document d topic t word w topic t word w word w
IR&DM ’13/’14
!118
IR&DM ’13/’14
!119 Dirichlet(α) Multinomial(β, k) topic t word w N D Multinomial(θ, M) D topic t N Dirichlet(γ) Multinomial(θ, k) Multinomial(φ, k)
IR&DM ’13/’14
!120
The William Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropoli- tan Opera Co., New York Philharmonic and Juilliard School. “Our board felt that we had a real opportunity to make a mark on the future of the performing arts with these grants an act every bit as important as our traditional areas of support in health, medical research, education and the social services,” Hearst Foundation President Randolph A. Hearst said Monday in announcing the grants. Lincoln Center’s share will be $200,000 for its new building, which will house young artists and provide new public facilities. The Metropolitan Opera Co. and New York Philharmonic will receive $400,000 each. The Juilliard School, where music and the performing arts are taught, will get $250,000. The Hearst Foundation, a leading supporter
donation, too. Figure 8: An example article from the AP corpus. Each color codes a different factor from which
IR&DM ’13/’14
!121
IR&DM ’13/’14
!122