SLIDE 1
A semi-automatic structure learning method for language modeling - - PowerPoint PPT Presentation
A semi-automatic structure learning method for language modeling - - PowerPoint PPT Presentation
A semi-automatic structure learning method for language modeling Vitor Pera September 11, 2019 Faculdade de Engenharia da Universidade do Porto (FEUP) 1/14 Outline Linguistic Classes Prediction Model (LCPM) LCPMs Structure Learning Method
SLIDE 2
SLIDE 3
Linguistic Classes Prediction Model (LCPM)
- Multiclass-dependent Ngram (M > N > 1)
P(ωt|ω1:t−1) =
- ct∈C(ωt)
P(ωt|ct, ω1:t−1)P(ct|ω1:t−1) ≈
- ct∈C(ωt)
P(ωt|ct, ωt−N+1:t−1)P(ct|ct−M+1:t−1)
3/14
SLIDE 4
Linguistic Classes Prediction Model (LCPM)
- Multiclass-dependent Ngram (M > N > 1)
P(ωt|ω1:t−1) =
- ct∈C(ωt)
P(ωt|ct, ω1:t−1)P(ct|ω1:t−1) ≈
- ct∈C(ωt)
P(ωt|ct, ωt−N+1:t−1)P(ct|ct−M+1:t−1)
- LCPM (FLM formalism)
P(ct|ct−M+1:t−1)
c ↔ f1:K
− − − − − → P(f1:K
t
|f1:K
t−M+1:t−1) 3/14
SLIDE 5
Linguistic Classes Prediction Model (LCPM)
- Multiclass-dependent Ngram (M > N > 1)
P(ωt|ω1:t−1) =
- ct∈C(ωt)
P(ωt|ct, ω1:t−1)P(ct|ω1:t−1) ≈
- ct∈C(ωt)
P(ωt|ct, ωt−N+1:t−1)P(ct|ct−M+1:t−1)
- LCPM (FLM formalism)
P(ct|ct−M+1:t−1)
c ↔ f1:K
− − − − − → P(f1:K
t
|f1:K
t−M+1:t−1)
- LCPM structure learning (Goal)
- accurate and simple
- two steps method
3/14
SLIDE 6
LCPM’s Structure Learning Method - Step 1: Intro
- Given
- The need for a LCPM to compute P(f 1:K
t
|f 1:K
t−M+1:t−1)
(factors not known, yet)
- Common knowledge on Linguistics
- Full knowledge of the specific language interface
4/14
SLIDE 7
LCPM’s Structure Learning Method - Step 1: Intro
- Given
- The need for a LCPM to compute P(f 1:K
t
|f 1:K
t−M+1:t−1)
(factors not known, yet)
- Common knowledge on Linguistics
- Full knowledge of the specific language interface
- Solve (non-automatically)
- Which linguistic features use?
- Which linguistic features exhibit some special statistical
independence property?
4/14
SLIDE 8
LCPM’s Structure Learning Method - Step 1: Procedure
- 1. Choose the linguistic features (→ f1:K)
- Informative to model P(ωt|f 1:K
t
, ωt−N+1:t−1)
- Adequate to data resources (annotation and robustness)
5/14
SLIDE 9
LCPM’s Structure Learning Method - Step 1: Procedure
- 1. Choose the linguistic features (→ f1:K)
- Informative to model P(ωt|f 1:K
t
, ωt−N+1:t−1)
- Adequate to data resources (annotation and robustness)
- 2. Make the (credible) assumption:
fn
t is statistically independent of any other factors, given its
- wn history, iff 1 ≤ n ≤ J
(accordingly, split f1:K → f1:J ++fJ+1:K, 1 ≤ J < K)
5/14
SLIDE 10
LCPM’s Structure Learning Method - Step 1: Procedure
- 1. Choose the linguistic features (→ f1:K)
- Informative to model P(ωt|f 1:K
t
, ωt−N+1:t−1)
- Adequate to data resources (annotation and robustness)
- 2. Make the (credible) assumption:
fn
t is statistically independent of any other factors, given its
- wn history, iff 1 ≤ n ≤ J
(accordingly, split f1:K → f1:J ++fJ+1:K, 1 ≤ J < K) LCPM factorization J
i=1 P(fi t|fi t−M+1:t−1)
- P(fJ+1:K
t
|f1:J
t
, f1:K
t−M+1:t−1)
- Step 2
5/14
SLIDE 11
LCPM’s Structure Learning Method - Step 1: Example
Given some application and a corpus annotated by multiple tags
- 1. Admit the following tags are judged as the most appropriate:
- Part-of-speech (POS)
- Semantic tag (ST)
- Gender inflection (GI)
6/14
SLIDE 12
LCPM’s Structure Learning Method - Step 1: Example
Given some application and a corpus annotated by multiple tags
- 1. Admit the following tags are judged as the most appropriate:
- Part-of-speech (POS)
- Semantic tag (ST)
- Gender inflection (GI)
- 2. Assuming that from these three LFs only ST can be predicted
based uniquely on its own history:
- ST → f 1
- (POS,GI) → f 2:3
6/14
SLIDE 13
LCPM’s Structure Learning Method - Step 1: Example
Given some application and a corpus annotated by multiple tags
- 1. Admit the following tags are judged as the most appropriate:
- Part-of-speech (POS)
- Semantic tag (ST)
- Gender inflection (GI)
- 2. Assuming that from these three LFs only ST can be predicted
based uniquely on its own history:
- ST → f 1
- (POS,GI) → f 2:3
Results the LCPM approximation: P(f1:3
t
|f1:3
t−M+1:t−1) ≈ P(f1 t |f1 t−M+1:t−1)P(f2:3 t
|f1
t , f1:3 t−M+1:t−1) 6/14
SLIDE 14
LCPM’s Structure Learning Method - Step 2: Intro
- Goal is to learn the structure of statistical model to compute
P(fJ+1:K
t
|f1:J
t
, f1:K
t−M+1:t−1), more precisely ... 7/14
SLIDE 15
LCPM’s Structure Learning Method - Step 2: Intro
- Goal is to learn the structure of statistical model to compute
P(fJ+1:K
t
|f1:J
t
, f1:K
t−M+1:t−1), more precisely ...
- Determine automatically Z ⊂ f1:K
t−M+1:t−1 such that
- |Z| is fixed and |Z| << |f 1:K
t−M+1:t−1|
(robustness constraint)
- and P(f J+1:K
t
|f 1:J
t
, Z) approximates the original conditional probabilities according to Information Theory based criteria
7/14
SLIDE 16
LCPM’s Structure Learning Method - Step 2: Intro
- Goal is to learn the structure of statistical model to compute
P(fJ+1:K
t
|f1:J
t
, f1:K
t−M+1:t−1), more precisely ...
- Determine automatically Z ⊂ f1:K
t−M+1:t−1 such that
- |Z| is fixed and |Z| << |f 1:K
t−M+1:t−1|
(robustness constraint)
- and P(f J+1:K
t
|f 1:J
t
, Z) approximates the original conditional probabilities according to Information Theory based criteria
Notation simplification (hereafter): X = f1:J
t
; Y = fJ+1:K
t
; Z ⊂ W = f1:K
t−M+1:t−1; → P(Y |X, Z) 7/14
SLIDE 17
LCPM’s SL Method - Step 2: Rules to determine Z
- Information Theory measures
- Conditional entropy, H(Y |X)
- Conditional mutual information (CMI), I(Y ; Z|X)
- Cross-context conditional mutual information (CCCMI),
IXl(Y ; Z|Xm)
8/14
SLIDE 18
LCPM’s SL Method - Step 2: Rules to determine Z
- Information Theory measures
- Conditional entropy, H(Y |X)
- Conditional mutual information (CMI), I(Y ; Z|X)
- Cross-context conditional mutual information (CCCMI),
IXl(Y ; Z|Xm)
- Possible/experimented rules (→ P(Y |X, Z) w/ Z ⊂ W)
- To discard Z∗
If I(Y ; Z∗|X) < η H(Y |X) then Z∗ is non-relevant
8/14
SLIDE 19
LCPM’s SL Method - Step 2: Rules to determine Z
- Information Theory measures
- Conditional entropy, H(Y |X)
- Conditional mutual information (CMI), I(Y ; Z|X)
- Cross-context conditional mutual information (CCCMI),
IXl(Y ; Z|Xm)
- Possible/experimented rules (→ P(Y |X, Z) w/ Z ⊂ W)
- To discard Z∗
If I(Y ; Z∗|X) < η H(Y |X) then Z∗ is non-relevant
- To determine Z∗
Z∗ = argmax
Z⊂W |Z|=ζ
{I(Y ; Z|X)}
8/14
SLIDE 20
LCPM’s SL Method - Step 2: Rules to determine Z (cont.)
- Rule to determine Z∗ using the “Utility” measure Nλ
Z∗ = argmax
Z⊂W |Z|=ζ
{Nλ(Y ; Z|X)}, 0 < λ ≤ 1
9/14
SLIDE 21
LCPM’s SL Method - Step 2: Rules to determine Z (cont.)
- Rule to determine Z∗ using the “Utility” measure Nλ
Z∗ = argmax
Z⊂W |Z|=ζ
{Nλ(Y ; Z|X)}, 0 < λ ≤ 1 where Nλ(Y ; Z|X) represents
- Xm
P(Xm)
- I(Y ; Z|Xm)−λ
- Xl=Xm
P(Xl)IXl(Y ; Z|Xm)
- 9/14
SLIDE 22
LCPM’s SL Method - Step 2: Rules to determine Z (cont.)
- Rule to determine Z∗ using the “Utility” measure Nλ
Z∗ = argmax
Z⊂W |Z|=ζ
{Nλ(Y ; Z|X)}, 0 < λ ≤ 1 where Nλ(Y ; Z|X) represents
- Xm
P(Xm)
- I(Y ; Z|Xm)−λ
- Xl=Xm
P(Xl)IXl(Y ; Z|Xm)
- and IXl(Y ; Z|Xm) represents
- Y
- Z
P(Y, Z|Xl) log P(Y, Z|Xm) P(Y |Xm)P(Z|Xm)
9/14
SLIDE 23
LCPM’s SL Method - Step 2: Example
Problem: Choose Z1 or Z2 to model P(Y |X, Z); X ∈ {F, S}, Y ∈ {A, B, U}, Z1 ∈ {C, D, V }, Z2 ∈ {E, F, W}
10/14
SLIDE 24
LCPM’s SL Method - Step 2: Example
Problem: Choose Z1 or Z2 to model P(Y |X, Z); X ∈ {F, S}, Y ∈ {A, B, U}, Z1 ∈ {C, D, V }, Z2 ∈ {E, F, W} Data: P(X = F) = P(X = S)
10/14
SLIDE 25
LCPM’s SL Method - Step 2: Example
Problem: Choose Z1 or Z2 to model P(Y |X, Z); X ∈ {F, S}, Y ∈ {A, B, U}, Z1 ∈ {C, D, V }, Z2 ∈ {E, F, W} Data: P(X = F) = P(X = S) “Utility” & Solutions: N0(Y ; Z1|X) < N0(Y ; Z2|X)
(near equality)
∴ λ = 0 ⇒ choose Z2 N1(Y ; Z1|X) > N1(Y ; Z2|X) ∴ λ = 1 ⇒ choose Z1
10/14
SLIDE 26
LCPM’s SL Method - Step 2: ALgorithm to define Z
Input: f1:K
t−M+1:t, J, K, M, ζ, λ, γ, η, Data
Output: Set of factors: Z for each z ∈ f1:K
t−M+1:t−1 do // factors relevance
if I(fJ+1:K
t
; z|f1:J
t
) < γH(fJ+1:K
t
|f1:J
t
) then Remove z from f1:K
t−M+1:t−1
end end Sort f1:K
t−M+1:t−1 by descending order of N(λ)(fJ+1:K t
; z|f1:J
t
) Z ← ∅ repeat// factors redundancy z ← next non-processed element in f1:K
t−M+1:t−1
if I(fJ+1:K
t
; z|f1:J
t
) > ηI(z; r|f1:J
t
), ∀r ∈ Z then Add z to Z end until |Z| = ζ or all elements of f1:K
t−M+1:t−1 are processed
Output Z
11/14
SLIDE 27
Preliminary Results
- Text corpus (vocab-size≈ 200K) which annotations include:
m - Part-of-speech (#13: ADJ, ADV, ...) g - Gender inflection (#3: M, F, N) n - Number inflection (#3: S, P, U) Select Z ⊂ W = {nt, mt−1, gt−1, nt−1, mt−2, gt−2, nt−2, ...} maximizing the Utility, Nλ(gt; Z|mt) (→ P(gt|mt, Z))
12/14
SLIDE 28
Preliminary Results
- Text corpus (vocab-size≈ 200K) which annotations include:
m - Part-of-speech (#13: ADJ, ADV, ...) g - Gender inflection (#3: M, F, N) n - Number inflection (#3: S, P, U) Select Z ⊂ W = {nt, mt−1, gt−1, nt−1, mt−2, gt−2, nt−2, ...} maximizing the Utility, Nλ(gt; Z|mt) (→ P(gt|mt, Z))
- Results
Cases λ Z sorted by decreasing Nλ g = N and n = U {gt−1, gt−2, mt−1, . . . } 1 {gt−1, mt−1, gt−2, . . . } Whole data {nt, gt−1, mt−1, . . . } 1 {gt−1, nt−2, gt−2, . . . }
12/14
SLIDE 29
Conclusions
- Method for learning LCPM structure
13/14
SLIDE 30
Conclusions
- Method for learning LCPM structure
- Guidelines:
Seek accurate and simple structure (FLM approach: keep just the relevant and non-redundant factors and dependencies)
13/14
SLIDE 31
Conclusions
- Method for learning LCPM structure
- Guidelines:
Seek accurate and simple structure (FLM approach: keep just the relevant and non-redundant factors and dependencies)
- Process:
Step 1 - manually set initial structure (Linguistic knowledge) Step 2 - automatically “prune” structure (data-driven algorithm based on Information Theory concepts)
13/14
SLIDE 32
Conclusions
- Method for learning LCPM structure
- Guidelines:
Seek accurate and simple structure (FLM approach: keep just the relevant and non-redundant factors and dependencies)
- Process:
Step 1 - manually set initial structure (Linguistic knowledge) Step 2 - automatically “prune” structure (data-driven algorithm based on Information Theory concepts)
- Preliminary results seem promising; larger experiments are
needed to get conclusive results
13/14
SLIDE 33
References
- 1. J. Bilmes, “Natural Statistical Models for Automatic Speech
Recognition”, PhD Thesis, 1999, Berkley, Cal, Intl. Computer Science Institute.
- 2. K. Kirchhoff, J. Bilmes, and K. Duh, “Factored Language
Model Tutorial”, Tech. Report, 2008, Dept. Electrical Engineering, Univ. of Washington.
- 3. Helmut Schmid, “Improvements in Part-of-Speech Tagging
with an Application to German”, Proc. ACL SIGDAT-Workshop, 1995. Dublin, Ireland.
- 4. D. Santos, and P. Rocha, “Evaluating CETEMP´