A semi-automatic structure learning method for language modeling - - PowerPoint PPT Presentation

a semi automatic structure learning method for language
SMART_READER_LITE
LIVE PREVIEW

A semi-automatic structure learning method for language modeling - - PowerPoint PPT Presentation

A semi-automatic structure learning method for language modeling Vitor Pera September 11, 2019 Faculdade de Engenharia da Universidade do Porto (FEUP) 1/14 Outline Linguistic Classes Prediction Model (LCPM) LCPMs Structure Learning Method


slide-1
SLIDE 1

A semi-automatic structure learning method for language modeling

Vitor Pera September 11, 2019

Faculdade de Engenharia da Universidade do Porto (FEUP) 1/14

slide-2
SLIDE 2

Outline

Linguistic Classes Prediction Model (LCPM) LCPM’s Structure Learning Method Preliminary Results Conclusions References

2/14

slide-3
SLIDE 3

Linguistic Classes Prediction Model (LCPM)

  • Multiclass-dependent Ngram (M > N > 1)

P(ωt|ω1:t−1) =

  • ct∈C(ωt)

P(ωt|ct, ω1:t−1)P(ct|ω1:t−1) ≈

  • ct∈C(ωt)

P(ωt|ct, ωt−N+1:t−1)P(ct|ct−M+1:t−1)

3/14

slide-4
SLIDE 4

Linguistic Classes Prediction Model (LCPM)

  • Multiclass-dependent Ngram (M > N > 1)

P(ωt|ω1:t−1) =

  • ct∈C(ωt)

P(ωt|ct, ω1:t−1)P(ct|ω1:t−1) ≈

  • ct∈C(ωt)

P(ωt|ct, ωt−N+1:t−1)P(ct|ct−M+1:t−1)

  • LCPM (FLM formalism)

P(ct|ct−M+1:t−1)

c ↔ f1:K

− − − − − → P(f1:K

t

|f1:K

t−M+1:t−1) 3/14

slide-5
SLIDE 5

Linguistic Classes Prediction Model (LCPM)

  • Multiclass-dependent Ngram (M > N > 1)

P(ωt|ω1:t−1) =

  • ct∈C(ωt)

P(ωt|ct, ω1:t−1)P(ct|ω1:t−1) ≈

  • ct∈C(ωt)

P(ωt|ct, ωt−N+1:t−1)P(ct|ct−M+1:t−1)

  • LCPM (FLM formalism)

P(ct|ct−M+1:t−1)

c ↔ f1:K

− − − − − → P(f1:K

t

|f1:K

t−M+1:t−1)

  • LCPM structure learning (Goal)
  • accurate and simple
  • two steps method

3/14

slide-6
SLIDE 6

LCPM’s Structure Learning Method - Step 1: Intro

  • Given
  • The need for a LCPM to compute P(f 1:K

t

|f 1:K

t−M+1:t−1)

(factors not known, yet)

  • Common knowledge on Linguistics
  • Full knowledge of the specific language interface

4/14

slide-7
SLIDE 7

LCPM’s Structure Learning Method - Step 1: Intro

  • Given
  • The need for a LCPM to compute P(f 1:K

t

|f 1:K

t−M+1:t−1)

(factors not known, yet)

  • Common knowledge on Linguistics
  • Full knowledge of the specific language interface
  • Solve (non-automatically)
  • Which linguistic features use?
  • Which linguistic features exhibit some special statistical

independence property?

4/14

slide-8
SLIDE 8

LCPM’s Structure Learning Method - Step 1: Procedure

  • 1. Choose the linguistic features (→ f1:K)
  • Informative to model P(ωt|f 1:K

t

, ωt−N+1:t−1)

  • Adequate to data resources (annotation and robustness)

5/14

slide-9
SLIDE 9

LCPM’s Structure Learning Method - Step 1: Procedure

  • 1. Choose the linguistic features (→ f1:K)
  • Informative to model P(ωt|f 1:K

t

, ωt−N+1:t−1)

  • Adequate to data resources (annotation and robustness)
  • 2. Make the (credible) assumption:

fn

t is statistically independent of any other factors, given its

  • wn history, iff 1 ≤ n ≤ J

(accordingly, split f1:K → f1:J ++fJ+1:K, 1 ≤ J < K)

5/14

slide-10
SLIDE 10

LCPM’s Structure Learning Method - Step 1: Procedure

  • 1. Choose the linguistic features (→ f1:K)
  • Informative to model P(ωt|f 1:K

t

, ωt−N+1:t−1)

  • Adequate to data resources (annotation and robustness)
  • 2. Make the (credible) assumption:

fn

t is statistically independent of any other factors, given its

  • wn history, iff 1 ≤ n ≤ J

(accordingly, split f1:K → f1:J ++fJ+1:K, 1 ≤ J < K) LCPM factorization J

i=1 P(fi t|fi t−M+1:t−1)

  • P(fJ+1:K

t

|f1:J

t

, f1:K

t−M+1:t−1)

  • Step 2

5/14

slide-11
SLIDE 11

LCPM’s Structure Learning Method - Step 1: Example

Given some application and a corpus annotated by multiple tags

  • 1. Admit the following tags are judged as the most appropriate:
  • Part-of-speech (POS)
  • Semantic tag (ST)
  • Gender inflection (GI)

6/14

slide-12
SLIDE 12

LCPM’s Structure Learning Method - Step 1: Example

Given some application and a corpus annotated by multiple tags

  • 1. Admit the following tags are judged as the most appropriate:
  • Part-of-speech (POS)
  • Semantic tag (ST)
  • Gender inflection (GI)
  • 2. Assuming that from these three LFs only ST can be predicted

based uniquely on its own history:

  • ST → f 1
  • (POS,GI) → f 2:3

6/14

slide-13
SLIDE 13

LCPM’s Structure Learning Method - Step 1: Example

Given some application and a corpus annotated by multiple tags

  • 1. Admit the following tags are judged as the most appropriate:
  • Part-of-speech (POS)
  • Semantic tag (ST)
  • Gender inflection (GI)
  • 2. Assuming that from these three LFs only ST can be predicted

based uniquely on its own history:

  • ST → f 1
  • (POS,GI) → f 2:3

Results the LCPM approximation: P(f1:3

t

|f1:3

t−M+1:t−1) ≈ P(f1 t |f1 t−M+1:t−1)P(f2:3 t

|f1

t , f1:3 t−M+1:t−1) 6/14

slide-14
SLIDE 14

LCPM’s Structure Learning Method - Step 2: Intro

  • Goal is to learn the structure of statistical model to compute

P(fJ+1:K

t

|f1:J

t

, f1:K

t−M+1:t−1), more precisely ... 7/14

slide-15
SLIDE 15

LCPM’s Structure Learning Method - Step 2: Intro

  • Goal is to learn the structure of statistical model to compute

P(fJ+1:K

t

|f1:J

t

, f1:K

t−M+1:t−1), more precisely ...

  • Determine automatically Z ⊂ f1:K

t−M+1:t−1 such that

  • |Z| is fixed and |Z| << |f 1:K

t−M+1:t−1|

(robustness constraint)

  • and P(f J+1:K

t

|f 1:J

t

, Z) approximates the original conditional probabilities according to Information Theory based criteria

7/14

slide-16
SLIDE 16

LCPM’s Structure Learning Method - Step 2: Intro

  • Goal is to learn the structure of statistical model to compute

P(fJ+1:K

t

|f1:J

t

, f1:K

t−M+1:t−1), more precisely ...

  • Determine automatically Z ⊂ f1:K

t−M+1:t−1 such that

  • |Z| is fixed and |Z| << |f 1:K

t−M+1:t−1|

(robustness constraint)

  • and P(f J+1:K

t

|f 1:J

t

, Z) approximates the original conditional probabilities according to Information Theory based criteria

Notation simplification (hereafter): X = f1:J

t

; Y = fJ+1:K

t

; Z ⊂ W = f1:K

t−M+1:t−1; → P(Y |X, Z) 7/14

slide-17
SLIDE 17

LCPM’s SL Method - Step 2: Rules to determine Z

  • Information Theory measures
  • Conditional entropy, H(Y |X)
  • Conditional mutual information (CMI), I(Y ; Z|X)
  • Cross-context conditional mutual information (CCCMI),

IXl(Y ; Z|Xm)

8/14

slide-18
SLIDE 18

LCPM’s SL Method - Step 2: Rules to determine Z

  • Information Theory measures
  • Conditional entropy, H(Y |X)
  • Conditional mutual information (CMI), I(Y ; Z|X)
  • Cross-context conditional mutual information (CCCMI),

IXl(Y ; Z|Xm)

  • Possible/experimented rules (→ P(Y |X, Z) w/ Z ⊂ W)
  • To discard Z∗

If I(Y ; Z∗|X) < η H(Y |X) then Z∗ is non-relevant

8/14

slide-19
SLIDE 19

LCPM’s SL Method - Step 2: Rules to determine Z

  • Information Theory measures
  • Conditional entropy, H(Y |X)
  • Conditional mutual information (CMI), I(Y ; Z|X)
  • Cross-context conditional mutual information (CCCMI),

IXl(Y ; Z|Xm)

  • Possible/experimented rules (→ P(Y |X, Z) w/ Z ⊂ W)
  • To discard Z∗

If I(Y ; Z∗|X) < η H(Y |X) then Z∗ is non-relevant

  • To determine Z∗

Z∗ = argmax

Z⊂W |Z|=ζ

{I(Y ; Z|X)}

8/14

slide-20
SLIDE 20

LCPM’s SL Method - Step 2: Rules to determine Z (cont.)

  • Rule to determine Z∗ using the “Utility” measure Nλ

Z∗ = argmax

Z⊂W |Z|=ζ

{Nλ(Y ; Z|X)}, 0 < λ ≤ 1

9/14

slide-21
SLIDE 21

LCPM’s SL Method - Step 2: Rules to determine Z (cont.)

  • Rule to determine Z∗ using the “Utility” measure Nλ

Z∗ = argmax

Z⊂W |Z|=ζ

{Nλ(Y ; Z|X)}, 0 < λ ≤ 1 where Nλ(Y ; Z|X) represents

  • Xm

P(Xm)

  • I(Y ; Z|Xm)−λ
  • Xl=Xm

P(Xl)IXl(Y ; Z|Xm)

  • 9/14
slide-22
SLIDE 22

LCPM’s SL Method - Step 2: Rules to determine Z (cont.)

  • Rule to determine Z∗ using the “Utility” measure Nλ

Z∗ = argmax

Z⊂W |Z|=ζ

{Nλ(Y ; Z|X)}, 0 < λ ≤ 1 where Nλ(Y ; Z|X) represents

  • Xm

P(Xm)

  • I(Y ; Z|Xm)−λ
  • Xl=Xm

P(Xl)IXl(Y ; Z|Xm)

  • and IXl(Y ; Z|Xm) represents
  • Y
  • Z

P(Y, Z|Xl) log P(Y, Z|Xm) P(Y |Xm)P(Z|Xm)

9/14

slide-23
SLIDE 23

LCPM’s SL Method - Step 2: Example

Problem: Choose Z1 or Z2 to model P(Y |X, Z); X ∈ {F, S}, Y ∈ {A, B, U}, Z1 ∈ {C, D, V }, Z2 ∈ {E, F, W}

10/14

slide-24
SLIDE 24

LCPM’s SL Method - Step 2: Example

Problem: Choose Z1 or Z2 to model P(Y |X, Z); X ∈ {F, S}, Y ∈ {A, B, U}, Z1 ∈ {C, D, V }, Z2 ∈ {E, F, W} Data: P(X = F) = P(X = S)

10/14

slide-25
SLIDE 25

LCPM’s SL Method - Step 2: Example

Problem: Choose Z1 or Z2 to model P(Y |X, Z); X ∈ {F, S}, Y ∈ {A, B, U}, Z1 ∈ {C, D, V }, Z2 ∈ {E, F, W} Data: P(X = F) = P(X = S) “Utility” & Solutions: N0(Y ; Z1|X) < N0(Y ; Z2|X)

(near equality)

∴ λ = 0 ⇒ choose Z2 N1(Y ; Z1|X) > N1(Y ; Z2|X) ∴ λ = 1 ⇒ choose Z1

10/14

slide-26
SLIDE 26

LCPM’s SL Method - Step 2: ALgorithm to define Z

Input: f1:K

t−M+1:t, J, K, M, ζ, λ, γ, η, Data

Output: Set of factors: Z for each z ∈ f1:K

t−M+1:t−1 do // factors relevance

if I(fJ+1:K

t

; z|f1:J

t

) < γH(fJ+1:K

t

|f1:J

t

) then Remove z from f1:K

t−M+1:t−1

end end Sort f1:K

t−M+1:t−1 by descending order of N(λ)(fJ+1:K t

; z|f1:J

t

) Z ← ∅ repeat// factors redundancy z ← next non-processed element in f1:K

t−M+1:t−1

if I(fJ+1:K

t

; z|f1:J

t

) > ηI(z; r|f1:J

t

), ∀r ∈ Z then Add z to Z end until |Z| = ζ or all elements of f1:K

t−M+1:t−1 are processed

Output Z

11/14

slide-27
SLIDE 27

Preliminary Results

  • Text corpus (vocab-size≈ 200K) which annotations include:

m - Part-of-speech (#13: ADJ, ADV, ...) g - Gender inflection (#3: M, F, N) n - Number inflection (#3: S, P, U) Select Z ⊂ W = {nt, mt−1, gt−1, nt−1, mt−2, gt−2, nt−2, ...} maximizing the Utility, Nλ(gt; Z|mt) (→ P(gt|mt, Z))

12/14

slide-28
SLIDE 28

Preliminary Results

  • Text corpus (vocab-size≈ 200K) which annotations include:

m - Part-of-speech (#13: ADJ, ADV, ...) g - Gender inflection (#3: M, F, N) n - Number inflection (#3: S, P, U) Select Z ⊂ W = {nt, mt−1, gt−1, nt−1, mt−2, gt−2, nt−2, ...} maximizing the Utility, Nλ(gt; Z|mt) (→ P(gt|mt, Z))

  • Results

Cases λ Z sorted by decreasing Nλ g = N and n = U {gt−1, gt−2, mt−1, . . . } 1 {gt−1, mt−1, gt−2, . . . } Whole data {nt, gt−1, mt−1, . . . } 1 {gt−1, nt−2, gt−2, . . . }

12/14

slide-29
SLIDE 29

Conclusions

  • Method for learning LCPM structure

13/14

slide-30
SLIDE 30

Conclusions

  • Method for learning LCPM structure
  • Guidelines:

Seek accurate and simple structure (FLM approach: keep just the relevant and non-redundant factors and dependencies)

13/14

slide-31
SLIDE 31

Conclusions

  • Method for learning LCPM structure
  • Guidelines:

Seek accurate and simple structure (FLM approach: keep just the relevant and non-redundant factors and dependencies)

  • Process:

Step 1 - manually set initial structure (Linguistic knowledge) Step 2 - automatically “prune” structure (data-driven algorithm based on Information Theory concepts)

13/14

slide-32
SLIDE 32

Conclusions

  • Method for learning LCPM structure
  • Guidelines:

Seek accurate and simple structure (FLM approach: keep just the relevant and non-redundant factors and dependencies)

  • Process:

Step 1 - manually set initial structure (Linguistic knowledge) Step 2 - automatically “prune” structure (data-driven algorithm based on Information Theory concepts)

  • Preliminary results seem promising; larger experiments are

needed to get conclusive results

13/14

slide-33
SLIDE 33

References

  • 1. J. Bilmes, “Natural Statistical Models for Automatic Speech

Recognition”, PhD Thesis, 1999, Berkley, Cal, Intl. Computer Science Institute.

  • 2. K. Kirchhoff, J. Bilmes, and K. Duh, “Factored Language

Model Tutorial”, Tech. Report, 2008, Dept. Electrical Engineering, Univ. of Washington.

  • 3. Helmut Schmid, “Improvements in Part-of-Speech Tagging

with an Application to German”, Proc. ACL SIGDAT-Workshop, 1995. Dublin, Ireland.

  • 4. D. Santos, and P. Rocha, “Evaluating CETEMP´

ublico, a free resource for Portuguese”, Proc. 39th Annual Meeting of the Association for Computational Linguistics, 2001, Stroudsburg, PA, USA.

14/14