Learning probabilistic finite automata Colin de la Higuera - - PowerPoint PPT Presentation

learning probabilistic finite automata
SMART_READER_LITE
LIVE PREVIEW

Learning probabilistic finite automata Colin de la Higuera - - PowerPoint PPT Presentation

Learning probabilistic finite automata Colin de la Higuera University of Nantes Nantes, November 2013 1 Acknowledgements Laurent Miclet, Jose Oncina, Tim Oates, Rafael Carrasco, Paco Casacuberta, Rmi Eyraud, Philippe Ezequel, Henning


slide-1
SLIDE 1

Nantes, November 2013 1

Learning probabilistic finite automata

Colin de la Higuera University of Nantes

slide-2
SLIDE 2

Nantes, November 2013 2

Acknowledgements

Laurent Miclet, Jose Oncina, Tim Oates, Rafael Carrasco, Paco

Casacuberta, Rémi Eyraud, Philippe Ezequel, Henning Fernau, Thierry Murgue, Franck Thollard, Enrique Vidal, Frédéric Tantini,...

List is necessarily incomplete. Excuses to those that have been

forgotten http://pagesperso.lina.univ-nantes.fr/~cdlh/slides/ Chapters 5 and 16

slide-3
SLIDE 3

Nantes, November 2013 3

Outline

1.

PFA

2.

Distances between distributions

3.

FFA

4.

Basic elements for learning PFA

5.

ALERGIA

6.

MDI and DSAI

7.

Open questions

slide-4
SLIDE 4

Nantes, November 2013 4

1 PFA

Probabilistic finite (state) automata

slide-5
SLIDE 5

Nantes, November 2013 5

Practical motivations

(Computational biology, speech recognition, web services, automatic translation, image processing …)

A lot of positive data Not necessarily any negative data No ideal target Noise

slide-6
SLIDE 6

Nantes, November 2013 6

The grammar induction problem, revisited

The

data consists

  • f

positive strings, «generated» following an unknown distribution

The goal is now to find (learn) this distribution

  • r the grammar/automaton that is used to

generate the strings

slide-7
SLIDE 7

Nantes, November 2013 7

Success of the probabilistic models

n-grams Hidden Markov Models Probabilistic grammars

slide-8
SLIDE 8

Nantes, November 2013 8

4 1

3 1 2 1 2 1 2 1

3 2

a b a b a b

4 3

2 1

DPFA: Deterministic Probabilistic Finite Automaton

slide-9
SLIDE 9

Nantes, November 2013 9

4 1

3 1 2 1 2 1 2 1

3 2

a b a b a b

4 3

2 1

PrA(abab)=

24 1 4 3 3 2 3 1 2 1 2 1 = × × × ×

slide-10
SLIDE 10

Nantes, November 2013 10

0.1 0.3

a b a b a b

0.65 0.35 0.9 0.7 0.3 0.7

slide-11
SLIDE 11

Nantes, November 2013 11

4 1

3 1 2 1 2 1 2 1

3 2

b b a a a b

4 3

2 1

PFA: Probabilistic Finite (state) Automaton

slide-12
SLIDE 12

Nantes, November 2013 12

4 1

3 1 2 1 2 1 2 1

3 2

ε b ε ε a b

4 3

2 1

ε-PFA: Probabilistic Finite (state) Automaton with ε-transitions

slide-13
SLIDE 13

Nantes, November 2013 13

How useful are these automata?

They can define a distribution over Σ* They do not tell us if a string belongs to a

language

They are good candidates for grammar induction There is (was?) not that much written theory

slide-14
SLIDE 14

Nantes, November 2013 14

Basic references

The HMM literature Azaria Paz 1973: Introduction to probabilistic

automata

Chapter 5 of my book Probabilistic Finite-State Machines, Vidal,

Thollard, cdlh, Casacuberta & Carrasco

Grammatical Inference papers

slide-15
SLIDE 15

Nantes, November 2013 15

Automata, definitions

Let D be a distribution over Σ*

0≤PrD(w)≤1

∑w∈Σ* PrD(w)=1

slide-16
SLIDE 16

Nantes, November 2013 16

A Probabilistic Finite (state) Automaton is a <Q, Σ, IP, FP, δP>

Q set of states IP : Q→[0;1] FP : Q→[0;1] δP : Q×Σ×Q →[0;1]

slide-17
SLIDE 17

Nantes, November 2013 17

What does a PFA do?

It defines the probability of each string w as the sum (over all

paths reading w) of the products of the probabilities

PrA(w)=∑πi∈paths(w)Pr(πi) πi=qi0ai1qi1ai2…ainqin Pr(πi)=IP(qi0)·

FP(qin) ·

∏aij δP (qij-1,aij,qij)

Note that if λ-transitions are allowed the sum may be infinite

slide-18
SLIDE 18

Nantes, November 2013 18

Pr(aba) = 0.7*0.4*0.1*1 +0.7*0.4*0.45*0.2 = 0.028+0.0252=0.0532

0.2 0.1 1

a b a a a b

0.45 0.35 0.4 0.7 0.3 0.1

b 0.4

slide-19
SLIDE 19

Nantes, November 2013 19

non deterministic PFA: many initial states/only

  • ne initial state

a λ-PFA: a PFA with λ-transitions and perhaps

many initial states

DPFA: a deterministic PFA

slide-20
SLIDE 20

Nantes, November 2013 20

Consistency

A PFA is consistent if

PrA(Σ*)=1 ∀x∈Σ* 0≤PrA(x)≤1

slide-21
SLIDE 21

Nantes, November 2013 21

Consistency theorem

A is consistent if every state is useful (accessible and co-accessible) and ∀q∈Q FP(q) + ∑q’∈Q,a∈Σ δP (q,a,q’)= 1

slide-22
SLIDE 22

Nantes, November 2013 22

Equivalence between models

Equivalence between PFA and HMM… But the HMM usually define distributions

  • ver each Σn
slide-23
SLIDE 23

Nantes, November 2013 23

4 1

2 1

win draw lose win draw lose win draw lose

2 1 2 1

3 4 1 2 1 4 1 4 3 4 1 4 1 4

4 1 4 1 4 1 4 1 4 1

A football HMM

slide-24
SLIDE 24

Nantes, November 2013 24

Equivalence between PFA with λ-transitions and PFA without λ-transitions

cdlh 2003, Hanneforth & cdlh 2009 Many initial states can be transformed into one initial

state with λ-transitions;

λ-transitions can be removed in polynomial time; Strategy: number the states eliminate first λ-loops, then the transitions with

highest ranking arrival state

slide-25
SLIDE 25

Nantes, November 2013 25

PFA are strictly more powerful than DPFA

Folk theorem (and) You can’t even tell in advance if you are in a good case or not (see: Denis & Esposito 2004)

slide-26
SLIDE 26

Nantes, November 2013 26

3 1 2 1 2 1 2 1 2 1

3 2

a a a a

This distribution cannot be modelled by a DPFA Example

slide-27
SLIDE 27

Nantes, November 2013 27

What does a DPFA over Σ ={a} look like?

And with this architecture you cannot generate the previous one

a … a a a

slide-28
SLIDE 28

Nantes, November 2013 28

Parsing issues

Computation of the probability of a string or of a

set of strings

Deterministic case

Simple: apply definitions Technically, rather sum up logs: this is easier, safer and

cheaper

slide-29
SLIDE 29

Nantes, November 2013 29

Pr(aba) = 0.7*0.9*0.35*0 = 0 Pr(abb) = 0.7*0.9*0.65*0.3 = 0.12285

0.1 0.3

a b a b a b

0.65 0.35 0.9 0.7 0.3 0.7

slide-30
SLIDE 30

Nantes, November 2013 30

Pr(aba) = 0.7*0.4*0.1*1 +0.7*0.4*0.45*0.2 = 0.028+0.0252=0.0532

Non-deterministic case

0.2 0.1 1

a b a a a b

0.45 0.35 0.4 0.7 0.3 0.1

b 0.4

slide-31
SLIDE 31

Nantes, November 2013 31

In the literature

The computation of the probability of a string is by

dynamic programming : O(n2m)

2 algorithms: Backward and Forward If we want the most probable derivation to define

the probability of a string, then we can use the Viterbi algorithm

slide-32
SLIDE 32

Nantes, November 2013 32

Forward algorithm

A[i,j]=Pr(qi|a1..aj)

(The probability of being in state qi after having read a1..aj)

A[i,0]=IP(qi) A[i,j+1]= ∑k≤|Q|A[k,j] . δP(qk,aj+1,qi) Pr(a1..an)= ∑k≤|Q|A[k,n] . FP(qk)

slide-33
SLIDE 33

Nantes, November 2013 33

2 Distances

What for?

Estimate the quality of a language model Have an indicator of the convergence of

learning algorithms

Construct kernels

slide-34
SLIDE 34

Nantes, November 2013 34

2.1 Entropy

How many bits do we need to correct our

model?

Two distributions over Σ*: D et D’ Kullback Leibler divergence (or relative entropy)

between D and D’: ∑w∈Σ* PrD(w) ×log PrD(w)-log PrD’ (w)

slide-35
SLIDE 35

Nantes, November 2013 35

2.2 Perplexity

The idea is to allow the computation of the

divergence, but relatively to a test set (S)

An approximation (sic) is perplexity: inverse of

the geometric mean of the probabilities of the elements of the test set

slide-36
SLIDE 36

Nantes, November 2013 36

∏w∈S PrD(w)-1/S

= 1

∏w∈S PrD(w)

Problem if some probability is null...

S

slide-37
SLIDE 37

Nantes, November 2013 37

Why multiply (1)

We are trying to compute the probability of

independently drawing the different strings in set S

slide-38
SLIDE 38

Nantes, November 2013 38

Why multiply? (2)

Suppose we have two predictors for a coin toss

Predictor 1: heads 60%, tails 40% Predictor 2: heads 100%

The tests are H: 6, T: 4 Arithmetic mean

P1: 36%+16%=0,52 P2: 0,6

Predictor 2 would be the better predictor ;-)

slide-39
SLIDE 39

Nantes, November 2013 39

2.3 Distance d2

d2(D, D’)=

∑w∈Σ*(PrD(w)-PrD’(w))2

Can be computed in polynomial time if D and D’ are given by PFA (Carrasco & cdlh 2002) This also means that equivalence of PFA is in P

slide-40
SLIDE 40

Nantes, November 2013 40

3 FFA

Frequency Finite (state) Automata

slide-41
SLIDE 41

Nantes, November 2013 41

A learning sample

is a multiset Strings appear with a frequency (or multiplicity) S={λ (3), aaa (4), aaba (2), ababa (1), bb (3),

bbaaa (1)}

slide-42
SLIDE 42

Nantes, November 2013 42

DFFA

A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition, and for entering the initial state such that

the sum of what enters is equal to what exits and the sum of what halts is equal to what starts

slide-43
SLIDE 43

Nantes, November 2013 43

Example

3 1 2

b : 5 b: 3 a: 1 a: 2 a: 5 b: 4 6

slide-44
SLIDE 44

Nantes, November 2013 44

From a DFFA to a DPFA

3/13 1/6 2/7

b: 5/13 b: 3/6 a: 1/7 a: 2/6 a: 5/13 b: 4/7 6/6 Frequencies become relative frequencies by dividing by sum of exiting frequencies

slide-45
SLIDE 45

Nantes, November 2013 45

From a DFA and a sample to a DFFA

3 1 2

b: 5 b: 3 a: 1 a: 2 a: 5 b: 4 6

S = {λ, aaaa, ab, babb, bbbb, bbbbaa}

slide-46
SLIDE 46

Nantes, November 2013 46

Note

Another sample may lead to the same DFFA Doing the same with a NFA is a much harder

problem

Typically what algorithm Baum-Welch (EM) has

been invented for…

slide-47
SLIDE 47

Nantes, November 2013 47

The frequency prefix tree acceptor

The data is a multi-set The FTA is the smallest tree-like FFA consistent

with the data

Can be transformed into a PFA if needed

slide-48
SLIDE 48

Nantes, November 2013 48

From the sample to the FTA

3 2 4 3

a:7 a:6 a:4 a:2 b:4 b:4 b:2

1

a:1 a:1 a:1 b:1

1

a:1 b:1 a:1 S={λ (3), aaa (4), aaba (2), ababa (1), bb (3), bbaaa (1)}

FTA(S)

14

slide-49
SLIDE 49

Nantes, November 2013 49

Red, Blue and White states

a a a b b b a b a

  • Red states are confirmed states
  • Blue states are the (non Red)

successors of the Red states

  • White states are the others

Same as with DFA and what RPNI does

slide-50
SLIDE 50

Nantes, November 2013 50

Merge and fold

60 9 10 11 6 4

Suppose we decide to merge with state a:26 a:4 a:6 a:4 a:10 b:24 b:24 b:9 b:6 100

λ b a

slide-51
SLIDE 51

Nantes, November 2013 51

Merge and fold

60 9 10 11 6 4

First disconnect and reconnect to a:26 a:4 a:6 a:4 a:10 b:24 b:24 b:9 b:6 100

λ b a

slide-52
SLIDE 52

Nantes, November 2013 52

Merge and fold

60 9 10 11 6 4

Then fold a:26 a:4 a:6 a:4 a:10 b:24 b:24 b:9 b:6 100

slide-53
SLIDE 53

Nantes, November 2013 53

Merge and fold

60 10 9 10 11 4

after folding a:26 a:4 a:10 a:10 b:24 b:9 b:30 100

slide-54
SLIDE 54

Nantes, November 2013 54

State merging algorithm

A=FTA(S); Blue ={δ(qI,a): a∈Σ }; Red ={qI} While Blue≠∅ do choose q from Blue such that Freq(q)≥t0 if ∃p∈Red: d(Ap,Aq) is small then A = merge_and_fold(A,p,q) else Red = Red ∪ {q} Blue = {δ(q,a): q∈Red} – {Red}

slide-55
SLIDE 55

Nantes, November 2013 55

The real question

How do we decide if d(Ap,Aq) is small? Use a distance… Be able to compute this distance If possible update the computation easily Have properties related to this distance

slide-56
SLIDE 56

Nantes, November 2013 56

Deciding if two distributions are similar

If the two distributions are known, equality can

be tested

Distance (L2 norm) between distributions can be

exactly computed

But what if the two distributions are unknown?

slide-57
SLIDE 57

Nantes, November 2013 57

Taking decisions

60 9 10 11 6 4

Suppose we want to merge with state a:26 a:4 a:6 a:4 a:10 b:24 b:24 b:9 b:6 100

λ b a

slide-58
SLIDE 58

Nantes, November 2013 58

Taking decisions

60 9 10 11 6 4

Yes if the two distributions induced are similar a:26 a:4 a:6 a:4 a:10 b:24 b:24 b:9 b:6

9 11 4

a:4 a:4 b:24 b:9

slide-59
SLIDE 59

Nantes, November 2013 59

5 Alergia

slide-60
SLIDE 60

Nantes, November 2013 60

Alergia’s test

D1 ≈ D2 if ∀x PrD1(x) ≈ PrD2(x)

Easier to test: PrD1(λ)=PrD2(λ) ∀a∈Σ PrD1(aΣ*)=PrD2(aΣ*)

And do this recursively! Of course, do it on frequencies

slide-61
SLIDE 61

Nantes, November 2013 61

Hoeffding bounds

1 1

n f

2 2 1 1

n f n f − ← γ

α γ 2 ln 2 1 . 1 1

2 1

        + < n n

γ indicates if the relative frequencies and are sufficiently close

2 2

n f

slide-62
SLIDE 62

Nantes, November 2013 62

A run of Alergia Our learning multisample

S={λ(490), a(128), b(170), aa(31), ab(42), ba(38), bb(14), aaa(8), aab(10), aba(10), abb(4), baa(9), bab(4), bba(3), bbb(6), aaaa(2), aaab(2), aaba(3), aabb(2), abaa(2), abab(2), abba(2), abbb(1), baaa(2), baab(2), baba(1), babb(1), bbaa(1), bbab(1), bbba(1), aaaaa(1), aaaab(1), aaaba(1), aabaa(1), aabab(1), aabba(1), abbaa(1), abbab(1)}

slide-63
SLIDE 63

Nantes, November 2013 63

Parameter α is arbitrarily set to 0.05. We choose

30 as a value for threshold t0.

Note that for the blue state who have a

frequency less than the threshold, a special merging operation takes place

slide-64
SLIDE 64

Nantes, November 2013 64

490 10 8 31 128 38 170

a :257 a : 57 a : 15

b : 170

b : 26 b : 18

9

a : 13 a :5

4 42

b : 65 a : 14 b : 9

14

a : 64

10 4 3 6

b : 6 b : 7

3 2 2 2 2 1 1 1 2 2 2 1 2 1 1 1 1 1 1 1 1 1 1 a : 4 a : 5 a : 2 a : 4 a : 2 a : 1 a : 1 a : 1 b : 1 b : 1 b : 2 b : 1 b : 2 b : 3 b : 3 a b b b a a a a 1000

slide-65
SLIDE 65

Nantes, November 2013 65

Can we merge λ and a?

Compare λ and a, aΣ* and aaΣ*, bΣ* and abΣ* 490/1000 with 128/257 , 257/1000 with 64/257 , 253/1000 with 65/257 , . . . . All tests return true

slide-66
SLIDE 66

Nantes, November 2013 66

490 10 8 31 128 38 170

a: 257 a: 57 a: 15 b: 170 b: 26 b: 18

9

a: 13 a: 5

4 42

b: 65 a: 14 b: 9

14

a: 64

10 4 3 6

b: 6 b: 7

3 2 2 2 2 1 1 1 2 2 2 1 2 1 1 1 1 1 1 1 1 1 1 a : 4 a : 5 a : 2 a : 4 a : 2 a : 1 a : 1 a : 1 b : 1 b : 1 b : 2 b : 1 b : 2 b : 3 b : 3 a b b b a a a a

Merge…

1000

slide-67
SLIDE 67

Nantes, November 2013 67

660 52 225

a:341 a: 77 b: 340 b: 38

12

a: 16 a: 10

20 7 6 7

b: 9 b: 8

2 1 1 1 2 1 1

a: 2 a: 1 a: 1 a: 1 b: 1 b: 1 b: 2

And fold

1000

slide-68
SLIDE 68

Nantes, November 2013 68

660 52 225

a:341 a: 77 b: 340 b: 38

12

a: 16 a: 10

20 7 6 7

b: 9 b: 8

2 1 1 1 2 1 1

a: 2 a: 1 a: 1 a: 1

b: 1 b: 1

b: 2

Next merge ? λ with b ?

1000

slide-69
SLIDE 69

Nantes, November 2013 69

Can we merge λ and b?

Compare λ and b, aΣ* and baΣ*, bΣ* and bbΣ* 660/1341 and 225/340 are different (giving γ=

0.162)

On the other hand

111 . 2 ln 2 1 . 1 1

2 1

=         + α n n

slide-70
SLIDE 70

Nantes, November 2013 70

660 52 225

a:341 a: 77 b: 340 b: 38

12

a: 16 a:10

20 7 6 7

b: 9 b: 8

2 1 1 1 2 1 1

a : 2 a : 1 a : 1 a : 1 b : 1 b : 1 b : 2

Promotion

1000

slide-71
SLIDE 71

Nantes, November 2013 71

660 52 225

a: 341 a: 77 b: 340 b: 38

12

a: 16 a: 10

20 7 6 7

b: 9 b: 8

2 1 1 1 2 1 1

a : 2 a : 1 a : 1 a : 1 b : 1 b : 1 b : 2

Merge

slide-72
SLIDE 72

Nantes, November 2013 72

660 291

a:341 a: 95 b: 340 b: 49 a: 11

29 7 8

b: 9

2 1 2

a: 2 a: 1 b: 2

And fold

1000

slide-73
SLIDE 73

Nantes, November 2013 73

660 225

a:341 a: 95 b: 340 b: 49 a: 11

29 7 8

b: 9

2 1 2

a: 2 a: 1 b: 2

Merge

1000

slide-74
SLIDE 74

Nantes, November 2013 74

698 302

a: 354 a: 96 b: 351 b: 49

And fold

1000

.698 .302 a: .354 a: .096 b: .351 b: .049

As a PFA

slide-75
SLIDE 75

Nantes, November 2013 75

Conclusion and logic

Alergia builds a DFFA in polynomial time Alergia can identify DPFA in the limit with

probability 1

No good definition of Alergia’s properties

slide-76
SLIDE 76

Nantes, November 2013 76

6 DSAI and MDI

Why not change the criterion?

slide-77
SLIDE 77

Nantes, November 2013 77

Criterion for DSAI

Use a distinguishing string Use norm L∞ Two distributions are different if there is a string

with a very different probability

Such a string is called µ-distinguishable Question becomes:

Is there a string x such that |PrA,q(x)-PrA,q’(x)|>µ

slide-78
SLIDE 78

Nantes, November 2013 78

(much more to DSAI)

  • D. Ron, Y. Singer, and N. Tishby. On the

learnability and usage of acyclic probabilistic finite

  • automata. In Proceedings of Colt 1995, pages 31–

40, 1995.

PAC learnability results, in the case where

targets are acyclic graphs

slide-79
SLIDE 79

Nantes, November 2013 79

Criterion for MDI

MDL inspired heuristic Criterion is: does the reduction of the size of the

automaton compensate for the increase in preplexity?

  • F. Thollard, P. Dupont, and C. de la Higuera. Probabilistic

Dfa inference using Kullback-Leibler divergence and minimality. In Proceedings

  • f

the 17th International Conference on Machine Learning, pages 975–982. Morgan Kaufmann, San Francisco, CA, 2000

slide-80
SLIDE 80

a PFA/HMM learning competition

Organisation committee:

♦ Hasan

Ibne Akram, Technische Universität München, Germany

♦ Rémi Eyraud, Aix-Marseille Université, France ♦ Jeffrey Heinz, University of Delaware, USA ♦ Colin de la Higuera, University of Nantes, France ♦ James Scicluna, University of Nantes, France ♦ Sicco

Verwer, Radboud University Nijmegen, The Nederlands

slide-81
SLIDE 81

ICGI'12 - Workshop 81

Scientific Committee

Pieter Adriaans, University of Amsterdam, The Netherlands Dana Angluin, Yale University, USA Alexander Clark, Royal Holloway University of London, United

Kingdom

Pierre Dupont, Université catholique de Louvain, Belgium. Ricard Gavaldà, Universitat Politécnica de Catalunya, Spain Colin de la Higuera, University of Nantes, France Jean-Christophe Janodet, University of Evry, France Tim Oates, University of Maryland in Baltimore County, USA Jose Oncina, University of Alicante, Spain Menno van Zaanen, Tilburg University, The Netherlands

slide-82
SLIDE 82

ICGI'12 - Workshop 82

Timeline

December 2011: first ideas February 2012: website, first baselines and the

first data set on-line

Mach 2012: First phase (training phase) May 20: Second phase (competition) June 5: First real world problem available July 3: End of the competition September 7: special session in ICGI'12

slide-83
SLIDE 83

ICGI'12 - Workshop 83

Target Generation

Targets were generated completely at random 4 kinds of targets:

HMM PDFA PFA Markov Chains (used only during the training phase)

5 to 75 states 4 to 24 letters alphabet All initial, symbol and transition probability draw

from a Dirichlet distribution

slide-84
SLIDE 84

ICGI'12 - Workshop 84

Target Generation

Symbol sparsity: percentage of possible state-

symbol pairs selected for the target (between 20% and 80%)

A state is randomly selected, then a not already taken

symbol for this state

One transition is generated by selecting a target state

Transition

sparsity: percentage

  • f

additional transitions (between 0% and 20%)

Selected without replacement from the set of possible

transitions

Modified to remain uniform over the source state and

transition labels

slide-85
SLIDE 85

ICGI'12 - Workshop 85

Evaluation Score

A perplexity measure:

Where PrT is the probability in the target and PrC is the submitted probability (these probabilities have to be normalize on the test set)

Equivalent to the Kullback–Leibler divergence Independent of a specific model

slide-86
SLIDE 86

ICGI'12 - Workshop 86

Real Data

Natural language problem: 10 000 POS

sequences (+1 000 unique for test) selected from +100 000 obtained with the Frog Dutch tagger (11 symbols) on a corpus of Dutch translations of Jules Verne books.

Discretized sensor signals: 20 000 strings (+

1 000 for test) corresponding to windows of length 20 over the fuel usage of trucks, selected from almost 500 000 available windows.

Evaluation: submissions were compared with the

probabilities obtained with a 3-gram trained on the whole data set.

slide-87
SLIDE 87

ICGI'12 - Workshop 87

Overall score

For each problem

5 points were given to the leader (participant with

the smallest perplexity score)

3 points to the second 2 points to the third 1 point to the fourth

The sum of the point gave the overall ranking.

slide-88
SLIDE 88

ICGI'12 - Workshop 88

Train and test sets

Access only to registered participants 51 problems for the training phase 48 problems for the competition phase (+2 real

world problems)

1 000 strings in each test set 20 000 or 100 000 strings in train sets

slide-89
SLIDE 89

ICGI'12 - Workshop 89

Baseline Algorithms

2 simple baselines in python:

Frequency of the strings in the sets (train + test) Usual 3-gram on the strings of the sets (train + test)

An

implementation

  • f

the Baum-Welch algorithm in python

An implementation of ALERGIA in OpenFST and

Visual Studio

Good page rank of this page (no registration

needed)

slide-90
SLIDE 90

ICGI'12 - Workshop 90

Competition activity

724 visits (max: 54 in one day) 196 unique visitors IPs from 37 countries, 14 countries with 5 or

more IPs

38 registered participants 16 submitted at least one of their solutions 2 787 submissions 5 participants scored points 4 participants ranked first at least one day

slide-91
SLIDE 91

ICGI'12 - Workshop 91

Overall results

Ran k Team name Overall score 1 Shibata-Yoshinaka 212 2 Mans Hulden 124 3 David Llorens 122 4 Raphael Bailly 75 5 Fabio Kepler 14

slide-92
SLIDE 92

ICGI'12 - Workshop 92

Overall Scores Evolution

slide-93
SLIDE 93

Nantes, November 2013 93

7 Conclusion and open questions

slide-94
SLIDE 94

Nantes, November 2013 94

Appendix

Stern Brocot trees Identification of probabilities If we were able to discover the structure, how do we identify the probabilities?

slide-95
SLIDE 95

Nantes, November 2013 95

By estimation: the edge is used 1501 times out of

3000 passages through the state :

3000

a

1501 3000

slide-96
SLIDE 96

Nantes, November 2013 96

Stern-Brocot trees: (Stern 1858, Brocot 1860)

Can be constructed from two simple adjacent fractions by the «mean» operation a c a+c b d b+d

= m

slide-97
SLIDE 97

Nantes, November 2013 97

1 1 1 1 1 2 2 1 1 2 3 3 3 3 2 1 1 2 3 3 4 5 5 4 4 5 5 4 3 3 2 1

slide-98
SLIDE 98

Nantes, November 2013 98

Idea:

Instead of returning c(x)/n, search the Stern-

Brocot tree to find a good simple approximation

  • f this value.
slide-99
SLIDE 99

Nantes, November 2013 99

Iterated Logarithm:

c(x) - a < λ log log n n b n ∀λ>1

With probability 1, for a co-finite number of values

  • f n we have: