Nantes, November 2013 1
Learning probabilistic finite automata
Colin de la Higuera University of Nantes
Learning probabilistic finite automata Colin de la Higuera - - PowerPoint PPT Presentation
Learning probabilistic finite automata Colin de la Higuera University of Nantes Nantes, November 2013 1 Acknowledgements Laurent Miclet, Jose Oncina, Tim Oates, Rafael Carrasco, Paco Casacuberta, Rmi Eyraud, Philippe Ezequel, Henning
Nantes, November 2013 1
Colin de la Higuera University of Nantes
Nantes, November 2013 2
Laurent Miclet, Jose Oncina, Tim Oates, Rafael Carrasco, Paco
Casacuberta, Rémi Eyraud, Philippe Ezequel, Henning Fernau, Thierry Murgue, Franck Thollard, Enrique Vidal, Frédéric Tantini,...
List is necessarily incomplete. Excuses to those that have been
forgotten http://pagesperso.lina.univ-nantes.fr/~cdlh/slides/ Chapters 5 and 16
Nantes, November 2013 3
1.
PFA
2.
Distances between distributions
3.
FFA
4.
Basic elements for learning PFA
5.
ALERGIA
6.
MDI and DSAI
7.
Open questions
Nantes, November 2013 4
Probabilistic finite (state) automata
Nantes, November 2013 5
(Computational biology, speech recognition, web services, automatic translation, image processing …)
A lot of positive data Not necessarily any negative data No ideal target Noise
Nantes, November 2013 6
The
data consists
positive strings, «generated» following an unknown distribution
The goal is now to find (learn) this distribution
generate the strings
Nantes, November 2013 7
n-grams Hidden Markov Models Probabilistic grammars
Nantes, November 2013 8
4 1
3 1 2 1 2 1 2 1
3 2
4 3
2 1
DPFA: Deterministic Probabilistic Finite Automaton
Nantes, November 2013 9
4 1
3 1 2 1 2 1 2 1
3 2
4 3
2 1
PrA(abab)=
24 1 4 3 3 2 3 1 2 1 2 1 = × × × ×
Nantes, November 2013 10
0.1 0.3
0.65 0.35 0.9 0.7 0.3 0.7
Nantes, November 2013 11
4 1
3 1 2 1 2 1 2 1
3 2
4 3
2 1
PFA: Probabilistic Finite (state) Automaton
Nantes, November 2013 12
4 1
3 1 2 1 2 1 2 1
3 2
4 3
2 1
ε-PFA: Probabilistic Finite (state) Automaton with ε-transitions
Nantes, November 2013 13
They can define a distribution over Σ* They do not tell us if a string belongs to a
language
They are good candidates for grammar induction There is (was?) not that much written theory
Nantes, November 2013 14
The HMM literature Azaria Paz 1973: Introduction to probabilistic
automata
Chapter 5 of my book Probabilistic Finite-State Machines, Vidal,
Thollard, cdlh, Casacuberta & Carrasco
Grammatical Inference papers
Nantes, November 2013 15
Let D be a distribution over Σ*
Nantes, November 2013 16
A Probabilistic Finite (state) Automaton is a <Q, Σ, IP, FP, δP>
Q set of states IP : Q→[0;1] FP : Q→[0;1] δP : Q×Σ×Q →[0;1]
Nantes, November 2013 17
It defines the probability of each string w as the sum (over all
paths reading w) of the products of the probabilities
PrA(w)=∑πi∈paths(w)Pr(πi) πi=qi0ai1qi1ai2…ainqin Pr(πi)=IP(qi0)·
FP(qin) ·
Note that if λ-transitions are allowed the sum may be infinite
Nantes, November 2013 18
Pr(aba) = 0.7*0.4*0.1*1 +0.7*0.4*0.45*0.2 = 0.028+0.0252=0.0532
0.2 0.1 1
0.45 0.35 0.4 0.7 0.3 0.1
Nantes, November 2013 19
non deterministic PFA: many initial states/only
a λ-PFA: a PFA with λ-transitions and perhaps
many initial states
DPFA: a deterministic PFA
Nantes, November 2013 20
PrA(Σ*)=1 ∀x∈Σ* 0≤PrA(x)≤1
Nantes, November 2013 21
Nantes, November 2013 22
Equivalence between PFA and HMM… But the HMM usually define distributions
Nantes, November 2013 23
4 1
2 1
2 1 2 1
3 4 1 2 1 4 1 4 3 4 1 4 1 4
4 1 4 1 4 1 4 1 4 1
Nantes, November 2013 24
Equivalence between PFA with λ-transitions and PFA without λ-transitions
cdlh 2003, Hanneforth & cdlh 2009 Many initial states can be transformed into one initial
state with λ-transitions;
λ-transitions can be removed in polynomial time; Strategy: number the states eliminate first λ-loops, then the transitions with
highest ranking arrival state
Nantes, November 2013 25
Folk theorem (and) You can’t even tell in advance if you are in a good case or not (see: Denis & Esposito 2004)
Nantes, November 2013 26
3 1 2 1 2 1 2 1 2 1
3 2
This distribution cannot be modelled by a DPFA Example
Nantes, November 2013 27
And with this architecture you cannot generate the previous one
Nantes, November 2013 28
Computation of the probability of a string or of a
set of strings
Deterministic case
Simple: apply definitions Technically, rather sum up logs: this is easier, safer and
cheaper
Nantes, November 2013 29
Pr(aba) = 0.7*0.9*0.35*0 = 0 Pr(abb) = 0.7*0.9*0.65*0.3 = 0.12285
0.1 0.3
0.65 0.35 0.9 0.7 0.3 0.7
Nantes, November 2013 30
Pr(aba) = 0.7*0.4*0.1*1 +0.7*0.4*0.45*0.2 = 0.028+0.0252=0.0532
Non-deterministic case
0.2 0.1 1
0.45 0.35 0.4 0.7 0.3 0.1
Nantes, November 2013 31
The computation of the probability of a string is by
dynamic programming : O(n2m)
2 algorithms: Backward and Forward If we want the most probable derivation to define
the probability of a string, then we can use the Viterbi algorithm
Nantes, November 2013 32
A[i,j]=Pr(qi|a1..aj)
(The probability of being in state qi after having read a1..aj)
A[i,0]=IP(qi) A[i,j+1]= ∑k≤|Q|A[k,j] . δP(qk,aj+1,qi) Pr(a1..an)= ∑k≤|Q|A[k,n] . FP(qk)
Nantes, November 2013 33
What for?
Estimate the quality of a language model Have an indicator of the convergence of
learning algorithms
Construct kernels
Nantes, November 2013 34
How many bits do we need to correct our
model?
Two distributions over Σ*: D et D’ Kullback Leibler divergence (or relative entropy)
between D and D’: ∑w∈Σ* PrD(w) ×log PrD(w)-log PrD’ (w)
Nantes, November 2013 35
The idea is to allow the computation of the
divergence, but relatively to a test set (S)
An approximation (sic) is perplexity: inverse of
the geometric mean of the probabilities of the elements of the test set
Nantes, November 2013 36
S
Nantes, November 2013 37
We are trying to compute the probability of
independently drawing the different strings in set S
Nantes, November 2013 38
Suppose we have two predictors for a coin toss
Predictor 1: heads 60%, tails 40% Predictor 2: heads 100%
The tests are H: 6, T: 4 Arithmetic mean
P1: 36%+16%=0,52 P2: 0,6
Predictor 2 would be the better predictor ;-)
Nantes, November 2013 39
d2(D, D’)=
Can be computed in polynomial time if D and D’ are given by PFA (Carrasco & cdlh 2002) This also means that equivalence of PFA is in P
Nantes, November 2013 40
Frequency Finite (state) Automata
Nantes, November 2013 41
is a multiset Strings appear with a frequency (or multiplicity) S={λ (3), aaa (4), aaba (2), ababa (1), bb (3),
bbaaa (1)}
Nantes, November 2013 42
A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition, and for entering the initial state such that
the sum of what enters is equal to what exits and the sum of what halts is equal to what starts
Nantes, November 2013 43
3 1 2
b : 5 b: 3 a: 1 a: 2 a: 5 b: 4 6
Nantes, November 2013 44
3/13 1/6 2/7
b: 5/13 b: 3/6 a: 1/7 a: 2/6 a: 5/13 b: 4/7 6/6 Frequencies become relative frequencies by dividing by sum of exiting frequencies
Nantes, November 2013 45
3 1 2
b: 5 b: 3 a: 1 a: 2 a: 5 b: 4 6
S = {λ, aaaa, ab, babb, bbbb, bbbbaa}
Nantes, November 2013 46
Another sample may lead to the same DFFA Doing the same with a NFA is a much harder
problem
Typically what algorithm Baum-Welch (EM) has
been invented for…
Nantes, November 2013 47
The data is a multi-set The FTA is the smallest tree-like FFA consistent
with the data
Can be transformed into a PFA if needed
Nantes, November 2013 48
3 2 4 3
a:7 a:6 a:4 a:2 b:4 b:4 b:2
1
a:1 a:1 a:1 b:1
1
a:1 b:1 a:1 S={λ (3), aaa (4), aaba (2), ababa (1), bb (3), bbaaa (1)}
FTA(S)
14
Nantes, November 2013 49
a a a b b b a b a
successors of the Red states
Same as with DFA and what RPNI does
Nantes, November 2013 50
60 9 10 11 6 4
Suppose we decide to merge with state a:26 a:4 a:6 a:4 a:10 b:24 b:24 b:9 b:6 100
λ b a
Nantes, November 2013 51
60 9 10 11 6 4
First disconnect and reconnect to a:26 a:4 a:6 a:4 a:10 b:24 b:24 b:9 b:6 100
λ b a
Nantes, November 2013 52
60 9 10 11 6 4
Then fold a:26 a:4 a:6 a:4 a:10 b:24 b:24 b:9 b:6 100
Nantes, November 2013 53
60 10 9 10 11 4
after folding a:26 a:4 a:10 a:10 b:24 b:9 b:30 100
Nantes, November 2013 54
A=FTA(S); Blue ={δ(qI,a): a∈Σ }; Red ={qI} While Blue≠∅ do choose q from Blue such that Freq(q)≥t0 if ∃p∈Red: d(Ap,Aq) is small then A = merge_and_fold(A,p,q) else Red = Red ∪ {q} Blue = {δ(q,a): q∈Red} – {Red}
Nantes, November 2013 55
How do we decide if d(Ap,Aq) is small? Use a distance… Be able to compute this distance If possible update the computation easily Have properties related to this distance
Nantes, November 2013 56
If the two distributions are known, equality can
be tested
Distance (L2 norm) between distributions can be
exactly computed
But what if the two distributions are unknown?
Nantes, November 2013 57
60 9 10 11 6 4
Suppose we want to merge with state a:26 a:4 a:6 a:4 a:10 b:24 b:24 b:9 b:6 100
λ b a
Nantes, November 2013 58
60 9 10 11 6 4
Yes if the two distributions induced are similar a:26 a:4 a:6 a:4 a:10 b:24 b:24 b:9 b:6
9 11 4
a:4 a:4 b:24 b:9
Nantes, November 2013 59
Nantes, November 2013 60
D1 ≈ D2 if ∀x PrD1(x) ≈ PrD2(x)
Easier to test: PrD1(λ)=PrD2(λ) ∀a∈Σ PrD1(aΣ*)=PrD2(aΣ*)
And do this recursively! Of course, do it on frequencies
Nantes, November 2013 61
1 1
n f
2 2 1 1
α γ 2 ln 2 1 . 1 1
2 1
+ < n n
γ indicates if the relative frequencies and are sufficiently close
2 2
n f
Nantes, November 2013 62
S={λ(490), a(128), b(170), aa(31), ab(42), ba(38), bb(14), aaa(8), aab(10), aba(10), abb(4), baa(9), bab(4), bba(3), bbb(6), aaaa(2), aaab(2), aaba(3), aabb(2), abaa(2), abab(2), abba(2), abbb(1), baaa(2), baab(2), baba(1), babb(1), bbaa(1), bbab(1), bbba(1), aaaaa(1), aaaab(1), aaaba(1), aabaa(1), aabab(1), aabba(1), abbaa(1), abbab(1)}
Nantes, November 2013 63
Parameter α is arbitrarily set to 0.05. We choose
30 as a value for threshold t0.
Note that for the blue state who have a
frequency less than the threshold, a special merging operation takes place
Nantes, November 2013 64
490 10 8 31 128 38 170
a :257 a : 57 a : 15
b : 170
b : 26 b : 18
9
a : 13 a :5
4 42
b : 65 a : 14 b : 9
14
a : 64
10 4 3 6
b : 6 b : 7
3 2 2 2 2 1 1 1 2 2 2 1 2 1 1 1 1 1 1 1 1 1 1 a : 4 a : 5 a : 2 a : 4 a : 2 a : 1 a : 1 a : 1 b : 1 b : 1 b : 2 b : 1 b : 2 b : 3 b : 3 a b b b a a a a 1000
Nantes, November 2013 65
Compare λ and a, aΣ* and aaΣ*, bΣ* and abΣ* 490/1000 with 128/257 , 257/1000 with 64/257 , 253/1000 with 65/257 , . . . . All tests return true
Nantes, November 2013 66
490 10 8 31 128 38 170
a: 257 a: 57 a: 15 b: 170 b: 26 b: 18
9
a: 13 a: 5
4 42
b: 65 a: 14 b: 9
14
a: 64
10 4 3 6
b: 6 b: 7
3 2 2 2 2 1 1 1 2 2 2 1 2 1 1 1 1 1 1 1 1 1 1 a : 4 a : 5 a : 2 a : 4 a : 2 a : 1 a : 1 a : 1 b : 1 b : 1 b : 2 b : 1 b : 2 b : 3 b : 3 a b b b a a a a
Merge…
1000
Nantes, November 2013 67
660 52 225
a:341 a: 77 b: 340 b: 38
12
a: 16 a: 10
20 7 6 7
b: 9 b: 8
2 1 1 1 2 1 1
a: 2 a: 1 a: 1 a: 1 b: 1 b: 1 b: 2
And fold
1000
Nantes, November 2013 68
660 52 225
a:341 a: 77 b: 340 b: 38
12
a: 16 a: 10
20 7 6 7
b: 9 b: 8
2 1 1 1 2 1 1
a: 2 a: 1 a: 1 a: 1
b: 1 b: 1
b: 2
Next merge ? λ with b ?
1000
Nantes, November 2013 69
Compare λ and b, aΣ* and baΣ*, bΣ* and bbΣ* 660/1341 and 225/340 are different (giving γ=
0.162)
On the other hand
111 . 2 ln 2 1 . 1 1
2 1
= + α n n
Nantes, November 2013 70
660 52 225
a:341 a: 77 b: 340 b: 38
12
a: 16 a:10
20 7 6 7
b: 9 b: 8
2 1 1 1 2 1 1
a : 2 a : 1 a : 1 a : 1 b : 1 b : 1 b : 2
Promotion
1000
Nantes, November 2013 71
660 52 225
a: 341 a: 77 b: 340 b: 38
12
a: 16 a: 10
20 7 6 7
b: 9 b: 8
2 1 1 1 2 1 1
a : 2 a : 1 a : 1 a : 1 b : 1 b : 1 b : 2
Merge
Nantes, November 2013 72
660 291
a:341 a: 95 b: 340 b: 49 a: 11
29 7 8
b: 9
2 1 2
a: 2 a: 1 b: 2
And fold
1000
Nantes, November 2013 73
660 225
a:341 a: 95 b: 340 b: 49 a: 11
29 7 8
b: 9
2 1 2
a: 2 a: 1 b: 2
Merge
1000
Nantes, November 2013 74
698 302
a: 354 a: 96 b: 351 b: 49
And fold
1000
.698 .302 a: .354 a: .096 b: .351 b: .049
As a PFA
Nantes, November 2013 75
Alergia builds a DFFA in polynomial time Alergia can identify DPFA in the limit with
probability 1
No good definition of Alergia’s properties
Nantes, November 2013 76
Why not change the criterion?
Nantes, November 2013 77
Use a distinguishing string Use norm L∞ Two distributions are different if there is a string
with a very different probability
Such a string is called µ-distinguishable Question becomes:
Is there a string x such that |PrA,q(x)-PrA,q’(x)|>µ
Nantes, November 2013 78
learnability and usage of acyclic probabilistic finite
40, 1995.
PAC learnability results, in the case where
targets are acyclic graphs
Nantes, November 2013 79
MDL inspired heuristic Criterion is: does the reduction of the size of the
automaton compensate for the increase in preplexity?
Dfa inference using Kullback-Leibler divergence and minimality. In Proceedings
the 17th International Conference on Machine Learning, pages 975–982. Morgan Kaufmann, San Francisco, CA, 2000
Organisation committee:
♦ Hasan
Ibne Akram, Technische Universität München, Germany
♦ Rémi Eyraud, Aix-Marseille Université, France ♦ Jeffrey Heinz, University of Delaware, USA ♦ Colin de la Higuera, University of Nantes, France ♦ James Scicluna, University of Nantes, France ♦ Sicco
Verwer, Radboud University Nijmegen, The Nederlands
ICGI'12 - Workshop 81
Pieter Adriaans, University of Amsterdam, The Netherlands Dana Angluin, Yale University, USA Alexander Clark, Royal Holloway University of London, United
Kingdom
Pierre Dupont, Université catholique de Louvain, Belgium. Ricard Gavaldà, Universitat Politécnica de Catalunya, Spain Colin de la Higuera, University of Nantes, France Jean-Christophe Janodet, University of Evry, France Tim Oates, University of Maryland in Baltimore County, USA Jose Oncina, University of Alicante, Spain Menno van Zaanen, Tilburg University, The Netherlands
ICGI'12 - Workshop 82
December 2011: first ideas February 2012: website, first baselines and the
first data set on-line
Mach 2012: First phase (training phase) May 20: Second phase (competition) June 5: First real world problem available July 3: End of the competition September 7: special session in ICGI'12
ICGI'12 - Workshop 83
Targets were generated completely at random 4 kinds of targets:
HMM PDFA PFA Markov Chains (used only during the training phase)
5 to 75 states 4 to 24 letters alphabet All initial, symbol and transition probability draw
from a Dirichlet distribution
ICGI'12 - Workshop 84
Symbol sparsity: percentage of possible state-
symbol pairs selected for the target (between 20% and 80%)
A state is randomly selected, then a not already taken
symbol for this state
One transition is generated by selecting a target state
Transition
sparsity: percentage
additional transitions (between 0% and 20%)
Selected without replacement from the set of possible
transitions
Modified to remain uniform over the source state and
transition labels
ICGI'12 - Workshop 85
A perplexity measure:
Where PrT is the probability in the target and PrC is the submitted probability (these probabilities have to be normalize on the test set)
Equivalent to the Kullback–Leibler divergence Independent of a specific model
ICGI'12 - Workshop 86
Natural language problem: 10 000 POS
sequences (+1 000 unique for test) selected from +100 000 obtained with the Frog Dutch tagger (11 symbols) on a corpus of Dutch translations of Jules Verne books.
Discretized sensor signals: 20 000 strings (+
1 000 for test) corresponding to windows of length 20 over the fuel usage of trucks, selected from almost 500 000 available windows.
Evaluation: submissions were compared with the
probabilities obtained with a 3-gram trained on the whole data set.
ICGI'12 - Workshop 87
For each problem
5 points were given to the leader (participant with
the smallest perplexity score)
3 points to the second 2 points to the third 1 point to the fourth
The sum of the point gave the overall ranking.
ICGI'12 - Workshop 88
Access only to registered participants 51 problems for the training phase 48 problems for the competition phase (+2 real
world problems)
1 000 strings in each test set 20 000 or 100 000 strings in train sets
ICGI'12 - Workshop 89
2 simple baselines in python:
Frequency of the strings in the sets (train + test) Usual 3-gram on the strings of the sets (train + test)
An
implementation
the Baum-Welch algorithm in python
An implementation of ALERGIA in OpenFST and
Visual Studio
Good page rank of this page (no registration
needed)
ICGI'12 - Workshop 90
724 visits (max: 54 in one day) 196 unique visitors IPs from 37 countries, 14 countries with 5 or
more IPs
38 registered participants 16 submitted at least one of their solutions 2 787 submissions 5 participants scored points 4 participants ranked first at least one day
ICGI'12 - Workshop 91
Ran k Team name Overall score 1 Shibata-Yoshinaka 212 2 Mans Hulden 124 3 David Llorens 122 4 Raphael Bailly 75 5 Fabio Kepler 14
ICGI'12 - Workshop 92
Nantes, November 2013 93
Nantes, November 2013 94
Stern Brocot trees Identification of probabilities If we were able to discover the structure, how do we identify the probabilities?
Nantes, November 2013 95
By estimation: the edge is used 1501 times out of
3000 passages through the state :
3000
1501 3000
Nantes, November 2013 96
Can be constructed from two simple adjacent fractions by the «mean» operation a c a+c b d b+d
= m
Nantes, November 2013 97
1 1 1 1 1 2 2 1 1 2 3 3 3 3 2 1 1 2 3 3 4 5 5 4 4 5 5 4 3 3 2 1
Nantes, November 2013 98
Instead of returning c(x)/n, search the Stern-
Brocot tree to find a good simple approximation
Nantes, November 2013 99
c(x) - a < λ log log n n b n ∀λ>1
With probability 1, for a co-finite number of values