Learning Automata with Hankel Matrices Borja Balle Amazon Research - - PowerPoint PPT Presentation

learning automata with hankel matrices
SMART_READER_LITE
LIVE PREVIEW

Learning Automata with Hankel Matrices Borja Balle Amazon Research - - PowerPoint PPT Presentation

Learning Automata with Hankel Matrices Borja Balle Amazon Research Cambridge Highlights London, September 2017 Weighted Finite Automata (WFA) (over R ) Graphical Representation Algebraic Representation A x , , t A a u a P y a ,


slide-1
SLIDE 1

Learning Automata with Hankel Matrices

Borja Balle

Amazon Research Cambridge

Highlights — London, September 2017

slide-2
SLIDE 2

Weighted Finite Automata (WFA) (over R)

Graphical Representation q1 1.2 ´1 q2 0.5

a, 1.2 b, 2 a, ´1 b, ´2 a, 3.2 b, 5 a, ´2 b, 0

Algebraic Representation A “ xα, β, tAauaPΣy α “ „ ´1 0.5  Aa “ „ 1.2 ´1 ´2 3.2  β “ „ 1.2  Ab “ „ 2 ´2 5  Behavioral Representation Each WFA A computes a function A : Σ‹ Ñ R given by Apx1 ¨ ¨ ¨ xTq “ αJAx1 ¨ ¨ ¨ AxT β

slide-3
SLIDE 3

In This Talk...

§ Describe a core algorithm common to many algorithms for learning weighted automata § Explain the role this core plays in three learning problems in different setups § Survey extensions to more complex models and some applications

slide-4
SLIDE 4

Outline

  • 1. From Hankel Matrices to Weighted Automata
  • 2. From Data to Hankel Matrices
  • 3. From Theory to Practice
slide-5
SLIDE 5

Outline

  • 1. From Hankel Matrices to Weighted Automata
  • 2. From Data to Hankel Matrices
  • 3. From Theory to Practice
slide-6
SLIDE 6

Hankel Matrices and Fliess’ Theorem

Given f : Σ‹ Ñ R define its Hankel matrix Hf P RΣ‹ˆΣ‹ as Hf “ » — — — — — — — — — — –

ǫ a b ¨¨¨ s ¨¨¨ ǫ

f pǫq f paq f pbq . . .

a

f paq f paaq f pabq . . .

b

f pbq f pbaq f pbbq . . . . . .

p

¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ f ppsq . . . fi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi fl Theorem [Fli74]

  • 1. The rank of Hf is finite if and only if f is computed by a WFA
  • 2. The rank rankpf q “ rankpHf q equals the number of states of a minimal WFA computing f
slide-7
SLIDE 7

The Structure of Hankel Matrices

App1 ¨ ¨ ¨ pTs1 ¨ ¨ ¨ sT 1q “ αJAp1 ¨ ¨ ¨ ApT As1 ¨ ¨ ¨ AsT 1β

H “ » — — — –

s

¨ ¨ ¨

p

¨ ¨ f ppsq ¨ ¨ ¨ fi ffi ffi ffi fl “ » — — — – ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ‚ ‚ ‚ ¨ ¨ ¨ fi ffi ffi ffi fl » – ¨ ¨ ‚ ¨ ¨ ¨ ¨ ‚ ¨ ¨ ¨ ¨ ‚ ¨ ¨ fi fl

App1 ¨ ¨ ¨ pTas1 ¨ ¨ ¨ sT 1q “ αJAp1 ¨ ¨ ¨ ApT AaAs1 ¨ ¨ ¨ AsT 1β

Ha “ » — — — –

s

¨ ¨ ¨

p

¨ ¨ f ppasq ¨ ¨ ¨ fi ffi ffi ffi fl “ » — — — – ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ‚ ‚ ‚ ¨ ¨ ¨ fi ffi ffi ffi fl » – ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ ‚ fi fl » – ¨ ¨ ‚ ¨ ¨ ¨ ¨ ‚ ¨ ¨ ¨ ¨ ‚ ¨ ¨ fi fl

Algebraically: Factorizing H lets us solve for Aa H “ P S = ñ Hσ “ P Aa S = ñ Aa “ P` Ha S`

slide-8
SLIDE 8

SVD-based Reconstruction [HKZ09; Bal+14]

Inputs

§ Desired number of states r § Basis B “ pP, Sq with P, S Ă Σ‹, ǫ P P X S § Finite Hankel blocks indexed by prefixes and suffixes in B:

§ HB P RPˆS § HB

Σ “ tHB a P RPˆS : a P Σu

Algorithm: SpectralpHB, HB

Σ, rq

  • 1. Compute the rank r SVD of HB « UDVJ
  • 2. Let Aa “ D´1UJHaV
  • 3. Let α “ VJHBpǫ, ´q and β “ D´1UJHBp´, ǫq
  • 4. Return A “ xα, β, tAauy

Running time:

  • 1. SVD takes Op|P||S|rq
  • 2. Matrix multiplications take Op|Σ||P||S|rq
slide-9
SLIDE 9

Properties of Spectral [HKZ09; Bal13; BM15a]

Consistency

§ If P is prefix-closed, S is suffix-closed, and r “ rankpHBq “ rankprHB|HB Σsq § Then @p P P, @s P S, @a P Σ, the WFA A “ SpectralpHB, HB Σ, rq satisfies

App ¨ sq “ HBpp, sq and ˜ App ¨ a ¨ sq “ HB

a pp, sq

Recovery

§ If HB and HB Σ are sub-blocks of Hf with r “ rankpf q “ rankpHBq § Then the WFA A “ SpectralpHB, HB Σ, rq satisfies A ” f

Robustness

§ If r “ rankpHBq “ rankprHB|HB Σsq and }HB ´ ˆ

HB} ď ε and }HB

a ´ ˆ

HB

a } ď ε for all a P Σ § Then xα, β, tAauy “ SpectralpHB, HB Σ, rq and xˆ

α, ˆ β, tˆ Aauy “ Spectralpˆ HB, ˆ HB

Σ, rq

satisfy }α ´ ˆ α}, }β ´ ˆ β}, }Aa ´ ˆ Aa} ď ε

slide-10
SLIDE 10

Outline

  • 1. From Hankel Matrices to Weighted Automata
  • 2. From Data to Hankel Matrices
  • 3. From Theory to Practice
slide-11
SLIDE 11

Learning Models

  • 1. Exact query learning: membership + equivalence queries [BV96; BBM06; BM15a]
  • 2. Distributional PAC learning: samples from a stochastic WFA [HKZ09; BDR09; Bal+14]
  • 3. Statistical learning: optimize output predictions wrt a loss function [BM12; BM15b]
slide-12
SLIDE 12

Exact Learning of WFA with Queries

Setup:

§ Unknown f : Σ‹ Ñ R with rankpf q “ n § Membership oracle: MQf pxq returns f pxq for any x P Σ‹ § Equivalence oracle: EQf pAq returns true if f ” A and pfalse, zq if f pzq ‰ Apzq

Algorithm:

  • 1. Initialize P “ S “ tǫu and maintain B “ pP, Sq
  • 2. Let A “ SpectralpHB, HB

Σ, rankpHBqq

  • 3. While EQpAq “ pfalse, zq

3.1 Let z “ p ¨ a ¨ s with p the longest prefix of z in P 3.2 Let S “ S Y suffixespsq 3.3 While Dp P P and Da P Σ such that HB

a pp, ´q R rowspanpHBq, add p ¨ a to P

3.4 Let A “ SpectralpHB, HB

Σ, rankpHBqq

Analysis:

§ At most n ` 1 calls to EQf and Op|Σ|n2Lq calls to MQf , where L “ max |z| § Can be improved to Opp|Σ| ` log Lqn2q calls to MQf ; can reduce calls to EQf by

increasing calls to MQf

slide-13
SLIDE 13

PAC Learning Stochastic WFA

Setup:

§ Unknown f : Σ‹ Ñ R with rankpf q “ n defining probability distribution on Σ‹ § Data: xp1q, . . . , xpmq i.i.d. strings sampled from f § Parameters: n and B “ pP, Sq such that rankpHBq “ n and ǫ P P X S

Algorithm:

  • 1. Estimate Hankel matrices ˆ

HB and ˆ HB

a for all a P Σ using empirical probabilities

ˆ f pxq “ 1 m

m

ÿ

i“1

1rxpiq “ xs

  • 2. Return ˆ

A “ Spectralpˆ HB, ˆ HB

Σ, nq

Analysis:

§ Running time is Op|P ¨ S|m ` |Σ||P||S|nq § With high probability ř |x|ďL |f pxq ´ ˆ

Apxq| “ O ´

L2|Σ|?n σnpHB

f q2?m

¯

slide-14
SLIDE 14

Statistical Learning of WFA

Setup:

§ Unknown distribution D over Σ‹ ˆ R § Data: pxp1q, yp1qq, . . . , pxpmq, ypmqq i.i.d. string-label pairs sampled from D § Parameters: n, convex loss function ℓ : R ˆ R Ñ R`, convex regularizer R, regularization

parameter λ ą 0, and B “ pP, Sq with ǫ P P X S Algorithm:

  • 1. Build B1 “ pP1, Sq with P1 “ P Y P ¨ Σ
  • 2. Find the Hankel matrix ˆ

HB1 solving minH 1

m

řm

i“1 ℓpHpxpiqq, ypiqq ` λRpHq

  • 3. Return ˆ

A “ Spectralpˆ HB, ˆ HB

Σ, nq, where ˆ

HB and ˆ HB

Σ are submatrices of ˆ

HB1 Analysis:

§ Running time is polynomial in n, m, |Σ|, |P|, and |S| § With high probability

Epx,yq„Drℓp ˆ Apxq, yqs ď 1 m

m

ÿ

i“1

ℓp ˆ Apxpiqq, ypiqq ` O ˆ 1 ?m ˙

slide-15
SLIDE 15

Outline

  • 1. From Hankel Matrices to Weighted Automata
  • 2. From Data to Hankel Matrices
  • 3. From Theory to Practice
slide-16
SLIDE 16

Extensions

  • 1. More complex models

§ Transducers and taggers [BQC11; Qua+14] § Grammars and tree automata [Luq+12; Bal+14; RBC16] § Reactive models [BBP15; LBP16; BM17a]

  • 2. More realistic setups

§ Multiple related tasks [RBP17] § Timing data [BBP15; LBP16] § Single trajectory [BM17a] § Probabilistic models [BHP14]

  • 3. Deeper theory

§ Convex relaxations [BQC12] § Generalization bounds [BM15b; BM17b] § Approximate minimisation [BPP15] § Bisimulation metrics [BGP17]

slide-17
SLIDE 17

And It Works Too!

Spectral methods are competitive against traditional methods:

§ Expectation maximization § Conditional random fields § Tensor decompositions

In a variety of problems:

§ Sequence tagging § Constituency and dependency

parsing

§ Timing and geometry learning § POS-level language modelling

32 128 512 2048 8192 32768 0.1 0.2 0.3 0.4 0.5 0.6 0.7

# training samples (in thousands) L1 distance HMM k−HMM FST

78 80 82 84 86 88 90 500 1K 2K 5K 10K 15K Hamming Accuracy (test) Training Samples No Regularization

  • Avg. Perceptron

CRF Spectral IO HMM L2 Max Margin Spectral Max Margin Spec-Str Spec-Sub CO Tensor EM 1 2 3 4 5 6 7 8 Runtime [log(sec)] Initialization Model Building 50 100 150 200 Hankel rank 0.1 0.2 0.3 0.4 0.5 0.6

  • Rel. error

100 1000 10000 True ODM 60 62 64 66 68 70 72 74 10 20 30 40 50 Word Error Rate (%) Number of States Spectral, Σ basis Spectral, basis k=25 Spectral, basis k=50 Spectral, basis k=100 Spectral, basis k=300 Spectral, basis k=500 Unigram Bigram

104 105 88 90 92 94 96 98 length ∙ 5 mu qn SVTA SVTA* 104 105 70 75 80 85 length ∙ 15 mu qn SVTA SVTA* 104 105 106 50 55 60 65 70 75 all sentences mu qn SVTA SVTA*

slide-18
SLIDE 18

Open Problems and Current Trends

§ Optimal selection of P and S from data § Scalable convex optimization over sets of Hankel matrices § Constraining the output WFA (eg. probabilistic automata) § Relations between learning and approximate minimisation § How much of this can be extended to WFA over semi-rings? § Spectral methods for initializing non-convex gradient-based learning algorithms

slide-19
SLIDE 19

Conclusion

Take home points

§ A single building block based on SVD of Hankel matrices § Implementation only requires linear algebra § Analysis involves linear algebra, probability, convex optimization § Can be made practical for a variety of models and applications

Want to know more?

§ EMNLP’14 tutorial (with slides, video, and code)

https://borjaballe.github.io/emnlp14-tutorial/

§ Survey papers [BM15a; TJ15] § Python toolkit Sp2Learn [Arr+16] § Neighbouring literature: Predictive state representations (PSR) [LSS02] and Observable

  • perator models (OOM) [Jae00]
slide-20
SLIDE 20

Thanks To All My Collaborators!

Xavier Carreras Mehryar Mohri Prakash Panangaden Joelle Pineau Doina Precup Ariadna Quattoni

§ Guillaume Rabusseau § Franco M. Luque § Pierre-Luc Bacon § Pascale Gourdeau § Odalric-Ambrym Maillard § Will Hamilton § Lucas Langer § Shay Cohen § Amir Globerson

slide-21
SLIDE 21

Bibliography I

[Arr+16]

  • D. Arrivault, D. Benielli, F. Denis, and R. Eyraud. “Sp2Learn: A Toolbox for the

Spectral Learning of Weighted Automata”. In: ICGI. 2016. [Bal+14]

  • B. Balle, X. Carreras, F.M. Luque, and A. Quattoni. “Spectral learning of weighted

automata: A forward-backward perspective”. In: Machine Learning (2014). [Bal13]

  • B. Balle. “Learning Finite-State Machines: Algorithmic and Statistical Aspects”.

PhD thesis. Universitat Polit` ecnica de Catalunya, 2013. [BBM06]

  • L. Bisht, N. H. Bshouty, and H. Mazzawi. “On Optimal Learning Algorithms for

Multiplicity Automata”. In: COLT. 2006. [BBP15] P.-L. Bacon, B. Balle, and D. Precup. “Learning and Planning with Timing Information in Markov Decision Processes”. In: UAI. 2015. [BDR09]

  • R. Bailly, F. Denis, and L. Ralaivola. “Grammatical inference as a principal

component analysis problem”. In: ICML. 2009.

slide-22
SLIDE 22

Bibliography II

[BGP17]

  • B. Balle, P. Gourdeau, and P. Panangaden. “Bisimulation Metrics for Weighted

Automata”. In: ICALP. 2017. [BHP14]

  • B. Balle, W. L. Hamilton, and J. Pineau. “Methods of Moments for Learning

Stochastic Languages: Unified Presentation and Empirical Comparison”. In:

  • ICML. 2014.

[BM12]

  • B. Balle and M. Mohri. “Spectral learning of general weighted automata via

constrained matrix completion”. In: NIPS. 2012. [BM15a]

  • B. Balle and M. Mohri. “Learning Weighted Automata (invited paper)”. In: CAI.

2015. [BM15b]

  • B. Balle and M. Mohri. “On the Rademacher complexity of weighted automata”.

In: ALT. 2015. [BM17a]

  • B. Balle and O.-A. Maillard. “Spectral Learning from a Single Trajectory under

Finite-State Policies”. In: ICML. 2017.

slide-23
SLIDE 23

Bibliography III

[BM17b]

  • B. Balle and M. Mohri. “Generalization Bounds for Learning Weighted

Automata”. In: Theor. Comput. Sci. (to appear) (2017). [BPP15]

  • B. Balle, P. Panangaden, and D. Precup. “A Canonical Form for Weighted

Automata and Applications to Approximate Minimization”. In: LICS. 2015. [BQC11]

  • B. Balle, A. Quattoni, and X. Carreras. “A spectral learning algorithm for finite

state transducers”. In: ECML-PKDD. 2011. [BQC12]

  • B. Balle, A. Quattoni, and X. Carreras. “Local loss optimization in operator

models: A new insight into spectral learning”. In: ICML. 2012. [BV96]

  • F. Bergadano and S. Varricchio. “Learning behaviors of automata from

multiplicity and equivalence queries”. In: SIAM Journal on Computing (1996). [Fli74]

  • M. Fliess. “Matrices de Hankel”. In: Journal de Math´

ematiques Pures et Appliqu´ ees (1974). [HKZ09]

  • D. Hsu, S. M. Kakade, and T. Zhang. “A spectral algorithm for learning hidden

Markov models”. In: COLT. 2009.

slide-24
SLIDE 24

Bibliography IV

[Jae00]

  • H. Jaeger. “Observable operator models for discrete stochastic time series”. In:

Neural Computation (2000). [LBP16]

  • L. Langer, B. Balle, and D. Precup. “Learning Multi-Step Predictive State

Representations”. In: IJCAI. 2016. [LSS02]

  • M. Littman, R. S. Sutton, and S. Singh. “Predictive representations of state”. In:
  • NIPS. 2002.

[Luq+12] F.M. Luque, A. Quattoni, B. Balle, and X. Carreras. “Spectral learning in non-deterministic dependency parsing”. In: EACL. 2012. [Qua+14]

  • A. Quattoni, B. Balle, X. Carreras, and A. Globerson. “Spectral Regularization for

Max-Margin Sequence Tagging”. In: ICML. 2014. [RBC16]

  • G. Rabusseau, B. Balle, and S. B. Cohen. “Low-Rank Approximation of Weighted

Tree Automata”. In: AISTATS. 2016. [RBP17]

  • G. Rabusseau, B. Balle, and J. Pineau. “Multitask Spectral Learning of Weighted

Automata”. In: NIPS. 2017.

slide-25
SLIDE 25

Bibliography V

[TJ15]

  • M. R. Thon and H. Jaeger. “Links between multiplicity automata, observable
  • perator models and predictive state representations: a unified learning

framework”. In: Journal of Machine Learning Research (2015).

slide-26
SLIDE 26

Learning Automata with Hankel Matrices

Borja Balle

Amazon Research Cambridge

Highlights — London, September 2017