Sequences classification by least general generalisations Fabien - - PowerPoint PPT Presentation

sequences classification by least general generalisations
SMART_READER_LITE
LIVE PREVIEW

Sequences classification by least general generalisations Fabien - - PowerPoint PPT Presentation

volata : a generic classification system Sequence classification by combining... Experiments and discussion Sequences classification by least general generalisations Fabien Torre joint work with F. Tantini and A. Terlutte INRIA LNE and CNRS


slide-1
SLIDE 1

volata: a generic classification system Sequence classification by combining... Experiments and discussion

Sequences classification by least general generalisations

Fabien Torre joint work with F. Tantini and A. Terlutte

INRIA LNE and CNRS LIFL (Mostrare) – LORIA (Parole)

ICGI 2010, Valencia

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

1 / 22 Sequences classification by lgg

slide-2
SLIDE 2

volata: a generic classification system Sequence classification by combining... Experiments and discussion

Outline of the talk

1

volata: a generic classification system

2

Sequence classification by combining... Automata Balls of words

3

Experiments and discussion

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

2 / 22 Sequences classification by lgg

slide-3
SLIDE 3

volata: a generic classification system Sequence classification by combining... Experiments and discussion Requirements

Supervised classification with volata

A supervised classification problem needs to define:

1 an input space: X; 2 a finite set of discrete classes: Y; 3 an hypothesis language: H ⊇ X and h ∈ H : X → Y; 4 a subsumption relation between hypotheses.

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

3 / 22 Sequences classification by lgg

slide-4
SLIDE 4

volata: a generic classification system Sequence classification by combining... Experiments and discussion Requirements

Supervised classification with volata

A supervised classification problem needs to define:

1 an input space: X; 2 a finite set of discrete classes: Y; 3 an hypothesis language: H ⊇ X and h ∈ H : X → Y; 4 a subsumption relation between hypotheses.

volata requires a generalisation operator called least general generalisation (lgg); then provides several learning methods, especially ensemble methods.

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

3 / 22 Sequences classification by lgg

slide-5
SLIDE 5

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

Least General Generalisations

Definition: lgg Given a set of examples E ⊆ X, an hypothesis h ∈ H is a least general generalisation of E iff: ∀e ∈ E : h e; there exists no h′ such that ∀e ∈ E : h′ e and h ≻ h′.

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

4 / 22 Sequences classification by lgg

slide-6
SLIDE 6

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

Least General Generalisations

Definition: lgg Given a set of examples E ⊆ X, an hypothesis h ∈ H is a least general generalisation of E iff: ∀e ∈ E : h e; there exists no h′ such that ∀e ∈ E : h′ e and h ≻ h′. ... and if the lgg is unique Two possible definitions of the lgg operator: lgg(e1, e2, . . . , en ∈ X) returns h ∈ H; lgg(hn−1, en) returns h ∈ H. We prefer the second one, more efficient for learning.

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

4 / 22 Sequences classification by lgg

slide-7
SLIDE 7

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

volata: a three-levels architecture

(level 1) lgg operator computes the least general generalisation of a set of examples. Follows from (H, ).

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

5 / 22 Sequences classification by lgg

slide-8
SLIDE 8

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

volata: a three-levels architecture

(level 1) lgg operator computes the least general generalisation of a set of examples. Follows from (H, ). (level 2) examples generalisation using lgg and classes. For a given class, cg (correct generalisation) generalises examples

  • ne by one and checks correction of each generalisation wrt
  • ther classes. Depends on the presentation order of examples.
  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

5 / 22 Sequences classification by lgg

slide-9
SLIDE 9

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

volata: a three-levels architecture

(level 1) lgg operator computes the least general generalisation of a set of examples. Follows from (H, ). (level 2) examples generalisation using lgg and classes. For a given class, cg (correct generalisation) generalises examples

  • ne by one and checks correction of each generalisation wrt
  • ther classes. Depends on the presentation order of examples.

(level 3) full classifiers learning. GloBoost is an ensemble method that uses the order dependency of cg to obtain random correct hypotheses and combine them with a one hypothesis, one vote principle. Importance of diversity. Only the first level depends on hypothesis language H, the two others are generics.

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

5 / 22 Sequences classification by lgg

slide-10
SLIDE 10

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

lgg, cg and GloBoost in action

In the plane with examples/points and hypotheses/rectangles: lgg

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

6 / 22 Sequences classification by lgg

slide-11
SLIDE 11

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

lgg, cg and GloBoost in action

In the plane with examples/points and hypotheses/rectangles: lgg

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

6 / 22 Sequences classification by lgg

slide-12
SLIDE 12

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

lgg, cg and GloBoost in action

In the plane with examples/points and hypotheses/rectangles: lgg

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

6 / 22 Sequences classification by lgg

slide-13
SLIDE 13

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

lgg, cg and GloBoost in action

In the plane with examples/points and hypotheses/rectangles: lgg

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

6 / 22 Sequences classification by lgg

slide-14
SLIDE 14

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

lgg, cg and GloBoost in action

In the plane with examples/points and hypotheses/rectangles: lgg

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

6 / 22 Sequences classification by lgg

slide-15
SLIDE 15

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

lgg, cg and GloBoost in action

In the plane with examples/points and hypotheses/rectangles: lgg cg

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

6 / 22 Sequences classification by lgg

slide-16
SLIDE 16

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

lgg, cg and GloBoost in action

In the plane with examples/points and hypotheses/rectangles: lgg cg

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

6 / 22 Sequences classification by lgg

slide-17
SLIDE 17

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

lgg, cg and GloBoost in action

In the plane with examples/points and hypotheses/rectangles: lgg cg

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

6 / 22 Sequences classification by lgg

slide-18
SLIDE 18

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

lgg, cg and GloBoost in action

In the plane with examples/points and hypotheses/rectangles: lgg cg

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

6 / 22 Sequences classification by lgg

slide-19
SLIDE 19

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

lgg, cg and GloBoost in action

In the plane with examples/points and hypotheses/rectangles: lgg cg

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

6 / 22 Sequences classification by lgg

slide-20
SLIDE 20

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

lgg, cg and GloBoost in action

In the plane with examples/points and hypotheses/rectangles: lgg cg

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

6 / 22 Sequences classification by lgg

slide-21
SLIDE 21

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

lgg, cg and GloBoost in action

In the plane with examples/points and hypotheses/rectangles: lgg cg

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

6 / 22 Sequences classification by lgg

slide-22
SLIDE 22

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

lgg, cg and GloBoost in action

In the plane with examples/points and hypotheses/rectangles: lgg cg

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

6 / 22 Sequences classification by lgg

slide-23
SLIDE 23

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

lgg, cg and GloBoost in action

In the plane with examples/points and hypotheses/rectangles: lgg cg GloBoost

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

6 / 22 Sequences classification by lgg

slide-24
SLIDE 24

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

lgg, cg and GloBoost in action

In the plane with examples/points and hypotheses/rectangles: lgg cg GloBoost

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

6 / 22 Sequences classification by lgg

slide-25
SLIDE 25

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

lgg, cg and GloBoost in action

In the plane with examples/points and hypotheses/rectangles: lgg cg GloBoost

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

6 / 22 Sequences classification by lgg

slide-26
SLIDE 26

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

lgg, cg and GloBoost in action

In the plane with examples/points and hypotheses/rectangles: lgg cg GloBoost

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

6 / 22 Sequences classification by lgg

slide-27
SLIDE 27

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

lgg, cg and GloBoost in action

In the plane with examples/points and hypotheses/rectangles: lgg cg GloBoost

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

6 / 22 Sequences classification by lgg

slide-28
SLIDE 28

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

lgg, cg and GloBoost in action

In the plane with examples/points and hypotheses/rectangles: lgg cg GloBoost

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

6 / 22 Sequences classification by lgg

slide-29
SLIDE 29

volata: a generic classification system Sequence classification by combining... Experiments and discussion Architecture and lgg-based algorithms

lgg, cg and GloBoost in action

In the plane with examples/points and hypotheses/rectangles: lgg cg GloBoost works only because least general rectangles are unique...

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

6 / 22 Sequences classification by lgg

slide-30
SLIDE 30

volata: a generic classification system Sequence classification by combining... Experiments and discussion The case of multiple lgg hypotheses

The case of disks in the plane

In the plane, with examples/points and hypotheses/disks, there is an infinity of least general disks that include three points.

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

7 / 22 Sequences classification by lgg

slide-31
SLIDE 31

volata: a generic classification system Sequence classification by combining... Experiments and discussion The case of multiple lgg hypotheses

The case of disks in the plane

In the plane, with examples/points and hypotheses/disks, there is an infinity of least general disks that include three points. Requirements to classify with volata

1 choose appropriate (H, ); 2 guarantee that least general hypotheses are unique; 3 define the corresponding lgg operator.

Application to grammatical inference and sequence classification?

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

7 / 22 Sequences classification by lgg

slide-32
SLIDE 32

volata: a generic classification system Sequence classification by combining... Experiments and discussion

Outline of the talk

1

volata: a generic classification system

2

Sequence classification by combining... Automata Balls of words

3

Experiments and discussion

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

8 / 22 Sequences classification by lgg

slide-33
SLIDE 33

volata: a generic classification system Sequence classification by combining... Experiments and discussion lgg and learnability proofs

Least general generalisations and grammatical inference

Comparison

Learning in the limit. When a positive example arrives, the learner must propose a language that contains seen examples, and finally the target language. lgg operator in GI context. Given a language L and a word w, it provides the smallest language that contains L and w.

lgg operators learn in the limit!

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

9 / 22 Sequences classification by lgg

slide-34
SLIDE 34

volata: a generic classification system Sequence classification by combining... Experiments and discussion lgg and learnability proofs

Least general generalisations and grammatical inference

Comparison

Learning in the limit. When a positive example arrives, the learner must propose a language that contains seen examples, and finally the target language. lgg operator in GI context. Given a language L and a word w, it provides the smallest language that contains L and w.

lgg operators learn in the limit! lgg operators in learnability proofs Available for: k-TSS automata [García and Vidal, 1990] 0-reversible automata [Angluin, 1982] but not for balls of words [de la Higuera et al., 2008].

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

9 / 22 Sequences classification by lgg

slide-35
SLIDE 35

volata: a generic classification system Sequence classification by combining... Experiments and discussion Automata

lgg operators for k-TSS and 0-reversible automata

lgg-tssi:

1 S = {λ, aa, aba} and k = 3 ; 2 learned automaton: λ a aa a a ab ba b a

lgg-zr:

1 S = {λ, aa, bb, abab, baba} 2 learned automaton: 1 2 3 4 a b a b b a b a

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

10 / 22 Sequences classification by lgg

slide-36
SLIDE 36

volata: a generic classification system Sequence classification by combining... Experiments and discussion Automata

lgg operators for k-TSS and 0-reversible automata

lgg-tssi:

1 S = {λ, aa, aba} and k = 3 ; 2 learned automaton: λ a aa a a ab ba b a bb b a 3 adding abba

lgg-zr:

1 S = {λ, aa, bb, abab, baba} 2 learned automaton: 1 2 3 4 a b a b b a b a

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

10 / 22 Sequences classification by lgg

slide-37
SLIDE 37

volata: a generic classification system Sequence classification by combining... Experiments and discussion Automata

lgg operators for k-TSS and 0-reversible automata

lgg-tssi:

1 S = {λ, aa, aba} and k = 3 ; 2 learned automaton: λ a aa a a ab ba b a bb b a b 3 adding abba and abbba.

lgg-zr:

1 S = {λ, aa, bb, abab, baba} 2 learned automaton: 1 2 3 4 a b a b b a b a

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

10 / 22 Sequences classification by lgg

slide-38
SLIDE 38

volata: a generic classification system Sequence classification by combining... Experiments and discussion Automata

lgg operators for k-TSS and 0-reversible automata

lgg-tssi:

1 S = {λ, aa, aba} and k = 3 ; 2 learned automaton: λ a aa a a ab ba b a bb b a b 3 adding abba and abbba.

lgg-zr:

1 S = {λ, aa, bb, abab, baba} 2 learned automaton: 1 2 3 4 5 6 a b a b b a b a b a 3 adding abba.

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

10 / 22 Sequences classification by lgg

slide-39
SLIDE 39

volata: a generic classification system Sequence classification by combining... Experiments and discussion Automata

lgg operators for k-TSS and 0-reversible automata

lgg-tssi:

1 S = {λ, aa, aba} and k = 3 ; 2 learned automaton: λ a aa a a ab ba b a bb b a b 3 adding abba and abbba.

lgg-zr:

1 S = {λ, aa, bb, abab, baba} 2 learned automaton: 0,6 1 2 3 4 5 a b a b b a b a b a 3 adding abba.

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

10 / 22 Sequences classification by lgg

slide-40
SLIDE 40

volata: a generic classification system Sequence classification by combining... Experiments and discussion Automata

lgg operators for k-TSS and 0-reversible automata

lgg-tssi:

1 S = {λ, aa, aba} and k = 3 ; 2 learned automaton: λ a aa a a ab ba b a bb b a b 3 adding abba and abbba.

lgg-zr:

1 S = {λ, aa, bb, abab, baba} 2 learned automaton: 0,6 1,5 2 3 4 a b a b b a b a b 3 adding abba.

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

10 / 22 Sequences classification by lgg

slide-41
SLIDE 41

volata: a generic classification system Sequence classification by combining... Experiments and discussion Automata

lgg operators for k-TSS and 0-reversible automata

lgg-tssi:

1 S = {λ, aa, aba} and k = 3 ; 2 learned automaton: λ a aa a a ab ba b a bb b a b 3 adding abba and abbba.

lgg-zr:

1 S = {λ, aa, bb, abab, baba} 2 learned automaton: 0,6 1,5 2 3,4 a b a b b a b a 3 adding abba.

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

10 / 22 Sequences classification by lgg

slide-42
SLIDE 42

volata: a generic classification system Sequence classification by combining... Experiments and discussion Automata

lgg operators for k-TSS and 0-reversible automata

lgg-tssi:

1 S = {λ, aa, aba} and k = 3 ; 2 learned automaton: λ a aa a a ab ba b a bb b a b 3 adding abba and abbba.

lgg-zr:

1 S = {λ, aa, bb, abab, baba} 2 learned automaton: 0,6 1,5 2 3,4 a b a b b a b a 3 adding abba.

Now able to combine such automata and predict class sequences!

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

10 / 22 Sequences classification by lgg

slide-43
SLIDE 43

volata: a generic classification system Sequence classification by combining... Experiments and discussion Balls of words

Balls of words: another representation for languages

Balls of words [de la Higuera et al., 2008] three unit cost edit operations:

insertion : aab → aabb deletion : aab → ab substitution: aab → abb

edit distance: d(w1, w2) is the minimum number of operations needed to transform w1 into w2; a langage is defined by a centre (a word) and a radius: e ∈ B2(bab) (or B2(bab) e) iff d(bab, e) ≤ 2; a ball of words represents a finite language; learnability results, by positive examples, with noisy data.

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

11 / 22 Sequences classification by lgg

slide-44
SLIDE 44

volata: a generic classification system Sequence classification by combining... Experiments and discussion Balls of words

Balls of words: another representation for languages

Balls of words [de la Higuera et al., 2008] three unit cost edit operations:

insertion : aab → aabb deletion : aab → ab substitution: aab → abb

edit distance: d(w1, w2) is the minimum number of operations needed to transform w1 into w2; a langage is defined by a centre (a word) and a radius: e ∈ B2(bab) (or B2(bab) e) iff d(bab, e) ≤ 2; a ball of words represents a finite language; learnability results, by positive examples, with noisy data. lgg operator for balls of words ?

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

11 / 22 Sequences classification by lgg

slide-45
SLIDE 45

volata: a generic classification system Sequence classification by combining... Experiments and discussion Balls of words

Multiple least general generalisations for balls of words

A sample with multiple lgg Let E = [a, b, ab]; h = B1(a) subsumes all examples in E; h′ = B1(b) subsumes all examples in E; h contains aa, h′ contains bb: h and h′ are not comparable; both h and h′ are lgg of E because all balls included in both h and h′ do not subsume E (like B1(λ) that does not subsume ab ∈ E). The unique lgg assumption is not true for balls of words. We have to define a new generalisation operator for balls!

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

12 / 22 Sequences classification by lgg

slide-46
SLIDE 46

volata: a generic classification system Sequence classification by combining... Experiments and discussion Balls of words

An algorithm for generalising balls of words

Algorithm gball Require: e ∈ X an example, h = Br(o) ∈ H an hypothesis. Ensure: g ∈ H a generalisation of e and h (g h and g e).

1: c = o ∗

− → e {a shortest path}

2: let x, y integers and o′ a word such that o x

− → o′ y − → e

3: x = d(o, o′), y = d(o′, e), x + y = d(o, e) 4: r ′ = max(r + x, y) 5: return Br′(o′)

About gball no experimental difference between strategies to chose o′; properties of this algorithm?

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

13 / 22 Sequences classification by lgg

slide-47
SLIDE 47

volata: a generic classification system Sequence classification by combining... Experiments and discussion Balls of words

Order dependency

valencia venezia

  • lencix
slide-48
SLIDE 48

volata: a generic classification system Sequence classification by combining... Experiments and discussion Balls of words

Order dependency

valencia venezia

  • lencix

vlencia vencia venzia

slide-49
SLIDE 49

volata: a generic classification system Sequence classification by combining... Experiments and discussion Balls of words

Order dependency

valencia venezia

  • lencix

vlencia vencia venzia

slide-50
SLIDE 50

volata: a generic classification system Sequence classification by combining... Experiments and discussion Balls of words

Order dependency

valencia venezia

  • lencix

vlencia vencia venzia

  • vencia
  • lencia
slide-51
SLIDE 51

volata: a generic classification system Sequence classification by combining... Experiments and discussion Balls of words

Order dependency

valencia venezia

  • lencix

vlencia vencia venzia

  • vencia
  • lencia

volencix valencix valcix

Dependency on the presentation order of examples, interesting property for ensemble methods!

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

14 / 22 Sequences classification by lgg

slide-52
SLIDE 52

volata: a generic classification system Sequence classification by combining... Experiments and discussion Balls of words

Monotonic generalisation...

If g = gball(e, h) then g e and g h. Proof. Recall of the algorithm: x + y = d(o, e) and r ′ = max(r + x, y). g e because d(o′, e) = y ≤ r ′ and therefore e ∈ Br′(o′); g h because for each word w ∈ Br(o) and by triangular inequality: d(o′, w) ≤ d(o′, o) + d(o, w) therefore d(o′, w) ≤ x + r ≤ r ′ and w ∈ Br′(o′).

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

15 / 22 Sequences classification by lgg

slide-53
SLIDE 53

volata: a generic classification system Sequence classification by combining... Experiments and discussion Balls of words

... but not least general

Counterexample Let E = [a, b]; first hypothesis = first example = h = B0(a); two possibilities to choose o′ on the path a 1 − → b; either gball(h, b) = B1(a), or gball(h, b) = B1(b); but the ball B1(λ) contains E; and is more specific than the computed hypotheses: B1(a) B1(λ) and B1(b) B1(λ).

  • ur method generalises a little too much...
  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

16 / 22 Sequences classification by lgg

slide-54
SLIDE 54

volata: a generic classification system Sequence classification by combining... Experiments and discussion Balls of words

Non monotonic correct generalisation

Example Recall: cg generalises using gball and checks correction. Let E + = [λ, b, a] and E − = [bb] cg with gball and x = 1 provide successively:

B0(λ) (initial hypothesis); B1(b) (rejected because B1(b) accepts bb); B1(a) (validated).

B1(a) b while the adding of b has been previously rejected. Implies less efficient AdaBoost implementation...

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

17 / 22 Sequences classification by lgg

slide-55
SLIDE 55

volata: a generic classification system Sequence classification by combining... Experiments and discussion Balls of words

Non monotonic correct generalisation

Example Recall: cg generalises using gball and checks correction. Let E + = [λ, b, a] and E − = [bb] cg with gball and x = 1 provide successively:

B0(λ) (initial hypothesis); B1(b) (rejected because B1(b) accepts bb); B1(a) (validated).

B1(a) b while the adding of b has been previously rejected. Implies less efficient AdaBoost implementation... Summary Balls have multiple lgg. gball is order dependant and monotonic but does not give a lgg ball and leads to a non monotonic cg.

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

17 / 22 Sequences classification by lgg

slide-56
SLIDE 56

volata: a generic classification system Sequence classification by combining... Experiments and discussion Experiments and discussion

Outline of the talk

1

volata: a generic classification system

2

Sequence classification by combining... Automata Balls of words

3

Experiments and discussion

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

18 / 22 Sequences classification by lgg

slide-57
SLIDE 57

volata: a generic classification system Sequence classification by combining... Experiments and discussion Experiments and discussion

Experiments (1): UCI sequential datasets [Asuncion and Newman, 2007]

Classical GI methods against ensemble methods. Datasets and protocol tic-tac-toe, badges, us-first-names, promoters, splice; 10-fold cross validation, 90% of data for learning. Competitors Majority, RPNI, TraxBar, Red-Blue; GloBoost + lgg-tssi and GloBoost + lgg-zr GloBoost + gball and random strategy.

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

19 / 22 Sequences classification by lgg

slide-58
SLIDE 58

volata: a generic classification system Sequence classification by combining... Experiments and discussion Experiments and discussion

Results on UCI datasets

tic-tac-toe badges promoters first-name splice Majority 65.34 % 71.43 % 50.00 % 81.62 % 50.26 % RPNI 91.13 % 62.24 %

  • 81.42 %
  • TraxBar

90.81 % 57.48 % 56.60 % 81.37 % 58.33 % Red-Blue 93.89 % 61.09 % 63.02 % 82.83 % 54.65 % Glgg-tssi 1 000 91.47 % 72.69 % 61.13 % 89.50 % 78.07 % Glgg-zr 1 000 98.36 % 71.43 % 50.00 % 83.07 %

  • Ggball 1 000

92.95 % 80.41 % 87.63 % 87.10 % 93.76 % Ggball 10 000 94.69 % 81.39 % 88.43 % 88.80 % 95.63 % Ggball 100 000 94.96 % 81.36 % 89.08 % 89.06 % 95.62 %

Observations: combinations are better than classical methods reversible: too specific; automata: too slow. balls: fastness+diversity; balls: good in genomic.

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

20 / 22 Sequences classification by lgg

slide-59
SLIDE 59

volata: a generic classification system Sequence classification by combining... Experiments and discussion Experiments and discussion

Experiments (2): handwritten digit classification

volata and balls on a real problem. Datasets and protocol Nist special database 3; 10 classes, 10 568 examples; 10-fold cross validation, 10% of data for learning; Competitors competitor: SeDiL [Boyer et al., 2008]. GloBoost + gball and random strategy.

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

21 / 22 Sequences classification by lgg

slide-60
SLIDE 60

volata: a generic classification system Sequence classification by combining... Experiments and discussion Experiments and discussion

Experiments (2): handwritten digit classification

volata and balls on a real problem. Datasets and protocol Nist special database 3; 10 classes, 10 568 examples; 10-fold cross validation, 10% of data for learning; Competitors competitor: SeDiL [Boyer et al., 2008]. GloBoost + gball and random strategy. Results

SeDiL 95.86 % Ggball 1 000 93.81 % Ggball 10 000 95.93 % Ggball 100 000 96.32 %

Good results for volata and balls!

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

21 / 22 Sequences classification by lgg

slide-61
SLIDE 61

volata: a generic classification system Sequence classification by combining... Experiments and discussion Experiments and discussion

Summary and perspectives

Automata unique lgg for two classes; corresponding lgg available and embedded in volata; not fast, very specific;

  • ther language classes with
  • r without unique least

general generalisations? classes that reach regular languages by union. Balls gball usable by volata; fast generalisations and fast classifications; great diversity, many balls; good experimental results; real applications; study hollow balls. Thank you for your attention. Any questions?

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

22 / 22 Sequences classification by lgg

slide-62
SLIDE 62

Learned Balls Bibliography

A ball of first-names

back to conclusion

A learned ball: B7(LRLRTSVKCA) contains 346 female first-names (all at distance 7) and 0 male:

ALBERTHA BERTA DRUSILLA ELSA FRANCESCA HORTENSIA JESSIKA KRYSTINA LORENZA MIRTA NERISSA OCTAVIA PARTICIA REBBECA SYLVIA TERESSA URSULA VERONICA etc.

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

bonus Sequences classification by lgg

slide-63
SLIDE 63

Learned Balls Bibliography

A ball of first-names

back to conclusion

A learned ball: B7(LRLRTSVKCA) contains 346 female first-names (all at distance 7) and 0 male:

ALBERTHA BERTA DRUSILLA ELSA FRANCESCA HORTENSIA JESSIKA KRYSTINA LORENZA MIRTA NERISSA OCTAVIA PARTICIA REBBECA SYLVIA TERESSA URSULA VERONICA etc.

average distance between examples : 4.9 average min distance between examples : 1.2 maximal min distance between examples : 3.0

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

bonus Sequences classification by lgg

slide-64
SLIDE 64

Learned Balls Bibliography

Bibliographie I

Angluin, D. (1982). Inference of reversible languages. Journal of the ACM, 29(3):741–765. Asuncion, A. and Newman, D. (2007). UCI machine learning repository. Boyer, L., Esposito, Y., Habrard, A., Oncina, J., and Sebban,

  • J. (2008).

Sedil: Software for edit distance learning. In Daelemans, W., Goethals, B., and Morik, K., editors, Proceedings of the 19th European Conference on Machine Learning, pages 672–677. Springer.

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

bonus Sequences classification by lgg

slide-65
SLIDE 65

Learned Balls Bibliography

Bibliographie II

de la Higuera, C., Janodet, J.-C., and Tantini, F. (2008). Learning languages from bounded resources: The case of the dfa and the balls of strings. In Clark, A., Coste, F., and Miclet, L., editors, Proceedings of the 9th International Conference in Grammatical Inference, volume 5278 of Lecture Notes in Artificial Intelligence, pages 43–56. Springer. García, P. and Vidal, E. (1990). Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell., 12(9):920–925.

  • F. Tantini & A.Terlutte & F. Torre (ICGI 2010)

bonus Sequences classification by lgg