Active Learning by the Naive Credal Classifier Alessandro Antonucci - - PowerPoint PPT Presentation

active learning by the naive credal classifier
SMART_READER_LITE
LIVE PREVIEW

Active Learning by the Naive Credal Classifier Alessandro Antonucci - - PowerPoint PPT Presentation

Active Learning by the Naive Credal Classifier Alessandro Antonucci , Giorgio Corani , Sandra Gabaglio Istituto Dalle Molle di Studi sullIntelligenza Artificiale - Lugano (Switzerland) ISIN/SUPSI - Lugano (Switzerland)


slide-1
SLIDE 1

Active Learning by the Naive Credal Classifier

Alessandro Antonucci∗, Giorgio Corani∗, Sandra Gabaglio†

∗Istituto “Dalle Molle” di Studi sull’Intelligenza Artificiale - Lugano (Switzerland) †ISIN/SUPSI - Lugano (Switzerland)

PGM’12 - Granada, September 20, 2012

slide-2
SLIDE 2

Active Learning

Class C (values in C) attributes A := (A1, . . . , Ak) Training Dataset (supervised) c(1), a(1)

1 , . . . , a(1) k

c(2), a(2)

1 , . . . , a(2) k

. . . c(n), a(n)

1 , . . . , a(n) k

Test Set/Instance (unsupervised) ∗, ˜ a1, . . . , ˜ ak ∗, ˜ a1, . . . , ˜ ak

slide-3
SLIDE 3

Active Learning

Class C (values in C) attributes A := (A1, . . . , Ak) Training Dataset (supervised) c(1), a(1)

1 , . . . , a(1) k

c(2), a(2)

1 , . . . , a(2) k

. . . c(n), a(n)

1 , . . . , a(n) k

Test Set/Instance (unsupervised) ∗, ˜ a1, . . . , ˜ ak ∗, ˜ a1, . . . , ˜ ak

Classifier

slide-4
SLIDE 4

Active Learning

Class C (values in C) attributes A := (A1, . . . , Ak) Training Dataset (supervised) c(1), a(1)

1 , . . . , a(1) k

c(2), a(2)

1 , . . . , a(2) k

. . . c(n), a(n)

1 , . . . , a(n) k

Test Set/Instance (unsupervised) ∗, ˜ a1, . . . , ˜ ak ∗, ˜ a1, . . . , ˜ ak

Classifier

Active Set (unsupervised) ∗, a(a)

1 , . . . , a(a) k

∗, a(c)

1 , . . . , a(c) k

∗, a(b)

1 , . . . , a(b) k

slide-5
SLIDE 5

Active Learning

Class C (values in C) attributes A := (A1, . . . , Ak) Training Dataset (supervised) c(1), a(1)

1 , . . . , a(1) k

c(2), a(2)

1 , . . . , a(2) k

. . . c(n), a(n)

1 , . . . , a(n) k

Test Set/Instance (unsupervised) ∗, ˜ a1, . . . , ˜ ak ∗, ˜ a1, . . . , ˜ ak

Classifier

Active Set (unsupervised) ∗, a(a)

1 , . . . , a(a) k

∗, a(c)

1 , . . . , a(c) k

∗, a(b)

1 , . . . , a(b) k

.8 Active Learning Score .3 .5

slide-6
SLIDE 6

Active Learning

Class C (values in C) attributes A := (A1, . . . , Ak) Training Dataset (supervised) c(1), a(1)

1 , . . . , a(1) k

c(2), a(2)

1 , . . . , a(2) k

. . . c(n), a(n)

1 , . . . , a(n) k

Test Set/Instance (unsupervised) ∗, ˜ a1, . . . , ˜ ak ∗, ˜ a1, . . . , ˜ ak

Classifier

Active Set (unsupervised) ∗, a(a)

1 , . . . , a(a) k

∗, a(c)

1 , . . . , a(c) k

.8

∗, a(b)

1 , . . . , a(b) k

Active Learning Score .3 .5

slide-7
SLIDE 7

Active Learning

Class C (values in C) attributes A := (A1, . . . , Ak) Training Dataset (supervised) c(1), a(1)

1 , . . . , a(1) k

c(2), a(2)

1 , . . . , a(2) k

. . . c(n), a(n)

1 , . . . , a(n) k

Test Set/Instance (unsupervised) ∗, ˜ a1, . . . , ˜ ak ∗, ˜ a1, . . . , ˜ ak

Classifier

Active Set (unsupervised) ∗, a(a)

1 , . . . , a(a) k

∗, a(c)

1 , . . . , a(c) k

Annotation

cb, a(b)

1 , . . . , a(b) k

Active Learning Score .3 .5

slide-8
SLIDE 8

Active Learning

Class C (values in C) attributes A := (A1, . . . , Ak) Training Dataset (supervised) c(1), a(1)

1 , . . . , a(1) k

c(2), a(2)

1 , . . . , a(2) k

. . . c(n), a(n)

1 , . . . , a(n) k

Test Set/Instance (unsupervised) ∗, ˜ a1, . . . , ˜ ak ∗, ˜ a1, . . . , ˜ ak

Classifier

Active Set (unsupervised) ∗, a(a)

1 , . . . , a(a) k

∗, a(c)

1 , . . . , a(c) k

Annotation

cb, a(b)

1 , . . . , a(b) k

Active Learning Score .3 .5 Actively Learned Classifier (more accurate)

slide-9
SLIDE 9

Accuracy Trajectories

constant AL score: random pick among active set instances (variance error decreases, accuracy increases)

slide-10
SLIDE 10

Accuracy Trajectories

constant AL score: random pick among active set instances (variance error decreases, accuracy increases)

Accuracy (on test set) Training Set Size N Active Set FULL N+d N+2d . . . N+kd Active Set EMPTY

slide-11
SLIDE 11

Accuracy Trajectories

constant AL score: random pick among active set instances (variance error decreases, accuracy increases)

Accuracy (on test set) Training Set Size N Active Set FULL N+d N+2d . . . N+kd Active Set EMPTY Random Pick

slide-12
SLIDE 12

Accuracy Trajectories

constant AL score: random pick among active set instances (variance error decreases, accuracy increases)

Accuracy (on test set) Training Set Size N Active Set FULL N+d N+2d . . . N+kd Active Set EMPTY Random Pick AL algorithm

AL algs should do better!

slide-13
SLIDE 13

Naive Classifiers

C Ak . . . A2 A1

Given class C, attributes independent

NAIVE BAYES (NBC)

A BN quantified from data by a flat Dirichlet prior Dir(st) P(c, a) = P(c) · m

i=1 P(ai|c)

Given test instance a, assigns class c∗ := arg maxc∈C P(c|f) ∀c′, c′′ ∈ C, dominance test

P(c′|a) P(c′′|a) = P(c′,a) P(c′′,a) > 1

NAIVE CREDAL (NCC)

BNs quantified by set of priors Imprecise Dirichlet model

T ≡

  • Dir(st) : t > 0,

i ti = 1

  • A set C∗ ⊆ C of optimal

(undominated) classes Conservative dominance test mint∈T

Pt(c′,a) Pt(c′′,a) > 1

slide-14
SLIDE 14

Naive Classifiers

C Ak . . . A2 A1

Given class C, attributes independent

NAIVE BAYES (NBC)

A BN quantified from data by a flat Dirichlet prior Dir(st) P(c, a) = P(c) · m

i=1 P(ai|c)

Given test instance a, assigns class c∗ := arg maxc∈C P(c|f) ∀c′, c′′ ∈ C, dominance test

P(c′|a) P(c′′|a) = P(c′,a) P(c′′,a) > 1

NAIVE CREDAL (NCC)

BNs quantified by set of priors Imprecise Dirichlet model

T ≡

  • Dir(st) : t > 0,

i ti = 1

  • A set C∗ ⊆ C of optimal

(undominated) classes Conservative dominance test mint∈T

Pt(c′,a) Pt(c′′,a) > 1

slide-15
SLIDE 15

Naive Classifiers

C Ak . . . A2 A1

Given class C, attributes independent

NAIVE BAYES (NBC)

A BN quantified from data by a flat Dirichlet prior Dir(st) P(c, a) = P(c) · m

i=1 P(ai|c)

Given test instance a, assigns class c∗ := arg maxc∈C P(c|f) ∀c′, c′′ ∈ C, dominance test

P(c′|a) P(c′′|a) = P(c′,a) P(c′′,a) > 1

NAIVE CREDAL (NCC)

BNs quantified by set of priors Imprecise Dirichlet model

T ≡

  • Dir(st) : t > 0,

i ti = 1

  • A set C∗ ⊆ C of optimal

(undominated) classes Conservative dominance test mint∈T

Pt(c′,a) Pt(c′′,a) > 1

slide-16
SLIDE 16

Naive Classifiers

C Ak . . . A2 A1

Given class C, attributes independent

NAIVE BAYES (NBC)

A BN quantified from data by a flat Dirichlet prior Dir(st) P(c, a) = P(c) · m

i=1 P(ai|c)

Given test instance a, assigns class c∗ := arg maxc∈C P(c|f) ∀c′, c′′ ∈ C, dominance test

P(c′|a) P(c′′|a) = P(c′,a) P(c′′,a) > 1

NAIVE CREDAL (NCC)

BNs quantified by set of priors Imprecise Dirichlet model

T ≡

  • Dir(st) : t > 0,

i ti = 1

  • A set C∗ ⊆ C of optimal

(undominated) classes Conservative dominance test mint∈T

Pt(c′,a) Pt(c′′,a) > 1

slide-17
SLIDE 17

Uncertainty Samplings

AL score(a) shows how hard-to-classify an instance is difficult/ambiguous instances give better contribution to learning

Uncertainty Sampling

Based on NBC posterior P(C|a) The smaller the probability of the most probable class, the more hard-to-classify is the instance score(a) ≡ −P(c∗|a)

Credal Uncertainty Sampling

Set of NCC posteriors P(C|a) The weaker the dominances, the more hard-to-classify instances If C = {c′, c′′} (binary class): score(a) ≡ − max

  • mint

P(c′|a) P(c′′|a), mint P(c′′|a) P(c′|a)

  • More than two classes?

Max over all pairs (c′, c′′) ∈ C2

slide-18
SLIDE 18

Uncertainty Samplings

AL score(a) shows how hard-to-classify an instance is difficult/ambiguous instances give better contribution to learning

Uncertainty Sampling

Based on NBC posterior P(C|a) The smaller the probability of the most probable class, the more hard-to-classify is the instance score(a) ≡ −P(c∗|a)

Credal Uncertainty Sampling

Set of NCC posteriors P(C|a) The weaker the dominances, the more hard-to-classify instances If C = {c′, c′′} (binary class): score(a) ≡ − max

  • mint

P(c′|a) P(c′′|a), mint P(c′′|a) P(c′|a)

  • More than two classes?

Max over all pairs (c′, c′′) ∈ C2

slide-19
SLIDE 19

Uncertainty Samplings

AL score(a) shows how hard-to-classify an instance is difficult/ambiguous instances give better contribution to learning

Uncertainty Sampling

Based on NBC posterior P(C|a) The smaller the probability of the most probable class, the more hard-to-classify is the instance score(a) ≡ −P(c∗|a)

Credal Uncertainty Sampling

Set of NCC posteriors P(C|a) The weaker the dominances, the more hard-to-classify instances If C = {c′, c′′} (binary class): score(a) ≡ − max

  • mint

P(c′|a) P(c′′|a), mint P(c′′|a) P(c′|a)

  • More than two classes?

Max over all pairs (c′, c′′) ∈ C2

slide-20
SLIDE 20

Preliminary Experiments

US vs. CUS Very similar performance Importance sampling score′(a) = P(a) · score(a) Can be done also with NCC: score′(a) = P(a) · score(a) Not significant improvements Naive cannot provide realistic probabilistic estimates

200 400 600 0.65 0.7 0.75 training instances accuracy

diabetes

US CUS 20 40 0.84 0.86 0.88 training instances accuracy

iris

US CUS

slide-21
SLIDE 21

Preliminary Experiments

US vs. CUS Very similar performance Importance sampling score′(a) = P(a) · score(a) Can be done also with NCC: score′(a) = P(a) · score(a) Not significant improvements Naive cannot provide realistic probabilistic estimates

200 400 600 0.65 0.7 0.75 training instances accuracy

diabetes

US CUS 20 40 0.84 0.86 0.88 training instances accuracy

iris

US CUS

slide-22
SLIDE 22

Preliminary Experiments

US vs. CUS Very similar performance Importance sampling score′(a) = P(a) · score(a) Can be done also with NCC: score′(a) = P(a) · score(a) Not significant improvements Naive cannot provide realistic probabilistic estimates

200 400 600 0.65 0.7 0.75 training instances accuracy

diabetes

US CUS 20 40 0.84 0.86 0.88 training instances accuracy

iris

US CUS

slide-23
SLIDE 23

Query-by-committee (QbC)

Pick instances whose posterior is not robust wrt resampling Training set resampled (by bootstrap) q times For each bootstrap, a NBC learned from data Pq(C|a) Compute average (center of mass of a polytope in the simplex) ˜ Pa(c) := 1 q

q

  • j=1

P(j)(c|a) Score is the average (KL) distance from CoM score(a) := 1 q

q

  • j=1

KL[P(j)(C|a), ˜ Pa(C)]

P(c′) P(c′′) P(c′′′)

NCC detects robustness wrt to the prior (no need of resampling)!

slide-24
SLIDE 24

Query-by-committee (QbC)

Pick instances whose posterior is not robust wrt resampling Training set resampled (by bootstrap) q times For each bootstrap, a NBC learned from data Pq(C|a) Compute average (center of mass of a polytope in the simplex) ˜ Pa(c) := 1 q

q

  • j=1

P(j)(c|a) Score is the average (KL) distance from CoM score(a) := 1 q

q

  • j=1

KL[P(j)(C|a), ˜ Pa(C)]

P(c′) P(c′′) P(c′′′)

NCC detects robustness wrt to the prior (no need of resampling)!

slide-25
SLIDE 25

Query-by-committee (QbC)

Pick instances whose posterior is not robust wrt resampling Training set resampled (by bootstrap) q times For each bootstrap, a NBC learned from data Pq(C|a) Compute average (center of mass of a polytope in the simplex) ˜ Pa(c) := 1 q

q

  • j=1

P(j)(c|a) Score is the average (KL) distance from CoM score(a) := 1 q

q

  • j=1

KL[P(j)(C|a), ˜ Pa(C)]

P(c′) P(c′′) P(c′′′)

NCC detects robustness wrt to the prior (no need of resampling)!

slide-26
SLIDE 26

Query-by-committee (QbC)

Pick instances whose posterior is not robust wrt resampling Training set resampled (by bootstrap) q times For each bootstrap, a NBC learned from data Pq(C|a) Compute average (center of mass of a polytope in the simplex) ˜ Pa(c) := 1 q

q

  • j=1

P(j)(c|a) Score is the average (KL) distance from CoM score(a) := 1 q

q

  • j=1

KL[P(j)(C|a), ˜ Pa(C)]

P(c′) P(c′′) P(c′′′)

NCC detects robustness wrt to the prior (no need of resampling)!

slide-27
SLIDE 27

Query-by-committee (QbC)

Pick instances whose posterior is not robust wrt resampling Training set resampled (by bootstrap) q times For each bootstrap, a NBC learned from data Pq(C|a) Compute average (center of mass of a polytope in the simplex) ˜ Pa(c) := 1 q

q

  • j=1

P(j)(c|a) Score is the average (KL) distance from CoM score(a) := 1 q

q

  • j=1

KL[P(j)(C|a), ˜ Pa(C)]

P(c′) P(c′′) P(c′′′)

NCC detects robustness wrt to the prior (no need of resampling)!

slide-28
SLIDE 28

Credal Query-by-committee (CQbC)

Credal set P(C|a) of NCC posteriors (committee members) (hard/easy to compute for global/local IDM) Behaviourally (Walley), only extremes of credal set matter Ignoring committee members in the convex hull? True for QbC! CQbC can only consider the extremes No need to tune the number q of committee members

slide-29
SLIDE 29

Credal Query-by-committee (CQbC)

Credal set P(C|a) of NCC posteriors (committee members) (hard/easy to compute for global/local IDM) Behaviourally (Walley), only extremes of credal set matter Ignoring committee members in the convex hull? True for QbC! CQbC can only consider the extremes No need to tune the number q of committee members

slide-30
SLIDE 30

Credal Query-by-committee (CQbC)

Credal set P(C|a) of NCC posteriors (committee members) (hard/easy to compute for global/local IDM) Behaviourally (Walley), only extremes of credal set matter Ignoring committee members in the convex hull? True for QbC! CQbC can only consider the extremes No need to tune the number q of committee members

slide-31
SLIDE 31

Credal Query-by-committee (CQbC)

Credal set P(C|a) of NCC posteriors (committee members) (hard/easy to compute for global/local IDM) Behaviourally (Walley), only extremes of credal set matter Ignoring committee members in the convex hull? True for QbC! CQbC can only consider the extremes No need to tune the number q of committee members

slide-32
SLIDE 32

Credal Query-by-committee (CQbC) (ii)

1

Compute {P(c|a, P(c|a)} for each c ∈ C

2

Evaluate the vertices of the consistent credal set (P ≤ P ≤ P)

3

Apply standard QbC to these vertices

Comment

Average KL distance from the center of mass as an uncertainty meaure for credal sets AL = benchmark for uncertainty measures

slide-33
SLIDE 33

Credal Query-by-committee (CQbC) (ii)

1

Compute {P(c|a, P(c|a)} for each c ∈ C

2

Evaluate the vertices of the consistent credal set (P ≤ P ≤ P)

3

Apply standard QbC to these vertices

Comment

Average KL distance from the center of mass as an uncertainty meaure for credal sets AL = benchmark for uncertainty measures

slide-34
SLIDE 34

Credal Query-by-committee (CQbC) (ii)

1

Compute {P(c|a, P(c|a)} for each c ∈ C

2

Evaluate the vertices of the consistent credal set (P ≤ P ≤ P)

3

Apply standard QbC to these vertices

Comment

Average KL distance from the center of mass as an uncertainty meaure for credal sets AL = benchmark for uncertainty measures

slide-35
SLIDE 35

Credal Query-by-committee (CQbC) (ii)

1

Compute {P(c|a, P(c|a)} for each c ∈ C

2

Evaluate the vertices of the consistent credal set (P ≤ P ≤ P)

3

Apply standard QbC to these vertices

Comment

Average KL distance from the center of mass as an uncertainty meaure for credal sets AL = benchmark for uncertainty measures

slide-36
SLIDE 36

Credal Query-by-committee (CQbC) (ii)

1

Compute {P(c|a, P(c|a)} for each c ∈ C

2

Evaluate the vertices of the consistent credal set (P ≤ P ≤ P)

3

Apply standard QbC to these vertices

Comment

Average KL distance from the center of mass as an uncertainty meaure for credal sets AL = benchmark for uncertainty measures

slide-37
SLIDE 37

Credal Query-by-committee (CQbC) (ii)

1

Compute {P(c|a, P(c|a)} for each c ∈ C

2

Evaluate the vertices of the consistent credal set (P ≤ P ≤ P)

3

Apply standard QbC to these vertices

Comment

Average KL distance from the center of mass as an uncertainty meaure for credal sets AL = benchmark for uncertainty measures

slide-38
SLIDE 38

Credal Query-by-committee (CQbC) (ii)

1

Compute {P(c|a, P(c|a)} for each c ∈ C

2

Evaluate the vertices of the consistent credal set (P ≤ P ≤ P)

3

Apply standard QbC to these vertices

Comment

Average KL distance from the center of mass as an uncertainty meaure for credal sets AL = benchmark for uncertainty measures

slide-39
SLIDE 39

Credal Query-by-committee (CQbC) (ii)

1

Compute {P(c|a, P(c|a)} for each c ∈ C

2

Evaluate the vertices of the consistent credal set (P ≤ P ≤ P)

3

Apply standard QbC to these vertices

Comment

Average KL distance from the center of mass as an uncertainty meaure for credal sets AL = benchmark for uncertainty measures

slide-40
SLIDE 40

Experiments

200 400 600 0.65 0.7 0.75 accuracy

diabetes

QbC CQbC 20 40 0.84 0.86 0.88 training instances

iris

QbC CQbC 100 200 0.75 0.8 0.85 accuracy

solar-flare-C

QbC CQbC 100 200 0.55 0.6 0.65 training instances

liver-disorders

QbC CQbC

slide-41
SLIDE 41

Conclusions and Outlooks

A novel algorithm for AL with NBC based on its credal version Competitive performances, based on a more intrinsic approach to sensitivity analysis AL as a benchmark to test uncertainty measures for credal sets To be extended to more general classifiers