SLIDE 1 Active Learning by the Naive Credal Classifier
Alessandro Antonucci∗, Giorgio Corani∗, Sandra Gabaglio†
∗Istituto “Dalle Molle” di Studi sull’Intelligenza Artificiale - Lugano (Switzerland) †ISIN/SUPSI - Lugano (Switzerland)
PGM’12 - Granada, September 20, 2012
SLIDE 2 Active Learning
Class C (values in C) attributes A := (A1, . . . , Ak) Training Dataset (supervised) c(1), a(1)
1 , . . . , a(1) k
c(2), a(2)
1 , . . . , a(2) k
. . . c(n), a(n)
1 , . . . , a(n) k
Test Set/Instance (unsupervised) ∗, ˜ a1, . . . , ˜ ak ∗, ˜ a1, . . . , ˜ ak
SLIDE 3 Active Learning
Class C (values in C) attributes A := (A1, . . . , Ak) Training Dataset (supervised) c(1), a(1)
1 , . . . , a(1) k
c(2), a(2)
1 , . . . , a(2) k
. . . c(n), a(n)
1 , . . . , a(n) k
Test Set/Instance (unsupervised) ∗, ˜ a1, . . . , ˜ ak ∗, ˜ a1, . . . , ˜ ak
Classifier
SLIDE 4 Active Learning
Class C (values in C) attributes A := (A1, . . . , Ak) Training Dataset (supervised) c(1), a(1)
1 , . . . , a(1) k
c(2), a(2)
1 , . . . , a(2) k
. . . c(n), a(n)
1 , . . . , a(n) k
Test Set/Instance (unsupervised) ∗, ˜ a1, . . . , ˜ ak ∗, ˜ a1, . . . , ˜ ak
Classifier
Active Set (unsupervised) ∗, a(a)
1 , . . . , a(a) k
∗, a(c)
1 , . . . , a(c) k
∗, a(b)
1 , . . . , a(b) k
SLIDE 5 Active Learning
Class C (values in C) attributes A := (A1, . . . , Ak) Training Dataset (supervised) c(1), a(1)
1 , . . . , a(1) k
c(2), a(2)
1 , . . . , a(2) k
. . . c(n), a(n)
1 , . . . , a(n) k
Test Set/Instance (unsupervised) ∗, ˜ a1, . . . , ˜ ak ∗, ˜ a1, . . . , ˜ ak
Classifier
Active Set (unsupervised) ∗, a(a)
1 , . . . , a(a) k
∗, a(c)
1 , . . . , a(c) k
∗, a(b)
1 , . . . , a(b) k
.8 Active Learning Score .3 .5
SLIDE 6 Active Learning
Class C (values in C) attributes A := (A1, . . . , Ak) Training Dataset (supervised) c(1), a(1)
1 , . . . , a(1) k
c(2), a(2)
1 , . . . , a(2) k
. . . c(n), a(n)
1 , . . . , a(n) k
Test Set/Instance (unsupervised) ∗, ˜ a1, . . . , ˜ ak ∗, ˜ a1, . . . , ˜ ak
Classifier
Active Set (unsupervised) ∗, a(a)
1 , . . . , a(a) k
∗, a(c)
1 , . . . , a(c) k
.8
∗, a(b)
1 , . . . , a(b) k
Active Learning Score .3 .5
SLIDE 7 Active Learning
Class C (values in C) attributes A := (A1, . . . , Ak) Training Dataset (supervised) c(1), a(1)
1 , . . . , a(1) k
c(2), a(2)
1 , . . . , a(2) k
. . . c(n), a(n)
1 , . . . , a(n) k
Test Set/Instance (unsupervised) ∗, ˜ a1, . . . , ˜ ak ∗, ˜ a1, . . . , ˜ ak
Classifier
Active Set (unsupervised) ∗, a(a)
1 , . . . , a(a) k
∗, a(c)
1 , . . . , a(c) k
Annotation
cb, a(b)
1 , . . . , a(b) k
Active Learning Score .3 .5
SLIDE 8 Active Learning
Class C (values in C) attributes A := (A1, . . . , Ak) Training Dataset (supervised) c(1), a(1)
1 , . . . , a(1) k
c(2), a(2)
1 , . . . , a(2) k
. . . c(n), a(n)
1 , . . . , a(n) k
Test Set/Instance (unsupervised) ∗, ˜ a1, . . . , ˜ ak ∗, ˜ a1, . . . , ˜ ak
Classifier
Active Set (unsupervised) ∗, a(a)
1 , . . . , a(a) k
∗, a(c)
1 , . . . , a(c) k
Annotation
cb, a(b)
1 , . . . , a(b) k
Active Learning Score .3 .5 Actively Learned Classifier (more accurate)
SLIDE 9
Accuracy Trajectories
constant AL score: random pick among active set instances (variance error decreases, accuracy increases)
SLIDE 10 Accuracy Trajectories
constant AL score: random pick among active set instances (variance error decreases, accuracy increases)
Accuracy (on test set) Training Set Size N Active Set FULL N+d N+2d . . . N+kd Active Set EMPTY
SLIDE 11 Accuracy Trajectories
constant AL score: random pick among active set instances (variance error decreases, accuracy increases)
Accuracy (on test set) Training Set Size N Active Set FULL N+d N+2d . . . N+kd Active Set EMPTY Random Pick
SLIDE 12 Accuracy Trajectories
constant AL score: random pick among active set instances (variance error decreases, accuracy increases)
Accuracy (on test set) Training Set Size N Active Set FULL N+d N+2d . . . N+kd Active Set EMPTY Random Pick AL algorithm
AL algs should do better!
SLIDE 13 Naive Classifiers
C Ak . . . A2 A1
Given class C, attributes independent
NAIVE BAYES (NBC)
A BN quantified from data by a flat Dirichlet prior Dir(st) P(c, a) = P(c) · m
i=1 P(ai|c)
Given test instance a, assigns class c∗ := arg maxc∈C P(c|f) ∀c′, c′′ ∈ C, dominance test
P(c′|a) P(c′′|a) = P(c′,a) P(c′′,a) > 1
NAIVE CREDAL (NCC)
BNs quantified by set of priors Imprecise Dirichlet model
T ≡
i ti = 1
(undominated) classes Conservative dominance test mint∈T
Pt(c′,a) Pt(c′′,a) > 1
SLIDE 14 Naive Classifiers
C Ak . . . A2 A1
Given class C, attributes independent
NAIVE BAYES (NBC)
A BN quantified from data by a flat Dirichlet prior Dir(st) P(c, a) = P(c) · m
i=1 P(ai|c)
Given test instance a, assigns class c∗ := arg maxc∈C P(c|f) ∀c′, c′′ ∈ C, dominance test
P(c′|a) P(c′′|a) = P(c′,a) P(c′′,a) > 1
NAIVE CREDAL (NCC)
BNs quantified by set of priors Imprecise Dirichlet model
T ≡
i ti = 1
(undominated) classes Conservative dominance test mint∈T
Pt(c′,a) Pt(c′′,a) > 1
SLIDE 15 Naive Classifiers
C Ak . . . A2 A1
Given class C, attributes independent
NAIVE BAYES (NBC)
A BN quantified from data by a flat Dirichlet prior Dir(st) P(c, a) = P(c) · m
i=1 P(ai|c)
Given test instance a, assigns class c∗ := arg maxc∈C P(c|f) ∀c′, c′′ ∈ C, dominance test
P(c′|a) P(c′′|a) = P(c′,a) P(c′′,a) > 1
NAIVE CREDAL (NCC)
BNs quantified by set of priors Imprecise Dirichlet model
T ≡
i ti = 1
(undominated) classes Conservative dominance test mint∈T
Pt(c′,a) Pt(c′′,a) > 1
SLIDE 16 Naive Classifiers
C Ak . . . A2 A1
Given class C, attributes independent
NAIVE BAYES (NBC)
A BN quantified from data by a flat Dirichlet prior Dir(st) P(c, a) = P(c) · m
i=1 P(ai|c)
Given test instance a, assigns class c∗ := arg maxc∈C P(c|f) ∀c′, c′′ ∈ C, dominance test
P(c′|a) P(c′′|a) = P(c′,a) P(c′′,a) > 1
NAIVE CREDAL (NCC)
BNs quantified by set of priors Imprecise Dirichlet model
T ≡
i ti = 1
(undominated) classes Conservative dominance test mint∈T
Pt(c′,a) Pt(c′′,a) > 1
SLIDE 17 Uncertainty Samplings
AL score(a) shows how hard-to-classify an instance is difficult/ambiguous instances give better contribution to learning
Uncertainty Sampling
Based on NBC posterior P(C|a) The smaller the probability of the most probable class, the more hard-to-classify is the instance score(a) ≡ −P(c∗|a)
Credal Uncertainty Sampling
Set of NCC posteriors P(C|a) The weaker the dominances, the more hard-to-classify instances If C = {c′, c′′} (binary class): score(a) ≡ − max
P(c′|a) P(c′′|a), mint P(c′′|a) P(c′|a)
Max over all pairs (c′, c′′) ∈ C2
SLIDE 18 Uncertainty Samplings
AL score(a) shows how hard-to-classify an instance is difficult/ambiguous instances give better contribution to learning
Uncertainty Sampling
Based on NBC posterior P(C|a) The smaller the probability of the most probable class, the more hard-to-classify is the instance score(a) ≡ −P(c∗|a)
Credal Uncertainty Sampling
Set of NCC posteriors P(C|a) The weaker the dominances, the more hard-to-classify instances If C = {c′, c′′} (binary class): score(a) ≡ − max
P(c′|a) P(c′′|a), mint P(c′′|a) P(c′|a)
Max over all pairs (c′, c′′) ∈ C2
SLIDE 19 Uncertainty Samplings
AL score(a) shows how hard-to-classify an instance is difficult/ambiguous instances give better contribution to learning
Uncertainty Sampling
Based on NBC posterior P(C|a) The smaller the probability of the most probable class, the more hard-to-classify is the instance score(a) ≡ −P(c∗|a)
Credal Uncertainty Sampling
Set of NCC posteriors P(C|a) The weaker the dominances, the more hard-to-classify instances If C = {c′, c′′} (binary class): score(a) ≡ − max
P(c′|a) P(c′′|a), mint P(c′′|a) P(c′|a)
Max over all pairs (c′, c′′) ∈ C2
SLIDE 20 Preliminary Experiments
US vs. CUS Very similar performance Importance sampling score′(a) = P(a) · score(a) Can be done also with NCC: score′(a) = P(a) · score(a) Not significant improvements Naive cannot provide realistic probabilistic estimates
200 400 600 0.65 0.7 0.75 training instances accuracy
diabetes
US CUS 20 40 0.84 0.86 0.88 training instances accuracy
iris
US CUS
SLIDE 21 Preliminary Experiments
US vs. CUS Very similar performance Importance sampling score′(a) = P(a) · score(a) Can be done also with NCC: score′(a) = P(a) · score(a) Not significant improvements Naive cannot provide realistic probabilistic estimates
200 400 600 0.65 0.7 0.75 training instances accuracy
diabetes
US CUS 20 40 0.84 0.86 0.88 training instances accuracy
iris
US CUS
SLIDE 22 Preliminary Experiments
US vs. CUS Very similar performance Importance sampling score′(a) = P(a) · score(a) Can be done also with NCC: score′(a) = P(a) · score(a) Not significant improvements Naive cannot provide realistic probabilistic estimates
200 400 600 0.65 0.7 0.75 training instances accuracy
diabetes
US CUS 20 40 0.84 0.86 0.88 training instances accuracy
iris
US CUS
SLIDE 23 Query-by-committee (QbC)
Pick instances whose posterior is not robust wrt resampling Training set resampled (by bootstrap) q times For each bootstrap, a NBC learned from data Pq(C|a) Compute average (center of mass of a polytope in the simplex) ˜ Pa(c) := 1 q
q
P(j)(c|a) Score is the average (KL) distance from CoM score(a) := 1 q
q
KL[P(j)(C|a), ˜ Pa(C)]
P(c′) P(c′′) P(c′′′)
NCC detects robustness wrt to the prior (no need of resampling)!
SLIDE 24 Query-by-committee (QbC)
Pick instances whose posterior is not robust wrt resampling Training set resampled (by bootstrap) q times For each bootstrap, a NBC learned from data Pq(C|a) Compute average (center of mass of a polytope in the simplex) ˜ Pa(c) := 1 q
q
P(j)(c|a) Score is the average (KL) distance from CoM score(a) := 1 q
q
KL[P(j)(C|a), ˜ Pa(C)]
P(c′) P(c′′) P(c′′′)
NCC detects robustness wrt to the prior (no need of resampling)!
SLIDE 25 Query-by-committee (QbC)
Pick instances whose posterior is not robust wrt resampling Training set resampled (by bootstrap) q times For each bootstrap, a NBC learned from data Pq(C|a) Compute average (center of mass of a polytope in the simplex) ˜ Pa(c) := 1 q
q
P(j)(c|a) Score is the average (KL) distance from CoM score(a) := 1 q
q
KL[P(j)(C|a), ˜ Pa(C)]
P(c′) P(c′′) P(c′′′)
NCC detects robustness wrt to the prior (no need of resampling)!
SLIDE 26 Query-by-committee (QbC)
Pick instances whose posterior is not robust wrt resampling Training set resampled (by bootstrap) q times For each bootstrap, a NBC learned from data Pq(C|a) Compute average (center of mass of a polytope in the simplex) ˜ Pa(c) := 1 q
q
P(j)(c|a) Score is the average (KL) distance from CoM score(a) := 1 q
q
KL[P(j)(C|a), ˜ Pa(C)]
P(c′) P(c′′) P(c′′′)
NCC detects robustness wrt to the prior (no need of resampling)!
SLIDE 27 Query-by-committee (QbC)
Pick instances whose posterior is not robust wrt resampling Training set resampled (by bootstrap) q times For each bootstrap, a NBC learned from data Pq(C|a) Compute average (center of mass of a polytope in the simplex) ˜ Pa(c) := 1 q
q
P(j)(c|a) Score is the average (KL) distance from CoM score(a) := 1 q
q
KL[P(j)(C|a), ˜ Pa(C)]
P(c′) P(c′′) P(c′′′)
NCC detects robustness wrt to the prior (no need of resampling)!
SLIDE 28
Credal Query-by-committee (CQbC)
Credal set P(C|a) of NCC posteriors (committee members) (hard/easy to compute for global/local IDM) Behaviourally (Walley), only extremes of credal set matter Ignoring committee members in the convex hull? True for QbC! CQbC can only consider the extremes No need to tune the number q of committee members
SLIDE 29
Credal Query-by-committee (CQbC)
Credal set P(C|a) of NCC posteriors (committee members) (hard/easy to compute for global/local IDM) Behaviourally (Walley), only extremes of credal set matter Ignoring committee members in the convex hull? True for QbC! CQbC can only consider the extremes No need to tune the number q of committee members
SLIDE 30
Credal Query-by-committee (CQbC)
Credal set P(C|a) of NCC posteriors (committee members) (hard/easy to compute for global/local IDM) Behaviourally (Walley), only extremes of credal set matter Ignoring committee members in the convex hull? True for QbC! CQbC can only consider the extremes No need to tune the number q of committee members
SLIDE 31
Credal Query-by-committee (CQbC)
Credal set P(C|a) of NCC posteriors (committee members) (hard/easy to compute for global/local IDM) Behaviourally (Walley), only extremes of credal set matter Ignoring committee members in the convex hull? True for QbC! CQbC can only consider the extremes No need to tune the number q of committee members
SLIDE 32 Credal Query-by-committee (CQbC) (ii)
1
Compute {P(c|a, P(c|a)} for each c ∈ C
2
Evaluate the vertices of the consistent credal set (P ≤ P ≤ P)
3
Apply standard QbC to these vertices
Comment
Average KL distance from the center of mass as an uncertainty meaure for credal sets AL = benchmark for uncertainty measures
SLIDE 33 Credal Query-by-committee (CQbC) (ii)
1
Compute {P(c|a, P(c|a)} for each c ∈ C
2
Evaluate the vertices of the consistent credal set (P ≤ P ≤ P)
3
Apply standard QbC to these vertices
Comment
Average KL distance from the center of mass as an uncertainty meaure for credal sets AL = benchmark for uncertainty measures
SLIDE 34 Credal Query-by-committee (CQbC) (ii)
1
Compute {P(c|a, P(c|a)} for each c ∈ C
2
Evaluate the vertices of the consistent credal set (P ≤ P ≤ P)
3
Apply standard QbC to these vertices
Comment
Average KL distance from the center of mass as an uncertainty meaure for credal sets AL = benchmark for uncertainty measures
SLIDE 35 Credal Query-by-committee (CQbC) (ii)
1
Compute {P(c|a, P(c|a)} for each c ∈ C
2
Evaluate the vertices of the consistent credal set (P ≤ P ≤ P)
3
Apply standard QbC to these vertices
Comment
Average KL distance from the center of mass as an uncertainty meaure for credal sets AL = benchmark for uncertainty measures
SLIDE 36 Credal Query-by-committee (CQbC) (ii)
1
Compute {P(c|a, P(c|a)} for each c ∈ C
2
Evaluate the vertices of the consistent credal set (P ≤ P ≤ P)
3
Apply standard QbC to these vertices
Comment
Average KL distance from the center of mass as an uncertainty meaure for credal sets AL = benchmark for uncertainty measures
SLIDE 37 Credal Query-by-committee (CQbC) (ii)
1
Compute {P(c|a, P(c|a)} for each c ∈ C
2
Evaluate the vertices of the consistent credal set (P ≤ P ≤ P)
3
Apply standard QbC to these vertices
Comment
Average KL distance from the center of mass as an uncertainty meaure for credal sets AL = benchmark for uncertainty measures
SLIDE 38 Credal Query-by-committee (CQbC) (ii)
1
Compute {P(c|a, P(c|a)} for each c ∈ C
2
Evaluate the vertices of the consistent credal set (P ≤ P ≤ P)
3
Apply standard QbC to these vertices
Comment
Average KL distance from the center of mass as an uncertainty meaure for credal sets AL = benchmark for uncertainty measures
SLIDE 39 Credal Query-by-committee (CQbC) (ii)
1
Compute {P(c|a, P(c|a)} for each c ∈ C
2
Evaluate the vertices of the consistent credal set (P ≤ P ≤ P)
3
Apply standard QbC to these vertices
Comment
Average KL distance from the center of mass as an uncertainty meaure for credal sets AL = benchmark for uncertainty measures
SLIDE 40 Experiments
200 400 600 0.65 0.7 0.75 accuracy
diabetes
QbC CQbC 20 40 0.84 0.86 0.88 training instances
iris
QbC CQbC 100 200 0.75 0.8 0.85 accuracy
solar-flare-C
QbC CQbC 100 200 0.55 0.6 0.65 training instances
liver-disorders
QbC CQbC
SLIDE 41
Conclusions and Outlooks
A novel algorithm for AL with NBC based on its credal version Competitive performances, based on a more intrinsic approach to sensitivity analysis AL as a benchmark to test uncertainty measures for credal sets To be extended to more general classifiers