SUPERSET LEARNING AND DATA IMPRECISIATION Eyke Hllermeier - - PowerPoint PPT Presentation

superset learning and data imprecisiation
SMART_READER_LITE
LIVE PREVIEW

SUPERSET LEARNING AND DATA IMPRECISIATION Eyke Hllermeier - - PowerPoint PPT Presentation

SUPERSET LEARNING AND DATA IMPRECISIATION Eyke Hllermeier Intelligent Systems Group Department of Computer Science University of Paderborn, Germany eyke@upb.de TFML 2017, Krakow, 15-FEB-2017 O UTLI NE PART 1 PART 2 PART 3 Superset


slide-1
SLIDE 1

Eyke Hüllermeier

Intelligent Systems Group Department of Computer Science University of Paderborn, Germany

eyke@upb.de

SUPERSET LEARNING AND DATA IMPRECISIATION

TFML 2017, Krakow, 15-FEB-2017

slide-2
SLIDE 2

O UTLI NE

2

PART 1 Superset learning PART 2 Optimistic loss minimization PART 3 Data imprecisiation

What it is about .... A general approach to superset learning .... Using superset learning for weighted learning ...

slide-3
SLIDE 3

SUPERSET LEARNI NG

3

... is a specific type of weakly supervised learning, studied under different names in machine learning:

  • learning from partial labels
  • multiple label learning
  • learning from ambiguously labeled examples
  • ...

... also connected to learning from coarse data in statistics (Rubin, 1976; Heitjan and Rubin, 1991), missing values, data augmentation (Tanner and Wong, 2012).

slide-4
SLIDE 4

SUPERSET LEARNI NG

4

  • Consider a standard setting of supervised learning with instance

space X, output space Y, and hypothesis space H

  • Output values yn 2 Y associated with training instances xn,

n = 1, . . . , N, are not necessarily observed precisely but only characterised in terms of supersets Yn 3 yn .

  • Set of imprecise/ambiguous/coarse observations is denoted

O =

  • (x1, Y1), . . . , (xN, YN)
  • An instantiation of O, denoted D, is obtained by replacing each Yn

with a candidate yn ∈ Yn.

slide-5
SLIDE 5

EXAM PLE: CLASSI FI CATI O N

5

Classes

slide-6
SLIDE 6

EXAM PLE: CLASSI FI CATI O N

6

Classes

  • ne of many

instantiations

slide-7
SLIDE 7

EXAM PLE: REG RESSI O N

7

  • ne of infinitely many

instantiations

slide-8
SLIDE 8

DATA DI SAM BI G UATI O N

8

How to learn from (super)set-valued data?

slide-9
SLIDE 9

DATA DI SAM BI G UATI O N

9

Classes

slide-10
SLIDE 10

DATA DI SAM BI G UATI O N

10

Classes

slide-11
SLIDE 11

DATA DI SAM BI G UATI O N

11

Classes

slide-12
SLIDE 12

DATA DI SAM BI G UATI O N

12

slide-13
SLIDE 13

DATA DI SAM BI G UATI O N

13

slide-14
SLIDE 14

DATA DI SAM BI G UATI O N

14

MORE PLAUSIBLE LESS PLAUSIBLE A less plausible instantiation, because there is no LINEAR model with a good fit! A plausible instantiation that can be fitted reasonably well with a LINEAR model!

slide-15
SLIDE 15

DATA DI SAM BI G UATI O N

15

PLAUSIBLE PLAUSIBLE A plausible instantiation that can be fitted quite well with a QUADRATIC model!

I t all depends on how y ou look at t he dat a!

A plausible instantiation that can be fitted quite well with a QUADRATIC model!

slide-16
SLIDE 16

DATA DI SAM BI G UATI O N

16

−4 −2 2 4 6 8 −3 −2 −1 1 2 3 4 5 6 7 8

= { , }

assume both class distributions to be Gaussian

slide-17
SLIDE 17

−4 −2 2 4 6 8 −3 −2 −1 1 2 3 4 5 6 7 8

DATA DI SAM BI G UATI O N

17

plausible instantiation

= { , }

assume both class distributions to be Gaussian

quadratic discriminant

slide-18
SLIDE 18

−4 −2 2 4 6 8 −3 −2 −1 1 2 3 4 5 6 7 8

DATA DI SAM BI G UATI O N

18

implausible instantiation

= { , }

assume both class distributions to be Gaussian

slide-19
SLIDE 19

Model identification and data disambiguation should be performed simultaneously:

identification disambiguation

DATA DI SAM BI G UATI O N

19

DATA MODEL

... quite natural from a Bayesian perspective: P(h, D) = P(h) P(D | h) = P(D) P(h | D)

slide-20
SLIDE 20

O UTLI NE

20

PART 1 Superset learning PART 2 Optimistic loss minimization PART 3 Data imprecisiation

slide-21
SLIDE 21

M AXI M UM LI KELI HO O D ESTI M ATI O N

21

Imprecise observation only depends on true data, not on the model.

precise DATA MODEL

generation

imprecise DATA

imprecisiation coarsening

Likelihood of a model h ∈ H: `(h) = P(O, D | h) = P(D | h) P(O | D, h) = P(D | h) P(O | D)

slide-22
SLIDE 22

SUPERSET ASSUM PTI O N

22

Imprecise data is a superset, but no other assumption.

precise DATA MODEL

generation

imprecise DATA

imprecisiation ambiguation

slide-23
SLIDE 23

G ENERALI ZED ERM

23

how well the (precise) model fits the imprecise data

We derive a principle of generalized empirical risk minimization with the empirical risk Remp(h) = 1 N

N

X

n=1

L∗ Yn, h(xn)

  • and the optimistic superset loss (OSL) function

L∗(Y, ˆ y) = min

  • L(y, ˆ

y) | y ∈ Y .

slide-24
SLIDE 24

SPECI AL CASES

24

slide-25
SLIDE 25

G ENERALI ZATI O N TO FUZZY DATA

25

1 1

interval fuzzy interval

slide-26
SLIDE 26

G ENERALI ZATI O N TO FUZZY DATA

26

LO SS 1

α

α-cut

slide-27
SLIDE 27

G ENERALI ZATI O N TO FUZZY DATA

27

LO SS RI SK

L∗∗(Y, ˆ y) = Z 1 L∗⇣ [Y ]α, ˆ y ⌘ dα

Remp(h) = 1 N

N

X

n=1

L∗∗⇣ Yn, h(xn) ⌘

slide-28
SLIDE 28

G ENERALI ZATI O N TO FUZZY DATA

28

à Huber loss !

L∗∗(Y, ˆ y)

slide-29
SLIDE 29

G ENERALI ZATI O N TO FUZZY DATA

29

à (generalized) Huber loss !

slide-30
SLIDE 30

STRUCTURED O UTPUT PREDI CTI O N

30

Superset learning naturally applies to learning problems with structured outputs, which are often only partially specified and can then be associated with the set of all consistent completions.

slide-31
SLIDE 31

LABEL RANKI NG

31

... is the problem to learn a model that maps instances to TOTAL ORDERS

  • ver a fixed set of alternatives/labels:

D A C B

slide-32
SLIDE 32

LABEL RANKI NG

32

(0,37,46,325,1,0)

... likes more ... reads more ... recommends more ...

A D C B

... is the problem to learn a model that maps instances to TOTAL ORDERS

  • ver a fixed set of alternatives/labels:
slide-33
SLIDE 33

LABEL RANKI NG

33

Tr aining dat a is t y pic ally inc om plet e!

(0,37,46,325,1,0)

A C

... is the problem to learn a model that maps instances to TOTAL ORDERS

  • ver a fixed set of alternatives/labels:
slide-34
SLIDE 34

LABEL RANKI NG

34

:

Tr aining dat a is t y pic ally inc om plet e!

set of linear extensions

(0,37,46,325,1,0)

... is the problem to learn a model that maps instances to TOTAL ORDERS

  • ver a fixed set of alternatives/labels:
slide-35
SLIDE 35

LABEL RANKI NG LO SSES

35

K E N D A L L S P E A R M A N

slide-36
SLIDE 36

EXPERI M ENTAL STUDI ES

36

30 60 0.7 0.8 0.9

authorship 30 60 0.5 1 glass 30 60 0.7 0.8 0.9 iris 30 60 0.5 1 pendigits 30 60 0.5 1 segement 30 60 0.6 0.8 1 30 60 0.5 1 vovel 30 60 0.85 0.9 0.95 wine

  • Cheng and H. (2015) compare an approach to label ranking based on

superset learning with a state-of-the-art label ranker based on the Plackett-Luce model (PL).

  • Two missing label scenarios: missing at random, top-rank
  • General conclusion: more robust toward incompleteness
slide-37
SLIDE 37

O UTLI NE

37

PART 1 Superset learning PART 2 Optimistic loss minimization PART 3 Data imprecisiation

slide-38
SLIDE 38

DATA I M PRECI SI ATI O N

38

So far: Observations are imprecise/incomplete, and we have to deal with that! Now: Deliberately turn precise into imprecise data, so as to modulate the influence of an observation on the learning process!

slide-39
SLIDE 39

EXAM PLE W EI G HI NG

39

slide-40
SLIDE 40

EXAM PLE W EI G HI NG

40

slide-41
SLIDE 41

EXAM PLE W EI G HI NG

41

We suggest an alternative way of weighing examples, namely, via „data imprecisiation“ ... 1 1 1

full support for precise observation

slide-42
SLIDE 42

EXAM PLE W EI G HI NG

42

slide-43
SLIDE 43

EXAM PLE W EI G HI NG

43

weighing through „imprecisiation“

slide-44
SLIDE 44

EXAM PLE W EI G HI NG

44

Different ways of (individually) discounting the loss function.

l o s s

In (Lu and H., 2015), we empirically compared standard locally weighted linear regression with this approach and essentially found no difference.

weighted loss OSL

slide-45
SLIDE 45

EXAM PLE W EI G HI NG

45

1

certainly positive less certainly positive

We suggest an alternative way of weighing examples, namely, via „data imprecisiation“ ...

slide-46
SLIDE 46

FUZZY M ARG I N LO SSES

46

G E N E R A L I Z E D H I N G E L O S S

w=1 w=3/4 w=1/2 w=1/4 w=0

slide-47
SLIDE 47

FUZZY M ARG I N LO SSES

47

Different ways of (individually) discounting the loss function.

w=1 w=3/4 w=1/2 w=1/4 w=0 w=1 w=3/4 w=1/2 w=1/4 w=0

weighted loss OSL

slide-48
SLIDE 48

THE HAT LO SS

48

slide-49
SLIDE 49

DATA DI SAM BI G UATI O N

49

slide-50
SLIDE 50

DATA DI SAM BI G UATI O N

50

slide-51
SLIDE 51

DATA DI SAM BI G UATI O N

51

slide-52
SLIDE 52

EXPERI M ENTS

52

Robust loss minimization techniques:

§ Robust truncated-hinge-loss support vector machines (RSVM) trains SVMs with the a truncated version of the hinge loss in order to be more robust toward outliers and noisy data (Wu and Liu, 2007). § One-step weighted SVM (OWSVM) first trains a standard SVM. Then, it weighs each training example based on its distance to the decision boundary and retrains using the weighted hinge loss (Wu and Liu, 2013). § Our approach (FLSVM) is the same as OWSVM, except for the weighted loss: instead of using a simple weighting of the hinge loss, we use the

  • ptimistic fuzzy loss.

Non-convex optimization problem solved by concave-convex procedure (Yuille and Rangaraja, 2002).

slide-53
SLIDE 53

EXPERI M ENTAL RESULTS

53

slide-54
SLIDE 54

THEO RETI CAL FO UNDATI O NS

54

Under what conditions is (successful) learning in the superset setting actually possible?

slide-55
SLIDE 55

THEO RETI CAL FO UNDATI O NS

55

slide-56
SLIDE 56

THEO RETI CAL FO UNDATI O NS

56

systematic imprecisiation

slide-57
SLIDE 57

THEO RETI CAL FO UNDATI O NS

57

non-systematicimprecisiation

slide-58
SLIDE 58

THEO RETI CAL FO UNDATI O NS

58

Liu and Dietterich (2014) consider the ambiguity degree, which is defined as the largest probability that a particular distractor label co-occurs with the true label in multi-class classification: = sup n PY ∼Ds(x,y)(` 2 Y ) | (x, y) 2 X ⇥ Y, ` 2 Y, p(x, y) > 0, ` 6= y

slide-59
SLIDE 59

THEO RETI CAL FO UNDATI O NS

59

Let ✓ = log(2/(1 + )) and dH the Natarajan dimension of H. Define n0(H, ✏, ) = 4 ✓✏ ✓ dH ✓ log(4dH + 2 log L + log ✓ 1 ✓✏ ◆◆ + log ✓1

+ 1 ◆ . Then, in the realizable case, with probability at least 1 − , the model with the smallest empirical superset loss on a set of training data of size n > n0(H, ✏, ) has a generalisation error of at most ✏.

slide-60
SLIDE 60

THEO RETI CAL FO UNDATI O NS

60

The balanced benefit condition: 0 ≤ η1 ≤ inf

h∈H

RS(h) R(h) ≤ sup

h∈H

RS(h) R(h) ≤ η2 ≤ 1 , where RS(h) is the expected superset loss of h. For sufficiently large sample size, R(ˆ h) ≤ R(h∗) + ∆(dH, ✏, , ⌘1, ⌘2) , with probability 1 − , where h∗ is the Bayes predictor and ˆ h the empirical (superset) risk minimizer; in general, ∆ cannot be made arbitrarily small.

slide-61
SLIDE 61

SUM M ARY AND O UTLO O K

§ Method for superset learning based on optimistic loss minimization, performing simultaneous model identification and data disambiguation. § Our framework covers several existing methods as special cases but also supports the systematic development of new methods. § Completely generic principle (classification, regression, structured

  • utput prediction, ...)

§ Example weighing via data imprecisiation (à „modeling data“) § Works for regression and classification, but seems to be even more interesting for other problems, including ranking, transfer learning, ... § More future work: Algorithmic solutions for specific instantiations of our framework, theoretical foundations.

61

slide-62
SLIDE 62

REFERENCES

62

  • E. Hüllermeier and W. Cheng (2015). Superset Learning Based on Generalized Loss
  • Minimization. Proc. ECML/PKDD 2015.
  • E. Hüllermeier (2014). Learning from Imprecise and Fuzzy Observations: Data

Disambiguation through Generalized Loss Minimization. International Journal of Approximate Reasoning, 55(7):1519-1534, 2014.

  • S. Lu and E. Hüllermeier. Locally Weighted Regression through Data
  • Imprecisiation. Workshop Computational Intelligence, Dortmund, 2015.

D.B. Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.

  • D. F. Heitjan and D. B. Rubin. Ignorability and coarse data. The Annals of Statistics,

19(4):2244–2253, 1991.