Semi-Supervised Learning Tutorial Xiaojin Zhu Department of - - PowerPoint PPT Presentation

semi supervised learning tutorial
SMART_READER_LITE
LIVE PREVIEW

Semi-Supervised Learning Tutorial Xiaojin Zhu Department of - - PowerPoint PPT Presentation

Semi-Supervised Learning Tutorial Xiaojin Zhu Department of Computer Sciences University of Wisconsin, Madison, USA ICML 2007 Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 1 / 135 Outline Introduction


slide-1
SLIDE 1

Semi-Supervised Learning Tutorial

Xiaojin Zhu

Department of Computer Sciences University of Wisconsin, Madison, USA

ICML 2007

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 1 / 135

slide-2
SLIDE 2

Outline

1

Introduction to Semi-Supervised Learning

2

Semi-Supervised Learning Algorithms Self Training Generative Models S3VMs Graph-Based Algorithms Multiview Algorithms

3

Semi-Supervised Learning in Nature

4

Some Challenges for Future Research

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 2 / 135

slide-3
SLIDE 3

Introduction to Semi-Supervised Learning

Outline

1

Introduction to Semi-Supervised Learning

2

Semi-Supervised Learning Algorithms Self Training Generative Models S3VMs Graph-Based Algorithms Multiview Algorithms

3

Semi-Supervised Learning in Nature

4

Some Challenges for Future Research

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 3 / 135

slide-4
SLIDE 4

Introduction to Semi-Supervised Learning

Disclaimer

This tutorial reflects my subjective opinions. Many work cannot be included. Thank Olivier Chapelle for some of the S3VM figures.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 4 / 135

slide-5
SLIDE 5

Introduction to Semi-Supervised Learning

Why bother?

Because people want better performance for free.

the traditional view

unlabeled data is cheap labeled data can be hard to get

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 5 / 135

slide-6
SLIDE 6

Introduction to Semi-Supervised Learning

Why bother?

Because people want better performance for free.

the traditional view

unlabeled data is cheap labeled data can be hard to get

◮ human annotation is boring Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 5 / 135

slide-7
SLIDE 7

Introduction to Semi-Supervised Learning

Why bother?

Because people want better performance for free.

the traditional view

unlabeled data is cheap labeled data can be hard to get

◮ human annotation is boring ◮ labels may require experts Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 5 / 135

slide-8
SLIDE 8

Introduction to Semi-Supervised Learning

Why bother?

Because people want better performance for free.

the traditional view

unlabeled data is cheap labeled data can be hard to get

◮ human annotation is boring ◮ labels may require experts ◮ labels may require special devices Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 5 / 135

slide-9
SLIDE 9

Introduction to Semi-Supervised Learning

Why bother?

Because people want better performance for free.

the traditional view

unlabeled data is cheap labeled data can be hard to get

◮ human annotation is boring ◮ labels may require experts ◮ labels may require special devices ◮ your graduate student is on vacation Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 5 / 135

slide-10
SLIDE 10

Introduction to Semi-Supervised Learning

Example of hard-to-get labels

Task: speech analysis Switchboard dataset telephone conversation transcription 400 hours annotation time for each hour of speech film ⇒ f ih n uh gl n m be all ⇒ bcl b iy iy tr ao tr ao l dl

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 6 / 135

slide-11
SLIDE 11

Introduction to Semi-Supervised Learning

Another example of hard-to-get labels

Task: natural language parsing Penn Chinese Treebank 2 years for 4000 sentences “The National Track and Field Championship has finished.”

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 7 / 135

slide-12
SLIDE 12

Introduction to Semi-Supervised Learning

Example of not-so-hard-to-get labels

a little secret

For some tasks, it may not be too difficult to label 1000+ instances. Task: image categorization of “eclipse”

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 8 / 135

slide-13
SLIDE 13

Introduction to Semi-Supervised Learning

Example of not-so-hard-to-get labels

There are ways like the ESP game (www.espgame.org) to encourage “human computation” for more labels.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 9 / 135

slide-14
SLIDE 14

Introduction to Semi-Supervised Learning

Example of not-so-hard-to-get labels

nonetheless... In this tutorial we will learn how to use unlabeled data to improve classification.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 10 / 135

slide-15
SLIDE 15

Introduction to Semi-Supervised Learning

The Learning Problem

Goal

Using both labeled and unlabeled data to build better learners, than using each one alone.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 11 / 135

slide-16
SLIDE 16

Introduction to Semi-Supervised Learning

Notations

input instance x, label y learner f : X → Y labeled data (Xl, Yl) = {(x1:l, y1:l)} unlabeled data Xu = {xl+1:n}, available during training usually l ≪ n test data Xtest = {xn+1:}, not available during training

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 12 / 135

slide-17
SLIDE 17

Introduction to Semi-Supervised Learning

Semi-supervised vs. transductive learning

labeled data (Xl, Yl) = {(x1:l, y1:l)} unlabeled data Xu = {xl+1:n}, available during training test data Xtest = {xn+1:}, not available during training

Semi-supervised learning

is ultimately applied to the test data (inductive).

Transductive learning

is only concerned with the unlabeled data.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 13 / 135

slide-18
SLIDE 18

Introduction to Semi-Supervised Learning

Why the name

supervised learning (classification, regression) {(x1:n, y1:n)}

  • semi-supervised classification/regression {(x1:l, y1:l), xl+1:n, xtest}

transductive classification/regression {(x1:l, y1:l), xl+1:n}

  • semi-supervised clustering {x1:n, must-, cannot-links}
  • unsupervised learning (clustering) {x1:n}

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 14 / 135

slide-19
SLIDE 19

Introduction to Semi-Supervised Learning

Why the name

supervised learning (classification, regression) {(x1:n, y1:n)}

  • semi-supervised classification/regression {(x1:l, y1:l), xl+1:n, xtest}

transductive classification/regression {(x1:l, y1:l), xl+1:n}

  • semi-supervised clustering {x1:n, must-, cannot-links}
  • unsupervised learning (clustering) {x1:n}

We will mainly discuss semi-supervised classification.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 14 / 135

slide-20
SLIDE 20

Introduction to Semi-Supervised Learning

How can unlabeled data ever help?

2 1

−1 1 x ∆

2 1

decision boundary (labeled) unlabeled data decision boundary (labeled and unlabeled) labeled data

assuming each class is a coherent group (e.g. Gaussian) with and without unlabeled data: decision boundary shift

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 15 / 135

slide-21
SLIDE 21

Introduction to Semi-Supervised Learning

How can unlabeled data ever help?

2 1

−1 1 x ∆

2 1

decision boundary (labeled) unlabeled data decision boundary (labeled and unlabeled) labeled data

assuming each class is a coherent group (e.g. Gaussian) with and without unlabeled data: decision boundary shift This is only one of many ways to use unlabeled data.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 15 / 135

slide-22
SLIDE 22

Introduction to Semi-Supervised Learning

Does unlabeled data always help?

Unfortunately, this is not the case, yet.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 16 / 135

slide-23
SLIDE 23

Semi-Supervised Learning Algorithms

Outline

1

Introduction to Semi-Supervised Learning

2

Semi-Supervised Learning Algorithms Self Training Generative Models S3VMs Graph-Based Algorithms Multiview Algorithms

3

Semi-Supervised Learning in Nature

4

Some Challenges for Future Research

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 17 / 135

slide-24
SLIDE 24

Semi-Supervised Learning Algorithms Self Training

Outline

1

Introduction to Semi-Supervised Learning

2

Semi-Supervised Learning Algorithms Self Training Generative Models S3VMs Graph-Based Algorithms Multiview Algorithms

3

Semi-Supervised Learning in Nature

4

Some Challenges for Future Research

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 18 / 135

slide-25
SLIDE 25

Semi-Supervised Learning Algorithms Self Training

Self-training algorithm

Assumption

One’s own high confidence predictions are correct. Self-training algorithm:

1 Train f from (Xl, Yl) 2 Predict on x ∈ Xu 3 Add (x, f(x)) to labeled data 4 Repeat Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 19 / 135

slide-26
SLIDE 26

Semi-Supervised Learning Algorithms Self Training

Variations in self-training

Add a few most confident (x, f(x)) to labeled data Add all (x, f(x)) to labeled data Add all (x, f(x)) to labeled data, weigh each by confidence

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 20 / 135

slide-27
SLIDE 27

Semi-Supervised Learning Algorithms Self Training

Self-training example: image categorization

Each image is divided into small patches 10 × 10 grid, random size in 10 ∼ 20

20 40 60 80 100 120 140 20 40 60 80 100 120 140

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 21 / 135

slide-28
SLIDE 28

Semi-Supervised Learning Algorithms Self Training

Self-training example: image categorization

All patches are normalized. Define a dictionary of 200 ‘visual words’ (cluster centroids) with 200-means clustering on all patches. Represent a patch by the index of its closest visual word.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 22 / 135

slide-29
SLIDE 29

Semi-Supervised Learning Algorithms Self Training

The bag-of-word representation of images

→ 1:0 2:1 3:2 4:2 5:0 6:0 7:0 8:3 9:0 10:3 11:31 12:0 13:0 14:0 15:0 16:9 17:1 18:0 19:0 20:1 21:0 22:0 23:0 24:0 25:6

26:0 27:6 28:0 29:0 30:0 31:1 32:0 33:0 34:0 35:0 36:0 37:0 38:0 39:0 40:0 41:0 42:1 43:0 44:2 45:0 46:0 47:0 48:0 49:3 50:0 51:3 52:0 53:0 54:0 55:1 56:1 57:1 58:1 59:0 60:3 61:1 62:0 63:3 64:0 65:0 66:0 67:0 68:0 69:0 70:0 71:1 72:0 73:2 74:0 75:0 76:0 77:0 78:0 79:0 80:0 81:0 82:0 83:0 84:3 85:1 86:1 87:1 88:2 89:0 90:0 91:0 92:0 93:2 94:0 95:1 96:0 97:1 98:0 99:0 100:0 101:1 102:0 103:0 104:0 105:1 106:0 107:0 108:0 109:0 110:3 111:1 112:0 113:3 114:0 115:0 116:0 117:0 118:3 119:0 120:0 121:1 122:0 123:0 124:0 125:0 126:0 127:3 128:3 129:3 130:4 131:4 132:0 133:0 134:2 135:0 136:0 137:0 138:0 139:0 140:0 141:1 142:0 143:6 144:0 145:2 146:0 147:3 148:0 149:0 150:0 151:0 152:0 153:0 154:1 155:0 156:0 157:3 158:12 159:4 160:0 161:1 162:7 163:0 164:3 165:0 166:0 167:0 168:0 169:1 170:3 171:2 172:0 173:1 174:0 175:0 176:2 177:0 178:0 179:1 180:0 181:1 182:2 183:0 184:0 185:2 186:0 187:0 188:0 189:0 190:0 191:0 192:0 193:1 194:2 195:4 196:0 197:0 198:0 199:0 200:0 Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 23 / 135

slide-30
SLIDE 30

Semi-Supervised Learning Algorithms Self Training

Self-training example: image categorization

  • 1. Train a na¨

ıve Bayes classifier on the two initial labeled images

  • 2. Classify unlabeled data, sort by confidence log p(y = astronomy|x)

. . .

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 24 / 135

slide-31
SLIDE 31

Semi-Supervised Learning Algorithms Self Training

Self-training example: image categorization

  • 3. Add the most confident images and predicted labels to labeled data
  • 4. Re-train the classifier and repeat

. . .

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 25 / 135

slide-32
SLIDE 32

Semi-Supervised Learning Algorithms Self Training

Advantages of self-training

The simplest semi-supervised learning method. A wrapper method, applies to existing (complex) classifiers. Often used in real tasks like natural language processing.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 26 / 135

slide-33
SLIDE 33

Semi-Supervised Learning Algorithms Self Training

Disadvantages of self-training

Early mistakes could reinforce themselves.

◮ Heuristic solutions, e.g. “un-label” an instance if its confidence falls

below a threshold.

Cannot say too much in terms of convergence.

◮ But there are special cases when self-training is equivalent to the

Expectation-Maximization (EM) algorithm.

◮ There are also special cases (e.g., linear functions) when the

closed-form solution is known.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 27 / 135

slide-34
SLIDE 34

Semi-Supervised Learning Algorithms Generative Models

Outline

1

Introduction to Semi-Supervised Learning

2

Semi-Supervised Learning Algorithms Self Training Generative Models S3VMs Graph-Based Algorithms Multiview Algorithms

3

Semi-Supervised Learning in Nature

4

Some Challenges for Future Research

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 28 / 135

slide-35
SLIDE 35

Semi-Supervised Learning Algorithms Generative Models

A simple example of generative models

Labeled data (Xl, Yl):

−5 −4 −3 −2 −1 1 2 3 4 5 −5 −4 −3 −2 −1 1 2 3 4 5

Assuming each class has a Gaussian distribution, what is the decision boundary?

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 29 / 135

slide-36
SLIDE 36

Semi-Supervised Learning Algorithms Generative Models

A simple example of generative models

Model parameters: θ = {w1, w2, µ1, µ2, Σ1, Σ2} The GMM: p(x, y|θ) = p(y|θ)p(x|y, θ) = wyN(x; µy, Σy) Classification: p(y|x, θ) =

p(x,y|θ) P

y′ p(x,y′|θ) Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 30 / 135

slide-37
SLIDE 37

Semi-Supervised Learning Algorithms Generative Models

A simple example of generative models

The most likely model, and its decision boundary:

−5 −4 −3 −2 −1 1 2 3 4 5 −5 −4 −3 −2 −1 1 2 3 4 5

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 31 / 135

slide-38
SLIDE 38

Semi-Supervised Learning Algorithms Generative Models

A simple example of generative models

Adding unlabeled data:

−5 −4 −3 −2 −1 1 2 3 4 5 −5 −4 −3 −2 −1 1 2 3 4 5

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 32 / 135

slide-39
SLIDE 39

Semi-Supervised Learning Algorithms Generative Models

A simple example of generative models

With unlabeled data, the most likely model and its decision boundary:

−5 −4 −3 −2 −1 1 2 3 4 5 −5 −4 −3 −2 −1 1 2 3 4 5

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 33 / 135

slide-40
SLIDE 40

Semi-Supervised Learning Algorithms Generative Models

A simple example of generative models

They are different because they maximize different quantities. p(Xl, Yl|θ) p(Xl, Yl, Xu|θ)

−5 −4 −3 −2 −1 1 2 3 4 5 −5 −4 −3 −2 −1 1 2 3 4 5 −5 −4 −3 −2 −1 1 2 3 4 5 −5 −4 −3 −2 −1 1 2 3 4 5

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 34 / 135

slide-41
SLIDE 41

Semi-Supervised Learning Algorithms Generative Models

Generative model for semi-supervised learning

Assumption

The full generative model p(X, Y |θ). Generative model for semi-supervised learning: quantity of interest: p(Xl, Yl, Xu|θ) =

Yu p(Xl, Yl, Xu, Yu|θ)

find the maximum likelihood estimate (MLE) of θ, the maximum a posteriori (MAP) estimate, or be Bayesian

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 35 / 135

slide-42
SLIDE 42

Semi-Supervised Learning Algorithms Generative Models

Examples of some generative models

Often used in semi-supervised learning: Mixture of Gaussian distributions (GMM)

◮ image classification ◮ the EM algorithm

Mixture of multinomial distributions (Na¨ ıve Bayes)

◮ text categorization ◮ the EM algorithm

Hidden Markov Models (HMM)

◮ speech recognition ◮ Baum-Welch algorithm Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 36 / 135

slide-43
SLIDE 43

Semi-Supervised Learning Algorithms Generative Models

Case study: GMM

For simplicity, consider binary classification with GMM using MLE. labeled data only

◮ log p(Xl, Yl|θ) = l

i=1 log p(yi|θ)p(xi|yi, θ)

◮ MLE for θ trivial (frequency, sample mean, sample covariance)

labeled and unlabeled data log p(Xl, Yl, Xu|θ) = l

i=1 log p(yi|θ)p(xi|yi, θ)

+ l+u

i=l+1 log

2

y=1 p(y|θ)p(xi|y, θ)

  • ◮ MLE harder (hidden variables)

◮ The Expectation-Maximization (EM) algorithm is one method to find a

local optimum.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 37 / 135

slide-44
SLIDE 44

Semi-Supervised Learning Algorithms Generative Models

The EM algorithm for GMM

1 Start from MLE θ = {w, µ, Σ}1:2 on (Xl, Yl), repeat: 2 The E-step: compute the expected label p(y|x, θ) =

p(x,y|θ) P

y′ p(x,y′|θ) for

all x ∈ Xu

◮ label p(y = 1|x, θ)-fraction of x with class 1 ◮ label p(y = 2|x, θ)-fraction of x with class 2 3 The M-step: update MLE θ with (now labeled) Xu ◮ wc=proportion of class c ◮ µc=sample mean of class c ◮ Σc=sample cov of class c Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 38 / 135

slide-45
SLIDE 45

Semi-Supervised Learning Algorithms Generative Models

The EM algorithm for GMM

1 Start from MLE θ = {w, µ, Σ}1:2 on (Xl, Yl), repeat: 2 The E-step: compute the expected label p(y|x, θ) =

p(x,y|θ) P

y′ p(x,y′|θ) for

all x ∈ Xu

◮ label p(y = 1|x, θ)-fraction of x with class 1 ◮ label p(y = 2|x, θ)-fraction of x with class 2 3 The M-step: update MLE θ with (now labeled) Xu ◮ wc=proportion of class c ◮ µc=sample mean of class c ◮ Σc=sample cov of class c

Can be viewed as a special form of self-training.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 38 / 135

slide-46
SLIDE 46

Semi-Supervised Learning Algorithms Generative Models

The EM algorithm in general

Set up:

◮ observed data D = (Xl, Yl, Xu) ◮ hidden data H = Yu ◮ p(D|θ) =

H p(D, H|θ)

Goal: find θ to maximize p(D|θ) Properties:

◮ EM starts from an arbitrary θ0 ◮ The E-step: q(H) = p(H|D, θ) ◮ The M-step: maximize

H q(H) log p(D, H|θ)

◮ EM iteratively improves p(D|θ) ◮ EM converges to a local maximum of θ Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 39 / 135

slide-47
SLIDE 47

Semi-Supervised Learning Algorithms Generative Models

Generative model for semi-supervised learning: beyond EM

Key is to maximize p(Xl, Yl, Xu|θ). EM is just one way to maximize it. Other ways to find parameters are possible too, e.g., variational approximation, or direct optimization.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 40 / 135

slide-48
SLIDE 48

Semi-Supervised Learning Algorithms Generative Models

Advantages of generative models

Clear, well-studied probabilistic framework Can be extremely effective, if the model is close to correct

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 41 / 135

slide-49
SLIDE 49

Semi-Supervised Learning Algorithms Generative Models

Disadvantages of generative models

Often difficult to verify the correctness of the model Model identifiability EM local optima Unlabeled data may hurt if generative model is wrong

−6 −4 −2 2 4 6 −6 −4 −2 2 4 6 Class 1 Class 2

For example, classifying text by topic vs. by genre.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 42 / 135

slide-50
SLIDE 50

Semi-Supervised Learning Algorithms Generative Models

Unlabeled data may hurt semi-supervised learning

If the generative model is wrong: high likelihood low likelihood wrong correct

−6 −4 −2 2 4 6 −6 −4 −2 2 4 6 −6 −4 −2 2 4 6 −6 −4 −2 2 4 6

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 43 / 135

slide-51
SLIDE 51

Semi-Supervised Learning Algorithms Generative Models

Heuristics to lessen the danger

Carefully construct the generative model to reflect the task

◮ e.g., multiple Gaussian distributions per class, instead of a single one

Down-weight the unlabeled data (λ < 1) log p(Xl, Yl, Xu|θ) = l

i=1 log p(yi|θ)p(xi|yi, θ)

+ λ l+u

i=l+1 log

2

y=1 p(y|θ)p(xi|y, θ)

  • Xiaojin Zhu (Univ. Wisconsin, Madison)

Semi-Supervised Learning Tutorial ICML 2007 44 / 135

slide-52
SLIDE 52

Semi-Supervised Learning Algorithms Generative Models

Related method: cluster-and-label

Instead of probabilistic generative models, any clustering algorithm can be used for semi-supervised classification too: Run your favorite clustering algorithm on Xl, Xu. Label all points within a cluster by the majority of labeled points in that cluster. Pro: Yet another simple method using existing algorithms. Con: Can be difficult to analyze.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 45 / 135

slide-53
SLIDE 53

Semi-Supervised Learning Algorithms S3VMs

Outline

1

Introduction to Semi-Supervised Learning

2

Semi-Supervised Learning Algorithms Self Training Generative Models S3VMs Graph-Based Algorithms Multiview Algorithms

3

Semi-Supervised Learning in Nature

4

Some Challenges for Future Research

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 46 / 135

slide-54
SLIDE 54

Semi-Supervised Learning Algorithms S3VMs

Semi-supervised Support Vector Machines

Semi-supervised SVMs (S3VMs) = Transductive SVMs (TSVMs) Maximizes “unlabeled data margin”

+ + + + + − − − −

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 47 / 135

slide-55
SLIDE 55

Semi-Supervised Learning Algorithms S3VMs

S3VMs

Assumption

Unlabeled data from different classes are separated with large margin. S3VM idea: Enumerate all 2u possible labeling of Xu Build one standard SVM for each labeling (and Xl) Pick the SVM with the largest margin

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 48 / 135

slide-56
SLIDE 56

Semi-Supervised Learning Algorithms S3VMs

Standard SVM review

Problem set up:

◮ two classes y ∈ {+1, −1} ◮ labeled data (Xl, Yl) ◮ a kernel K ◮ the reproducing Hilbert kernel space HK

SVM finds a function f(x) = h(x) + b with h ∈ HK Classify x by sign(f(x))

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 49 / 135

slide-57
SLIDE 57

Semi-Supervised Learning Algorithms S3VMs

Standard soft margin SVMs

Try to keep labeled points outside the margin, while maximizing the margin: min

h,b,ξ l

  • i=1

ξi + λh2

HK

subject to yi(h(xi) + b) ≥ 1 − ξi , ∀i = 1 . . . l ξi ≥ 0 The ξ’s are slack variables.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 50 / 135

slide-58
SLIDE 58

Semi-Supervised Learning Algorithms S3VMs

Hinge function

min

ξ

ξ subject to ξ ≥ z ξ ≥ 0 If z ≤ 0, min ξ = 0 If z > 0, min ξ = z Therefore the constrained optimization problem above is equivalent to the hinge function (z)+ = max(z, 0)

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 51 / 135

slide-59
SLIDE 59

Semi-Supervised Learning Algorithms S3VMs

SVM with hinge function

Let zi = 1 − yi(h(xi) + b) = 1 − yif(xi), the problem min

h,b,ξ l

  • i=1

ξi + λh2

HK

subject to yi(h(xi) + b) ≥ 1 − ξi , ∀i = 1 . . . l ξi ≥ 0 is equivalent to min

f l

  • i=1

(1 − yif(xi))+ + λh2

HK

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 52 / 135

slide-60
SLIDE 60

Semi-Supervised Learning Algorithms S3VMs

The hinge loss in standard SVMs

minf l

i=1(1 − yif(xi))+ + λh2 HK

yif(xi) known as the margin, (1 − yif(xi))+ the hinge loss

−2 −1.5 −1 −0.5 0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

yif(xi) Prefers labeled points on the ‘correct’ side.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 53 / 135

slide-61
SLIDE 61

Semi-Supervised Learning Algorithms S3VMs

S3VM objective function

How to incorporate unlabeled points? Assign putative labels sign(f(x)) to x ∈ Xu sign(f(x))f(x) = |f(x)| The hinge loss on unlabeled points becomes (1 − yif(xi))+ = (1 − |f(xi)|)+ S3VM objective: min

f l

  • i=1

(1 − yif(xi))+ + λ1h2

HK + λ2 n

  • i=l+1

(1 − |f(xi)|)+

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 54 / 135

slide-62
SLIDE 62

Semi-Supervised Learning Algorithms S3VMs

The hat loss on unlabeled data

(1 − |f(xi)|)+

−2 −1.5 −1 −0.5 0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

f(xi) Prefers f(x) ≥ 1 or f(x) ≤ −1, i.e., unlabeled instance away from decision boundary f(x) = 0.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 55 / 135

slide-63
SLIDE 63

Semi-Supervised Learning Algorithms S3VMs

Avoiding unlabeled data in the margin

S3VM objective: min

f l

  • i=1

(1 − yif(xi))+ + λ1h2

HK + λ2 n

  • i=l+1

(1 − |f(xi)|)+ the third term prefers unlabeled points outside the margin. Equivalently, the decision boundary f = 0 wants to be placed so that there is few unlabeled data near it.

+ + + + + − − − − Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 56 / 135

slide-64
SLIDE 64

Semi-Supervised Learning Algorithms S3VMs

The class balancing constraint

Directly optimizing the S3VM objective often produces unbalanced classification – most points fall in one class. Heuristic class balance:

1 n−l

n

i=l+1 yi = 1 l

l

i=1 yi.

Relaxed class balancing constraint:

1 n−l

n

i=l+1 f(xi) = 1 l

l

i=1 yi.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 57 / 135

slide-65
SLIDE 65

Semi-Supervised Learning Algorithms S3VMs

The S3VM algorithm

1 Input: kernel K, weights λ1, λ2, (Xl, Yl), Xu 2 Solve the optimization problem for f(x) = h(x) + b, h(x) ∈ HK

min

f

l

i=1(1 − yif(xi))+ + λ1h2 HK + λ2

n

i=l+1(1 − |f(xi)|)+

s.t.

1 n−l

n

i=l+1 f(xi) = 1 l

l

i=1 yi

3 Classify a new test point x by sign(f(x)) Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 58 / 135

slide-66
SLIDE 66

Semi-Supervised Learning Algorithms S3VMs

The S3VM optimization challenge

SVM objective is convex:

−2 −1.5 −1 −0.5 0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

Semi-supervised SVM objective is non-convex:

−2 −1.5 −1 −0.5 0.5 1 1.5 2 0.5 1 1.5 2 2.5 3

Finding a solution for semi-supervised SVM is difficult, which has been the focus of S3VM research. Different approaches: SVMlight, ∇S3VM, continuation S3VM, deterministic annealing, CCCP, Branch and Bound, SDP convex relaxation, etc.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 59 / 135

slide-67
SLIDE 67

Semi-Supervised Learning Algorithms S3VMs

S3VM implementation 1: SVMlight

Local combinatorial search Assign hard labels to unlabeled data Outer loop: “Anneal” λ2 from zero up Inner loop: Pairwise label switch

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 60 / 135

slide-68
SLIDE 68

Semi-Supervised Learning Algorithms S3VMs

S3VM implementation 1: SVMlight

1 Train an SVM with (Xl, Yl). 2 Sort Xu by f(Xu). Label y = 1, −1 for the appropriate portions. 3 FOR ˜

λ ← 10−5λ2 . . . λ2

1

REPEAT:

2

minf l

i=1(1 − yif(xi))+ + λ1h2 HK + ˜

λ n

i=l+1(1 − yif(xi))+

3

IF ∃(i, j) switchable THEN switch yi, yj

4

UNTIL No labels switchable

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 61 / 135

slide-69
SLIDE 69

Semi-Supervised Learning Algorithms S3VMs

S3VM implementation 1: SVMlight

i, j ∈ Xu switchable if yi = 1, yj = −1 and loss(yi = 1, f(xi)) + loss(yj = −1, f(xj)) > loss(yi = −1, f(xi)) + loss(yj = 1, f(xj)) With the hinge loss loss(y, f) = (1 − yf)+

− − + + + − − + negative positive

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 62 / 135

slide-70
SLIDE 70

Semi-Supervised Learning Algorithms S3VMs

S3VM implementation 2: ∇S3VM

Make S3VM a standard unconstrained optimization problem: Revert kernel to primal space Trick to make class balancing constraint implicit Smooth the hat loss so it is differentiable (though still non-convex)

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 63 / 135

slide-71
SLIDE 71

Semi-Supervised Learning Algorithms S3VMs

S3VM implementation 2: ∇S3VM

Revert kernel to primal space: Given kernel k(xi, xj), want z s.t. z⊤

i zj = k(xi, xj)

Cholesky factor of Gram matrix K = B⊤B, or Eigen-decomposition K = UΛU⊤, B = Λ1/2U⊤ (Kernel PCA map) The z’s are columns of B f(xi) = w⊤zi + b, where w is the primal parameter

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 64 / 135

slide-72
SLIDE 72

Semi-Supervised Learning Algorithms S3VMs

S3VM implementation 2: ∇S3VM

Hide class balancing constraint:

1 n−l

n

i=l+1(w⊤zi + b) = 1 l

l

i=1 yi

We can center the unlabeled data n

i=l+1 zi = 0, and

Fix b = 1

l

l

i=1 yi

The class balancing constraint is automatically satisfied.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 65 / 135

slide-73
SLIDE 73

Semi-Supervised Learning Algorithms S3VMs

S3VM implementation 2: ∇S3VM

Smooth the hat loss (1 − |f|)+ with a similar-looking Gaussian curve exp

  • −5f2

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 66 / 135

slide-74
SLIDE 74

Semi-Supervised Learning Algorithms S3VMs

S3VM implementation 2: ∇S3VM

The ∇S3VM problem (b = 1

l

l

i=1 yi):

min

w

l

i=1(1 − yi(w⊤zi + b))+ + λ1w2

+λ2 n

i=l+1 exp(−5(w⊤zi + b)2)

Again, increasing λ2 gradually as a heuristic to try to avoid bad local

  • ptima.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 67 / 135

slide-75
SLIDE 75

Semi-Supervised Learning Algorithms S3VMs

S3VM implementation 3: Continuation method

Global optimization on the non-convex S3VM objective function. Convolve the objective with a Gaussian to smooth it With enough smoothing, global minimum is easy to find Gradually decrease smoothing, use previous solution as starting point Stop when no smoothing

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 68 / 135

slide-76
SLIDE 76

Semi-Supervised Learning Algorithms S3VMs

S3VM implementation 3: Continuation method

1 Input: S3VM objective R(w), initial weight w0, sequence

γ0 > γ1 > . . . > γp = 0

2 Convolve: Rγ(w) = (πγ)−d/2

R(w − t) exp(−t2/γ)dt

3 FOR i = 0 . . . p 1

Starting from wi, find local minimizer wi+1 of Rγ

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 69 / 135

slide-77
SLIDE 77

Semi-Supervised Learning Algorithms S3VMs

S3VM implementation 4: CCCP

The Concave-Convex Procedure The non-convex hat loss function is the sum of a convex term and a concave term Upper bound the concave term with a line Iteratively minimize the sequence of convex functions

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 70 / 135

slide-78
SLIDE 78

Semi-Supervised Learning Algorithms S3VMs

S3VM implementation 4: CCCP

The hat loss (1 − |f|)+ = (|f| − 1)+ + (−|f|) + 1 = + +1

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 71 / 135

slide-79
SLIDE 79

Semi-Supervised Learning Algorithms S3VMs

S3VM implementation 4: CCCP

To minimize R(w) = Rvex(w) + Rcave(w):

1 Input starting point w0 2 t = 0 3 WHILE ∇R(wt) = 0 1

wt+1 = arg minz Rvex(z) + ∇Rcave(wt)(z − wt) + Rcave(wt)

2

t = t + 1

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 72 / 135

slide-80
SLIDE 80

Semi-Supervised Learning Algorithms S3VMs

S3VM implementation 5: Branch and Bound

All previous S3VM implementations suffer from local optima. BB finds the exact global solution. It uses classic branch and bound search technique in AI. Unfortunately it can only handle a few hundred unlabeled points.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 73 / 135

slide-81
SLIDE 81

Semi-Supervised Learning Algorithms S3VMs

S3VM implementation 5: Branch and Bound

Combinatorial optimization. A tree of partial labellings on Xu.

◮ Root node: nothing in Xu labeled ◮ Child node: one more x ∈ Xu in parent node labeled ◮ leaf nodes: all x ∈ Xu labeled

Partial labellings have non-decreasing S3VM objective min

f l

  • i=1

(1 − yif(xi))+ + λ1h2

HK + λ2

  • i∈labeled so far

(1 − yif(xi))+

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 74 / 135

slide-82
SLIDE 82

Semi-Supervised Learning Algorithms S3VMs

S3VM implementation 5: Branch and Bound

Depth-first search on the tree Keep the best complete objective so far Prune internal node (and its subtree) if it’s worse than the best

  • bjective

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 75 / 135

slide-83
SLIDE 83

Semi-Supervised Learning Algorithms S3VMs

Advantages of S3VMs

Applicable wherever SVMs are applicable Clear mathematical framework

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 76 / 135

slide-84
SLIDE 84

Semi-Supervised Learning Algorithms S3VMs

Disadvantages of S3VMs

Optimization difficult Can be trapped in bad local optima More modest assumption than generative model or graph-based methods, potentially lesser gain

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 77 / 135

slide-85
SLIDE 85

Semi-Supervised Learning Algorithms Graph-Based Algorithms

Outline

1

Introduction to Semi-Supervised Learning

2

Semi-Supervised Learning Algorithms Self Training Generative Models S3VMs Graph-Based Algorithms Multiview Algorithms

3

Semi-Supervised Learning in Nature

4

Some Challenges for Future Research

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 78 / 135

slide-86
SLIDE 86

Semi-Supervised Learning Algorithms Graph-Based Algorithms

Example: text classification

Classify astronomy vs. travel articles Similarity measured by content word overlap

d1 d3 d4 d2 asteroid

  • bright
  • comet
  • year

zodiac . . . airport bike camp

  • yellowstone
  • zion
  • Xiaojin Zhu (Univ. Wisconsin, Madison)

Semi-Supervised Learning Tutorial ICML 2007 79 / 135

slide-87
SLIDE 87

Semi-Supervised Learning Algorithms Graph-Based Algorithms

When labeled data alone fails

No overlapping words!

d1 d3 d4 d2 asteroid

  • bright
  • comet

year zodiac

  • .

. . airport

  • bike
  • camp

yellowstone

  • zion
  • Xiaojin Zhu (Univ. Wisconsin, Madison)

Semi-Supervised Learning Tutorial ICML 2007 80 / 135

slide-88
SLIDE 88

Semi-Supervised Learning Algorithms Graph-Based Algorithms

Unlabeled data as stepping stones

Labels “propagate” via similar unlabeled articles.

d1 d5 d6 d7 d3 d4 d8 d9 d2 asteroid

  • bright
  • comet
  • year
  • zodiac
  • .

. . airport

  • bike
  • camp
  • yellowstone
  • zion
  • Xiaojin Zhu (Univ. Wisconsin, Madison)

Semi-Supervised Learning Tutorial ICML 2007 81 / 135

slide-89
SLIDE 89

Semi-Supervised Learning Algorithms Graph-Based Algorithms

Another example

Handwritten digits recognition with pixel-wise Euclidean distance not similar ‘indirectly’ similar with stepping stones

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 82 / 135

slide-90
SLIDE 90

Semi-Supervised Learning Algorithms Graph-Based Algorithms

Graph-based semi-supervised learning

Assumption

A graph is given on the labeled and unlabeled data. Instances connected by heavy edge tend to have the same label.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 83 / 135

slide-91
SLIDE 91

Semi-Supervised Learning Algorithms Graph-Based Algorithms

The graph

Nodes: Xl ∪ Xu Edges: similarity weights computed from features, e.g.,

◮ k-nearest-neighbor graph, unweighted (0, 1 weights) ◮ fully connected graph, weight decays with distance

w = exp

  • −xi − xj2/σ2

Want: implied similarity via all paths

d1 d2 d4 d3 Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 84 / 135

slide-92
SLIDE 92

Semi-Supervised Learning Algorithms Graph-Based Algorithms

An example graph

A graph for person identification: time, color, face edges.

image 4005 neighbor 1: time edge neighbor 2: color edge neighbor 3: color edge neighbor 4: color edge neighbor 5: face edge Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 85 / 135

slide-93
SLIDE 93

Semi-Supervised Learning Algorithms Graph-Based Algorithms

Some graph-based algorithms

mincut harmonic local and global consistency manifold regularization

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 86 / 135

slide-94
SLIDE 94

Semi-Supervised Learning Algorithms Graph-Based Algorithms

The mincut algorithm

The graph mincut problem: Fix Yl, find Yu ∈ {0, 1}n−l to minimize

ij wij|yi − yj|.

Equivalently, solves the optimization problem min

Y ∈{0,1}n ∞ l

  • i=1

(yi − Yli)2 +

  • ij

wij(yi − yj)2 Combinatorial problem, but has polynomial time solution.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 87 / 135

slide-95
SLIDE 95

Semi-Supervised Learning Algorithms Graph-Based Algorithms

The mincut algorithm

Mincut computes the modes of a Boltzmann machine There might be multiple modes One solution is to randomly perturb the weights, and average the results.

+ −

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 88 / 135

slide-96
SLIDE 96

Semi-Supervised Learning Algorithms Graph-Based Algorithms

The harmonic function

Relaxing discrete labels to continuous values in R, the harmonic function f satisfies f(xi) = yi for i = 1 . . . l f minimizes the energy

  • i∼j

wij(f(xi) − f(xj))2 the mean of a Gaussian random field average of neighbors f(xi) =

P

j∼i wijf(xj)

P

j∼i wij

, ∀xi ∈ Xu

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 89 / 135

slide-97
SLIDE 97

Semi-Supervised Learning Algorithms Graph-Based Algorithms

An electric network interpretation

Edges are resistors with conductance wij 1 volt battery connects to labeled points y = 0, 1 The voltage at the nodes is the harmonic function f Implied similarity: similar voltage if many paths exist

+1 volt w

ij

R =

ij

1 1

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 90 / 135

slide-98
SLIDE 98

Semi-Supervised Learning Algorithms Graph-Based Algorithms

A random walk interpretation

Randomly walk from node i to j with probability

wij P

k wik

Stop if we hit a labeled node The harmonic function f = Pr(hit label 1|start from i)

1 i

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 91 / 135

slide-99
SLIDE 99

Semi-Supervised Learning Algorithms Graph-Based Algorithms

An algorithm to compute harmonic function

One way to compute the harmonic function is:

1 Initially, set f(xi) = yi for i = 1 . . . l, and f(xj) arbitrarily (e.g., 0)

for xj ∈ Xu.

2 Repeat until convergence: Set f(xi) =

P

j∼i wijf(xj)

P

j∼i wij

, ∀xi ∈ Xu, i.e., the average of neighbors. Note f(Xl) is fixed. This can be viewed as a special case of self-training too.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 92 / 135

slide-100
SLIDE 100

Semi-Supervised Learning Algorithms Graph-Based Algorithms

The graph Laplacian

We can also compute f in closed form using the graph Laplacian. n × n weight matrix W on Xl ∪ Xu

◮ symmetric, non-negative

Diagonal degree matrix D: Dii = n

j=1 Wij

Graph Laplacian matrix ∆ ∆ = D − W The energy can be rewritten as

  • i∼j

wij(f(xi) − f(xj))2 = f⊤∆f

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 93 / 135

slide-101
SLIDE 101

Semi-Supervised Learning Algorithms Graph-Based Algorithms

Harmonic solution with Laplacian

The harmonic solution minimizes energy subject to the given labels min

f

l

  • i=1

(f(xi) − yi)2 + f⊤∆f Partition the Laplacian matrix ∆ = ∆ll ∆lu ∆ul ∆uu

  • Harmonic solution

fu = −∆uu−1∆ulYl The normalized Laplacian L = D−1/2∆D−1/2 = I − D−1/2WD−1/2, or ∆p, Lp are often used too (p > 0).

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 94 / 135

slide-102
SLIDE 102

Semi-Supervised Learning Algorithms Graph-Based Algorithms

Graph spectrum ∆ = n

i=1 λiφiφ⊤ i

λ1=0.00 λ2=0.00 λ3=0.04 λ4=0.17 λ5=0.38 λ6=0.38 λ7=0.66 λ8=1.00 λ9=1.38 λ10=1.38 λ11=1.79 λ12=2.21 λ13=2.62 λ14=2.62 λ15=3.00 λ16=3.34 λ17=3.62 λ18=3.62 λ19=3.83 λ20=3.96

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 95 / 135

slide-103
SLIDE 103

Semi-Supervised Learning Algorithms Graph-Based Algorithms

Relation to spectral clustering

f can be decomposed as f =

i αiφi

f⊤∆f =

  • i

α2

i λi

f wants basis φi with small λ φ’s with small λ’s correspond to clusters f is a balance between spectral clustering and obeying labeled data

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 96 / 135

slide-104
SLIDE 104

Semi-Supervised Learning Algorithms Graph-Based Algorithms

Problems with harmonic solution

Harmonic solution has two issues It fixes the given labels Yl

◮ What if some labels are wrong? ◮ Want to be flexible and disagree with given labels occasionally

It cannot handle new test points directly

◮ f is only defined on Xu ◮ We have to add new test points to the graph, and find a new harmonic

solution

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 97 / 135

slide-105
SLIDE 105

Semi-Supervised Learning Algorithms Graph-Based Algorithms

Local and Global consistency

Allow f(Xl) to be different from Yl, but penalize it Introduce a balance between labeled data fit and graph energy min

f l

  • i=1

(f(xi) − yi)2 + λf⊤∆f

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 98 / 135

slide-106
SLIDE 106

Semi-Supervised Learning Algorithms Graph-Based Algorithms

Manifold regularization

Manifold regularization solves the two issues Allows but penalizes f(Xl) = Yi using hinge loss Automatically applies to new test data

◮ Defines function in kernel K induced RKHS:

f(x) = h(x) + b, h(x) ∈ HK

Still prefers low energy f⊤

1:n∆f1:n

min

f l

  • i=1

(1 − yif(xi))+ + λ1h2

HK + λ2f⊤ 1:n∆f1:n

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 99 / 135

slide-107
SLIDE 107

Semi-Supervised Learning Algorithms Graph-Based Algorithms

Manifold regularization algorithm

1 Input: kernel K, weights λ1, λ2, (Xl, Yl), Xu 2 Construct similarity graph W from Xl, Xu, compute graph Laplacian

3 Solve the optimization problem for f(x) = h(x) + b, h(x) ∈ HK

min

f l

  • i=1

(1 − yif(xi))+ + λ1h2

HK + λ2f⊤ 1:n∆f1:n

4 Classify a new test point x by sign(f(x)) Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 100 / 135

slide-108
SLIDE 108

Semi-Supervised Learning Algorithms Graph-Based Algorithms

Advantages of graph-based method

Clear mathematical framework Performance is strong if the graph happens to fit the task The (pseudo) inverse of the Laplacian can be viewed as a kernel matrix Can be extended to directed graphs

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 101 / 135

slide-109
SLIDE 109

Semi-Supervised Learning Algorithms Graph-Based Algorithms

Disadvantages of graph-based method

Performance is bad if the graph is bad Sensitive to graph structure and edge weights

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 102 / 135

slide-110
SLIDE 110

Semi-Supervised Learning Algorithms Multiview Algorithms

Outline

1

Introduction to Semi-Supervised Learning

2

Semi-Supervised Learning Algorithms Self Training Generative Models S3VMs Graph-Based Algorithms Multiview Algorithms

3

Semi-Supervised Learning in Nature

4

Some Challenges for Future Research

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 103 / 135

slide-111
SLIDE 111

Semi-Supervised Learning Algorithms Multiview Algorithms

Co-training

Two views of an item: image and HTML text

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 104 / 135

slide-112
SLIDE 112

Semi-Supervised Learning Algorithms Multiview Algorithms

Feature split

Each instance is represented by two sets of features x = [x(1); x(2)] x(1) = image features x(2) = web page text This is a natural feature split (or multiple views) Co-training idea: Train an image classifier and a text classifier The two classifiers teach each other

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 105 / 135

slide-113
SLIDE 113

Semi-Supervised Learning Algorithms Multiview Algorithms

Co-training assumptions

Assumptions

feature split x = [x(1); x(2)] exists x(1) or x(2) alone is sufficient to train a good classifier x(1) and x(2) are conditionally independent given the class X1 view X2 view

+ + + + + + + + + + − − − − − − − − + − ++ + + + + + + + + + + + + + + − − − − − − − − − − − −

+ + + + + + + + + + + − − − − − − − − − + + + + + + + + + + + + + + + + − − − − − − − − − − − −

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 106 / 135

slide-114
SLIDE 114

Semi-Supervised Learning Algorithms Multiview Algorithms

Co-training algorithm

Co-training algorithm

1 Train two classifiers: f(1) from (X(1)

l

, Yl), f(2) from (X(2)

l

, Yl).

2 Classify Xu with f(1) and f(2) separately. 3 Add f(1)’s k-most-confident (x, f(1)(x)) to f(2)’s labeled data. 4 Add f(2)’s k-most-confident (x, f(2)(x)) to f(1)’s labeled data. 5 Repeat. Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 107 / 135

slide-115
SLIDE 115

Semi-Supervised Learning Algorithms Multiview Algorithms

Pros and cons of co-training

Pros Simple wrapper method. Applies to almost all existing classifiers Less sensitive to mistakes than self-training Cons Natural feature splits may not exist Models using BOTH features should do better

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 108 / 135

slide-116
SLIDE 116

Semi-Supervised Learning Algorithms Multiview Algorithms

Variants of co-training

Co-EM: add all, not just top k Each classifier probabilistically label Xu Add (x, y) with weight P(y|x) Fake feature split create random, artificial feature split apply co-training Multiview: agreement among multiple classifiers no feature split train multiple classifiers of different types classify unlabeled data with all classifiers add majority vote label

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 109 / 135

slide-117
SLIDE 117

Semi-Supervised Learning Algorithms Multiview Algorithms

Multiview learning

A regularized risk minimization framework to encourage multi-learner agreement: min

f M

  • v=1
  • l
  • i=1

c(yi, fv(xi)) + λ1f2

K

  • + λ2

M

  • u,v=1

n

  • i=l+1

(fu(xi) − fv(xi))2 M learners. c() is the loss function, e.g., hinge loss.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 110 / 135

slide-118
SLIDE 118

Semi-Supervised Learning in Nature

Outline

1

Introduction to Semi-Supervised Learning

2

Semi-Supervised Learning Algorithms Self Training Generative Models S3VMs Graph-Based Algorithms Multiview Algorithms

3

Semi-Supervised Learning in Nature

4

Some Challenges for Future Research

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 111 / 135

slide-119
SLIDE 119

Semi-Supervised Learning in Nature

Do we learn from both labeled and unlabeled data?

Learning exists long before machine learning. Do humans perform semi-supervised learning? Yes, it seems. We discuss three human experiments:

1

visual recognition with temporal association

2

infant word-object mapping

3

novel object categorization

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 112 / 135

slide-120
SLIDE 120

Semi-Supervised Learning in Nature

Visual recognition with temporal association

A face from two angles are very different, but we can easily associate it. The image sequence (unlabeled data) might be the glue. Artificial wrong sequences (person A’s profile morphs to B’s frontal) damage people’s ability to match test profile and frontal images.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 113 / 135

slide-121
SLIDE 121

Semi-Supervised Learning in Nature

Infant word-object mapping

17-month infants listen to a word, see an object Measure their ability to associate the word and object

◮ If the word heard many times before (without seeing the object;

unlabeled data), association is stronger.

◮ If the word not heard before, association is weaker.

Similar to cluster-then-label.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 114 / 135

slide-122
SLIDE 122

Semi-Supervised Learning in Nature

Novel object categorization

2 1

−1 1 x ∆

2 1

decision boundary (labeled) unlabeled data decision boundary (labeled and unlabeled) labeled data

assuming each class is a coherent group (e.g. Gaussian) machine learning: decision boundary shift

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 115 / 135

slide-123
SLIDE 123

Semi-Supervised Learning in Nature

Novel object categorization

2 1

−1 1 x ∆

2 1

decision boundary (labeled) unlabeled data decision boundary (labeled and unlabeled) labeled data

assuming each class is a coherent group (e.g. Gaussian) machine learning: decision boundary shift Do we humans shift decision boundary too?

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 115 / 135

slide-124
SLIDE 124

Semi-Supervised Learning in Nature

Human learning: a behavioral experiment

Determine human decision boundary

labeled data only labeled and unlabeled data

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 116 / 135

slide-125
SLIDE 125

Semi-Supervised Learning in Nature

Human learning: a behavioral experiment

Determine human decision boundary

labeled data only labeled and unlabeled data

Participants and materials

22 UW students told visual stimuli (examples) are microscopic pollens stimuli displayed one at a time press ‘b’ or ‘n’ to classify label is audio feedback

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 116 / 135

slide-126
SLIDE 126

Semi-Supervised Learning in Nature

Visual stimuli

Stimuli parameterized by a continuous scalar x. Some examples: −2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 117 / 135

slide-127
SLIDE 127

Semi-Supervised Learning in Nature

Experiment procedure

−3 −2 −1 1 2 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 left−shifted Gaussian mixture range examples test examples x

6 blocks

1 20 labeled points at x = −1, 1 Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 118 / 135

slide-128
SLIDE 128

Semi-Supervised Learning in Nature

Experiment procedure

−3 −2 −1 1 2 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 left−shifted Gaussian mixture range examples test examples x

6 blocks

1 20 labeled points at x = −1, 1 2 21 test examples in [−1, 1] (all

unlabeled from now on)

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 118 / 135

slide-129
SLIDE 129

Semi-Supervised Learning in Nature

Experiment procedure

−3 −2 −1 1 2 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 left−shifted Gaussian mixture range examples test examples x

6 blocks

1 20 labeled points at x = −1, 1 2 21 test examples in [−1, 1] (all

unlabeled from now on)

3 230 examples ∼ offset GMM,

plus 21 range examples in [−2.5, 2.5]

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 118 / 135

slide-130
SLIDE 130

Semi-Supervised Learning in Nature

Experiment procedure

−3 −2 −1 1 2 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 left−shifted Gaussian mixture range examples test examples x

6 blocks

1 20 labeled points at x = −1, 1 2 21 test examples in [−1, 1] (all

unlabeled from now on)

3 230 examples ∼ offset GMM,

plus 21 range examples in [−2.5, 2.5]

4 similar to block 3 5 similar to block 3 Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 118 / 135

slide-131
SLIDE 131

Semi-Supervised Learning in Nature

Experiment procedure

−3 −2 −1 1 2 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 left−shifted Gaussian mixture range examples test examples x

6 blocks

1 20 labeled points at x = −1, 1 2 21 test examples in [−1, 1] (all

unlabeled from now on)

3 230 examples ∼ offset GMM,

plus 21 range examples in [−2.5, 2.5]

4 similar to block 3 5 similar to block 3 6 21 test examples in [−1, 1] again Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 118 / 135

slide-132
SLIDE 132

Semi-Supervised Learning in Nature

Experiment procedure

−3 −2 −1 1 2 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 left−shifted Gaussian mixture range examples test examples x

6 blocks

1 20 labeled points at x = −1, 1 2 21 test examples in [−1, 1] (all

unlabeled from now on)

3 230 examples ∼ offset GMM,

plus 21 range examples in [−2.5, 2.5]

4 similar to block 3 5 similar to block 3 6 21 test examples in [−1, 1] again

12 participants receive left-offset GMM, 10 receive right-offset GMM. Record their decisions and response times.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 118 / 135

slide-133
SLIDE 133

Semi-Supervised Learning in Nature

Observation 1: unlabeled data affects decision boundary

−1 −0.5 0.5 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x percent class 2 responses test−1, all test−2, L−subjects test−2, R−subjects

average decision boundary after seeing labeled data (block 2): x = 0.11

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 119 / 135

slide-134
SLIDE 134

Semi-Supervised Learning in Nature

Observation 1: unlabeled data affects decision boundary

−1 −0.5 0.5 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x percent class 2 responses test−1, all test−2, L−subjects test−2, R−subjects

average decision boundary after seeing labeled data (block 2): x = 0.11 after seeing labeled and unlabeled data (block 6): L-subjects x = −0.10, R-subjects x = 0.48

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 119 / 135

slide-135
SLIDE 135

Semi-Supervised Learning in Nature

Observation 2: unlabeled data affects reaction time

−1 −0.5 0.5 1 400 450 500 550 600 650 700 750 800 850 900 x reaction time (ms) test−1, all test−2, L−subjects test−2, R−subjects

longer reaction time → harder example → closer to decision boundary

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 120 / 135

slide-136
SLIDE 136

Semi-Supervised Learning in Nature

Observation 2: unlabeled data affects reaction time

−1 −0.5 0.5 1 400 450 500 550 600 650 700 750 800 850 900 x reaction time (ms) test−1, all test−2, L−subjects test−2, R−subjects

longer reaction time → harder example → closer to decision boundary block 2: reaction time peak near x = 0.11

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 120 / 135

slide-137
SLIDE 137

Semi-Supervised Learning in Nature

Observation 2: unlabeled data affects reaction time

−1 −0.5 0.5 1 400 450 500 550 600 650 700 750 800 850 900 x reaction time (ms) test−1, all test−2, L−subjects test−2, R−subjects

longer reaction time → harder example → closer to decision boundary block 2: reaction time peak near x = 0.11 block 6: overall faster, familiarity with experiment

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 120 / 135

slide-138
SLIDE 138

Semi-Supervised Learning in Nature

Observation 2: unlabeled data affects reaction time

−1 −0.5 0.5 1 400 450 500 550 600 650 700 750 800 850 900 x reaction time (ms) test−1, all test−2, L−subjects test−2, R−subjects

longer reaction time → harder example → closer to decision boundary block 2: reaction time peak near x = 0.11 block 6: overall faster, familiarity with experiment L-subjects reaction time plateau around x = −0.1, R-subjects peak around x = 0.6 Reaction times too suggest decision boundary shift.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 120 / 135

slide-139
SLIDE 139

Semi-Supervised Learning in Nature

Machine learning: Gaussian Mixture Model

We can explain the human experiment with a semi-supervised machine learning model. A Gaussian Mixture Model θ = {w1, µ1, σ2

1, w2, µ2, σ2 2} with 2 components

w1N(µ1, σ2

1) + w2N(µ2, σ2 2)

, w1 + w2 = 1, wi ≥ 0 Prior wk ∼ Uniform[0, 1], µk ∼ N(0, ∞), σ2

k ∼ Inv−χ2(ν, s2), k = 1, 2

Data (assume: remember all, order independent) D = {(x1, y1), . . . , (xl, yl), xl+1, . . . , xn} Goal: find θMAP = arg maxθ p(θ)p(D|θ)

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 121 / 135

slide-140
SLIDE 140

Semi-Supervised Learning in Nature

EM

Maximize the objective (λ ≤ 1 weight on unlabeled example) log p(θ) +

l

  • i=1

log p(xi, yi|θ) + λ

n

  • i=l+1

log p(xi|θ) E-step qi(k) ∝ wkN(xi; µk, σ2

k), i = l + 1, . . . , n; k = 1, 2

M-step µk = l

i=1 δ(yi, k)xi + λ n i=l+1 qi(k)xi

l

i=1 δ(yi, k) + λ n i=l+1 qi(k)

σ2

k

= νs2 + l

i=1 δ(yi, k)eik + λ n i=l+1 qi(k)eik

ν + 2 + l

i=1 δ(yi, k) + λ n i=l+1 qi(k)

wk = l

i=1 δ(yi, k) + λ n i=l+1 qi(k)

l + λ(n − l)

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 122 / 135

slide-141
SLIDE 141

Semi-Supervised Learning in Nature

Model fitting result 1

GMM predicts decision boundary shift:

−1 −0.5 0.5 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x p(y=2|x) test−1 test−2, L−data test−2, R−data

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 123 / 135

slide-142
SLIDE 142

Semi-Supervised Learning in Nature

Model fitting result 2

Unlabeled data seem to worth less than labeled data (λ = 0.06)

10

−2

10

−1

10 −0.5 −0.4 −0.3 −0.2 −0.1 0.1 0.2 0.3 0.4 0.5 λ decision boundary L−data R−data

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 124 / 135

slide-143
SLIDE 143

Semi-Supervised Learning in Nature

Model fitting result 3

GMM explains reaction time:

−1 −0.5 0.5 1 400 450 500 550 600 650 700 750 800 850 900 x fitted reaction time (ms) test−1 test−2, L−data test−2, R−data

t = aH(x) + b

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 125 / 135

slide-144
SLIDE 144

Semi-Supervised Learning in Nature

Findings

Humans and machines both perform semi-supervised learning. Understanding natural learning may lead to new machine learning algorithms.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 126 / 135

slide-145
SLIDE 145

Some Challenges for Future Research

Outline

1

Introduction to Semi-Supervised Learning

2

Semi-Supervised Learning Algorithms Self Training Generative Models S3VMs Graph-Based Algorithms Multiview Algorithms

3

Semi-Supervised Learning in Nature

4

Some Challenges for Future Research

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 127 / 135

slide-146
SLIDE 146

Some Challenges for Future Research

Challenge 0: Real SSL tasks

What tasks can be dramatically improved by SSL, so that new functionalities are enabled? Move from two-moon to the real world

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 128 / 135

slide-147
SLIDE 147

Some Challenges for Future Research

Challenge 1: New SSL assumptions

Generative models, multiview, graph methods, S3VMs

l

  • i=1

log p(yi|θ)p(xi|yi, θ) + λ

n

  • i=l+1

log c

  • y=1

p(y|θ)p(xi|y, θ)

  • min

f M

  • v=1
  • l
  • i=1

c(yi, fv(xi)) + λ1f2

K

  • + λ2

M

  • u,v=1

n

  • i=l+1

(fu(xi) − fv(xi))2 min

f l

  • i=1

c(yi, f(xi)) + λ1f2

K + λ2 n

  • i,j=1

wij(f(xi) − f(xj))2 min

f l

  • i=1

(1 − yif(xi))+ + λ1f2

K + λ2 n

  • i=l+1

(1 − |f(xi)|)+

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 129 / 135

slide-148
SLIDE 148

Some Challenges for Future Research

Challenge 1: New SSL assumptions

What other assumptions can we make on unlabeled data? For example: label dissimilarity yi = yj

  • i,j

wij(f(xi) − sijf(xj))2 wij edge confidence; sij = 1: same label, -1: different labels

  • rder preference yi − yj ≥ d for regression

(d − (f(xi) − f(xj))+ New assumptions may lead to new SSL algorithms.

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 130 / 135

slide-149
SLIDE 149

Some Challenges for Future Research

Challenge 2: Efficiency on huge unlabeled datasets

Some recent SSL datasets as reported in research papers:

10 10

2

10

4

10

2

10

4

10

6

10

8

10

10

world population internet users in the US people in full stadium labeled data size unlabeled data size Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 131 / 135

slide-150
SLIDE 150

Some Challenges for Future Research

Challenge 3: Safe SSL

no pain, no gain

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 132 / 135

slide-151
SLIDE 151

Some Challenges for Future Research

Challenge 3: Safe SSL

no pain, no gain no model assumption, no gain

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 132 / 135

slide-152
SLIDE 152

Some Challenges for Future Research

Challenge 3: Safe SSL

no pain, no gain no model assumption, no gain wrong model assumption, no gain, a lot of pain

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 132 / 135

slide-153
SLIDE 153

Some Challenges for Future Research

Challenge 3: Safe SSL

no pain, no gain no model assumption, no gain wrong model assumption, no gain, a lot of pain An example where S3VM, graph methods will not work, but GMM will:

−5 −4 −3 −2 −1 1 2 3 4 5 0.1 0.2 0.3 0.4 0.5

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 132 / 135

slide-154
SLIDE 154

Some Challenges for Future Research

Challenge 3: Safe SSL

How do we know that we are making the right model assumptions? Which semi-supervised learning method should I use? If I have labeled AND unlabeled data, I should do at least as well as

  • nly having the labeled data.

How can we make sure that SSL is “safe”?

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 133 / 135

slide-155
SLIDE 155

Some Challenges for Future Research

Challenge 4: What can we borrow from Natural Learning?

Example: Semi-supervised learning with trees Tree over labeled and unlabeled data (inspired by taxonomy) Label mutation process over the edges defines a prior

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 134 / 135

slide-156
SLIDE 156

Some Challenges for Future Research

References

1 Olivier Chapelle, Alexander Zien, Bernhard Sch¨

  • lkopf (Eds.). (2006).

Semi-supervised learning. MIT Press.

2 Xiaojin Zhu (2005). Semi-supervised learning literature survey.

TR-1530. University of Wisconsin-Madison Department of Computer Science.

3 Matthias Seeger (2001). Learning with labeled and unlabeled data.

Technical Report. University of Edinburgh. ... and the references therein. Thank you

Xiaojin Zhu (Univ. Wisconsin, Madison) Semi-Supervised Learning Tutorial ICML 2007 135 / 135