Semi-Supervised Learning Jia-Bin Huang Virginia Tech Spring 2019 - - PowerPoint PPT Presentation

โ–ถ
semi supervised learning
SMART_READER_LITE
LIVE PREVIEW

Semi-Supervised Learning Jia-Bin Huang Virginia Tech Spring 2019 - - PowerPoint PPT Presentation

Semi-Supervised Learning Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative HW 4 due April 10 Recommender Systems Motivation Problem formulation Content-based recommendations Collaborative filtering


slide-1
SLIDE 1

Semi-Supervised Learning

Jia-Bin Huang Virginia Tech

Spring 2019

ECE-5424G / CS-5824

slide-2
SLIDE 2

Administrative

  • HW 4 due April 10
slide-3
SLIDE 3

Recommender Systems

  • Motivation
  • Problem formulation
  • Content-based recommendations
  • Collaborative filtering
  • Mean normalization
slide-4
SLIDE 4

Problem motivation

Movie Alice (1) Bob (2) Carol (3) Dave (4) ๐‘ฆ1 (romance) ๐‘ฆ2 (action) Love at last

5 5 0.9

Romance forever

5 ? ? 1.0 0.01

Cute puppies

  • f love

? 4 ? 0.99

Nonstop car chases

5 4 0.1 1.0

Swords vs. karate

5 ? 0.9

slide-5
SLIDE 5

Problem motivation

Movie Alice (1) Bob (2) Carol (3) Dave (4) ๐‘ฆ1 (romance) ๐‘ฆ2 (action) Love at last

5 5 ? ?

Romance forever

5 ? ? ? ?

Cute puppies

  • f love

? 4 ? ? ?

Nonstop car chases

5 4 ? ?

Swords vs. karate

5 ? ? ?

๐œ„ 1 = 5 ๐œ„ 2 = 5 ๐œ„ 3 = 5 ๐œ„ 4 = 5 ๐‘ฆ 1 = ? ? ?

slide-6
SLIDE 6

Optimization algorithm

  • Given ๐œ„ 1 , ๐œ„ 2 , โ‹ฏ , ๐œ„ ๐‘œ๐‘ฃ , to learn ๐‘ฆ(๐‘—):

min

๐‘ฆ(๐‘—)

1 2 เท

๐‘˜:๐‘  ๐‘—,๐‘˜ =1

(๐œ„ ๐‘˜ )โŠค ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—,๐‘˜

2 + ๐œ‡

2 เท

๐‘™=1 ๐‘œ

๐‘ฆ๐‘™

(๐‘—) 2

  • Given ๐œ„ 1 , ๐œ„ 2 , โ‹ฏ , ๐œ„ ๐‘œ๐‘ฃ , to learn ๐‘ฆ(1), ๐‘ฆ(2), โ‹ฏ , ๐‘ฆ(๐‘œ๐‘›):

min

๐‘ฆ(1),๐‘ฆ(2),โ‹ฏ,๐‘ฆ(๐‘œ๐‘›)

1 2 เท

๐‘—=1 ๐‘œ๐‘›

เท

๐‘˜:๐‘  ๐‘—,๐‘˜ =1

(๐œ„ ๐‘˜ )โŠค ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—,๐‘˜

2 + ๐œ‡

2 เท

๐‘—=1 ๐‘œ๐‘›

เท

๐‘™=1 ๐‘œ

๐‘ฆ๐‘™

(๐‘—) 2

slide-7
SLIDE 7

Collaborative filtering

  • Given ๐‘ฆ 1 , ๐‘ฆ 2 , โ‹ฏ , ๐‘ฆ ๐‘œ๐‘› (and movie ratings),

Can estimate ๐œ„ 1 , ๐œ„ 2 , โ‹ฏ , ๐œ„ ๐‘œ๐‘ฃ

  • Given ๐œ„ 1 , ๐œ„ 2 , โ‹ฏ , ๐œ„ ๐‘œ๐‘ฃ

Can estimate ๐‘ฆ 1 , ๐‘ฆ 2 , โ‹ฏ , ๐‘ฆ ๐‘œ๐‘›

slide-8
SLIDE 8

Collaborative filtering optimization

  • bjective
  • Given ๐‘ฆ 1 , ๐‘ฆ 2 , โ‹ฏ , ๐‘ฆ ๐‘œ๐‘› , estimate ๐œ„ 1 , ๐œ„ 2 , โ‹ฏ , ๐œ„ ๐‘œ๐‘ฃ

min

๐œ„ 1 ,๐œ„ 2 ,โ‹ฏ,๐œ„ ๐‘œ๐‘ฃ

1 2 เท

๐‘˜=1 ๐‘œ๐‘ฃ

เท

๐‘—:๐‘  ๐‘—,๐‘˜ =1

(๐œ„ ๐‘˜ )โŠค ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—,๐‘˜

2 + ๐œ‡

2 เท

๐‘˜=1 ๐‘œ๐‘ฃ

เท

๐‘™=1 ๐‘œ

๐œ„๐‘™

๐‘˜ 2

  • Given ๐œ„ 1 , ๐œ„ 2 , โ‹ฏ , ๐œ„ ๐‘œ๐‘ฃ , estimate ๐‘ฆ 1 , ๐‘ฆ 2 , โ‹ฏ , ๐‘ฆ ๐‘œ๐‘›

min

๐‘ฆ(1),๐‘ฆ(2),โ‹ฏ,๐‘ฆ(๐‘œ๐‘›)

1 2 เท

๐‘—=1 ๐‘œ๐‘›

เท

๐‘˜:๐‘  ๐‘—,๐‘˜ =1

(๐œ„ ๐‘˜ )โŠค ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—,๐‘˜

2 + ๐œ‡

2 เท

๐‘—=1 ๐‘œ๐‘›

เท

๐‘™=1 ๐‘œ

๐‘ฆ๐‘™

(๐‘—) 2

slide-9
SLIDE 9

Collaborative filtering optimization objective

  • Given ๐‘ฆ 1 , ๐‘ฆ 2 , โ‹ฏ , ๐‘ฆ ๐‘œ๐‘› , estimate ๐œ„ 1 , ๐œ„ 2 , โ‹ฏ , ๐œ„ ๐‘œ๐‘ฃ

min

๐œ„ 1 ,๐œ„ 2 ,โ‹ฏ,๐œ„ ๐‘œ๐‘ฃ

1 2 เท

๐‘˜=1 ๐‘œ๐‘ฃ

เท

๐‘—:๐‘  ๐‘—,๐‘˜ =1

(๐œ„ ๐‘˜ )โŠค ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—,๐‘˜

2 + ๐œ‡

2 เท

๐‘˜=1 ๐‘œ๐‘ฃ

เท

๐‘™=1 ๐‘œ

๐œ„๐‘™

๐‘˜ 2

  • Given ๐œ„ 1 , ๐œ„ 2 , โ‹ฏ , ๐œ„ ๐‘œ๐‘ฃ , estimate ๐‘ฆ 1 , ๐‘ฆ 2 , โ‹ฏ , ๐‘ฆ ๐‘œ๐‘›

min

๐‘ฆ(1),๐‘ฆ(2),โ‹ฏ,๐‘ฆ(๐‘œ๐‘›)

1 2 เท

๐‘—=1 ๐‘œ๐‘›

เท

๐‘˜:๐‘  ๐‘—,๐‘˜ =1

(๐œ„ ๐‘˜ )โŠค ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—,๐‘˜

2 + ๐œ‡

2 เท

๐‘—=1 ๐‘œ๐‘›

เท

๐‘™=1 ๐‘œ

๐‘ฆ๐‘™

(๐‘—) 2

  • Minimize ๐‘ฆ 1 , ๐‘ฆ 2 , โ‹ฏ , ๐‘ฆ ๐‘œ๐‘› and ๐œ„ 1 , ๐œ„ 2 , โ‹ฏ , ๐œ„ ๐‘œ๐‘ฃ simultaneously

๐พ = 1 2 เท

๐‘  ๐‘—,๐‘˜ =1

(๐œ„ ๐‘˜ )โŠค ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—,๐‘˜

2 + ๐œ‡

2 เท

๐‘˜=1 ๐‘œ๐‘ฃ

เท

๐‘™=1 ๐‘œ

๐œ„๐‘™

๐‘˜ 2

+ ๐œ‡ 2 เท

๐‘—=1 ๐‘œ๐‘›

เท

๐‘™=1 ๐‘œ

๐‘ฆ๐‘™

(๐‘—) 2

slide-10
SLIDE 10

Collaborative filtering optimization objective

๐พ(๐‘ฆ 1 , ๐‘ฆ 2 , โ‹ฏ , ๐‘ฆ ๐‘œ๐‘› , ๐œ„ 1 , ๐œ„ 2 , โ‹ฏ , ๐œ„ ๐‘œ๐‘ฃ ) = 1 2 เท

๐‘  ๐‘—,๐‘˜ =1

(๐œ„ ๐‘˜ )โŠค ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—,๐‘˜

2 + ๐œ‡

2 เท

๐‘˜=1 ๐‘œ๐‘ฃ

เท

๐‘™=1 ๐‘œ

๐œ„๐‘™

๐‘˜ 2

+ ๐œ‡ 2 เท

๐‘—=1 ๐‘œ๐‘›

เท

๐‘™=1 ๐‘œ

๐‘ฆ๐‘™

(๐‘—) 2

slide-11
SLIDE 11

Collaborative filtering algorithm

  • Initialize ๐‘ฆ 1 , ๐‘ฆ 2 , โ‹ฏ , ๐‘ฆ ๐‘œ๐‘› , ๐œ„ 1 , ๐œ„ 2 , โ‹ฏ , ๐œ„ ๐‘œ๐‘ฃ to small random values
  • Minimize ๐พ(๐‘ฆ 1 , ๐‘ฆ 2 , โ‹ฏ , ๐‘ฆ ๐‘œ๐‘› , ๐œ„ 1 , ๐œ„ 2 , โ‹ฏ , ๐œ„ ๐‘œ๐‘ฃ ) using gradient

descent (or an advanced optimization algorithm). For every ๐‘˜ = 1 โ‹ฏ ๐‘œ๐‘ฃ, ๐‘— = 1, โ‹ฏ , ๐‘œ๐‘›: ๐‘ฆ๐‘™

๐‘˜ โ‰” ๐‘ฆ๐‘™ ๐‘˜ โˆ’ ๐›ฝ

เท

๐‘˜:๐‘  ๐‘—,๐‘˜ =1

( ๐œ„ ๐‘˜

โŠค ๐‘ฆ ๐‘—

โˆ’ ๐‘ง ๐‘—,๐‘˜ ) ๐œ„๐‘™

๐‘— + ๐œ‡ ๐‘ฆ๐‘™ (๐‘—)

๐œ„๐‘™

๐‘˜ โ‰” ๐œ„๐‘™ ๐‘˜ โˆ’ ๐›ฝ

เท

๐‘—:๐‘  ๐‘—,๐‘˜ =1

( ๐œ„ ๐‘˜

โŠค ๐‘ฆ ๐‘—

โˆ’ ๐‘ง ๐‘—,๐‘˜ ) ๐‘ฆ๐‘™

๐‘— + ๐œ‡ ๐œ„๐‘™ (๐‘˜)

  • For a user with parameter ๐œ„ and movie with (learned) feature ๐‘ฆ, predict

a star rating of ๐œ„โŠค๐‘ฆ

slide-12
SLIDE 12

Collaborative filtering

Movie Alice (1) Bob (2) Carol (3) Dave (4) Love at last

5 5

Romance forever

5 ? ?

Cute puppies of love

? 4 ?

Nonstop car chases

5 4

Swords vs. karate

5 ?

slide-13
SLIDE 13

Collaborative filtering

  • Predicted ratings:

๐‘Œ = โˆ’ ๐‘ฆ 1

โŠค โˆ’

โˆ’ ๐‘ฆ 2

โŠค โˆ’

โ‹ฎ โˆ’ ๐‘ฆ ๐‘œ๐‘›

โŠค โˆ’

ฮ˜ = โˆ’ ๐œ„ 1

โŠค โˆ’

โˆ’ ๐œ„ 2

โŠค โˆ’

โ‹ฎ โˆ’ ๐œ„ ๐‘œ๐‘ฃ

โŠค โˆ’

Y = Xฮ˜โŠค

Low-rank matrix factorization

slide-14
SLIDE 14

Finding related movies/products

  • For each product ๐‘—, we learn a feature vector ๐‘ฆ(๐‘—) โˆˆ ๐‘†๐‘œ

๐‘ฆ1: romance, ๐‘ฆ2: action, ๐‘ฆ3: comedy, โ€ฆ

  • How to find movie ๐‘˜ relate to movie ๐‘—?

Small ๐‘ฆ(๐‘—) โˆ’ ๐‘ฆ(๐‘˜) movie j and I are โ€œsimilarโ€

slide-15
SLIDE 15

Recommender Systems

  • Motivation
  • Problem formulation
  • Content-based recommendations
  • Collaborative filtering
  • Mean normalization
slide-16
SLIDE 16

Users who have not rated any movies

1 2 เท

๐‘  ๐‘—,๐‘˜ =1

(๐œ„ ๐‘˜ )โŠค ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—,๐‘˜

2 + ๐œ‡

2 เท

๐‘˜=1 ๐‘œ๐‘ฃ

เท

๐‘™=1 ๐‘œ

๐œ„๐‘™

๐‘˜ 2

+ ๐œ‡ 2 เท

๐‘—=1 ๐‘œ๐‘›

เท

๐‘™=1 ๐‘œ

๐‘ฆ๐‘™

(๐‘—) 2

๐œ„(5) = 0

Movie Alice (1) Bob (2) Carol (3) Dave (4) Eve (5) Love at last

5 5 ?

Romance forever

5 ? ? ?

Cute puppies

  • f love

? 4 ? ?

Nonstop car chases

5 4 ?

Swords vs. karate

5 ? ?

slide-17
SLIDE 17

Users who have not rated any movies

1 2 เท

๐‘  ๐‘—,๐‘˜ =1

(๐œ„ ๐‘˜ )โŠค ๐‘ฆ ๐‘— โˆ’ ๐‘ง ๐‘—,๐‘˜

2 + ๐œ‡

2 เท

๐‘˜=1 ๐‘œ๐‘ฃ

เท

๐‘™=1 ๐‘œ

๐œ„๐‘™

๐‘˜ 2

+ ๐œ‡ 2 เท

๐‘—=1 ๐‘œ๐‘›

เท

๐‘™=1 ๐‘œ

๐‘ฆ๐‘™

(๐‘—) 2

๐œ„(5) = 0

Movie Alice (1) Bob (2) Carol (3) Dave (4) Eve (5) Love at last

5 5

Romance forever

5 ? ?

Cute puppies

  • f love

? 4 ?

Nonstop car chases

5 4

Swords vs. karate

5 ?

slide-18
SLIDE 18

Mean normalization

For user ๐‘˜, on movie ๐‘— predict: ๐œ„ ๐‘˜

โŠค ๐‘ฆ(๐‘—) + ๐œˆ๐‘—

User 5 (Eve): ๐œ„ 5 = 0 ๐œ„ 5

โŠค ๐‘ฆ(๐‘—) + ๐œˆ๐‘—

Learn ๐œ„(๐‘˜), ๐‘ฆ(๐‘—)

slide-19
SLIDE 19

Recommender Systems

  • Motivation
  • Problem formulation
  • Content-based recommendations
  • Collaborative filtering
  • Mean normalization
slide-20
SLIDE 20

Review: Supervised Learning

  • K nearest neighbor
  • Linear Regression
  • Naรฏve Bayes
  • Logistic Regression
  • Support Vector Machines
  • Neural Networks
slide-21
SLIDE 21

Review: Unsupervised Learning

  • Clustering, K-Mean
  • Expectation maximization
  • Dimensionality reduction
  • Anomaly detection
  • Recommendation system
slide-22
SLIDE 22

Advanced Topics

  • Semi-supervised learning
  • Probabilistic graphical models
  • Generative models
  • Sequence prediction models
  • Deep reinforcement learning
slide-23
SLIDE 23

Semi-supervised Learning

  • Motivation
  • Problem formulation
  • Consistency regularization
  • Entropy-based method
  • Pseudo-labeling
slide-24
SLIDE 24

Semi-supervised Learning

  • Motivation
  • Problem formulation
  • Consistency regularization
  • Entropy-based method
  • Pseudo-labeling
slide-25
SLIDE 25

Classic Paradigm Insufficient Nowadays

  • Modern applications: massive amounts of raw data.
  • Only a tiny fraction can be annotated by human experts

Protein sequences Billions of webpages Images

slide-26
SLIDE 26

Semi-supervised Learning

slide-27
SLIDE 27

Active Learning

slide-28
SLIDE 28

Semi-supervised Learning

  • Motivation
  • Problem formulation
  • Consistency regularization
  • Entropy-based method
  • Pseudo-labeling
slide-29
SLIDE 29

Semi-supervised Learning Problem Formulation

  • Labeled data

๐‘‡๐‘š = ๐‘ฆ 1 , ๐‘ง 1 , ๐‘ฆ 2 , ๐‘ง 2 , โ‹ฏ , ๐‘ฆ ๐‘›๐‘š , ๐‘ง ๐‘›๐‘š

  • Unlabeled data

๐‘‡๐‘ฃ = ๐‘ฆ 1 , ๐‘ง 1 , ๐‘ฆ 2 , ๐‘ง 2 , โ‹ฏ , ๐‘ฆ ๐‘›๐‘ฃ , ๐‘ง ๐‘›๐‘ฃ

  • Goal: Learn a hypothesis โ„Ž๐œ„ (e.g., a classifier) that has small error
slide-30
SLIDE 30

Combining labeled and unlabeled data

  • Classical methods
  • Transductive SVM [Joachims โ€™99]
  • Co-training [Blum and Mitchell โ€™98]
  • Graph-based methods [Blum and Chawla โ€˜01] [Zhu, Ghahramani,

Lafferty โ€˜03]

slide-31
SLIDE 31

Transductive SVM

  • The separator goes through low density regions of the space /

large margin

slide-32
SLIDE 32

SVM

Inputs: ๐‘ฆl

(๐‘—), ๐‘งl (๐‘—)

min

๐œ„ 1 2 ฯƒ๐‘˜=1 ๐‘œ

๐œ„

๐‘˜ 2

  • s. t. ๐‘งl

(๐‘—)๐œ„โŠค๐‘ฆ๐‘š ๐‘— โ‰ฅ 1

Transductive SVM

Inputs: ๐‘ฆl

(๐‘—), ๐‘งl (๐‘—)

, ๐‘ฆu

(๐‘—), ๐‘ง๐‘ฃ (๐‘—)

min

๐œ„ 1 2 ฯƒ๐‘˜=1 ๐‘œ

๐œ„

๐‘˜ 2

  • s. t.

๐‘งl

(๐‘—)๐œ„โŠค๐‘ฆ๐‘š ๐‘— โ‰ฅ 1

เทข ๐‘งu

(๐‘—)๐œ„โŠค๐‘ฆ ๐‘— โ‰ฅ 1

เทข ๐‘งu

๐‘— โˆˆ {โˆ’1, 1}

slide-33
SLIDE 33

Transductive SVMs

  • First maximize margin over the labeled points
  • Use this to give initial labels to unlabeled points based on this

separator.

  • Try flipping labels of unlabeled points to see if doing so can increase

margin

slide-34
SLIDE 34

Deep Semi-supervised Learning

slide-35
SLIDE 35

Semi-supervised Learning

  • Motivation
  • Problem formulation
  • Consistency regularization
  • Entropy-based method
  • Pseudo-labeling
slide-36
SLIDE 36

Stochastic Perturbations/ฮ -Model

  • Realistic perturbations ๐‘ฆ โ†’ เทœ

๐‘ฆ of data points ๐‘ฆ โˆˆ ๐ธ๐‘‰๐‘€ should not significantly change the output of h๐œ„(๐‘ฆ)

slide-37
SLIDE 37

Temporal Ensembling

slide-38
SLIDE 38

Mean Teacher

slide-39
SLIDE 39

Virtual Adversarial Training

slide-40
SLIDE 40

Semi-supervised Learning

  • Motivation
  • Problem formulation
  • Consistency regularization
  • Entropy-based method
  • Pseudo-labeling
slide-41
SLIDE 41

EntMin

  • Encourages more confident predictions on unlabeled data.
slide-42
SLIDE 42

Semi-supervised Learning

  • Motivation
  • Problem formulation
  • Consistency regularization
  • Entropy-based method
  • Pseudo-labeling
slide-43
SLIDE 43

Comparison

slide-44
SLIDE 44

Varying number of labels

slide-45
SLIDE 45

Class mismatch in Labeled/Unlabeled datasets hurts the performance

slide-46
SLIDE 46

Lessons

  • Standardized architecture + equal budget for tuning hyperparameters
  • Unlabeled data from a different class distribution not that useful
  • Most methods donโ€™t work well in the very low labeled-data regime
  • Transferring Pre-Trained Imagenet produces lower error rate
  • Conclusions based on small datasets though