Visual Parsing with Weak Supervision Jia Xu Department of Computer - - PowerPoint PPT Presentation

visual parsing with weak supervision
SMART_READER_LITE
LIVE PREVIEW

Visual Parsing with Weak Supervision Jia Xu Department of Computer - - PowerPoint PPT Presentation

Visual Parsing with Weak Supervision Jia Xu Department of Computer Sciences University of Wisconsin-Madison 2015-07-30 Introduction Object Segmentation Scene Parsing Video Parsing Discussion Research Goal Teach Computer to See at/beyond


slide-1
SLIDE 1

Visual Parsing with Weak Supervision

Jia Xu

Department of Computer Sciences University of Wisconsin-Madison

2015-07-30

slide-2
SLIDE 2

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Research Goal

Teach Computer to See at/beyond Human Level Interpret/summarize/organize visual data on the Internet Help the disabled population (e.g., the blind)

slide-3
SLIDE 3

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Visual Parsing

Fundamental Task Semantically parse every pixel in images and videos

slide-4
SLIDE 4

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Visual Parsing

Fundamental Task Semantically parse every pixel in images and videos First step towards high level applications

Self-driving Car Unmanned Aerial Vehicle Wearable Glasses

slide-5
SLIDE 5

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Visual Parsing

Fundamental Task Turning Visual Data Into Knowledge

Everyday > 3.5 million > 300 million > 150, 000 hours

Never Ending Language Learning (Mitchell et al., 2009) Never Ending Image Learner (Chen et al., 2013)

slide-6
SLIDE 6

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Challenges

Modern Image Dataset

Noisy Label Image-Level Bounding Box Segmentation Noisy Label Image-Level Bounding Box Segmentation > 6 Billion > 14 Million ∼ 1 Million ∼ 5000 Information Log(Size)

slide-7
SLIDE 7

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Challenges

Modern Image Dataset

Noisy Label Image-Level Bounding Box Segmentation Noisy Label Image-Level Bounding Box Segmentation > 6 Billion > 14 Million ∼ 1 Million ∼ 5000 Information Log(Size)

Much fewer segmentations are annotated for videos!

slide-8
SLIDE 8

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Motivation

Bottleneck of Fully Supervised Methods Full annotation is expensive to collect and limited at size

slide-9
SLIDE 9

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Motivation

Bottleneck of Fully Supervised Methods Full annotation is expensive to collect and limited at size Why Weakly Supervised Learning Weak supervision is easier to obtain: e.g., gaze

slide-10
SLIDE 10

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Motivation

Bottleneck of Fully Supervised Methods Full annotation is expensive to collect and limited at size Why Weakly Supervised Learning Weak supervision is easier to obtain: e.g., gaze Large datasets with side/weak annotations are readily available: metadata, tags, text

slide-11
SLIDE 11

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Motivation

Bottleneck of Fully Supervised Methods Full annotation is expensive to collect and limited at size Why Weakly Supervised Learning Weak supervision is easier to obtain: e.g., gaze Large datasets with side/weak annotations are readily available: metadata, tags, text Visual data presents the physical world: shape, geometry, context

slide-12
SLIDE 12

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

My Thesis Research

How can we utilize weakly labeled data effectively for the visual parsing task? When human comes into the visual parsing loop, how can we minimize user effort while still achieving satisfactory parsing results?

slide-13
SLIDE 13

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Roadmap

Chapter Parsing Task Weak Supervision Publication

  • Ch. 2

Object Segmentation User Indication CVPR 2013

  • Ch. 3

Scene Parsing Image-level Tags CVPR 2014 Image-level Tags

  • Ch. 4

Scene Parsing Bounding Boxes CVPR 2015a Partial Labels

  • Ch. 5

Video Segmentation Side Knowledge ICCV 2013

  • Ch. 6

Video Summarization Human Gaze CVPR 2015b

slide-14
SLIDE 14

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Roadmap

Chapter Parsing Task Weak Supervision Publication

  • Ch. 2

Object Segmentation User Indication CVPR 2013

  • Ch. 3

Scene Parsing Image-level Tags CVPR 2014 Image-level Tags

  • Ch. 4

Scene Parsing Bounding Boxes CVPR 2015a Partial Labels

  • Ch. 5

Video Segmentation Side Knowledge ICCV 2013

  • Ch. 6

Video Summarization Human Gaze CVPR 2015b

slide-15
SLIDE 15

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Object Segmentation

slide-16
SLIDE 16

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Object Segmentation

Main Challenges

1

Semantic gap: what is an object?

slide-17
SLIDE 17

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Object Segmentation

Main Challenges

1

Semantic gap: what is an object?

2

Ambiguity of user intention: which object do you want?

slide-18
SLIDE 18

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Interactive Object Segmentation

Main Challenges

1

Semantic gap: what is an object?

2

Ambiguity of user intention: which object do you want? A few user scribbles can make segmentation much easier!

slide-19
SLIDE 19

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Related work

Region-based: Graphcut (Boykov and Jolly, 2001), Grabcut (Rother et al., 2004), Random Walks (Grady, 2006), Geodesic Shortest Path (Bai and Sapiro, 2009), Geodesic Star Convexity (Gulshan et al., 2010) Edge-based: Intelligent Scissors (Mortensen and Barrett, 1998), LabelMe (Russell et al., 2008)

GraphCut GrabCut Intelligent Scissors LabelMe

slide-20
SLIDE 20

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Our Ideas (EulerSeg)

Objective Modeling topological constraint while concurrently finding one

  • r more minimum energy closed contours which satisfy:

Foreground seeds must be “inside” Background seeds must be “outside” [X., Collins, Singh, CVPR 2013]

slide-21
SLIDE 21

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Our Ideas (EulerSeg)

Main Advantages

1

Basic primitives are edgelets (Little dependence on # of pixels)

slide-22
SLIDE 22

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Our Ideas (EulerSeg)

Main Advantages

1

Basic primitives are edgelets (Little dependence on # of pixels)

2

Dense strokes not needed to learn appearance model. Results do NOT vary with seed location (Interaction constraints are completely geometric in form)

slide-23
SLIDE 23

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Our Ideas (EulerSeg)

Main Advantages

1

Basic primitives are edgelets (Little dependence on # of pixels)

2

Dense strokes not needed to learn appearance model. Results do NOT vary with seed location (Interaction constraints are completely geometric in form)

3

Incorporating connectedness priors and specifying # of closures are easy (Euler characteristic)

slide-24
SLIDE 24

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Graph Representation

slide-25
SLIDE 25

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Graph Representation

x: face indicator vector y: edge indicator vector z: vertex indicator vector w: indicator vector for foreground boundary edges. Internal edges yi = wi = 0 are black, while boundary edges yi = wi = 1 are red

slide-26
SLIDE 26

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Discrete Calculus

Vertex Edge Face

Coherent Anti-coherent Cell Orientation

slide-27
SLIDE 27

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Discrete Calculus

Vertex Edge Face

Coherent Anti-coherent Cell Orientation Vertex-edge Incidence Matrix: A1 = A, A2 = A1./D Avk,eij =

  • 1

k = i, j

  • therwise

[Grady and Polimeni, 2010]

slide-28
SLIDE 28

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Discrete Calculus

Vertex Edge Face

Coherent Anti-coherent Cell Orientation Edge-face Incidence Matrix: C1 = C, C2 = |C| Ce,f =      +1 e is incident to f and coherently oriented −1 e is incident to f and anti-coherently oriented

  • therwise

[Grady and Polimeni, 2010]

slide-29
SLIDE 29

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

An Example

f1 f2 f3 e1 e2 e3 e4 e5 e6 e7 v1 v2 v3 v4 v5 f1 f2 f3 e1 e2 e3 e4 e5 e6 e7 v1 v2 v3 v4 v5

C =           1 −1 1 −1 1 −1 1 −1 1           x =   1 1   b = Cx =           1 −1 1 −1          

slide-30
SLIDE 30

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Euler Characteristic

f1 f2 f3 e1 e2 e3 e4 e5 e6 e7 v1 v2 v3 v4 v5

Number of faces (1Tx):

slide-31
SLIDE 31

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Euler Characteristic

f1 f2 f3 e1 e2 e3 e4 e5 e6 e7 v1 v2 v3 v4 v5

Number of faces (1Tx): 2 Number of nodes (1Tz):

slide-32
SLIDE 32

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Euler Characteristic

f1 f2 f3 e1 e2 e3 e4 e5 e6 e7 v1 v2 v3 v4 v5

Number of faces (1Tx): 2 Number of nodes (1Tz): 4 Number of edges (1Ty):

slide-33
SLIDE 33

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Euler Characteristic

f1 f2 f3 e1 e2 e3 e4 e5 e6 e7 v1 v2 v3 v4 v5

Number of faces (1Tx): 2 Number of nodes (1Tz): 4 Number of edges (1Ty): 5 Number of connected components (1Tx + 1Tz − 1Ty):

slide-34
SLIDE 34

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Euler Characteristic

f1 f2 f3 e1 e2 e3 e4 e5 e6 e7 v1 v2 v3 v4 v5

Number of faces (1Tx): 2 Number of nodes (1Tz): 4 Number of edges (1Ty): 5 Number of connected components (1Tx + 1Tz − 1Ty): 1

slide-35
SLIDE 35

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Problem Formulation

Optimization Model min

w,x,y,z

f(w) s.t. w = |C1x|, 2y = w + C2x, A2y ≤ z ≤ A1y, 1Tx + 1Tz − 1Ty = n, x1 ≤ x ≤ 1 − x0, wi, xj, yk, zl ∈ {0, 1}.

slide-36
SLIDE 36

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Ratio Objective

Input Solution 1 Solution 2 Solution 3 NTw = 38.48 NTw = 164.77 NTw = 389.61

slide-37
SLIDE 37

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Ratio Objective

Input Solution 1 Solution 2 Solution 3 NTw = 38.48 NTw = 164.77 NTw = 389.61 DTw = 52 DTw = 288 DTw = 865

NTw DTw = 0.5721 NTw DTw = 0.7400 NTw DTw = 0.4504

slide-38
SLIDE 38

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Problem Formulation

Optimization Model min

w,x,y,z

NTw DTw s.t. w = |C1x|, 2y = w + C2x, A2y ≤ z ≤ A1y, 1Tx + 1Tz − 1Ty = n, x1 ≤ x ≤ 1 − x0, wi, xj, yk, zl ∈ {0, 1}.

slide-39
SLIDE 39

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Minimizing a Ratio Cost

Solved by minimizing ψ(t, w) = (N − tD)Tw Over feasible w for a sequence of chosen values of t With an initial finite bounding interval [tl, tu]

slide-40
SLIDE 40

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Minimizing a Ratio Cost

Solved by minimizing ψ(t, w) = (N − tD)Tw Over feasible w for a sequence of chosen values of t With an initial finite bounding interval [tl, tu] Pick t0 = tl+tu

2 , and let

¯ w = arg min

w ψ(t0, w)

ψ(t0, ¯ w) = 0: NT ¯ w/DT ¯ w = t0, terminate with solution t0 ψ(t0, ¯ w) < 0: NT ¯ w/DT ¯ w < t0, tu ← NT ¯ w/DT ¯ w ψ(t0, ¯ w) > 0: NT ¯ w/DT ¯ w > t0, tl ← t0

slide-41
SLIDE 41

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Qualitative Results

Original Truth BJ SP RW GSCseq EulerSeg EulerSeg-0

slide-42
SLIDE 42

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Quantitative Evaluation

F-Measure P = |A ∩ T| |A| , R = |A ∩ T| |T| , F = 2PR P + R How much effort to reach F = 0.95 (using a robot user)? Method BJ RW SP GSCseq EulerSeg User Scribbles 5.51 6.48 4.54 2.30 2.06 Seeds tell MORE than link/cannot link [Gulshan et al., 2010]

slide-43
SLIDE 43

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Roadmap

Chapter Parsing Task Weak Supervision Publication

  • Ch. 2

Object Segmentation User Indication CVPR 2013

  • Ch. 3

Scene Parsing Image-level Tags CVPR 2014 Image-level Tags

  • Ch. 4

Scene Parsing Bounding Boxes CVPR 2015a Partial Labels

  • Ch. 5

Video Segmentation Side Knowledge ICCV 2013

  • Ch. 6

Video Summarization Human Gaze CVPR 2015b

slide-44
SLIDE 44

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Semantic Segmentation

Building Tree Boat Person

slide-45
SLIDE 45

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Semantic Segmentation

Building Tree Boat Person Bad Object Labels

slide-46
SLIDE 46

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Weakly Supervised Semantic Segmentation

Motivation Annotation: presence of image classes Tags readily available in online photo collections Easier to obtain than segmentations

sky, building, tree sky building tree tree

[X., Schwing, Urtasun, CVPR 2014]

slide-47
SLIDE 47

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Cosegmentation

Concurrently segment common foreground objects from a set

  • f images

[Collins, X., Grady, Singh, CVPR 2012] [Mukherjee, Singh, X., Collins, ECCV 2012] [Collins, Liu, X., Mukherjee, Singh, ECCV 2014]

slide-48
SLIDE 48

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Latent Structured Prediction

Graphical Model Presence/absence of a class: yi ∈ {0, 1} Semantic superpixel label: hj ∈ {1, . . . , C} Image evidence: x

y1 y2 x yC h1 x1 h2 x2 hN xN · · · · · · · · · y1 y2 x yC h1 x1 h2 x2 hN xN · · · · · · · · ·

Learning/Inference with Tags Inference without tags [X., Schwing, Urtasun, CVPR 2014]

slide-49
SLIDE 49

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

How About Other Forms of Weak Supervision

Tag Bounding Box Partial Label

Sky Boat Sea Person

slide-50
SLIDE 50

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

How About Other Forms of Weak Supervision

Tag Bounding Box Partial Label

Sky Boat Sea Person

Unified Model min

W,H

1 2tr(WTW) + λ

n

  • p=1

ξ(W; xp, hp) s.t. H1C = 1n, H ∈ {0, 1}n×C H ∈ S [X., Schwing, Urtasun, CVPR, 2015]

slide-51
SLIDE 51

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Max-Margin Objective

Denote X = [xT

1, xT p, · · · , xT n] ∈ Rn×d: feature matrix

H = [hT

1, hT p, · · · , hT n] ∈ {0, 1}n×c: hidden label matrix

W ∈ Rd×c: feature weighting matrix

slide-52
SLIDE 52

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Max-Margin Objective

Denote X = [xT

1, xT p, · · · , xT n] ∈ Rn×d: feature matrix

H = [hT

1, hT p, · · · , hT n] ∈ {0, 1}n×c: hidden label matrix

W ∈ Rd×c: feature weighting matrix min

W,H

1 2tr(WTW) + λ

n

  • p=1

C

  • c=1

ξ(wc; xp, hc

p)

where ξ(wc; xp, hc

p) =

max(0, 1 + (wT

c xp)),

hc

p = 0

µc max(0, 1 − (wT

c xp)),

hc

p = 1

µc = n

p=1 1(hc p == 0)

n

p=1 1(hc p == 1)

[Zhao et al., 2008, Zhao et al., 2009 ]

slide-53
SLIDE 53

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Supervision Space as Constraints

Unlabeled/Cosegmentation/Transductive: S = ∅ Image level tags: S = {H ≤ BZ, BTH ≥ Z} Bounding boxes: S = {H ≤ ˆ Bˆ Z, ˆ BTH ≥ ˆ Z} Semi-supervision S = {HΩ = ˆ HΩ}

slide-54
SLIDE 54

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Supervision Space as Constraints

Unlabeled/Cosegmentation/Transductive: S = ∅ Image level tags: S = {H ≤ BZ, BTH ≥ Z} Bounding boxes: S = {H ≤ ˆ Bˆ Z, ˆ BTH ≥ ˆ Z} Semi-supervision S = {HΩ = ˆ HΩ} An Example (2 images, 5 superpixels (2+3), 3 classes) B =       1 1 1 1 1       , Z = 1 1 1 1

  • ,

H =       1 1 1 1 1       H ≤ BZ =       1 1 1 1 1 1 1 1 1 1       , BTH = 1 1 1 2

  • ≥ Z
slide-55
SLIDE 55

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Optimization Model

min

W,H

1 2tr(WTW) + λ

n

  • p=1

ξ(W; xp, hp) s.t. H1C = 1n, H ∈ {0, 1}n×C H ∈ S

slide-56
SLIDE 56

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Optimization Model

min

W,H

1 2tr(WTW) + λ

n

  • p=1

ξ(W; xp, hp) s.t. H1C = 1n, H ∈ {0, 1}n×C H ∈ S Observations Challenge: non-convex mixed integer programming

slide-57
SLIDE 57

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Optimization Model

min

W,H

1 2tr(WTW) + λ

n

  • p=1

ξ(W; xp, hp) s.t. H1C = 1n, H ∈ {0, 1}n×C H ∈ S Observations Challenge: non-convex mixed integer programming Optimization problem is bi-convex, i.e., it is convex w.r.t. W if H is fixed, and convex w.r.t. H if W is fixed Constraints are linear and they only involve the super-pixel assignment matrix H

slide-58
SLIDE 58

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Learning Algorithm

min

W,H

1 2tr(WTW) + λ

n

  • p=1

ξ(W; xp, hp) s.t. H1C = 1n, H ∈ {0, 1}n×C H ∈ S Alternating Between Fix H solve for W independent of classes (1-vs-all linear SVM) Fix W infer super-pixel labels H in parallel w.r.t images (small LP instances)

slide-59
SLIDE 59

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Learning Algorithm

Alternating Between Fix H solve for W independent of classes (1-vs-all linear SVM) Fix W infer super-pixel labels H in parallel w.r.t images (small LP instances) Inference max

H

tr((XW)TH) s.t. H1C = 1n, H ∈ {0, 1}n×C, H ∈ S

slide-60
SLIDE 60

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Learning Algorithm

Alternating Between Fix H solve for W independent of classes (1-vs-all linear SVM) Fix W infer super-pixel labels H in parallel w.r.t images (small LP instances) Inference max

H

tr((XW)TH) s.t. H1C = 1n, H ∈ {0, 1}n×C, H ∈ S Proposition Fixing W solving for H using a linear program gives the integral

  • ptimal solution.
slide-61
SLIDE 61

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Theoretical Guarantee

Proposition Fixing W solving for H using a linear program gives the integral

  • ptimal solution.

Proof. (Sketch) The main idea of our proof is to show our coefficient matrix is totally unimodular. By Grady 2010: If A is totally unimodular and b is integral, then linear programs of forms like {min cTx | Ax = b, x ≥ 0} have integral optima, for any c. Hence, the LP relaxation gives the optimal integral solution.

slide-62
SLIDE 62

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Computation Efficiency

Model Nature Decomposable Parallelizable Theoretical guarantee of relaxation quality

slide-63
SLIDE 63

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Computation Efficiency

Model Nature Decomposable Parallelizable Theoretical guarantee of relaxation quality Running time

  • rders of magnitude faster than the state-of-the-art (20 min

v.s. 24 hours) 10 ms to test one image

slide-64
SLIDE 64

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Experimental Evaluation

Datasets SIFT-Flow (a.k.a, LabelMe): 2688 images, 33 classes MSRC: 591 images, 21 classes Accuracy Metric Per-pixel: the fraction of the number of pixels classified rightly over the number of pixels to be classified in total Per-class: the average of accuracy of all the classes

slide-65
SLIDE 65

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Comparison to State-of-the-art on Sift-Flow

Method Supervision Per-class Per-pixel Liu et al., 2011 (PAMI) full 24 76.7 Farabet et al., 2012 (ICML) full 29.5 78.5 Farabet et al., 2012 (ICML) balanced full 46.0 74.2 Eigen et al., 2012 (CVPR) full 32.5 77.1 Singh et al., 2013 (CVPR) full 33.8 79.2 Tighe et al., 2013 (IJCV) full 30.1 77.0 Tighe et al., 2014 (CVPR) full 39.3 78.6 Yang et al., 2014 (CVPR) full 48.7 79.8 Vezhnevets et al., 2011 (ICCV) weak (tags) 14 N/A Vezhnevets et al., 2012 (CVPR) weak (tags) 22 51 Xu et al., 2014 (CVPR) weak (tags) 27.9 N/A Ours (1-vs-all) weak (tags) 32.0 64.4 Ours (ILT) weak (tags) 35.0 65.0 Ours (1-vs-all + transductive) weak (tags) 40.0 59.0 Ours (ILT + transductive) weak (tags) 41.4 62.7

slide-66
SLIDE 66

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Comparison to State-of-the-art on MSRC

Method Supervision per-class per-pixel Shotton et al., 2008 (ECCV) full 67 72 Yao et al., 2012 (CVPR) full 79 86 Vezhnevets et al., 2011 (ICCV) weak (tags) 67 67 Liu et al., 2012 (TMM) weak (tags) N/A 71 Ours weak (tags) 73 70

slide-67
SLIDE 67

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Sample Results

Input Truth Ours Input Truth Ours

unlabeled sky mountain road tree car sign person field building

slide-68
SLIDE 68

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Sample Results (continued)

Input Truth Ours Input Truth Ours

unlabeled sky mountain road tree car sign person field building

slide-69
SLIDE 69

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Other Forms of Weak Supervision

Semi-supervision

0.1 0.2 0.3 0.4 0.5 31 32 33 34 35 36 37 38 39 Superpixel label sample ratio Per−class accuracy (%) 0.1 0.2 0.3 0.4 0.5 64 66 68 70 72 74 Superpixel label sample ratio Per−pixel accuracy (%)

slide-70
SLIDE 70

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Other Forms of Weak Supervision

Semi-supervision

0.1 0.2 0.3 0.4 0.5 31 32 33 34 35 36 37 38 39 Superpixel label sample ratio Per−class accuracy (%) 0.1 0.2 0.3 0.4 0.5 64 66 68 70 72 74 Superpixel label sample ratio Per−pixel accuracy (%)

Bounding Box

0.1 0.2 0.3 0.4 0.5 31 32 33 34 35 36 Box sample ratio Per−class accuracy (%) 0.1 0.2 0.3 0.4 0.5 64.4 64.6 64.8 65 65.2 65.4 65.6 65.8 66 Box sample ratio Per−pixel accuracy (%)

slide-71
SLIDE 71

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Roadmap

Chapter Parsing Task Weak Supervision Publication

  • Ch. 2

Object Segmentation User Indication CVPR 2013

  • Ch. 3

Scene Parsing Image-level Tags CVPR 2014 Image-level Tags

  • Ch. 4

Scene Parsing Bounding Boxes CVPR 2015a Partial Labels

  • Ch. 5

Video Segmentation Side Knowledge ICCV 2013

  • Ch. 6

Video Summarization Human Gaze CVPR 2015b

slide-72
SLIDE 72

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Online Video Segmentation

Background subspace is modeled on a Grassmannian manifold with online updating along the geodesic Spatially contiguous and structured foreground is modeled via group sparsity Input Background Foreground [X., Ithapu, Mukherjee, Rehg, Singh, ICCV 2013]

slide-73
SLIDE 73

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

First Person Vision

Motivation Life-logging with wearable cameras: SenseCam, GoPro, Google glass Memory aid Gaze provides a form of weak supervision: window of mind

slide-74
SLIDE 74

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Gaze-enabled Egocentric Video Summarization

··· ··· ··· ···

Video Summarization 1:00PM 2:00PM 3:00PM 4:00PM 5:00PM

slide-75
SLIDE 75

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Gaze-enabled Egocentric Video Summarization

··· ··· ··· ···

Video Summarization 1:00PM 2:00PM 3:00PM 4:00PM 5:00PM

What makes a good summary? Relevance Diversity Compactness Personalization [X., Mukherjee, Li, Warnewr, Rehg, Singh, CVPR, 2015]

slide-76
SLIDE 76

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Relevance and Diversity Measurement

Mutual Information M(V\S; S) = H(V\S) − H(V\S|S) = H(V\S) + H(S) − H(V)

slide-77
SLIDE 77

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Relevance and Diversity Measurement

Mutual Information M(V\S; S) = H(V\S) − H(V\S|S) = H(V\S) + H(S) − H(V) Entropy H(S) = 1 + log(2π) 2 |S| + 1 2 log(det(LS))

slide-78
SLIDE 78

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Relevance and Diversity Measurement

Mutual Information M(V\S; S) = H(V\S) − H(V\S|S) = H(V\S) + H(S) − H(V) Entropy H(S) = 1 + log(2π) 2 |S| + 1 2 log(det(LS)) Maximizing M(S) = 1 2 log(det(LV\S)) + 1 2 log(det(LS)) [Krause et al., 2008]

slide-79
SLIDE 79

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Relation to Determinantal Point Process

Positive semidefinite kernel matrix L indexed by elements of V Lij = vT

i

vi vj vj For every S ∈ V, we define a diversity score D(S) = log(det(LS)) [Kulesza and Taskar, 2012] (Acknowledgement to Jerry :)

slide-80
SLIDE 80

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Gaze in Video Summarization

fixation fixation saccade saccade saccade fixation fixation

κ

0.91 0.85 0.5 0.89 0.81 threshold subshot 1 subshot 2

Better temporal segmentation: egocentric is continuous, but gaze is discrete

slide-81
SLIDE 81

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Gaze in Video Summarization

fixation fixation saccade saccade saccade fixation fixation

κ

0.91 0.85 0.5 0.89 0.81 threshold subshot 1 subshot 2

Better temporal segmentation: egocentric is continuous, but gaze is discrete Personalization: attention measurement from gaze fixations I(S) =

  • i∈S

ci

slide-82
SLIDE 82

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Partition Matroid Constraint

Motivation Compactness: cardinality or knapsack constraint? High level supervision: timeline

slide-83
SLIDE 83

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Partition Matroid Constraint

Motivation Compactness: cardinality or knapsack constraint? High level supervision: timeline Partition Matroid Construction Partition the video into b disjoint blocks P1, P2, · · · , Pb Limit associated with each block I = {A : |A ∩ Pm| ≤ fm, m = 1, 2, · · · , b} [Bilmes, 2013]

slide-84
SLIDE 84

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Submodular Formulation

max

S

F(S) = M(S) + λI(S) s.t. S ∈ I

slide-85
SLIDE 85

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Submodular Formulation

max

S

F(S) = M(S) + λI(S) s.t. S ∈ I Corollary F(S) is submodular.

slide-86
SLIDE 86

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Submodular Formulation

max

S

F(S) = M(S) + λI(S) s.t. S ∈ I Corollary F(S) is submodular. Proposition Greedy local search achieves a 1

4-approximation factor for our

constrained submodular maximization problem. [Lee et al., 2010] [Filmus and Ward, 2012]

slide-87
SLIDE 87

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Dataset Collection

5 subjects to record their daily lives 21 videos with gaze 15 hours in total Annotation Subjects group subshots into events.

slide-88
SLIDE 88

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Systematic Evaluation

Evaluation Metric P = |A ∩ T| |A| , R = |A ∩ T| |T| , F = 2PR P + R

slide-89
SLIDE 89

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Systematic Evaluation

Evaluation Metric P = |A ∩ T| |A| , R = |A ∩ T| |T| , F = 2PR P + R F-measure on GTEA-GAZE+

Method uniform kmeans uniform(gaze) kmeans(gaze)

  • urs

F-measure 0.161 0.215 ± 0.016 0.526 0.475 ± 0.026 0.621

F-measure on Our New Dataset

Method uniform kmeans uniform(gaze) kmeans(gaze)

  • urs

F-measure 0.080 0.095 ± 0.030 0.476 0.509 ± 0.025 0.585

slide-90
SLIDE 90

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Qualitative Result

uniform k-means uniform (our subshots) k-means (our subshots)

  • urs

Results from GTEA-gaze+ pizza preparation video.

slide-91
SLIDE 91

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Qualitative Result

uniform k-means uniform (our subshots) k-means (our subshots)

  • urs

Results from our new dataset: our subject mixes a shake, drinks it, washes his cup, plays chess and texts a friend.

slide-92
SLIDE 92

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Qualitative Result

uniform k-means uniform (our subshots) k-means (our subshots)

  • urs

Results from our new dataset: our subject is cooking chicken and have a conversation with his roommate.

slide-93
SLIDE 93

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Summary

Thesis Contribution An efficient approach for interactive segmentation while minimizing human effort (Ch. 2)

slide-94
SLIDE 94

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Summary

Thesis Contribution An efficient approach for interactive segmentation while minimizing human effort (Ch. 2) A latent graphical model for semantic segmentation using

  • nly image level tags (Ch. 3)
slide-95
SLIDE 95

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Summary

Thesis Contribution An efficient approach for interactive segmentation while minimizing human effort (Ch. 2) A latent graphical model for semantic segmentation using

  • nly image level tags (Ch. 3)

A unified model for semantic segmentation with various forms of weak supervision (Ch. 4)

slide-96
SLIDE 96

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Summary

Thesis Contribution An efficient approach for interactive segmentation while minimizing human effort (Ch. 2) A latent graphical model for semantic segmentation using

  • nly image level tags (Ch. 3)

A unified model for semantic segmentation with various forms of weak supervision (Ch. 4) An online foreground/background video segmentation using Grassmannian subspace learning (Ch. 5)

slide-97
SLIDE 97

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Summary

Thesis Contribution An efficient approach for interactive segmentation while minimizing human effort (Ch. 2) A latent graphical model for semantic segmentation using

  • nly image level tags (Ch. 3)

A unified model for semantic segmentation with various forms of weak supervision (Ch. 4) An online foreground/background video segmentation using Grassmannian subspace learning (Ch. 5) A submodular summarization framework for first person videos (Ch. 6)

slide-98
SLIDE 98

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Future: Joint Visual and Textual Parsing

y1 y2 x yC h1 x1 h2 x2 hN xN · · · · · · · · ·

Enhance graphical model with richer prior knowledge: geometry (Hoeim et al., 2007), co-occurrence, etc. Other form of supervisions: Air Quality Index (AQI) Tackle noisy tags Extend to videos

slide-99
SLIDE 99

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Future: Egocentric/Robotic Vision

Daily life logging / memory aid Predictive diagnosis for disease First-person vision for robotics Help the blind to sense the visual world

slide-100
SLIDE 100

Introduction Object Segmentation Scene Parsing Video Parsing Discussion

Acknowledgement

Thesis Committee Vikas Singh (advisor) Chuck Dyer Jerry Zhu Jude Shavlik Mark Craven Funding UW-Epic RAship NSF RI 1116584 NVIDIA Hardware Gift Adobe Gift Collaborators

Maxwell Collins (UW-Madison) Chuck Dyer (UW-Madison) Leo Grady (Heartflow) Vamsi Ithapu (UW-Madison) Hyunwoo Kim (UW-Madison) Yin Li (Georgia Tech) Zhe Lin (Adobe Research) Ji Liu (URochester) Lopa Mukherjee (UW-Whitewater) James M. Rehg (Georgia Tech) Alexander Schwing (UToronto) Xiaohui Shen (Adobe Research) Vikas Singh (UW-Madison) Raquel Urtasun (UToronto) Baba Vemuri (UFlorida) Jamieson Warner (UW-Madison) Jerry Zhu (UW-Madison)