Multi-Label Learning with Highly Incomplete Data via Collaborative - - PowerPoint PPT Presentation

multi label learning with highly incomplete data via
SMART_READER_LITE
LIVE PREVIEW

Multi-Label Learning with Highly Incomplete Data via Collaborative - - PowerPoint PPT Presentation

Multi-Label Learning with Highly Incomplete Data via Collaborative Embedding Yufei Han 1 , Guolei Sun 2 , Yun Shen 1 , Xiangliang Zhang 2 1. Symantec Research Labs 2. King Abdullah University of Science and Technology Outline Introduction


slide-1
SLIDE 1

Multi-Label Learning with Highly Incomplete Data via Collaborative Embedding

Yufei Han1, Guolei Sun2, Yun Shen1, Xiangliang Zhang2 1. Symantec Research Labs 2. King Abdullah University of Science and Technology

slide-2
SLIDE 2

Outline

  • Introduction and Problem Definition
  • Our Methods
  • Experimental Results
slide-3
SLIDE 3

Multi-Label Classification in Cyber Security

  • Multi-class classification,

f(x) = c1? c2?

  • r

c3?

  • Multi-label classification, f(x) = {c1? and c2? and c3?}

f(x)=apple f(x)=banana

Multi-label classification Collaborative embedding Incomplete feature

f(x)=orange

slide-4
SLIDE 4

Existing popular solutions

  • Binary relevance

– Construct classifier for each label independently – Not consider label dependency

  • Label power-set

– Convert into multi-class classification – A,B: {}, {A}, {B}, {A,B} – 2n: 40 labels, 240=1,099,511,627,776 multi-class classification

  • Classifier Chains

– Learn L binary classifiers by formatting the training problems as – Only capture the dependency of yi on y1, …, yi-1

(xi, y1, ..., yj−1) → yj = {0, 1}

slide-5
SLIDE 5

Use Case of Multi-label Classification

5

1 0 5 … 1 0 0 … 1 ? 0 … 0 1 2 … 0 2 1 … 1 1 0 … 3 1 0 … 0 0 1 0 … ? 0 1 … 0 0 1 … 0 9 0 1 … 1 ? 1 … 1 0 0 … 0 1 0 1 0 1 0 … ? 1 ? …

Machine days

Incomplete signature counts as features

Incomple te labels

Training

Train a prediction model for a given product

slide-6
SLIDE 6

Our Problem: A Tale of Two Cities

  • Multi-label learning with incomplete feature

values and weak labels

– Training data (N instances with D features) is partially observed, with if is observed. Otherwise – Label assignment (M is the label dimension) is a positive-unlabeled matrix, with

  • indicating the corresponding instance Xi,: is

positively labeled in the j-th label

  • indicating unobservable

X ∈ RN∗D Ωi,j = 1 Xi,j Ωi,j = 0 Y ∈ {0, 1}N∗M

Yi,j = 1

Yi,j = 0

slide-7
SLIDE 7

Our Problem: A Tale of Two Cities

Feature Matrix Classification Model Label Matrix

Corrupted / Incomplete data

  • Limited coverage of sensors
  • Privacy control
  • Failure of sensors
  • Partial responses

Weak supervision

  • Semi-Supervised information
  • Positive Unlabeled / Partially
  • bserved Supervision
  • Weak Pairwise / Triple-wise

constraint

slide-8
SLIDE 8

Existing Approaches

Methods Feature Values Labels Transductive/ Inductive BiasMC (ICML’15) Complete Positive (Weak) Both WELL (AAAI’10) Complete Positive (Weak) Transductive LEML (ICML’14) Complete Positive and Negative Inductive CoEmbed (AAAI’17) Complete Positive and Negative Transductive MC-1 (NIPS’10) Missing Positive and Negative Transductive DirtyIMC (NIPS’15) Noisy Positive and Negative Both Our study Missing Positive (Weak) Both Q: Give this column?

slide-9
SLIDE 9

Outline

  • Introduction and Problem Definition
  • Our Methods
  • Experimental Results
slide-10
SLIDE 10

Incomplete Feature Matrix (signatures

  • f security events)

Partially observed Label Matrix (security event class) Low-rank LSE based Matrix Factorization X = U VT Shared Embedding Space Y = W HT Cost-Sensitive Logistic Matrix Factorization Logit + R(W)

WH

T =φ(UV T )

Collaborative Embedding: A Transfer Learning Approach

slide-11
SLIDE 11

Incomplete Feature Matrix (signatures

  • f security events)

Partially observed Label Matrix (security event class) Low-rank LSE based Matrix Factorization X = U VT Shared Embedding Space Y = W HT Cost-Sensitive Logistic Matrix Factorization Logit + R(W)

WH

T =φ(UV T )

Collaborative Embedding: A Transfer Learning Approach

slide-12
SLIDE 12

Feature Matrix Completion

  • Low-rank Completion to Partially Observed Feature Matrix

U*,V * = argmin

U,V

Ωx ∗(X −UV T )

2

X U VT

U: projected features of data instances V: spanning basis defining the projection subspace

slide-13
SLIDE 13

Collaborative Embedding: A Transfer Learning Approach

Incomplete Feature Matrix (signatures

  • f security events)

Partially observed Label Matrix (security event class) Low-rank LSE based Matrix Factorization X = U VT Shared Embedding Space Y = W HT Cost-Sensitive Logistic Matrix Factorization Logit + R(W)

WH

T =φ(UV T )

slide-14
SLIDE 14

Label Matrix Reconstruction

  • Cost-sensitive Logistic Matrix Factorization on Positive-

Unlabeled class assignment matrix

W *, H * = argmin

W,H

Γi, j log(1+e

(1−2Yi, j )Xi,:(WH T ),:, j ) i, j

+ λ( W

2 + H 2)

Yi, j =1 Yi, j = 0 Γi, j =α Γi, j =1−α

Observed and positively labeled entries Unobserved thus unlabeled entries

Y = I(WH T )

slide-15
SLIDE 15
  • Collaborative Embedding as a solution to learning

with incomplete feature and weak labels:

ColEmbed: Collaborative Embedding

Feature completion Label completion Tolerance to residual error Functional Feature Extraction

slide-16
SLIDE 16
  • Provably reconstruction of the missing label entries

– M, D: the number of labels and the dimensionality of feature vectors – N: the number of training samples – t : the upper bound of the spectral norm of H – : maximum L2-norm of the row vectors in X

  • The label reconstruction error is of the order of 1/(NM(1-

))

Upper Bound of Reconstruction Error

slide-17
SLIDE 17

ColEmbed-L

  • Linear Collaborative Embedding: f ( ˆ

X) = ˆ XST

Flexible for both Transductive and Inductive setting

slide-18
SLIDE 18

ColEmbed-NL

  • Non-linear Embedding: linear combination of

random feature expansion

Ali Rahimi and Ben Recht, Random Features for Large-Scale Kernel Machines, NIPS 2007

slide-19
SLIDE 19

ColEmbed-NL

  • Non-linear Embedding: linear combination of

random feature expansion

slide-20
SLIDE 20

Training Process

  • Stochastic Gradient: Large-scale matrix factorization

Non-linear case:

slide-21
SLIDE 21

Outline

  • Introduction and Problem Definition
  • Our Methods
  • Experimental Results
slide-22
SLIDE 22

Empirical Study

  • Empirical study aims at answering the

following questions

– Is it really helpful to reconstruct features and labels simultaneously ? – Do transductive and inductive classification present consistently high precision ? – Does the proposed method provide better classification compared to the state-of-the-art approaches ? – Does the proposed method scale well ?

slide-23
SLIDE 23

Methods to Compare

  • Baseline approaches:

– BiasMC (transductive)and BiasMC-I (inductive), by PU-learning – LEML (cost-sensitive binomial loss), need + and - labels – LEML (least squared loss) – WELL, weak labels – CoEmbed, need + and - labels – MC-1, need + and - labels – DirtyIMC, need + and - labels

  • Incomplete feature matrix is completed using the convex low-

rank matrix completion approach, noted as MC-Convex

With missing or noisy feature values With complete Feature values

slide-24
SLIDE 24

Evaluation Data Sets

  • Benchmark data sets

Real-world IOT device event detection data Public benchmark data

slide-25
SLIDE 25

Feature Reconstruction

  • Lower errors on estimating the missing

feature values, comparing to baseline method

slide-26
SLIDE 26

Transductive Classification Accuracy

  • Higher classification accuracy than baseline methods
slide-27
SLIDE 27

Inductive Classification Accuracy

  • Higher classification accuracy than baseline methods
slide-28
SLIDE 28

On Real-world Security Data

  • Consistent better performances on classifying real-world

security data, comparing to baseline methods

Transductive mode test Inductive mode test

slide-29
SLIDE 29

Efficiency Evaluation

  • Run time in seconds, linear w.r.t. the No. of instances
slide-30
SLIDE 30

Takeaway

  • Collaboratively reconstructing missing feature values

and learning missing labels is beneficial for both tasks.

  • Our proposed method is applicable for both

transductive and inductive classification setting.

  • Our proposed method has better performance than

the state-of-the-art approaches.

slide-31
SLIDE 31
slide-32
SLIDE 32

Future Work

  • Learning with incomplete data streams
  • Deep Neural Nets as a more powerful functional

mapping between features and labels

  • Structured feature / label missing patterns
  • Further extension to multi-task learning