Binary Factorization Models for Statistical Relational Learning - - PowerPoint PPT Presentation

binary factorization models
SMART_READER_LITE
LIVE PREVIEW

Binary Factorization Models for Statistical Relational Learning - - PowerPoint PPT Presentation

Binary Factorization Models for Statistical Relational Learning Guillaume Bouchard Collaborators Machine Learning for Services Beyza Ermis Xerox Research Centre Europe Behrouz Behmardi Cedric Archambeau Dawei Yin Ehsan Abbasnejad Julien


slide-1
SLIDE 1

Binary Factorization Models for Statistical Relational Learning

Guillaume Bouchard Machine Learning for Services Xerox Research Centre Europe Collaborators Beyza Ermis Behrouz Behmardi Cedric Archambeau Dawei Yin Ehsan Abbasnejad Julien Perez Shengbo Guo

slide-2
SLIDE 2

Motivating example Human biological data (1)

patient's internal temperature patient's surface temperature

  • xygen

saturation blood pressure stability of patient's surface temp stability of patient's core temp stability of patient's blood pressure patient's perceived comfort discharge decision

mid low excellent mid stable stable stable 15 A mid high excellent high stable stable stable 10 S high low excellent high stable stable mod-stable 10 A mid low good high stable unstable mod-stable 15 A mid mid excellent high stable stable stable 10 A high low good mid stable stable unstable 15 S mid low excellent high stable stable mod-stable 5 S high mid excellent mid unstable unstable stable 10 S mid high good mid stable stable stable 10 S mid low excellent mid unstable stable mod-stable 10 S mid mid good mid stable stable stable 15 A mid low good high stable stable mod-stable 10 A high high excellent high unstable stable unstable 15 A mid high good mid unstable stable mod-stable 10 A mid low good high unstable unstable stable 15 S

9 variables 90 patients Patients data Year 1993 Source: UCI repository Objective: statistical modelling (tests, predictions, visualisation)

slide-3
SLIDE 3

Motivating example Human biological data (2)

38M variables 1092 individuals Genetic data Year2012 Source: 1000 genomes project

slide-4
SLIDE 4

Example Human biological data (3)

Biological data Year 2013 Source: KEGG/DBGET/LinkDB

slide-5
SLIDE 5

2 3 1 3 2 1 2 3 1 3 1 2 3 2 1 1 3 2 2 1 3 3 2 1 2 3 1 3 1 2 3 1 2 3 2 1 2 1 3 3 1 2 2 3 2 1 1 2 3 3 2 1 2 3 1

Example Social Network (1)

Dislikes Likes

2 3 1 3 2 1 2 3 1 3 1 2 3 2 1 1 3 2 2 1 3 3 2 1 2 3 1 3 1 2 3 1 2 3 2 1 2 1 3 3 1 2 2 3 2 1 1 2 3 3 2 1 2 3 1

Monk relationships Year 1969 Source: Sampson 18x18x10

2 3 1 3 2 1 2 3 1 3 1 2 3 2 1 1 3 2 2 1 3 3 2 1 2 3 1 3 1 2 3 1 2 3 2 1 2 1 3 3 1 2 2 3 2 1 1 2 3 3 2 1 2 3 1

Dislikes Likes

2 3 1 3 2 1 2 3 1 3 1 2 3 2 1 1 3 2 2 1 3 3 2 1 2 3 1 3 1 2 3 1 2 3 2 1 2 1 3 3 1 2 2 3 2 1 1 2 3 3 2 1 2 3 1 2 3 1 3 2 1 2 3 1 3 1 2 3 2 1 1 3 2 2 1 3 3 2 1 2 3 1 3 1 2 3 1 2 3 2 1 2 1 3 3 1 2 2 3 2 1 1 2 3 3 2 1 2 3 1 2 3 1 3 2 1 2 3 1 3 1 2 3 2 1 1 3 2 2 1 3 3 2 1 2 3 1 3 1 2 3 1 2 3 2 1 2 1 3 3 1 2 2 3 2 1 1 2 3 3 2 1 2 3 1

Dislikes Likes

2 3 1 3 2 1 2 3 1 3 1 2 3 2 1 1 3 2 2 1 3 3 2 1 2 3 1 3 1 2 3 1 2 3 2 1 2 1 3 3 1 2 2 3 2 1 1 2 3 3 2 1 2 3 1 2 3 1 3 2 1 2 3 1 3 1 2 3 2 1 1 3 2 2 1 3 3 2 1 2 3 1 3 1 2 3 1 2 3 2 1 2 1 3 3 1 2 2 3 2 1 1 2 3 3 2 1 2 3 1

Dislikes Likes

2 3 1 3 2 1 2 3 1 3 1 2 3 2 1 1 3 2 2 1 3 3 2 1 2 3 1 3 1 2 3 1 2 3 2 1 2 1 3 3 1 2 2 3 2 1 1 2 3 3 2 1 2 3 1 2 3 1 3 2 1 2 3 1 3 1 2 3 2 1 1 3 2 2 1 3 3 2 1 2 3 1 3 1 2 3 1 2 3 2 1 2 1 3 3 1 2 2 3 2 1 1 2 3 3 2 1 2 3 1

Monk i Monk j Like Does not like Influences

slide-6
SLIDE 6

Example Social Network (2)

Bloggers community Year 2012 Source: LiveJournal online social network http://snap.stanford.edu/data 4Mx4Mx300K

slide-7
SLIDE 7

Evolution of Machine Learning

  • Early age: One task  many models
  • Today: One task  one model
  • Wish: multiple tasks  one model

Xerox Internal Use Only We need a generic data model!

slide-8
SLIDE 8

Example Social Network

C4

User Item

C1 C2

Item feature Tag

comment

Relational data Year 2012 Data source: FlickR

Several prediction tasks:

  • Item recommendation
  • Friend recommendation
  • Automatic tagging
  • Data cleaning
  • Predicting the comment type

C3

slide-9
SLIDE 9

Statistical Relational Learning

Also called multi-relational learning Goal: predict relationships

  • Probabilistic relational models
  • Markov logic networks
  • Recently, distributed representations have

shown great performances

– Collective Factorization [Singh&Gordon2005] – Tensor factorization [Sutskever et al., 2009,

Nickel&Tresp, 2010]

slide-10
SLIDE 10

Why factorization?

Distributed representations There is a continuous latent space (embedding space) that represents entities Relative positions in this latent space captures intrinsic relations Rotational invariance If a function f(M) of a matrix M is invariant by rotation (including permutation of the rows or columns), then f depends only on the spectrum of M

slide-11
SLIDE 11

Pearson, Karl. "On lines and planes of closest fit to systems of points in space", 1901. Tipping, Michael E, and Christopher M Bishop. "Probabilistic principal component analysis".1999. Salakhutdinov, Ruslan, and Andriy Mnih. "Probabilistic matrix factorization". 2008. Fazel, Maryam, Haitham Hindi, Stephen P Boyd. "A rank minimization heuristic with application to minimum order system

  • approximation. 2001.

Candès, Emmanuel J, and Benjamin

  • Recht. "Exact matrix completion via

convex optimization.". 2009. Wright, John, Arvind Ganesh, Shankar Rao, Yigang Peng, and Yi

  • Ma. "Robust principal component

analysis". 2009. Friedman, Nir, Lise Getoor, Daphne Koller, and Avi Pfeffer. "Learning probabilistic relational models”. 1999.

Singh, Ajit P, and Geoffrey J

  • Gordon. "Relational learning via

collective matrix factorization”,

2008

Richardson, Matthew, and Pedro Domingos. "Markov logic networks." 2006.

Statistical Relational AI & Factorization

1999 2008

Tamioka et al. “Statistical performance of convex tensor decomposition”. 2011 Sutskever, Ilya, Ruslan Salakhutdinov, and Josh

  • Tenenbaum. "Modelling

relational data using Bayesian clustered tensor factorization. 2009

2010 2015

Nickel et al. A Three-Way Model for Collective Learning

  • n Multi-Relational Data. 2011

Convex

  • ptimization

Probabilistic models Relational learning

slide-12
SLIDE 12

Outline

Noise models for matrix factorization

  • Missing data
  • Heteroscedastic
  • Binary data

Distributed models for statistical relational learning

  • Joint factorization = Multi-view learning
  • Collective Matrix Factorization (CMF)
  • Tensor Factorization (TF)
  • Collective Matrix and Tensor Factorization (CMTF)

Applications

  • Data analysis (print logs, textual data, gene expressions, …)
  • Recommender systems
  • Predictive queries in databases

Learning distributed representations

  • Scalable Binary Tensor Factorization
  • Convex methods
  • Bayesian methods
slide-13
SLIDE 13

Matrix Factorization with Missing Data

slide-14
SLIDE 14

Introduction to factor analysis

𝑍 ∈ ℜ𝑜×𝑛is a fully observed matrix Each entry of the matrix is a dot 𝑧𝑗𝑘 = 𝑣𝑗𝑠𝑤𝑘𝑠 + 𝜏𝑓𝑗𝑘

𝑆 𝑠=1

  • 𝑓𝑗𝑘 is a white Gaussian noise
  • 𝑉 ∈ ℜ𝑜×𝑙are the row latent variables
  • V ∈ ℜ𝑛×𝑙are the column latent variables

Model fit by maximum likelihood 𝑉 , 𝑊 = arg min

𝑉,𝑊

𝑧𝑗𝑘 − 𝑣𝑗𝑠𝑤𝑗𝑠

𝑆 𝑠=1 2 𝑛 j=1 𝑜 i=1

= arg min

𝑉,𝑊

‖𝑍 − 𝑉𝑊𝑈‖𝐺

2

n m 1 1 yij ui vj

Closed form solution: rank-R partial SVD

slide-15
SLIDE 15

Factor analysis with missing elements

The matrix 𝑍 ∈ ℜ𝑜×𝑛 matrix is only observed at positions 𝑗, 𝑘 ∈ Ω. We define the weight matrix 𝑋 ∈ {0,1} 𝑜×𝑛 Model fit by maximum likelihood 𝑉 , 𝑊 = arg min

𝑉,𝑊

𝑧𝑗𝑘 − 𝑣𝑗𝑠𝑤𝑗𝑠

𝑆 𝑠=1 2 i,j ∈Ω

= arg min

𝑉,𝑊

‖𝑋 ⊙ (𝑍 − 𝑉𝑊𝑈)‖𝐺

2

n m 1 1 yij ui vj

No closed form solution

Proportion of missing Low High Method EM =Reweighted SVD Stochastic gradient descent

slide-16
SLIDE 16

Example: Movie recommendation

[Mazunder et al. 2009] Popularized by The Netflix Prize

slide-17
SLIDE 17

Binary Matrix Factorization

slide-18
SLIDE 18

Binary matrices

Binary data are very common in practice

  • Facts about the world (true/false)
  • Graphs
  • Choices
  • Discretized values
  • Membership/assignment relations

Problem: Many common binary matrices have high rank:

1 1 1 1 1

𝑠𝑙 (Y)=? 𝑍 =

slide-19
SLIDE 19

The sign rank

The sign-rank of a binary matrix is defined as the smallest rank of a real matrix such that once thresholded, we obtain the binary matrix. [Alon et al., 1985], [Linial et al., 2007]

1 1 1 1 1

𝑠𝑙±(Y)=? 𝑍 =

slide-20
SLIDE 20

Sign rank of identity (equality) matrix

slide-21
SLIDE 21

Introduction to binary-data factor analysis

𝑍 ∈ {0,1}𝑜×𝑛is a fully observed matrix Each entry of the matrix is a dot 𝑧𝑗𝑘~𝐶𝑓𝑠𝑜𝑝𝑣𝑚𝑚𝑗 𝜏( 𝑣𝑗𝑠𝑤𝑘𝑠

𝑆 𝑠=1

)

  • 𝑉 ∈ ℜ𝑜×𝑙are the row latent variables
  • V ∈ ℜ𝑛×𝑙are the column latent variables
  • 𝜏 𝑢 : = 1/(1 + 𝑓−𝑢): sigmoid function

Model fit by maximum likelihood 𝑉 , 𝑊 = arg min

𝑉,𝑊

ℓ 𝑣𝑗𝑠𝑤𝑗𝑠

𝑆 𝑠=1

, 𝑧𝑗𝑘

𝑛 j=1 𝑜 i=1

ℓ 𝑦, 𝑧 = log(1 + 𝑓− 2𝑧−1 𝑦)

n m 1 1 yij ui vj

Minimization using

  • Alternating logistic regression
  • Gradient descent
  • Proximal methods
slide-22
SLIDE 22

Factorizing the identity matrix

In practice, the sign rank is hard to find By minimizing the logistic loss, we get an upper bound to the sign rank In practice, it scales logarithmically with the dimension

slide-23
SLIDE 23

Data imputation in structured matrices

slide-24
SLIDE 24

Join matrix factorization Multi-view learning

slide-25
SLIDE 25

Joint factorization

We observe n individuals 𝑍 ∈ ℜ𝑜×𝑛 are m similar measurements 𝑍′ ∈ ℜ𝑜×𝑛′ are m’ similar measurements We could factorize Y and Y’ independently, but intuitively we can do better by allowing factorizations to “share” information

?

Left light Right light

slide-26
SLIDE 26

Joint factorization (multi-view)

Share a latent representation across views Naïve solution: concatenated matrices

n m1 1 1

Y1ij

ui v1j n m2 1 1

Y2ij

v2j 𝑉 , 𝑊 1 , 𝑊 2 = arg min

𝑉,𝑊

‖𝑍

1 − 𝑉𝑊 1 𝑈 𝐺 2 + 𝛽‖𝑍 2 − 𝑉𝑊 2 𝑈 𝐺 2

[Goldberg et al., 2009]

slide-27
SLIDE 27

Regression as special case

Linear regression, supervised classification

  • X is the feature matrix
  • y is the vector of outputs

The CMF objective is: 𝑌 − 𝑉𝑊𝑈

2 𝐺 + α 𝑧 − 𝑉𝑥 2 𝐺

When α and the rank R increases, the objective becomes the regression objective. For large rank R, α=1 gives good generalization performances

features ui vj w

  • utput

y X

slide-28
SLIDE 28

Using view-dependent correlations

[Group Factor Analysis, Klami et al., 2014]

Share all the latent variables: 𝑉 , 𝑊 1 , 𝑊 2 = arg min

𝑉,𝑊

1,𝑊 2

‖𝑍

1 − 𝑉𝑊 1 𝑈 𝐺 2 + ‖𝑍 2 − 𝑉𝑊 2 𝑈 𝐺 2

= arg min ‖𝑍

𝑙 − 𝑉𝑊 k 𝑈 𝐺 2 𝑙

Use view-dependent factorization (Inter-Battery Factor Analysis) 𝑉 , 𝑊 1 , 𝑊 2, 𝑉 1

′, 𝑉

2

′, 𝑊

1

′, 𝑊

2

′ = arg min ‖ 𝑍 𝑙 − 𝑉𝑊 k 𝑈 − Uk ′ Vk ′T 𝐺 2 𝑙

View-dependent low-rank structure

slide-29
SLIDE 29

Example: joint image denoising

View-specific low-rank model Simple matrix concatenation Independent views

slide-30
SLIDE 30

Collective matrix factorization

slide-31
SLIDE 31

Collective matrix factorization: main idea

The user embeddings ui are shared. The movie embeddings mj are shared. The actor embeddings ai’ and pj’ should be shared: ai’=pj’

user movie actor preferred actor ui ai’ mj pj’

(𝑉 1, 𝑉 2 , 𝑉 3) = arg min

𝑉1,𝑉2,𝑉3

𝑍

1 − 𝑉1𝑉2 𝑈 𝐺 2 + 𝑍 2 − 𝑉2𝑉3 𝑈 𝐺 2 + 𝑍 3 − 𝑉3𝑉1 𝑈 𝐺 2

slide-32
SLIDE 32

Collective matrix factorization as probabilistic model

Low-rank representation Bias Noise Value of the mth relation between entities i and entity j

[1] Singh, Ajit P, and Geoffrey J Gordon. "Relational learning via collective matrix factorization”, 2008

Index of the type to use as the row- and column-entity for the mth view We can learn this model using MAP inference There a simple interpretation of this model as the decomposition of a symmetric matrix

slide-33
SLIDE 33

Relational data

X1 t1 t2

Single-view 2 entity types, 1 relation X1 t1 t2

t2 t3 tM X1 t1

Multi-view M+1 entity types, M relations

X2 XM

X2 t1 t2 t3 tM+1

X1 XM

t2 t3 X1 t1

Relational database N=5 entity types, M=4 relations

X2 X3 t5

X2 t1 t2 t3 t4 X1 X4

X4 t4

X3

t1 t2 t3 X2 X3 X1

X2 t1 t2 t3 X1 X3 Augmented Multi-view 3 entity types, 3 relations

slide-34
SLIDE 34

Distributed representations

X1 X1

Single-view 2 entity types, 1 relation X1 Multi-view M+1 entity types, M relations

X2 XM X2 X3 X1

X2

X1 XM X2 X1 X3 Augmented Multi-view 3 entity types, 3 relations

X1

Relational database N=5 entity types, M=4 relations

X2 X3

X2 X1 X4

X4

X3 U2 U1 U3 U2 U2 U3 U1 U2 U1 U1 U1 U2

UM+!

U1 U2 U1 U2 U3 UM+1 U1 U1 U2 U3 U4 U5

slide-35
SLIDE 35

? ? ? ? e1 e1 e2 e3 e2 ? ? e3

2

X

T

X 2

? ?

T

X 3

3

X

Y 

?

1

X

T

X1

Building a symmetric data matrix [2]

[2] Guillaume Bouchard, Shengbo Guo, and Dawei Yin. Convex collective matrix factorization. AISTATS 2013

X3 X2 X1

e1 e2 e3

U UT

slide-36
SLIDE 36

? ? ? ? ? ? ? ? e1 e1 e2 e3 ? ? e4 e5 e2 ? ? ? ? e3

2

X

T

X 2

? ? ? ? e4 ?

T

X 3

3

X

?

T

X 4

4

X

? ? ? ? e5

Y

e1 e2 e3 e4 e5

1

X

2

X

3

X

4

X

?

1

X

T

X1

Building a symmetric data matrix [2]

[2] Guillaume Bouchard, Shengbo Guo, and Dawei Yin. Convex collective matrix factorization. AISTATS 2013

slide-37
SLIDE 37

1 2 3 4 5 6

T

U 

? e1 e1 e2 e3 ? ? e4 e5

T

X 4

1

X

2

X

e2 ? ? ? ? e3

T

X1

T

X 2

? ? ?

T

X 3

? e4 ?

3

X

?

4

X

? ? ? ? e5

Y

0 0 0 0 0 0 0 0 1 2 3 4 5 6

U

k =

e1 e2 e3 e4 e5

1

X

2

X

3

X

4

X

slide-38
SLIDE 38

Experiment: Yale dataset prediction

1

X

2

X

3

X

t1 t3 t2

Augmented multi-view

  • X1: Left-lightning picture (Person e1, Pixel e2)
  • X2: Right-lightning picture (Person e1, Pixel e3)
  • X3: Pixel distance lower than a threshold (e2,

e3)

  • 50*50 images 2500 rows
  • 8 persons with left and right lightning

conditions

  • 7 pictures with only left lightning

conditions

  • Task: predicting 7 right-lightning pictures

?

slide-39
SLIDE 39

Experiment: gene expression data

1

X

2

X

3

X

t1 t3 t2

Augmented multi-view

  • X1: Gene expression (Patients e1, Genes e2)
  • X2: Copy number change (Patients e1, Genes

e3)

  • X3: Augmented matrix (e2, e3) represents

chromosomal proximity of genes in two views

  • 40 patients with breast cancer
  • Views: 2 measurements of 4287 genes.
  • Task: predicting random missing entries

Baselines:

  • CMF: no group-sparsity
  • CCA: no chromosomal proximity
  • PCA: independent views
slide-40
SLIDE 40

Experiment: gene expression data

40 patients with breast cancer Views: 2 measurements of 4287 genes. Task: predicting random missing entries Relativ e RMSE

Group sparsity helps to prevent

  • verfitting
slide-41
SLIDE 41

Simulation: binary data

Ring network CMF Binary matrices

  • 5 factors shared by all matrices
  • 2 factors of low-rank view-specific noise

learn the models with 10+2M factors, letting ARD prune out the extra ones. Task: Prediction of 40% of the missing entries X2 t1 t5 t4 t3 X1 X3 X4 t2 X5

slide-42
SLIDE 42

Tensor Factorization

slide-43
SLIDE 43

Linked Open Data

Generic knowledge bases are represented as multi-graphs Multi-graph graphs with labelled edges

[credits: M. Nickel]

slide-44
SLIDE 44

Tensor factorization models Long history of methods

  • CP decomposition
  • Tucker

decomposition

  • Dedicomp

[Harshman 1978]

  • Rescal

[Nickel&Tresp, 2010] Good predictive capabilities in link prediction 𝑍

𝑙 = 𝑉𝑆𝑙𝑉𝑈

Y

slide-45
SLIDE 45

Example: US-President data Data extracted from DBpedia

  • (vice) Presidents
  • Parties
  • Party memberships
  • Information who was (vice)

president of whom Task: Predict party membership for persons in dataset (link prediction) [Nickel 2011]

slide-46
SLIDE 46

Visualization of latent variables

Data: Freebase (1 Billion entities, 23K relations!) Source: [Bordes et al., 2013]

slide-47
SLIDE 47

Visualization of latent variables (zoom)

Data: Freebase (1 Billion entities, 23K relations!) Source: [Bordes et al., 2013]

slide-48
SLIDE 48

Link prediction

slide-49
SLIDE 49

Collective Matrix and Tensor Factorization

slide-50
SLIDE 50

Matrix Factorization Tensor Factorization Coupled factorization

A family of models

[credits: Evrim Acar]

slide-51
SLIDE 51

y u v n m

Matrix factorization

slide-52
SLIDE 52

y’ y v u v’ m m’ n

Matrix co-factorization (multi-view)

slide-53
SLIDE 53

y2 y3 y1 u1 u2 u3 n1 n3 n2

Simple collective factorizing model

slide-54
SLIDE 54

y5 y3 y2 u1 u2 u3 n1 n3 n3 u4 n4 n1 y1 y4

More complex relational model

slide-55
SLIDE 55

y2 y3 y1 u1 u2 u3 n1 n3 n2 u4 n4 y4 n5 u5 y5 y6

Collective Matrix and Tensor Factorization (CMTF)

slide-56
SLIDE 56

Collective Matrix and Tensor Factorization (CMTF)

C4

User Item

C1 C2

Item feature Tag

comment

Relational data Year 2012 Source: FlickR

Several prediction tasks:

  • Item recommendation
  • Friend recommendation
  • Automatic tagging
  • Data cleaning
  • Predicting the comment type

C3

slide-57
SLIDE 57

C4

Flickr example

Several prediction tasks:

  • Item recommendation
  • Friend recommendation
  • Automatic tagging
  • Data cleaning
  • Predicting the comment type

BPRA: Bayesian Probabilistic Relational Analysis PRA: Probabilistic Relational Analysis PMT: Probabilistic Matrix Factorization BPMT: Bayesian Probabilistic Matrix Factorization TF: Tensor Factorization BPTF: Bayesian Probabilistic Tensor Factorization CMF: Collective Matrix Factorization

User Item

C1 C2

Item feature Tag

comment

C3

Data imputation error rate

slide-58
SLIDE 58

Predictive queries

Probabilistic databases

slide-59
SLIDE 59

Semantic Search API Predictive SPARQL

Core idea: learn a model on KB  Now we can query missing data! SPARQL is a standard query language for semantic data Predictive SPARQL: generalization to probabilistic models

slide-60
SLIDE 60

Predictive query example

slide-61
SLIDE 61

Scalable Binary tensor factorization

slide-62
SLIDE 62

Binary tensor factorization in sublinear time

Lots of data matrices have a lot of zeros (not missing) Positives: Ω+ Negatives: Ω- Can we compute the loss in less than O(|Ω+| + |Ω-|) operations? Answer: yes, for some problems, we can do it in O(|Ω+|) operations We apply it for binary matrix and tensor factorization

2 3 1 3 2 1 2 3 1 3 1 2 3 2 1 1 3 2 2 1 3 3 2 1 2 3 1 3 1 2 3 1 2 3 2 1 2 1 3 3 1 2 2 3 2 1 1 2 3 3 2 1 2 3 1

slide-63
SLIDE 63

Quadratic loss is “sparse-friendly”

Observation matrix: 𝑍 ∈ ℝ𝑜×𝑛 s non-zero elements With s<<min(n,m) Naïve time complexity of computing 𝑍 − 𝑉𝑊𝑈

𝐺 2 if

r:=rank(U)<<min(n,m)? 𝑷(𝒔𝒐𝒏) No, we use the trace trick: tr 𝐵𝐶 = tr 𝐶𝐵 𝑍 − 𝑉𝑊𝑈

𝐺 2 =

𝑍 𝐺

2 − 2tr 𝑍𝑈𝑉𝑊𝑈 + 𝑊𝑉𝑈𝑉𝑊𝑈 𝐺 2

= 𝑍 𝐺

2 − 2tr 𝑍𝑈𝑉𝑊𝑈 + 𝑉𝑈𝑉𝑊𝑈𝑊 𝐺 2

= 𝑍 𝐺

2 − 2tr 𝑍𝑈𝑉𝑊𝑈 + 1𝑈 (𝑉𝑈𝑉)⨀(𝑊𝑈𝑊) 1

Complexity: O((r+1)s + r2(n + m))

0 0 2 0 3 0 0 0 0 0 0 0 0 1 3 0 0 0 0 0 2 0 0 0 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 1 0 0 0 2 0 0 0 0 0 0 0 3 0 0 0 0 0 0 2 0 1 0 1 0 0 3 0 0 0 0 2 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 0 3 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2 0 0 3 0 0 0 0 0 0 0 0 0 3 0 0 0 1 0 0 0 0 0 2 0 0 0 0 3 0 0 1 0 0 0 0 0 2 3 2 0 0 0 0 0 0 0 0 0 0 0 1

𝑃(𝑡) 𝑃(𝑡𝑠) 𝑃(𝑠2(𝑜 + 𝑛))

slide-64
SLIDE 64

Non-quadratic loss

Previous trick does not apply to not quadratic losses Solution: Upper bound the loss by a quadratic function: Jaakkola’s local variational bound to the logistic: Issue: the local variational parameter must be constant to apply the trace trick…

Lower-bounding the sigmoid by a Gaussian

slide-65
SLIDE 65

Bound refinement

f(x) b1(x) z f(x) b2(x) z f(x) b3(x) z f(x) b3'(x) z

slide-66
SLIDE 66

Split the matrix to improve the bound

slide-67
SLIDE 67

Algorithm

slide-68
SLIDE 68

Results Time-accuracy trade-off

slide-69
SLIDE 69

Results Relational learning datasets

slide-70
SLIDE 70

Convex methods

slide-71
SLIDE 71

Learning distributed models

  • 1. Alternating
  • ptimization
  • 2. Gradient descent
  • 3. Proximal gradient
  • 4. ADMM

min

𝜄∈Θ 𝑀(𝜄, 𝐸)

Convex methods

min

𝑌∈𝐷 𝑀(𝑌, 𝐸)

Convex relaxation

min

𝑌∈𝐿 𝑀(𝑌, 𝐸)

Change of parameters

slide-72
SLIDE 72

Introduction to convex factorization

Given Y ∈ ℝ𝑛×𝑜 Low-Rank matrix approximation: 𝑍 ≈ 𝑉𝑊𝑈 where U ∈ ℝ𝑜×𝑆 and V ∈ ℝ𝑛×𝑆 Can be obtained by SVD: min

𝑉,𝑊 𝑍 − 𝑉𝑊𝑈 𝐺

Missing values? Weighted SVD: min

𝑉,𝑊 𝑋.∗ (𝑍 − 𝑉𝑊𝑈) 𝐺

Intractable!  Use nuclear norm regularization

n m 1 1

U VT

R 1 1 R

Y

slide-73
SLIDE 73

Convex factorization (nuclear norm)

Original problem: min

𝑉∈ℝ𝑜×𝑆,𝑊∈ℝ𝑛×𝑆 𝑋⨀(𝑍 − 𝑉𝑊𝑈) 𝐺 + 𝜇 rank 𝑉

Equivalent problem: (1) min

𝑌∈ℝ𝑛×𝑜 𝑋⨀(𝑍 − 𝑌) 𝐺 + 𝜇 rank 𝑌

Relaxed problem (2) min

𝑌∈ℝ𝑛×𝑜 𝑋⨀(𝑍 − 𝑌) 𝐺 + 𝜇 𝑌 ∗

Where 𝑌 ∗ = 𝜏𝑙(𝑌)

min (𝑛,𝑜) 𝑠=1

(2) is convex, (1) is not ! n m 1 1

U VT

R 1 1 R

Y

slide-74
SLIDE 74

Multi-view with overlapping trace norms

slide-75
SLIDE 75

Convex multi-view Experiments

slide-76
SLIDE 76

Convex collective matrix factorization

user movie actor min

𝑌1∈ℝ𝑜1×𝑜2 𝑌2∈ℝ𝑜3×𝑜2 𝑌3∈ℝ𝑜1×𝑜3

(𝑍

𝑙 − 𝑌𝑙) 𝐺 3 𝑙=1

+ 𝜇 (𝑌1, 𝑌2, 𝑌3) # Where (𝑌1, 𝑌2, 𝑌3) #: = min

𝐵∈ℝ𝑜1×𝑜1 𝐶∈ℝ𝑜2×𝑜2 𝐷∈ℝ𝑜3×𝑜3 𝑎∈𝑇𝑜

+

𝑢𝑠 𝑎 And Z = 𝐵 𝑌1 𝑌3 𝑌1

𝑈

𝐶 𝑌2 𝑌3

𝑈

𝑌2

𝑈

𝐷 Can be solved efficiently using convex methods actor n1 n2 n3 n3 𝑍

1

𝑍

3

𝑍

2

[B. et al. 2013]

slide-77
SLIDE 77

Example of collective trace norm

slide-78
SLIDE 78

Convex Tensor Factorization

Tamioka et al. (2013) proposed a relaxation of the tensor rank: 𝑌 ∗ = 𝑌 𝑙

∗ 𝑙 ∗

k-th tensor unfolding

slide-79
SLIDE 79

Bayesian methods

slide-80
SLIDE 80

Collective matrix factorization with group-sparse embeddings

Low-rank representation Bias Noise Impose a group-sparse penality on the embeddings matrix U Groups are determined by the entity types: One group per entity type and factor  NK groups Automatic Relevance Determination (ARD) Group removal for large precisions

[3] Arto Klami, Seppo Virtanen, and Samuel Kaski.Bayesian canonical correlation analysis. JMLR 14, 2013.

Embeddings in the same group have the same precision [3]:

slide-81
SLIDE 81

Variational Bayes

slide-82
SLIDE 82

1 2 3 4 5 6

T

U 

? e1 e1 e2 e3 ? ? e4 e5

T

X 4

1

X

2

X

e2 ? ? ? ? e3

T

X1

T

X 2

? ? ?

T

X 3

? e4 ?

3

X

?

4

X

? ? ? ? e5

Y

0 0 0 0 0 0 0 0 1 2 3 4 5 6

U

k =

e1 e2 e3 e4 e5

1

X

2

X

3

X

4

X

slide-83
SLIDE 83

Algorithm: Variational Bayes

Variational Bayes with alternating updates

[4] Matthias Seeger and Guillaume Bouchard: Fast Variational Bayesian Inference for Non-Conjugate Matrix Factorization Models.

[4]

slide-84
SLIDE 84

Simulation: VB vs. MAP

Requires heavy cross- validation! Ring structure Gaussian noise Entity set sizes between 40 and 80 For MAP grid of more than 1000 values for a0 = b0 and p0 = q0

slide-85
SLIDE 85

Conclusion

slide-86
SLIDE 86

Conclusion

Take-home message

  • Learn more by fusing different sources
  • Automate machine learning tasks
  • Convexity
  • Good: theoretical guarantee + experts in optimization listened to you
  • Bad: does not tell you how to tune parameters?
  • Bayesian
  • Good: automatic tuning of parameters
  • Bad: no theoretical guarantee

Ongoing work

  • Convex and Bayesian: Is it possible?
  • Efficient flexible implementation of predictive queries
  • Privacy-preserving learning of embedding
slide-87
SLIDE 87

References

1) Guillaume Bouchard, Shengbo Guo and Dawei Yin. Convex Collective Matrix

  • Factorization. AISTATS 2013

2) Dawei Yin, Shengbo Guo, Boris Chidlovskii, Brian D. Davison, Cédric Archambeau and Guillaume Bouchard: Connecting comments and tags: improved modeling of social tagging systems. WSDM 2013: 547-556 3) Matthias Seeger and Guillaume Bouchard: Fast Variational Bayesian Inference for Non-Conjugate Matrix Factorization Models. Journal of Machine Learning Research

  • Proceedings Track 22: 1012-1018 (2012)

4) Mohammad Emtiyaz Khan, Benjamin M. Marlin, Guillaume Bouchard and Kevin P. Murphy: Variational bounds for mixed-data factor analysis. NIPS 2010: 1108-1116 5) Guillaume Bouchard, Behrouz Behmardi, Cedric Archambeau. Overlapping trace norms for multi-view learning. 2014. http://arxiv.org/abs/1404.6163.. 6) Beyza Ermis and Guillaume Bouchard. Binary tensor factorization in sublinear time.

  • 2014. UAI 2014.

7) Arto Klami, Abhishek Tripathi and Guillaume Bouchard. Efficient inference for binary matrix and tensor factorization. ICLR 2014.