[PPT] - Binary Factorization Models for Statistical Relational Learning PowerPoint Presentation

SLIDE 1

Binary Factorization Models for Statistical Relational Learning

Guillaume Bouchard Machine Learning for Services Xerox Research Centre Europe Collaborators Beyza Ermis Behrouz Behmardi Cedric Archambeau Dawei Yin Ehsan Abbasnejad Julien Perez Shengbo Guo

SLIDE 2

Motivating example Human biological data (1)

patient's internal temperature patient's surface temperature

xygen

saturation blood pressure stability of patient's surface temp stability of patient's core temp stability of patient's blood pressure patient's perceived comfort discharge decision

mid low excellent mid stable stable stable 15 A mid high excellent high stable stable stable 10 S high low excellent high stable stable mod-stable 10 A mid low good high stable unstable mod-stable 15 A mid mid excellent high stable stable stable 10 A high low good mid stable stable unstable 15 S mid low excellent high stable stable mod-stable 5 S high mid excellent mid unstable unstable stable 10 S mid high good mid stable stable stable 10 S mid low excellent mid unstable stable mod-stable 10 S mid mid good mid stable stable stable 15 A mid low good high stable stable mod-stable 10 A high high excellent high unstable stable unstable 15 A mid high good mid unstable stable mod-stable 10 A mid low good high unstable unstable stable 15 S

9 variables 90 patients Patients data Year 1993 Source: UCI repository Objective: statistical modelling (tests, predictions, visualisation)

SLIDE 3

Motivating example Human biological data (2)

38M variables 1092 individuals Genetic data Year2012 Source: 1000 genomes project

SLIDE 4

Example Human biological data (3)

Biological data Year 2013 Source: KEGG/DBGET/LinkDB

SLIDE 5

2 3 1 3 2 1 2 3 1 3 1 2 3 2 1 1 3 2 2 1 3 3 2 1 2 3 1 3 1 2 3 1 2 3 2 1 2 1 3 3 1 2 2 3 2 1 1 2 3 3 2 1 2 3 1

Example Social Network (1)

Dislikes Likes

2 3 1 3 2 1 2 3 1 3 1 2 3 2 1 1 3 2 2 1 3 3 2 1 2 3 1 3 1 2 3 1 2 3 2 1 2 1 3 3 1 2 2 3 2 1 1 2 3 3 2 1 2 3 1

Monk relationships Year 1969 Source: Sampson 18x18x10

2 3 1 3 2 1 2 3 1 3 1 2 3 2 1 1 3 2 2 1 3 3 2 1 2 3 1 3 1 2 3 1 2 3 2 1 2 1 3 3 1 2 2 3 2 1 1 2 3 3 2 1 2 3 1

Dislikes Likes

2 3 1 3 2 1 2 3 1 3 1 2 3 2 1 1 3 2 2 1 3 3 2 1 2 3 1 3 1 2 3 1 2 3 2 1 2 1 3 3 1 2 2 3 2 1 1 2 3 3 2 1 2 3 1 2 3 1 3 2 1 2 3 1 3 1 2 3 2 1 1 3 2 2 1 3 3 2 1 2 3 1 3 1 2 3 1 2 3 2 1 2 1 3 3 1 2 2 3 2 1 1 2 3 3 2 1 2 3 1 2 3 1 3 2 1 2 3 1 3 1 2 3 2 1 1 3 2 2 1 3 3 2 1 2 3 1 3 1 2 3 1 2 3 2 1 2 1 3 3 1 2 2 3 2 1 1 2 3 3 2 1 2 3 1

Dislikes Likes

2 3 1 3 2 1 2 3 1 3 1 2 3 2 1 1 3 2 2 1 3 3 2 1 2 3 1 3 1 2 3 1 2 3 2 1 2 1 3 3 1 2 2 3 2 1 1 2 3 3 2 1 2 3 1 2 3 1 3 2 1 2 3 1 3 1 2 3 2 1 1 3 2 2 1 3 3 2 1 2 3 1 3 1 2 3 1 2 3 2 1 2 1 3 3 1 2 2 3 2 1 1 2 3 3 2 1 2 3 1

Dislikes Likes

2 3 1 3 2 1 2 3 1 3 1 2 3 2 1 1 3 2 2 1 3 3 2 1 2 3 1 3 1 2 3 1 2 3 2 1 2 1 3 3 1 2 2 3 2 1 1 2 3 3 2 1 2 3 1 2 3 1 3 2 1 2 3 1 3 1 2 3 2 1 1 3 2 2 1 3 3 2 1 2 3 1 3 1 2 3 1 2 3 2 1 2 1 3 3 1 2 2 3 2 1 1 2 3 3 2 1 2 3 1

Monk i Monk j Like Does not like Influences

SLIDE 6

Example Social Network (2)

Bloggers community Year 2012 Source: LiveJournal online social network http://snap.stanford.edu/data 4Mx4Mx300K

SLIDE 7

Evolution of Machine Learning

Early age: One task  many models
Today: One task  one model
Wish: multiple tasks  one model

Xerox Internal Use Only We need a generic data model!

SLIDE 8

Example Social Network

C4

User Item

C1 C2

Item feature Tag

comment

Relational data Year 2012 Data source: FlickR

Several prediction tasks:

Item recommendation
Friend recommendation
Automatic tagging
Data cleaning
Predicting the comment type

C3

SLIDE 9

Statistical Relational Learning

Also called multi-relational learning Goal: predict relationships

Probabilistic relational models
Markov logic networks
Recently, distributed representations have

shown great performances

– Collective Factorization [Singh&Gordon2005] – Tensor factorization [Sutskever et al., 2009,

Nickel&Tresp, 2010]

SLIDE 10

Why factorization?

Distributed representations There is a continuous latent space (embedding space) that represents entities Relative positions in this latent space captures intrinsic relations Rotational invariance If a function f(M) of a matrix M is invariant by rotation (including permutation of the rows or columns), then f depends only on the spectrum of M

SLIDE 11

Pearson, Karl. "On lines and planes of closest fit to systems of points in space", 1901. Tipping, Michael E, and Christopher M Bishop. "Probabilistic principal component analysis".1999. Salakhutdinov, Ruslan, and Andriy Mnih. "Probabilistic matrix factorization". 2008. Fazel, Maryam, Haitham Hindi, Stephen P Boyd. "A rank minimization heuristic with application to minimum order system

approximation. 2001.

Candès, Emmanuel J, and Benjamin

Recht. "Exact matrix completion via

convex optimization.". 2009. Wright, John, Arvind Ganesh, Shankar Rao, Yigang Peng, and Yi

Ma. "Robust principal component

analysis". 2009. Friedman, Nir, Lise Getoor, Daphne Koller, and Avi Pfeffer. "Learning probabilistic relational models”. 1999.

Singh, Ajit P, and Geoffrey J

Gordon. "Relational learning via

collective matrix factorization”,

2008

Richardson, Matthew, and Pedro Domingos. "Markov logic networks." 2006.

Statistical Relational AI & Factorization

1999 2008

Tamioka et al. “Statistical performance of convex tensor decomposition”. 2011 Sutskever, Ilya, Ruslan Salakhutdinov, and Josh

Tenenbaum. "Modelling

relational data using Bayesian clustered tensor factorization. 2009

2010 2015

Nickel et al. A Three-Way Model for Collective Learning

n Multi-Relational Data. 2011

Convex

ptimization

Probabilistic models Relational learning

SLIDE 12

Outline

Noise models for matrix factorization

Missing data
Heteroscedastic
Binary data

Distributed models for statistical relational learning

Joint factorization = Multi-view learning
Collective Matrix Factorization (CMF)
Tensor Factorization (TF)
Collective Matrix and Tensor Factorization (CMTF)

Applications

Data analysis (print logs, textual data, gene expressions, …)
Recommender systems
Predictive queries in databases

Learning distributed representations

Scalable Binary Tensor Factorization
Convex methods
Bayesian methods

SLIDE 13

Matrix Factorization with Missing Data

SLIDE 14

Introduction to factor analysis

𝑍 ∈ ℜ𝑜×𝑛is a fully observed matrix Each entry of the matrix is a dot 𝑧𝑗𝑘 = 𝑣𝑗𝑠𝑤𝑘𝑠 + 𝜏𝑓𝑗𝑘

𝑆 𝑠=1

𝑓𝑗𝑘 is a white Gaussian noise
𝑉 ∈ ℜ𝑜×𝑙are the row latent variables
V ∈ ℜ𝑛×𝑙are the column latent variables

Model fit by maximum likelihood 𝑉 , 𝑊 = arg min

𝑉,𝑊

𝑧𝑗𝑘 − 𝑣𝑗𝑠𝑤𝑗𝑠

𝑆 𝑠=1 2 𝑛 j=1 𝑜 i=1

= arg min

𝑉,𝑊

‖𝑍 − 𝑉𝑊𝑈‖𝐺

2

n m 1 1 yij ui vj

Closed form solution: rank-R partial SVD

SLIDE 15

Factor analysis with missing elements

The matrix 𝑍 ∈ ℜ𝑜×𝑛 matrix is only observed at positions 𝑗, 𝑘 ∈ Ω. We define the weight matrix 𝑋 ∈ {0,1} 𝑜×𝑛 Model fit by maximum likelihood 𝑉 , 𝑊 = arg min

𝑉,𝑊

𝑧𝑗𝑘 − 𝑣𝑗𝑠𝑤𝑗𝑠

𝑆 𝑠=1 2 i,j ∈Ω

= arg min

𝑉,𝑊

‖𝑋 ⊙ (𝑍 − 𝑉𝑊𝑈)‖𝐺

2

n m 1 1 yij ui vj

No closed form solution

Proportion of missing Low High Method EM =Reweighted SVD Stochastic gradient descent

SLIDE 16

Example: Movie recommendation

[Mazunder et al. 2009] Popularized by The Netflix Prize

SLIDE 17

Binary Matrix Factorization

SLIDE 18

Binary matrices

Binary data are very common in practice

Facts about the world (true/false)
Graphs
Choices
Discretized values
Membership/assignment relations

Problem: Many common binary matrices have high rank:

1 1 1 1 1

𝑠𝑙 (Y)=? 𝑍 =

SLIDE 19

The sign rank

The sign-rank of a binary matrix is defined as the smallest rank of a real matrix such that once thresholded, we obtain the binary matrix. [Alon et al., 1985], [Linial et al., 2007]

1 1 1 1 1

𝑠𝑙±(Y)=? 𝑍 =

SLIDE 20

Sign rank of identity (equality) matrix

SLIDE 21

Introduction to binary-data factor analysis

𝑍 ∈ {0,1}𝑜×𝑛is a fully observed matrix Each entry of the matrix is a dot 𝑧𝑗𝑘~𝐶𝑓𝑠𝑜𝑝𝑣𝑚𝑚𝑗 𝜏( 𝑣𝑗𝑠𝑤𝑘𝑠

𝑆 𝑠=1

)

𝑉 ∈ ℜ𝑜×𝑙are the row latent variables
V ∈ ℜ𝑛×𝑙are the column latent variables
𝜏 𝑢 : = 1/(1 + 𝑓−𝑢): sigmoid function

Model fit by maximum likelihood 𝑉 , 𝑊 = arg min

𝑉,𝑊

ℓ 𝑣𝑗𝑠𝑤𝑗𝑠

𝑆 𝑠=1

, 𝑧𝑗𝑘

𝑛 j=1 𝑜 i=1

ℓ 𝑦, 𝑧 = log(1 + 𝑓− 2𝑧−1 𝑦)

n m 1 1 yij ui vj

Minimization using

Alternating logistic regression
Gradient descent
Proximal methods

SLIDE 22

Factorizing the identity matrix

In practice, the sign rank is hard to find By minimizing the logistic loss, we get an upper bound to the sign rank In practice, it scales logarithmically with the dimension

SLIDE 23

Data imputation in structured matrices

SLIDE 24

Join matrix factorization Multi-view learning

SLIDE 25

Joint factorization

We observe n individuals 𝑍 ∈ ℜ𝑜×𝑛 are m similar measurements 𝑍′ ∈ ℜ𝑜×𝑛′ are m’ similar measurements We could factorize Y and Y’ independently, but intuitively we can do better by allowing factorizations to “share” information

?

Left light Right light

SLIDE 26

Joint factorization (multi-view)

Share a latent representation across views Naïve solution: concatenated matrices

n m1 1 1

Y1ij

ui v1j n m2 1 1

Y2ij

v2j 𝑉 , 𝑊 1 , 𝑊 2 = arg min

𝑉,𝑊

‖𝑍

1 − 𝑉𝑊 1 𝑈 𝐺 2 + 𝛽‖𝑍 2 − 𝑉𝑊 2 𝑈 𝐺 2

[Goldberg et al., 2009]

SLIDE 27

Regression as special case

Linear regression, supervised classification

X is the feature matrix
y is the vector of outputs

The CMF objective is: 𝑌 − 𝑉𝑊𝑈

2 𝐺 + α 𝑧 − 𝑉𝑥 2 𝐺

When α and the rank R increases, the objective becomes the regression objective. For large rank R, α=1 gives good generalization performances

features ui vj w

utput

y X

SLIDE 28

Using view-dependent correlations

[Group Factor Analysis, Klami et al., 2014]

Share all the latent variables: 𝑉 , 𝑊 1 , 𝑊 2 = arg min

𝑉,𝑊

1,𝑊 2

‖𝑍

1 − 𝑉𝑊 1 𝑈 𝐺 2 + ‖𝑍 2 − 𝑉𝑊 2 𝑈 𝐺 2

= arg min ‖𝑍

𝑙 − 𝑉𝑊 k 𝑈 𝐺 2 𝑙

Use view-dependent factorization (Inter-Battery Factor Analysis) 𝑉 , 𝑊 1 , 𝑊 2, 𝑉 1

′, 𝑉

2

′, 𝑊

1

′, 𝑊

2

′ = arg min ‖ 𝑍 𝑙 − 𝑉𝑊 k 𝑈 − Uk ′ Vk ′T 𝐺 2 𝑙

View-dependent low-rank structure

SLIDE 29

Example: joint image denoising

View-specific low-rank model Simple matrix concatenation Independent views

SLIDE 30

Collective matrix factorization

SLIDE 31

Collective matrix factorization: main idea

The user embeddings ui are shared. The movie embeddings mj are shared. The actor embeddings ai’ and pj’ should be shared: ai’=pj’

user movie actor preferred actor ui ai’ mj pj’

(𝑉 1, 𝑉 2 , 𝑉 3) = arg min

𝑉1,𝑉2,𝑉3

𝑍

1 − 𝑉1𝑉2 𝑈 𝐺 2 + 𝑍 2 − 𝑉2𝑉3 𝑈 𝐺 2 + 𝑍 3 − 𝑉3𝑉1 𝑈 𝐺 2

SLIDE 32

Collective matrix factorization as probabilistic model

Low-rank representation Bias Noise Value of the mth relation between entities i and entity j

[1] Singh, Ajit P, and Geoffrey J Gordon. "Relational learning via collective matrix factorization”, 2008

Index of the type to use as the row- and column-entity for the mth view We can learn this model using MAP inference There a simple interpretation of this model as the decomposition of a symmetric matrix

SLIDE 33

Relational data

X1 t1 t2

Single-view 2 entity types, 1 relation X1 t1 t2

t2 t3 tM X1 t1

Multi-view M+1 entity types, M relations

…

X2 XM

X2 t1 t2 t3 tM+1

…

X1 XM

t2 t3 X1 t1

Relational database N=5 entity types, M=4 relations

X2 X3 t5

X2 t1 t2 t3 t4 X1 X4

X4 t4

X3

t1 t2 t3 X2 X3 X1

X2 t1 t2 t3 X1 X3 Augmented Multi-view 3 entity types, 3 relations

SLIDE 34

Distributed representations

X1 X1

Single-view 2 entity types, 1 relation X1 Multi-view M+1 entity types, M relations

…

X2 XM X2 X3 X1

X2

…

X1 XM X2 X1 X3 Augmented Multi-view 3 entity types, 3 relations

X1

Relational database N=5 entity types, M=4 relations

X2 X3

X2 X1 X4

X4

X3 U2 U1 U3 U2 U2 U3 U1 U2 U1 U1 U1 U2

UM+!

U1 U2 U1 U2 U3 UM+1 U1 U1 U2 U3 U4 U5

SLIDE 35

? ? ? ? e1 e1 e2 e3 e2 ? ? e3

2

X

T

X 2

? ?

T

X 3

3

X

Y 

?

1

X

T

X1

Building a symmetric data matrix [2]

[2] Guillaume Bouchard, Shengbo Guo, and Dawei Yin. Convex collective matrix factorization. AISTATS 2013

X3 X2 X1

e1 e2 e3

U UT

SLIDE 36

? ? ? ? ? ? ? ? e1 e1 e2 e3 ? ? e4 e5 e2 ? ? ? ? e3

2

X

T

X 2

? ? ? ? e4 ?

T

X 3

3

X

?

T

X 4

4

X

? ? ? ? e5

Y

e1 e2 e3 e4 e5

1

X

2

X

3

X

4

X

?

1

X

T

X1

Building a symmetric data matrix [2]

[2] Guillaume Bouchard, Shengbo Guo, and Dawei Yin. Convex collective matrix factorization. AISTATS 2013

SLIDE 37

1 2 3 4 5 6

T

U 

? e1 e1 e2 e3 ? ? e4 e5

T

X 4

1

X

2

X

e2 ? ? ? ? e3

T

X1

T

X 2

? ? ?

T

X 3

? e4 ?

3

X

?

4

X

? ? ? ? e5

Y

0 0 0 0 0 0 0 0 1 2 3 4 5 6

U

k =

e1 e2 e3 e4 e5

1

X

2

X

3

X

4

X

SLIDE 38

Experiment: Yale dataset prediction

1

X

2

X

3

X

t1 t3 t2

Augmented multi-view

X1: Left-lightning picture (Person e1, Pixel e2)
X2: Right-lightning picture (Person e1, Pixel e3)
X3: Pixel distance lower than a threshold (e2,

e3)

50*50 images 2500 rows
8 persons with left and right lightning

conditions

7 pictures with only left lightning

conditions

Task: predicting 7 right-lightning pictures

?

SLIDE 39

Experiment: gene expression data

1

X

2

X

3

X

t1 t3 t2

Augmented multi-view

X1: Gene expression (Patients e1, Genes e2)
X2: Copy number change (Patients e1, Genes

e3)

X3: Augmented matrix (e2, e3) represents

chromosomal proximity of genes in two views

40 patients with breast cancer
Views: 2 measurements of 4287 genes.
Task: predicting random missing entries

Baselines:

CMF: no group-sparsity
CCA: no chromosomal proximity
PCA: independent views

SLIDE 40

Experiment: gene expression data

40 patients with breast cancer Views: 2 measurements of 4287 genes. Task: predicting random missing entries Relativ e RMSE

Group sparsity helps to prevent

verfitting

SLIDE 41

Simulation: binary data

Ring network CMF Binary matrices

5 factors shared by all matrices
2 factors of low-rank view-specific noise

learn the models with 10+2M factors, letting ARD prune out the extra ones. Task: Prediction of 40% of the missing entries X2 t1 t5 t4 t3 X1 X3 X4 t2 X5

SLIDE 42

Tensor Factorization

SLIDE 43

Linked Open Data

Generic knowledge bases are represented as multi-graphs Multi-graph graphs with labelled edges

[credits: M. Nickel]

SLIDE 44

Tensor factorization models Long history of methods

CP decomposition
Tucker

decomposition

Dedicomp

[Harshman 1978]

Rescal

[Nickel&Tresp, 2010] Good predictive capabilities in link prediction 𝑍

𝑙 = 𝑉𝑆𝑙𝑉𝑈

Y

SLIDE 45

Example: US-President data Data extracted from DBpedia

(vice) Presidents
Parties
Party memberships
Information who was (vice)

president of whom Task: Predict party membership for persons in dataset (link prediction) [Nickel 2011]

SLIDE 46

Visualization of latent variables

Data: Freebase (1 Billion entities, 23K relations!) Source: [Bordes et al., 2013]

SLIDE 47

Visualization of latent variables (zoom)

Data: Freebase (1 Billion entities, 23K relations!) Source: [Bordes et al., 2013]

SLIDE 48

Link prediction

SLIDE 49

Collective Matrix and Tensor Factorization

SLIDE 50

Matrix Factorization Tensor Factorization Coupled factorization

A family of models

[credits: Evrim Acar]

SLIDE 51

y u v n m

Matrix factorization

SLIDE 52

y’ y v u v’ m m’ n

Matrix co-factorization (multi-view)

SLIDE 53

y2 y3 y1 u1 u2 u3 n1 n3 n2

Simple collective factorizing model

SLIDE 54

y5 y3 y2 u1 u2 u3 n1 n3 n3 u4 n4 n1 y1 y4

More complex relational model

SLIDE 55

y2 y3 y1 u1 u2 u3 n1 n3 n2 u4 n4 y4 n5 u5 y5 y6

Collective Matrix and Tensor Factorization (CMTF)

SLIDE 56

Collective Matrix and Tensor Factorization (CMTF)

C4

User Item

C1 C2

Item feature Tag

comment

Relational data Year 2012 Source: FlickR

Several prediction tasks:

Item recommendation
Friend recommendation
Automatic tagging
Data cleaning
Predicting the comment type

C3

SLIDE 57

C4

Flickr example

Several prediction tasks:

Item recommendation
Friend recommendation
Automatic tagging
Data cleaning
Predicting the comment type

BPRA: Bayesian Probabilistic Relational Analysis PRA: Probabilistic Relational Analysis PMT: Probabilistic Matrix Factorization BPMT: Bayesian Probabilistic Matrix Factorization TF: Tensor Factorization BPTF: Bayesian Probabilistic Tensor Factorization CMF: Collective Matrix Factorization

User Item

C1 C2

Item feature Tag

comment

C3

Data imputation error rate

SLIDE 58

Predictive queries

Probabilistic databases

SLIDE 59

Semantic Search API Predictive SPARQL

Core idea: learn a model on KB  Now we can query missing data! SPARQL is a standard query language for semantic data Predictive SPARQL: generalization to probabilistic models

SLIDE 60

Predictive query example

SLIDE 61

Scalable Binary tensor factorization

SLIDE 62

Binary tensor factorization in sublinear time

Lots of data matrices have a lot of zeros (not missing) Positives: Ω+ Negatives: Ω- Can we compute the loss in less than O(|Ω+| + |Ω-|) operations? Answer: yes, for some problems, we can do it in O(|Ω+|) operations We apply it for binary matrix and tensor factorization

2 3 1 3 2 1 2 3 1 3 1 2 3 2 1 1 3 2 2 1 3 3 2 1 2 3 1 3 1 2 3 1 2 3 2 1 2 1 3 3 1 2 2 3 2 1 1 2 3 3 2 1 2 3 1

SLIDE 63

Quadratic loss is “sparse-friendly”

Observation matrix: 𝑍 ∈ ℝ𝑜×𝑛 s non-zero elements With s<<min(n,m) Naïve time complexity of computing 𝑍 − 𝑉𝑊𝑈

𝐺 2 if

r:=rank(U)<<min(n,m)? 𝑷(𝒔𝒐𝒏) No, we use the trace trick: tr 𝐵𝐶 = tr 𝐶𝐵 𝑍 − 𝑉𝑊𝑈

𝐺 2 =

𝑍 𝐺

2 − 2tr 𝑍𝑈𝑉𝑊𝑈 + 𝑊𝑉𝑈𝑉𝑊𝑈 𝐺 2

= 𝑍 𝐺

2 − 2tr 𝑍𝑈𝑉𝑊𝑈 + 𝑉𝑈𝑉𝑊𝑈𝑊 𝐺 2

= 𝑍 𝐺

2 − 2tr 𝑍𝑈𝑉𝑊𝑈 + 1𝑈 (𝑉𝑈𝑉)⨀(𝑊𝑈𝑊) 1

Complexity: O((r+1)s + r2(n + m))

0 0 2 0 3 0 0 0 0 0 0 0 0 1 3 0 0 0 0 0 2 0 0 0 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 1 0 0 0 2 0 0 0 0 0 0 0 3 0 0 0 0 0 0 2 0 1 0 1 0 0 3 0 0 0 0 2 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 0 3 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2 0 0 3 0 0 0 0 0 0 0 0 0 3 0 0 0 1 0 0 0 0 0 2 0 0 0 0 3 0 0 1 0 0 0 0 0 2 3 2 0 0 0 0 0 0 0 0 0 0 0 1

𝑃(𝑡) 𝑃(𝑡𝑠) 𝑃(𝑠2(𝑜 + 𝑛))

SLIDE 64

Non-quadratic loss

Previous trick does not apply to not quadratic losses Solution: Upper bound the loss by a quadratic function: Jaakkola’s local variational bound to the logistic: Issue: the local variational parameter must be constant to apply the trace trick…

Lower-bounding the sigmoid by a Gaussian

SLIDE 65

Bound refinement

f(x) b1(x) z f(x) b2(x) z f(x) b3(x) z f(x) b3'(x) z

SLIDE 66

Split the matrix to improve the bound

SLIDE 67

Algorithm

SLIDE 68

Results Time-accuracy trade-off

SLIDE 69

Results Relational learning datasets

SLIDE 70

Convex methods

SLIDE 71

Learning distributed models

1. Alternating
ptimization
2. Gradient descent
3. Proximal gradient
4. ADMM

min

𝜄∈Θ 𝑀(𝜄, 𝐸)

Convex methods

min

𝑌∈𝐷 𝑀(𝑌, 𝐸)

Convex relaxation

min

𝑌∈𝐿 𝑀(𝑌, 𝐸)

Change of parameters

SLIDE 72

Introduction to convex factorization

Given Y ∈ ℝ𝑛×𝑜 Low-Rank matrix approximation: 𝑍 ≈ 𝑉𝑊𝑈 where U ∈ ℝ𝑜×𝑆 and V ∈ ℝ𝑛×𝑆 Can be obtained by SVD: min

𝑉,𝑊 𝑍 − 𝑉𝑊𝑈 𝐺

Missing values? Weighted SVD: min

𝑉,𝑊 𝑋.∗ (𝑍 − 𝑉𝑊𝑈) 𝐺

Intractable!  Use nuclear norm regularization

n m 1 1

U VT

R 1 1 R

Y

SLIDE 73

Convex factorization (nuclear norm)

Original problem: min

𝑉∈ℝ𝑜×𝑆,𝑊∈ℝ𝑛×𝑆 𝑋⨀(𝑍 − 𝑉𝑊𝑈) 𝐺 + 𝜇 rank 𝑉

Equivalent problem: (1) min

𝑌∈ℝ𝑛×𝑜 𝑋⨀(𝑍 − 𝑌) 𝐺 + 𝜇 rank 𝑌

Relaxed problem (2) min

𝑌∈ℝ𝑛×𝑜 𝑋⨀(𝑍 − 𝑌) 𝐺 + 𝜇 𝑌 ∗

Where 𝑌 ∗ = 𝜏𝑙(𝑌)

min (𝑛,𝑜) 𝑠=1

(2) is convex, (1) is not ! n m 1 1

U VT

R 1 1 R

Y

SLIDE 74

Multi-view with overlapping trace norms

SLIDE 75

Convex multi-view Experiments

SLIDE 76

Convex collective matrix factorization

user movie actor min

𝑌1∈ℝ𝑜1×𝑜2 𝑌2∈ℝ𝑜3×𝑜2 𝑌3∈ℝ𝑜1×𝑜3

(𝑍

𝑙 − 𝑌𝑙) 𝐺 3 𝑙=1

+ 𝜇 (𝑌1, 𝑌2, 𝑌3) # Where (𝑌1, 𝑌2, 𝑌3) #: = min

𝐵∈ℝ𝑜1×𝑜1 𝐶∈ℝ𝑜2×𝑜2 𝐷∈ℝ𝑜3×𝑜3 𝑎∈𝑇𝑜

+

𝑢𝑠 𝑎 And Z = 𝐵 𝑌1 𝑌3 𝑌1

𝑈

𝐶 𝑌2 𝑌3

𝑈

𝑌2

𝑈

𝐷 Can be solved efficiently using convex methods actor n1 n2 n3 n3 𝑍

1

𝑍

3

𝑍

2

[B. et al. 2013]

SLIDE 77

Example of collective trace norm

SLIDE 78

Convex Tensor Factorization

Tamioka et al. (2013) proposed a relaxation of the tensor rank: 𝑌 ∗ = 𝑌 𝑙

∗ 𝑙 ∗

k-th tensor unfolding

SLIDE 79

Bayesian methods

SLIDE 80

Collective matrix factorization with group-sparse embeddings

Low-rank representation Bias Noise Impose a group-sparse penality on the embeddings matrix U Groups are determined by the entity types: One group per entity type and factor  NK groups Automatic Relevance Determination (ARD) Group removal for large precisions

[3] Arto Klami, Seppo Virtanen, and Samuel Kaski.Bayesian canonical correlation analysis. JMLR 14, 2013.

Embeddings in the same group have the same precision [3]:

SLIDE 81

Variational Bayes

SLIDE 82

1 2 3 4 5 6

T

U 

? e1 e1 e2 e3 ? ? e4 e5

T

X 4

1

X

2

X

e2 ? ? ? ? e3

T

X1

T

X 2

? ? ?

T

X 3

? e4 ?

3

X

?

4

X

? ? ? ? e5

Y

0 0 0 0 0 0 0 0 1 2 3 4 5 6

U

k =

e1 e2 e3 e4 e5

1

X

2

X

3

X

4

X

SLIDE 83

Algorithm: Variational Bayes

Variational Bayes with alternating updates

[4] Matthias Seeger and Guillaume Bouchard: Fast Variational Bayesian Inference for Non-Conjugate Matrix Factorization Models.

[4]

SLIDE 84

Simulation: VB vs. MAP

Requires heavy cross- validation! Ring structure Gaussian noise Entity set sizes between 40 and 80 For MAP grid of more than 1000 values for a0 = b0 and p0 = q0

SLIDE 85

Conclusion

SLIDE 86

Conclusion

Take-home message

Learn more by fusing different sources
Automate machine learning tasks
Convexity
Good: theoretical guarantee + experts in optimization listened to you
Bad: does not tell you how to tune parameters?
Bayesian
Good: automatic tuning of parameters
Bad: no theoretical guarantee

Ongoing work

Convex and Bayesian: Is it possible?
Efficient flexible implementation of predictive queries
Privacy-preserving learning of embedding

SLIDE 87

References

1) Guillaume Bouchard, Shengbo Guo and Dawei Yin. Convex Collective Matrix

Factorization. AISTATS 2013

2) Dawei Yin, Shengbo Guo, Boris Chidlovskii, Brian D. Davison, Cédric Archambeau and Guillaume Bouchard: Connecting comments and tags: improved modeling of social tagging systems. WSDM 2013: 547-556 3) Matthias Seeger and Guillaume Bouchard: Fast Variational Bayesian Inference for Non-Conjugate Matrix Factorization Models. Journal of Machine Learning Research

Proceedings Track 22: 1012-1018 (2012)

4) Mohammad Emtiyaz Khan, Benjamin M. Marlin, Guillaume Bouchard and Kevin P. Murphy: Variational bounds for mixed-data factor analysis. NIPS 2010: 1108-1116 5) Guillaume Bouchard, Behrouz Behmardi, Cedric Archambeau. Overlapping trace norms for multi-view learning. 2014. http://arxiv.org/abs/1404.6163.. 6) Beyza Ermis and Guillaume Bouchard. Binary tensor factorization in sublinear time.

2014. UAI 2014.

7) Arto Klami, Abhishek Tripathi and Guillaume Bouchard. Efficient inference for binary matrix and tensor factorization. ICLR 2014.