Embe mbedding dding as a To Tool ol for Al Algorithm orithm De - - PowerPoint PPT Presentation

embe mbedding dding as a to tool ol for al algorithm
SMART_READER_LITE
LIVE PREVIEW

Embe mbedding dding as a To Tool ol for Al Algorithm orithm De - - PowerPoint PPT Presentation

Embe mbedding dding as a To Tool ol for Al Algorithm orithm De Design sign Le Song College of Computing Center for Machine Learning Georgia Institute of Technology What is machine learning (ML) Design algorithms and systems that can


slide-1
SLIDE 1

Embe mbedding dding as a To Tool

  • l for

Al Algorithm

  • rithm De

Design sign

Le Song

College of Computing Center for Machine Learning Georgia Institute of Technology

slide-2
SLIDE 2

What is machine learning (ML)

Design algorithms and systems that can improve their performance with data

2

The best design pattern for big data? Embedding structures The best design pattern for big data? Embedding structures

slide-3
SLIDE 3

Ex 1: Prediction for structured data

code graphs benign/ malicious?

3

Drug/materials effective/ineffective? Information spread viral/non-viral? Natural language positive/negative?

slide-4
SLIDE 4

Big dataset, explosive feature space

metho hod dimen mensio ion MAE Level 6 1.3 billion 0.096 Embedding 0.1 million 0.085 2.3 million

  • rganic

materials

4

Efficiency (PCE) (0 -12 %) Predict

Level 2 …

Structure elements 4 1 1

… …

0 0 2 1

… …

Feature vector

Level 1

Reduce model size by 10,000 times! Reduce model size by 10,000 times!

slide-5
SLIDE 5

who and when will do what?

Ex 2: Social information network modeling

5

Christine ine Alice ce David vid Jacob

  • b
slide-6
SLIDE 6

Complex behavior not well modeled

time

𝑈

𝑢1 𝑢2 𝑢3

Epoch 1 Epoch 2 Epoch 3 Shoe David Alice Book

6

How long? How long? Deal with no data? Deal with no data? No difference? No difference? Predict future event? Predict future event? user time

≈ + …

tensor factorization

1 10 100 Return Time (MAE)

Hours Reduce error by 5 folds! Reduce error by 5 folds! 7,100 users 385 programs ~2 million internet TV views

slide-7
SLIDE 7

Ex 3: Combinatorial optimizations over graphs

NP-hard problems

7

App pplicati lication

  • n

Optimiz imizat ation ion Problem

  • blem

Influence maximization Community discovery Resource scheduling Minimum vertex/set cover Maximum cut Traveling salesman

slide-8
SLIDE 8

Simple heuristics do not exploit data

8

2 2 - app pproxim

  • ximat

ation ion for

  • r minim

nimum um vertex ex cover ver Repeat till all edges covered: 1. Select uncovered edge with largest total degree Decision not data-driven. Can we learn from data?

1 1.1 1.2 1.3 approximation ratio

Learn to be near optimal! Learn to be near optimal!

slide-9
SLIDE 9

Fundamental problems

9

How to describe node? How to describe entire structure? How to incorporate various info.? How to do it efficiently? Structure 𝜓

𝑌5 𝑌5 𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌6 𝑌6 𝑌4 𝑌4 attribute/ raw info. attribute/ raw info.

slide-10
SLIDE 10

Represent structure as latent variable model (LVM)

𝑞 𝐼𝑗 , 𝑌𝑗 ∝ Ψ𝑤 𝐼𝑗, 𝑌𝑗|𝜄𝑤 Ψ𝑓(𝐼𝑗, 𝐼

𝑘|𝜄𝑓) 𝑗,𝑘 ∈ℇ 𝑗∈𝒲 Represent Represent

Structure 𝜓 Joint likelihood

[Dai, Dai & Song 2016]

𝑌6 𝑌6 𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌4 𝑌4 𝑌5 𝑌5

LVM 𝐻 = (𝒲, ℇ)

10

𝑌5 𝑌5 𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌6 𝑌6 𝑌4 𝑌4 Nonnegative node potential Nonnegative edge potential Categorical / Continuous/ Raw features Categorical / Continuous/ Raw features Continuous Latent Continuous Latent Ψ𝑓 𝐼𝑗, 𝐼

𝑘

Ψ𝑓 𝐼𝑗, 𝐼

𝑘

Ψ𝑤 𝐼𝑗, 𝑌𝑗 Ψ𝑤 𝐼𝑗, 𝑌𝑗

slide-11
SLIDE 11

𝑌6 𝑌6 𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌4 𝑌4 𝑌5 𝑌5

𝑞 𝐼1 𝑦𝑘

Posterior distribution as features

𝑞 𝐼𝑗 𝑦𝑘 = 𝑞 𝐼

𝑘 , 𝑦𝑘 𝑏𝑚𝑚 𝐼𝑘 𝑓𝑦𝑑𝑓𝑞𝑢 𝐼𝑗

𝑞 𝑦𝑘 𝑞 𝐼2 𝑦𝑘

posterior

[Dai, Dai & Song 2016]

11

LVM 𝐻 = (𝒲, ℇ) Capture both nodal and topological info. Aggregate information from distant nodes

𝜈1(𝜓, 𝑋) 𝜈2(𝜓, 𝑋)

Features of nodes Features of the entire structure

= 𝜈𝑏(𝜓, 𝑋)

+ +

slide-12
SLIDE 12

Mean field algorithm aggregates information

Approximate posterior 𝑞 𝐼𝑗 𝑦𝑘 ≈ 𝑟𝑗(𝐼𝑗) via fixed point update

  • 1. Initialize 𝑟𝑗 𝐼𝑗 , ∀ 𝑗
  • 2. Iterate many times

𝑟𝑗 𝐼𝑗 ← Ψ𝑤 𝐼𝑗, 𝑌𝑗 ⋅ exp 𝑟𝑘 𝐼

𝑘 log Ψ𝑓 𝐼𝑗, 𝐼 𝑘

𝑒𝐼

𝑘 𝓘 𝑘∈𝓞 𝑗

, ∀ 𝑗

𝑟1(𝐼1) 𝑟2(𝐼2) 𝑟5(𝐼5) 𝑟6(𝐼6)

[Song et al. 11a,b] [Song et al. 10a,b]

12

𝑌6 𝑌6 𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌4 𝑌4 𝑌5 𝑌5 Ψ𝑓 𝐼𝑗, 𝐼

𝑘

Ψ𝑓 𝐼𝑗, 𝐼

𝑘

Ψ𝑤 𝐼𝑗, 𝑌𝑗 Ψ𝑤 𝐼𝑗, 𝑌𝑗

𝓤 ∘ 𝑌𝑗, 𝑟𝑘(𝐼

𝑘) 𝑘∈𝒪 𝑗

𝓤 ∘ 𝑌𝑗, 𝑟𝑘(𝐼

𝑘) 𝑘∈𝒪 𝑗

slide-13
SLIDE 13

Embedding of distribution

Feature space

Mean, Variance, higher

  • rder

moment ⋮

[Smola, Gretton, Song & Scholkopf. 2007]

𝑞(𝑌) 𝑟(𝑌)

Injective for rich nonlinear feature 𝜚(𝑦) Operator View 𝓤 ∘ 𝑞 𝑦 = 𝓤 ∘ 𝜈𝑌

𝜈𝑌 𝜈𝑌

𝜚 𝑌 = 𝑌 𝑌2 𝑌3 ⋮

𝔽𝑞 𝜚 𝑌 𝔽𝑟 𝜚 𝑌

𝜈𝑌 is a sufficient statistic of 𝑞(𝑌)

13

Density space

slide-14
SLIDE 14

𝑌6 𝑌6 𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌4 𝑌4 𝑌5 𝑌5

Structure2vec (S2V): embedding mean field

𝜈2

(0)

𝜈1

(0)

𝜈6

(0)

𝜈3

(0)

𝜈5

(0)

𝜈4

(0)

14

Approximate embedding of 𝑞 𝐼𝑗 𝑦𝑘 ↦ 𝜈𝑗 via fixed point update

  • 1. Initialize 𝜈𝑗, ∀ 𝑗
  • 2. Iterate many times

𝜈𝑗 ← 𝓤 ∘ 𝑌𝑗, 𝜈𝑘 𝑘∈𝒪 𝑗 , ∀ 𝑗 𝜈𝑗 ← 𝓤 ∘ 𝑌𝑗, 𝜈𝑘 𝑘∈𝒪 𝑗 , ∀ 𝑗

slide-15
SLIDE 15

𝑌6 𝑌6 𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌4 𝑌4 𝑌5 𝑌5

Structure2vec (S2V): embedding mean field

𝜈2

(1)

𝜈1

(1)

𝜈6

(1)

𝜈3

(1)

𝜈5

(1)

𝜈4

(1)

15

Approximate embedding of 𝑞 𝐼𝑗 𝑦𝑘 ↦ 𝜈𝑗 via fixed point update

  • 1. Initialize 𝜈𝑗, ∀ 𝑗
  • 2. Iterate many times

𝜈𝑗 ← 𝓤 ∘ 𝑌𝑗, 𝜈𝑘 𝑘∈𝒪 𝑗 , ∀ 𝑗 𝜈𝑗 ← 𝓤 ∘ 𝑌𝑗, 𝜈𝑘 𝑘∈𝒪 𝑗 , ∀ 𝑗

slide-16
SLIDE 16

𝑌6 𝑌6 𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌4 𝑌4 𝑌5 𝑌5

Structure2vec (S2V): embedding mean field

𝜈2

(2)

𝜈1

(2)

𝜈6

(2)

𝜈3

(2)

𝜈5

(2)

𝜈4

(2)

16

Approximate embedding of 𝑞 𝐼𝑗 𝑦𝑘 ↦ 𝜈𝑗 via fixed point update

  • 1. Initialize 𝜈𝑗, ∀ 𝑗
  • 2. Iterate many times

𝜈𝑗 ← 𝓤 ∘ 𝑌𝑗, 𝜈𝑘 𝑘∈𝒪 𝑗 , ∀ 𝑗 𝜈𝑗 ← 𝓤 ∘ 𝑌𝑗, 𝜈𝑘 𝑘∈𝒪 𝑗 , ∀ 𝑗 How to parametrize 𝓤 ? Depends on unknown Ψ𝑤 𝐼𝑗, 𝑌𝑗 and Ψ𝑓 𝐼𝑗, 𝐼

𝑘

How to parametrize 𝓤 ? Depends on unknown Ψ𝑤 𝐼𝑗, 𝑌𝑗 and Ψ𝑓 𝐼𝑗, 𝐼

𝑘

slide-17
SLIDE 17

Directly parameterize nonlinear mapping

  • Eg. assume 𝜈𝑗 ∈ 𝓢𝑒, 𝑌𝑗 ∈ 𝓢𝑜, neural network parameterization

Learn with supervision, unsupervised learning, or reinforcement learning Learn with supervision, unsupervised learning, or reinforcement learning 𝜈𝑗 ← 𝓤 ∘ 𝑌𝑗, 𝜈𝑘 𝑘∈𝒪 𝑗 𝜈𝑗 ← 𝜏 𝑋

1𝑌𝑗 + 𝑋 2 𝜈𝑘 𝑘∈𝒪 𝑗

max 0,⋅ tanh(⋅) sigmoid(⋅)

𝑒 × 𝑜

matrix 𝑒 × 𝑒 matrix

Any universal nonlinear function will do

17

slide-18
SLIDE 18

Embedding belief propagation

𝑌6 𝑌6 𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌4 𝑌4 𝑌5 𝑌5

𝐻 = (𝒲, ℇ) Approximate 𝑞 𝐼𝑗 𝑦𝑘 , 𝜄 as

𝑟𝑗 𝐼𝑗 = Ψ𝑤 𝐼𝑗, 𝑦𝑗|𝜄 ⋅ 𝑛𝑘𝑗(𝐼𝑗)

𝑘∈𝓞 𝑗

with iterative messages updates

  • 1. Initialize 𝑛𝑗𝑘 𝐼

𝑘 , ∀𝑗, 𝑘

  • 2. Iterate many times

𝑛𝑗𝑘 𝐼

𝑘 ← Ψ𝑤(𝐼𝑗, 𝑌𝑗|𝜄)Ψ𝑓 𝐼𝑗, 𝐼 𝑘|𝜄 𝓘

⋅ 𝑛ℓ𝑗 𝐼𝑗

ℓ∈𝒪 𝑗 \𝑘

𝑒𝐼𝑗, ∀𝑗, 𝑘

[Song et al. 11a,b] [Song et al. 10a,b] 18

𝓤 ∘ 𝑌𝑗, 𝑛ℓ𝑗(𝐼𝑗) ℓ∈𝒪 𝑗 \𝑘 𝓤 ∘ 𝑌𝑗, 𝑛ℓ𝑗(𝐼𝑗) ℓ∈𝒪 𝑗 \𝑘 𝓤′ ∘ 𝑌𝑗, 𝑛ℓ𝑗(𝐼𝑗) ℓ∈𝒪 𝑗 𝓤′ ∘ 𝑌𝑗, 𝑛ℓ𝑗(𝐼𝑗) ℓ∈𝒪 𝑗

Ψ𝑓 𝐼𝑗, 𝐼

𝑘

Ψ𝑓 𝐼𝑗, 𝐼

𝑘

Ψ𝑤 𝐼𝑗, 𝑌𝑗 Ψ𝑤 𝐼𝑗, 𝑌𝑗

slide-19
SLIDE 19

Ex 1: Prediction for structured data

code graphs benign/ malicious?

19

Drug/materials effective/ineffective? Information spread viral/non-viral? Natural language positive/negative?

slide-20
SLIDE 20

Algorithm learning

Given 𝑛 data points 𝜓1, 𝜓2, … , 𝜓𝑛 And their labels 𝑧1, 𝑧2, … , 𝑧𝑛 Estimate parameters 𝑋 and 𝑊 via min

𝑊,𝑋 𝑀 𝑊, 𝑋 : = 𝑧𝑗 − 𝑊⊤𝜈𝑏(𝑋, 𝜓𝑗) 2 𝑛 𝑗=1

Comp

  • mputation

tion Ope perati ation

  • n

Similar ar to to

Objective 𝑀 𝑊, 𝑋 A sequence of nonlinear mappings over graph Graphical model inference Gradient 𝜖𝑀 𝜖𝑋 Chain rule of derivatives in reverse order Back propagation in deep learning

20

slide-21
SLIDE 21

10,000x smaller model but accurate prediction

Test t MAE Test RMSE MSE # pa parame amete ters Mean predictor 1.986 2.406 1 WL level-3 0.143 0.204 1.6 m WL level-6 0.096 0.137 1.3 b S2V-MF 0.091 0.125 0.1 m S2V-BP 0.085 0.117 0.1 m Harvard clean energy project: predict material efficiency (0-12) 2.3 million organic molecules 90% for training, 10% data for testing

21

~4% relative error ~4% relative error

slide-22
SLIDE 22

who and when will do what?

Ex 2: Social information network modeling

22

Christine ine Alice ce David vid Jacob

  • b
slide-23
SLIDE 23

Unroll: time-varying dependency structure

time

𝑢0

23

𝑢2 𝑢1 𝑢3 Represent Represent 𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌4 𝑌4 𝑌5 𝑌5 𝐼8 𝐼8 𝐼9 𝐼9 𝐼6 𝐼6 𝐼7 𝐼7 𝐼4 𝐼4 𝐼1 𝐼1 𝐼5 𝐼5 𝐼2 𝐼2 𝐼3 𝐼3 𝑌6 𝑌6

LVM 𝐻 = (𝒲, ℇ)

user/item raw features user/item raw features Interaction time/context Interaction time/context

slide-24
SLIDE 24

Embed filtering/forward belief propagation

24

𝜈4 = 𝜏 𝑊

1 ⋅ 𝜈1

+𝑊

2 ⋅ 𝜈2

+𝑊

3 ⋅ 𝑌4

𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌4 𝑌4 𝑌5 𝑌5 𝐼8 𝐼8 𝐼9 𝐼9 𝐼6 𝐼6 𝐼7 𝐼7 𝐼4 𝐼4 𝐼1 𝐼1 𝐼5 𝐼5 𝐼2 𝐼2 𝐼3 𝐼3 𝑌6 𝑌6

LVM 𝐻 = (𝒲, ℇ)

user/item raw features user/item raw features Interaction time/context Interaction time/context

𝜈1 = 𝜏 𝑊

0 ⋅ 𝑌1

𝜈2 = 𝜏 𝑋

0 ⋅ 𝑌2

𝜈5 = 𝜏 𝑋

1 ⋅ 𝜈1

+𝑋

2 ⋅ 𝜈2

+𝑋

3 ⋅ 𝑌4

time

𝑢0 𝑢2 𝑢1 𝑢3 Represent Represent

slide-25
SLIDE 25

Co-evolutionary embedding

Christine ine Alice ce David vid Jacob

  • b

Item embedding 𝜈𝑗(𝑢) User embedding 𝜈𝑣(𝑢)

25

slide-26
SLIDE 26

Christine ine Alice ce David vid Jacob

  • b

User embedding 𝜈𝑣(𝑢)

Co-evolutionary embedding

Item embedding 𝜈𝑗(𝑢)

26

slide-27
SLIDE 27

Christine ine Alice ce David vid Jacob

  • b

User embedding 𝜈𝑣(𝑢)

Co-evolutionary embedding

Item embedding 𝜈𝑗(𝑢)

27

slide-28
SLIDE 28

Christine ine Alice ce David vid Jacob

  • b

Item embedding 𝜈𝑗(𝑢) User embedding 𝜈𝑣(𝑢)

Co-evolutionary embedding

28

slide-29
SLIDE 29

Christine ine Alice ce David vid Jacob

  • b

Item embedding 𝜈𝑗(𝑢) User embedding 𝜈𝑣(𝑢)

Co-evolutionary embedding

29

slide-30
SLIDE 30

Christine ine Alice ce David vid Jacob

  • b

Update U→I Update I→U

Co-evolutionary embedding

Item embedding 𝜈𝑗(𝑢) User embedding 𝜈𝑣(𝑢)

30

slide-31
SLIDE 31

From embedding to next interaction time

Link embedding with interaction data using generative model

time

𝑈 𝑢0 = 0

𝑣1, 𝑗1, 𝑢1, 𝑟1 𝑣𝑜, 𝑗𝑜, 𝑢𝑜, 𝑟𝑜

?

?

𝑢 Density function

𝑞𝑣𝑗 𝑢|𝑢𝑜 = 𝜇𝑣𝑗 𝑢|𝑢𝑜 𝑇𝑣𝑗 𝑢|𝑢𝑜

Survival function

𝑇𝑣𝑗 𝑢|𝑢𝑜 = exp − 𝜇𝑣𝑗 𝜐 𝑒𝜐

𝑢 𝑢𝑜

Intensity of interaction determined by compatibility and time-lapse 𝜇𝑣𝑗 𝑢|𝑢𝑜 = exp 𝜈𝑣

⊤ 𝑢𝑜 𝜈𝑗 𝑢𝑜

⋅ (𝑢 − 𝑢𝑜)

31

slide-32
SLIDE 32

Embedding leads to better prediction

Next item prediction MAR: mean absolute rank difference Return time prediction MAE: mean absolute error (hours) Reddit dataset: prediction of discussion forum participation 1,000 users, 1403 groups, ~10K interactions Better Better

32

slide-33
SLIDE 33

GDELT database: Events in news media Total archives span >215 years, trillion of events Event (knowledge item):

  • Subject --- relation --- object
  • Time

Temporal knowledge graph: What will happen next?

33

slide-34
SLIDE 34

Reasoning over time I

Enemy’s friend is enemy

34

slide-35
SLIDE 35

Friends’ friend is a friend, common enemy improves bond

35

Reasoning over time II

EITC / EIDC / EIMC: some form of cooperation

slide-36
SLIDE 36

App 3: Combinatorial optimizations over graphs

NP-hard problems

36

App pplicati lication

  • n

Optimiz imizat ation ion Problem

  • blem

Influence maximization Community discovery Resource scheduling Minimum vertex/set cover Maximum cut Traveling salesman

slide-37
SLIDE 37

Combinatorial optimization as MDP

min

𝑦𝑗∈ 0,1 𝑦𝑗 𝑗∈𝓦

𝑡. 𝑢. 𝑦𝑗 + 𝑦𝑘 > 0, ∀ 𝑗, 𝑘 ∈ 𝓕 Repeat:

  • 1. Compute total degree of each

uncovered edge

  • 2. Select both ends of uncovered

edge with largest total degree Until all edges are covered min

𝑦𝑗∈ 0,1 𝑦𝑗 𝑗∈𝓦

𝑡. 𝑢. 𝑦𝑗 + 𝑦𝑘 > 0, ∀ 𝑗, 𝑘 ∈ 𝓕 Repeat:

  • 1. Compute total degree of each

uncovered edge

  • 2. Select both ends of uncovered

edge with largest total degree Until all edges are covered Minimum vertex cover: smallest number of nodes to cover all edges

37

multistage decision making problem 𝑠𝑢 = 𝑦𝑗

𝑢 𝑗∈𝓦

− 𝑦𝑗

𝑢+1 𝑗∈𝓦

= −1 multistage decision making problem 𝑠𝑢 = 𝑦𝑗

𝑢 𝑗∈𝓦

− 𝑦𝑗

𝑢+1 𝑗∈𝓦

= −1 State 𝑇: current set of nodes selected State 𝑇: current set of nodes selected Action value function: 𝑅(𝑇, 𝑗) Action value function: 𝑅(𝑇, 𝑗) Greedy policy: 𝑗∗ = 𝑏𝑠𝑕𝑛𝑏𝑦𝑗 𝑅(𝑇, 𝑗) Greedy policy: 𝑗∗ = 𝑏𝑠𝑕𝑛𝑏𝑦𝑗 𝑅(𝑇, 𝑗) Update state 𝑇 Update state 𝑇

slide-38
SLIDE 38

Graph embedding for state-action value function

Embed graph Embed graph 1st iteration Add best node Add best node State 𝑇 = 𝑦𝑗 = 0 𝑗∈𝓦 State-action value function 𝑅 𝑇, 𝑗 = 𝜄1𝜏(𝜄2 𝜈𝑏 + 𝜄3 𝜈𝑗) Greedy action 𝑗∗ = 𝑏𝑠𝑕𝑛𝑏𝑦 𝑅(𝑇, 𝑗) aggregated embedding individual embedding Embed graph Embed graph Add best node Add best node 2nd iteration 𝑦𝑗∗ = 1, 𝑝𝑢ℎ𝑓𝑠 𝑦𝑘 = 0

38

slide-39
SLIDE 39

Embedding leads to better heuristic algorithm

Minimum vertex cover: smallest number of nodes to cover all edges A distribution of scale free networks Optimal approximated by running CPLEX for 1 hour

approximation ratio ≈ 1

39

slide-40
SLIDE 40

Training converge quite fast

Pre-training: initialize embedding parameters with ones trained with smaller networks

40

slide-41
SLIDE 41

Also good for traveling salesman problem

Optimal Embedding 0.07% longer 0.5% longer

41

slide-42
SLIDE 42

𝑞 𝐼1 𝑦𝑘

𝑌6 𝑌6 𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌4 𝑌4 𝑌5 𝑌5

Embedding as a tool for algorithm design

𝑞 𝐼2 𝑦𝑘

posterior

𝜈1(𝜓, 𝑋) 𝜈2(𝜓, 𝑋)

Embedding of node

[Dai, Dai & Song 2016]

42

LVM 𝐻 = (𝒲, ℇ) Embedding of entire structure

= 𝜈𝑏(𝜓, 𝑋)

+ +

  • Embedding structures
  • Learn better? Nonconvex & RL?
  • New system & programming language?