Embe mbedding dding as a To Tool
- l for
Al Algorithm
- rithm De
Embe mbedding dding as a To Tool ol for Al Algorithm orithm De - - PowerPoint PPT Presentation
Embe mbedding dding as a To Tool ol for Al Algorithm orithm De Design sign Le Song College of Computing Center for Machine Learning Georgia Institute of Technology What is machine learning (ML) Design algorithms and systems that can
2
code graphs benign/ malicious?
3
Drug/materials effective/ineffective? Information spread viral/non-viral? Natural language positive/negative?
metho hod dimen mensio ion MAE Level 6 1.3 billion 0.096 Embedding 0.1 million 0.085 2.3 million
materials
4
Efficiency (PCE) (0 -12 %) Predict
Level 2 …
Structure elements 4 1 1
… …
0 0 2 1
… …
Feature vector
Level 1
Reduce model size by 10,000 times! Reduce model size by 10,000 times!
5
Christine ine Alice ce David vid Jacob
time
𝑈
𝑢1 𝑢2 𝑢3
Epoch 1 Epoch 2 Epoch 3 Shoe David Alice Book
6
How long? How long? Deal with no data? Deal with no data? No difference? No difference? Predict future event? Predict future event? user time
tensor factorization
1 10 100 Return Time (MAE)
Hours Reduce error by 5 folds! Reduce error by 5 folds! 7,100 users 385 programs ~2 million internet TV views
7
App pplicati lication
Optimiz imizat ation ion Problem
Influence maximization Community discovery Resource scheduling Minimum vertex/set cover Maximum cut Traveling salesman
8
2 2 - app pproxim
ation ion for
nimum um vertex ex cover ver Repeat till all edges covered: 1. Select uncovered edge with largest total degree Decision not data-driven. Can we learn from data?
1 1.1 1.2 1.3 approximation ratio
Learn to be near optimal! Learn to be near optimal!
9
𝑌5 𝑌5 𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌6 𝑌6 𝑌4 𝑌4 attribute/ raw info. attribute/ raw info.
𝑘|𝜄𝑓) 𝑗,𝑘 ∈ℇ 𝑗∈𝒲 Represent Represent
[Dai, Dai & Song 2016]
𝑌6 𝑌6 𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌4 𝑌4 𝑌5 𝑌5
10
𝑌5 𝑌5 𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌6 𝑌6 𝑌4 𝑌4 Nonnegative node potential Nonnegative edge potential Categorical / Continuous/ Raw features Categorical / Continuous/ Raw features Continuous Latent Continuous Latent Ψ𝑓 𝐼𝑗, 𝐼
𝑘
Ψ𝑓 𝐼𝑗, 𝐼
𝑘
Ψ𝑤 𝐼𝑗, 𝑌𝑗 Ψ𝑤 𝐼𝑗, 𝑌𝑗
𝑌6 𝑌6 𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌4 𝑌4 𝑌5 𝑌5
𝑞 𝐼1 𝑦𝑘
𝑞 𝐼𝑗 𝑦𝑘 = 𝑞 𝐼
𝑘 , 𝑦𝑘 𝑏𝑚𝑚 𝐼𝑘 𝑓𝑦𝑑𝑓𝑞𝑢 𝐼𝑗
𝑞 𝑦𝑘 𝑞 𝐼2 𝑦𝑘
[Dai, Dai & Song 2016]
11
𝜈1(𝜓, 𝑋) 𝜈2(𝜓, 𝑋)
= 𝜈𝑏(𝜓, 𝑋)
𝑘 log Ψ𝑓 𝐼𝑗, 𝐼 𝑘
𝑘 𝓘 𝑘∈𝓞 𝑗
𝑟1(𝐼1) 𝑟2(𝐼2) 𝑟5(𝐼5) 𝑟6(𝐼6)
[Song et al. 11a,b] [Song et al. 10a,b]
12
𝑌6 𝑌6 𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌4 𝑌4 𝑌5 𝑌5 Ψ𝑓 𝐼𝑗, 𝐼
𝑘
Ψ𝑓 𝐼𝑗, 𝐼
𝑘
Ψ𝑤 𝐼𝑗, 𝑌𝑗 Ψ𝑤 𝐼𝑗, 𝑌𝑗
𝑘) 𝑘∈𝒪 𝑗
𝑘) 𝑘∈𝒪 𝑗
Mean, Variance, higher
moment ⋮
[Smola, Gretton, Song & Scholkopf. 2007]
𝑞(𝑌) 𝑟(𝑌)
𝜈𝑌 𝜈𝑌
′
𝜚 𝑌 = 𝑌 𝑌2 𝑌3 ⋮
𝔽𝑞 𝜚 𝑌 𝔽𝑟 𝜚 𝑌
13
𝑌6 𝑌6 𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌4 𝑌4 𝑌5 𝑌5
𝜈2
(0)
𝜈1
(0)
𝜈6
(0)
𝜈3
(0)
𝜈5
(0)
𝜈4
(0)
14
𝑌6 𝑌6 𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌4 𝑌4 𝑌5 𝑌5
𝜈2
(1)
𝜈1
(1)
𝜈6
(1)
𝜈3
(1)
𝜈5
(1)
𝜈4
(1)
15
𝑌6 𝑌6 𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌4 𝑌4 𝑌5 𝑌5
𝜈2
(2)
𝜈1
(2)
𝜈6
(2)
𝜈3
(2)
𝜈5
(2)
𝜈4
(2)
16
𝑘
𝑘
1𝑌𝑗 + 𝑋 2 𝜈𝑘 𝑘∈𝒪 𝑗
max 0,⋅ tanh(⋅) sigmoid(⋅)
matrix 𝑒 × 𝑒 matrix
17
𝑌6 𝑌6 𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌4 𝑌4 𝑌5 𝑌5
𝑟𝑗 𝐼𝑗 = Ψ𝑤 𝐼𝑗, 𝑦𝑗|𝜄 ⋅ 𝑛𝑘𝑗(𝐼𝑗)
𝑘∈𝓞 𝑗
𝑘 , ∀𝑗, 𝑘
𝑛𝑗𝑘 𝐼
𝑘 ← Ψ𝑤(𝐼𝑗, 𝑌𝑗|𝜄)Ψ𝑓 𝐼𝑗, 𝐼 𝑘|𝜄 𝓘
⋅ 𝑛ℓ𝑗 𝐼𝑗
ℓ∈𝒪 𝑗 \𝑘
𝑒𝐼𝑗, ∀𝑗, 𝑘
[Song et al. 11a,b] [Song et al. 10a,b] 18
Ψ𝑓 𝐼𝑗, 𝐼
𝑘
Ψ𝑓 𝐼𝑗, 𝐼
𝑘
Ψ𝑤 𝐼𝑗, 𝑌𝑗 Ψ𝑤 𝐼𝑗, 𝑌𝑗
code graphs benign/ malicious?
19
Drug/materials effective/ineffective? Information spread viral/non-viral? Natural language positive/negative?
𝑊,𝑋 𝑀 𝑊, 𝑋 : = 𝑧𝑗 − 𝑊⊤𝜈𝑏(𝑋, 𝜓𝑗) 2 𝑛 𝑗=1
Objective 𝑀 𝑊, 𝑋 A sequence of nonlinear mappings over graph Graphical model inference Gradient 𝜖𝑀 𝜖𝑋 Chain rule of derivatives in reverse order Back propagation in deep learning
20
21
22
Christine ine Alice ce David vid Jacob
time
𝑢0
23
𝑢2 𝑢1 𝑢3 Represent Represent 𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌4 𝑌4 𝑌5 𝑌5 𝐼8 𝐼8 𝐼9 𝐼9 𝐼6 𝐼6 𝐼7 𝐼7 𝐼4 𝐼4 𝐼1 𝐼1 𝐼5 𝐼5 𝐼2 𝐼2 𝐼3 𝐼3 𝑌6 𝑌6
user/item raw features user/item raw features Interaction time/context Interaction time/context
24
𝜈4 = 𝜏 𝑊
1 ⋅ 𝜈1
+𝑊
2 ⋅ 𝜈2
+𝑊
3 ⋅ 𝑌4
𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌4 𝑌4 𝑌5 𝑌5 𝐼8 𝐼8 𝐼9 𝐼9 𝐼6 𝐼6 𝐼7 𝐼7 𝐼4 𝐼4 𝐼1 𝐼1 𝐼5 𝐼5 𝐼2 𝐼2 𝐼3 𝐼3 𝑌6 𝑌6
user/item raw features user/item raw features Interaction time/context Interaction time/context
𝜈1 = 𝜏 𝑊
0 ⋅ 𝑌1
𝜈2 = 𝜏 𝑋
0 ⋅ 𝑌2
𝜈5 = 𝜏 𝑋
1 ⋅ 𝜈1
+𝑋
2 ⋅ 𝜈2
+𝑋
3 ⋅ 𝑌4
time
𝑢0 𝑢2 𝑢1 𝑢3 Represent Represent
Christine ine Alice ce David vid Jacob
25
Christine ine Alice ce David vid Jacob
26
Christine ine Alice ce David vid Jacob
27
Christine ine Alice ce David vid Jacob
28
Christine ine Alice ce David vid Jacob
29
Christine ine Alice ce David vid Jacob
30
time
𝑈 𝑢0 = 0
𝑣1, 𝑗1, 𝑢1, 𝑟1 𝑣𝑜, 𝑗𝑜, 𝑢𝑜, 𝑟𝑜
𝑢 Density function
Survival function
𝑢 𝑢𝑜
⊤ 𝑢𝑜 𝜈𝑗 𝑢𝑜
31
32
33
34
35
36
App pplicati lication
Optimiz imizat ation ion Problem
Influence maximization Community discovery Resource scheduling Minimum vertex/set cover Maximum cut Traveling salesman
𝑦𝑗∈ 0,1 𝑦𝑗 𝑗∈𝓦
𝑦𝑗∈ 0,1 𝑦𝑗 𝑗∈𝓦
37
multistage decision making problem 𝑠𝑢 = 𝑦𝑗
𝑢 𝑗∈𝓦
− 𝑦𝑗
𝑢+1 𝑗∈𝓦
= −1 multistage decision making problem 𝑠𝑢 = 𝑦𝑗
𝑢 𝑗∈𝓦
− 𝑦𝑗
𝑢+1 𝑗∈𝓦
= −1 State 𝑇: current set of nodes selected State 𝑇: current set of nodes selected Action value function: 𝑅(𝑇, 𝑗) Action value function: 𝑅(𝑇, 𝑗) Greedy policy: 𝑗∗ = 𝑏𝑠𝑛𝑏𝑦𝑗 𝑅(𝑇, 𝑗) Greedy policy: 𝑗∗ = 𝑏𝑠𝑛𝑏𝑦𝑗 𝑅(𝑇, 𝑗) Update state 𝑇 Update state 𝑇
Embed graph Embed graph 1st iteration Add best node Add best node State 𝑇 = 𝑦𝑗 = 0 𝑗∈𝓦 State-action value function 𝑅 𝑇, 𝑗 = 𝜄1𝜏(𝜄2 𝜈𝑏 + 𝜄3 𝜈𝑗) Greedy action 𝑗∗ = 𝑏𝑠𝑛𝑏𝑦 𝑅(𝑇, 𝑗) aggregated embedding individual embedding Embed graph Embed graph Add best node Add best node 2nd iteration 𝑦𝑗∗ = 1, 𝑝𝑢ℎ𝑓𝑠 𝑦𝑘 = 0
38
Minimum vertex cover: smallest number of nodes to cover all edges A distribution of scale free networks Optimal approximated by running CPLEX for 1 hour
39
Pre-training: initialize embedding parameters with ones trained with smaller networks
40
Optimal Embedding 0.07% longer 0.5% longer
41
𝑞 𝐼1 𝑦𝑘
𝑌6 𝑌6 𝑌1 𝑌1 𝑌2 𝑌2 𝑌3 𝑌3 𝑌4 𝑌4 𝑌5 𝑌5
𝑞 𝐼2 𝑦𝑘
𝜈1(𝜓, 𝑋) 𝜈2(𝜓, 𝑋)
[Dai, Dai & Song 2016]
42
= 𝜈𝑏(𝜓, 𝑋)