SLIDE 1 Fully Distributed EM for Very Large Datasets
Jason Wolfe Aria Haghighi Dan Klein
Computer Science Division UC Berkeley
SLIDE 2 Overview
Task: unsupervised learning via EM Focus: models w/ many local parameters (relevant to few datums) Approach: fully distributed, localized EM
⋆ parameter locality → less bandwidth
ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ US Hosts Middle East Peace Conference Next Week US Hosts Middle East Peace Conference Next Week US Hosts Middle East Peace Conference Next Week ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ US Hosts Middle East Peace Conference Next Week
244 1 2 3
millions of parameters millions of data points
useful work communication
SLIDE 3 Overview
Task: unsupervised learning via EM Focus: models w/ many local parameters (relevant to few datums) Approach: fully distributed, localized EM
⋆ parameter locality → less bandwidth
ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ US Hosts Middle East Peace Conference Next Week US Hosts Middle East Peace Conference Next Week US Hosts Middle East Peace Conference Next Week ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ US Hosts Middle East Peace Conference Next Week
244 1 2 3
millions of parameters millions of data points
useful work communication
SLIDE 4 Overview
Task: unsupervised learning via EM Focus: models w/ many local parameters (relevant to few datums) Approach: fully distributed, localized EM
⋆ parameter locality → less bandwidth
ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ US Hosts Middle East Peace Conference Next Week US Hosts Middle East Peace Conference Next Week US Hosts Middle East Peace Conference Next Week ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ US Hosts Middle East Peace Conference Next Week
244 1 2 3
millions of parameters millions of data points
useful work communication
SLIDE 5
Outline
Running example: IBM Model 1 for word alignment Naive distributed EM Efficiently distributed EM
SLIDE 6
Word alignment for machine translation
Goal: parallel sentences → word-level translation model Parameters θst: probability that Spanish word s translates to English word t θ = θlathe θlachair θlatable θsillathe θsillachair θmesathe θmesatable
la silla the chair la mesa the table corpus of parallel sentences
SLIDE 7
Word alignment for machine translation
Goal: parallel sentences → word-level translation model Parameters θst: probability that Spanish word s translates to English word t θ = θlathe θlachair θlatable θsillathe θsillachair θmesathe θmesatable
la silla the chair la mesa the table corpus of parallel sentences la silla the chair la mesa the table possible alignment arcs
SLIDE 8
Word alignment for machine translation
Goal: parallel sentences → word-level translation model Parameters θst: probability that Spanish word s translates to English word t θ = θlathe = 1.0 θlachair = 0.0 θlatable = 0.0 θsillathe = 0.0 θsillachair = 1.0 θmesathe = 0.0 θmesatable = 1.0
la silla the chair la mesa the table corpus of parallel sentences la silla the chair la mesa the table possible alignment arcs la silla the chair la mesa the table unobserved true alignments
SLIDE 9 IBM Model 1 for word alignment
a Steve no le gustan las ferias grandes Steve does not like big ferris wheels ? ? ? ? ? ? ?
each target word is generated by exactly
- ne source word chosen u.a.r
IBM Model 1: a simple generative model
For each target position i, independently
choose a source index ai u.a.r. choose a target word Ti ∼ θSai ·
SLIDE 10 IBM Model 1 for word alignment
a Steve no le gustan las ferias grandes Steve does not like big ferris wheels ? ? ? ? ? ? ?
each target word is generated by exactly
- ne source word chosen u.a.r
IBM Model 1: a simple generative model
For each target position i, independently
choose a source index ai u.a.r. choose a target word Ti ∼ θSai ·
SLIDE 11 IBM Model 1 for word alignment
a Steve no le gustan las ferias grandes Steve does not like big ferris wheels ? ? ? ? ? ?
each target word is generated by exactly
- ne source word chosen u.a.r
IBM Model 1: a simple generative model
For each target position i, independently
choose a source index ai u.a.r. choose a target word Ti ∼ θSai ·
SLIDE 12 IBM Model 1 for word alignment
a Steve no le gustan las ferias grandes Steve does not like big ferris wheels ? ? ? ? ? ?
each target word is generated by exactly
- ne source word chosen u.a.r
IBM Model 1: a simple generative model
For each target position i, independently
choose a source index ai u.a.r. choose a target word Ti ∼ θSai ·
SLIDE 13 IBM Model 1 for word alignment
a Steve no le gustan las ferias grandes Steve does not like big ferris wheels ? ? ? ? ?
each target word is generated by exactly
- ne source word chosen u.a.r
IBM Model 1: a simple generative model
For each target position i, independently
choose a source index ai u.a.r. choose a target word Ti ∼ θSai ·
SLIDE 14 IBM Model 1 for word alignment
a Steve no le gustan las ferias grandes Steve does not like big ferris wheels ? ? ? ? ?
each target word is generated by exactly
- ne source word chosen u.a.r
IBM Model 1: a simple generative model
For each target position i, independently
choose a source index ai u.a.r. choose a target word Ti ∼ θSai ·
SLIDE 15 IBM Model 1 for word alignment
a Steve no le gustan las ferias grandes Steve does not like big ferris wheels ? ? ? ?
each target word is generated by exactly
- ne source word chosen u.a.r
IBM Model 1: a simple generative model
For each target position i, independently
choose a source index ai u.a.r. choose a target word Ti ∼ θSai ·
SLIDE 16 IBM Model 1 for word alignment
a Steve no le gustan las ferias grandes Steve does not like big ferris wheels ? ? ? ?
each target word is generated by exactly
- ne source word chosen u.a.r
IBM Model 1: a simple generative model
For each target position i, independently
choose a source index ai u.a.r. choose a target word Ti ∼ θSai ·
SLIDE 17 IBM Model 1 for word alignment
a Steve no le gustan las ferias grandes Steve does not like big ferris wheels ? ? ?
each target word is generated by exactly
- ne source word chosen u.a.r
IBM Model 1: a simple generative model
For each target position i, independently
choose a source index ai u.a.r. choose a target word Ti ∼ θSai ·
SLIDE 18 IBM Model 1 for word alignment
a Steve no le gustan las ferias grandes Steve does not like big ferris wheels
each target word is generated by exactly
- ne source word chosen u.a.r
IBM Model 1: a simple generative model
For each target position i, independently
choose a source index ai u.a.r. choose a target word Ti ∼ θSai ·
SLIDE 19
EM algorithm for IBM Model 1
θ ← some initial guess
θlathe=.33, θlachair=.33, θlatable=.33, θsillathe=.5,...
SLIDE 20 EM algorithm for IBM Model 1
θ ← some initial guess Iterate:
1
E-step: estimate alignment counts η
1
compute posteriors p(ai|θ)
θlathe=.33, θlachair=.33, θlatable=.33, θsillathe=.5,...
la silla the chair
.6=
.5 .33+.5 .33 .33+.5 =.4
SLIDE 21 EM algorithm for IBM Model 1
θ ← some initial guess Iterate:
1
E-step: estimate alignment counts η
1
compute posteriors p(ai|θ)
θlathe=.33, θlachair=.33, θlatable=.33, θsillathe=.5,...
la silla the chair
.6=
.5 .33+.5 .33 .33+.5 =.4
la mesa the table la silla the chair
.6 .4 .6 .4
SLIDE 22 EM algorithm for IBM Model 1
θ ← some initial guess Iterate:
1
E-step: estimate alignment counts η
1
compute posteriors p(ai|θ)
2
aggregate into expected counts ηst (expected # times st under θ)
ηst ←
θst
θlathe=.33, θlachair=.33, θlatable=.33, θsillathe=.5,...
la silla the chair
.6=
.5 .33+.5 .33 .33+.5 =.4
la mesa the table la silla the chair
.6 .4 .6 .4
ηlathe=.8, ηlachair=.4, ηlatable=.4, ηsillathe=.6,...
SLIDE 23 EM algorithm for IBM Model 1
θ ← some initial guess Iterate:
1
E-step: estimate alignment counts η
1
compute posteriors p(ai|θ)
2
aggregate into expected counts ηst (expected # times st under θ)
ηst ←
θst
2
M-step: normalize η to get new ML θ θst ← ηst
θlathe=.33, θlachair=.33, θlatable=.33, θsillathe=.5,...
la silla the chair
.6=
.5 .33+.5 .33 .33+.5 =.4
la mesa the table la silla the chair
.6 .4 .6 .4
ηlathe=.8, ηlachair=.4, ηlatable=.4, ηsillathe=.6,... θlathe=.5, θlachair=.25, θlatable=.25, θsillathe=.5,...
SLIDE 24
EM example continued
E-Step 1
la silla the chair la mesa the table
SLIDE 25
EM example continued
E-Step 2
la silla the chair la mesa the table
SLIDE 26
EM example continued
E-Step 3
la silla the chair la mesa the table
SLIDE 27
EM example continued
E-Step 4
la silla the chair la mesa the table
SLIDE 28
EM example continued
E-Step 5
la silla the chair la mesa the table
SLIDE 29
EM example continued
E-Step ∞
la silla the chair la mesa the table
SLIDE 30 UN Arabic English TIDES v2 corpus
ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ US Hosts Middle East Peace Conference Next Week US Hosts Middle East Peace Conference Next Week US Hosts Middle East Peace Conference Next Week ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ US Hosts Middle East Peace Conference Next Week
2.9 million sentence pairs from UN proceedings 243 million unique word pairs (translations possible in some sentence pair)
243 M parameters in θ 243 M counts in η
Even fitting all (indexed) parameters in 32-bit memory can be challenging
SLIDE 31
Outline
Running example: IBM Model 1 for word alignment Naive distributed EM Efficiently distributed EM
SLIDE 32 Previous approach: distributing the E-step
Reduce, M-step la silla the chair la mesa the table E-step 1 E-step 2
1
E-step computations distribute easily
partition data over k nodes alignments independent given θ
2
Nodes communicate partial counts to central Reduce node
3
Reduce node does global M-step
4
Reduce sends new parameters back Remaining problems:
Memory at Reduce node C-step (communication) bandwidth: 5.5 B numbers per iteration (on full dataset with 20 nodes) (Chu et al. 2006, Dyer et al. 2008, Newman et al. 2008, ...)
SLIDE 33 Previous approach: distributing the E-step
Reduce, M-step la silla the chair la mesa the table E-step 1 E-step 2
1
E-step computations distribute easily
partition data over k nodes alignments independent given θ
2
Nodes communicate partial counts to central Reduce node
3
Reduce node does global M-step
4
Reduce sends new parameters back Remaining problems:
Memory at Reduce node C-step (communication) bandwidth: 5.5 B numbers per iteration (on full dataset with 20 nodes) (Chu et al. 2006, Dyer et al. 2008, Newman et al. 2008, ...)
SLIDE 34 Previous approach: distributing the E-step
Reduce, M-step la silla the chair la mesa the table E-step 1 E-step 2
ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable
1
E-step computations distribute easily
partition data over k nodes alignments independent given θ
2
Nodes communicate partial counts to central Reduce node
3
Reduce node does global M-step
4
Reduce sends new parameters back Remaining problems:
Memory at Reduce node C-step (communication) bandwidth: 5.5 B numbers per iteration (on full dataset with 20 nodes) (Chu et al. 2006, Dyer et al. 2008, Newman et al. 2008, ...)
SLIDE 35 Previous approach: distributing the E-step
Reduce, M-step la silla the chair la mesa the table E-step 1 E-step 2
ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable
1
E-step computations distribute easily
partition data over k nodes alignments independent given θ
2
Nodes communicate partial counts to central Reduce node
3
Reduce node does global M-step
4
Reduce sends new parameters back Remaining problems:
Memory at Reduce node C-step (communication) bandwidth: 5.5 B numbers per iteration (on full dataset with 20 nodes) (Chu et al. 2006, Dyer et al. 2008, Newman et al. 2008, ...)
SLIDE 36 Previous approach: distributing the E-step
Reduce, M-step la silla the chair la mesa the table E-step 1 E-step 2
ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable All θ (7 params) All θ (7 params)
1
E-step computations distribute easily
partition data over k nodes alignments independent given θ
2
Nodes communicate partial counts to central Reduce node
3
Reduce node does global M-step
4
Reduce sends new parameters back Remaining problems:
Memory at Reduce node C-step (communication) bandwidth: 5.5 B numbers per iteration (on full dataset with 20 nodes) (Chu et al. 2006, Dyer et al. 2008, Newman et al. 2008, ...)
SLIDE 37 Previous approach: distributing the E-step
Reduce, M-step la silla the chair la mesa the table E-step 1 E-step 2
ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable All θ (7 params) All θ (7 params)
1
E-step computations distribute easily
partition data over k nodes alignments independent given θ
2
Nodes communicate partial counts to central Reduce node
3
Reduce node does global M-step
4
Reduce sends new parameters back Remaining problems:
Memory at Reduce node C-step (communication) bandwidth: 5.5 B numbers per iteration (on full dataset with 20 nodes) (Chu et al. 2006, Dyer et al. 2008, Newman et al. 2008, ...)
SLIDE 38
Speedup (on 200K total sentence pairs) Iteration time vs. # of E-step nodes
50 100 150 200 250 1 2 5 10 20
Iteration time (s) # of nodes
MapReduce
M-Step C-Step E-Step
SLIDE 39
Common practical solutions
Memory and bandwidth are real problems in practice Workarounds
Use less data Ignore rare words Train on independent chunks Swap to disk Distribute over multiple machines
SLIDE 40
Outline
Running example: IBM Model 1 for word alignment Naive distributed EM Efficiently distributed EM
SLIDE 41 Distributing the M-step locally
Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2
Distribute M-step alongside E-step
Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst
Bandwidth savings: 30%
SLIDE 42 Distributing the M-step locally
Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2
θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable
Distribute M-step alongside E-step
Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst
Bandwidth savings: 30%
SLIDE 43 Distributing the M-step locally
Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2
θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable
Distribute M-step alongside E-step
Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst
Bandwidth savings: 30%
SLIDE 44 Distributing the M-step locally
Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2
ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable
Distribute M-step alongside E-step
Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst
Bandwidth savings: 30%
SLIDE 45 Distributing the M-step locally
Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2
ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable
Distribute M-step alongside E-step
Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst
Bandwidth savings: 30%
SLIDE 46 Distributing the M-step locally
Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2
ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable
Distribute M-step alongside E-step
Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst
Bandwidth savings: 30%
SLIDE 47 Distributing the M-step locally
Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2
ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable
Distribute M-step alongside E-step
Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst
Bandwidth savings: 30%
SLIDE 48 Distributing the M-step locally
Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2
ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable
Distribute M-step alongside E-step
Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst
Bandwidth savings: 30%
SLIDE 49 Distributing the M-step locally
Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2
ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable
Distribute M-step alongside E-step
Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst
Bandwidth savings: 30%
SLIDE 50 Distributing the M-step locally
Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2
ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable
Distribute M-step alongside E-step
Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst
Bandwidth savings: 30%
SLIDE 51 Distributing the M-step locally
Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2
ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable
Distribute M-step alongside E-step
Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst
Bandwidth savings: 30%
SLIDE 52 Distributing the M-step locally
Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2
ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable
Distribute M-step alongside E-step
Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst
Bandwidth savings: 30%
SLIDE 53 Augmenting η to increase locality
Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2
θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηlachair ηlatable ηlathe ηlachair ηlatable ηlathe ηlachair ηlathe ηlatable
Augment η with redundant ηs =
t′ ηst′ in E-step
M-step becomes θst ← ηst
ηs
Increases locality Total bandwidth savings: 84% (bigger if more nodes) Similar tricks for other models
SLIDE 54 Augmenting η to increase locality
Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2
θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηlachair ηla ηsilla ηlathe ηlatable ηla ηmesa ηlathe ηlachair ηlatable ηla ηsilla ηlathe ηlachair ηlatable ηla ηmesa
Augment η with redundant ηs =
t′ ηst′ in E-step
M-step becomes θst ← ηst
ηs
Increases locality Total bandwidth savings: 84% (bigger if more nodes) Similar tricks for other models
SLIDE 55 Augmenting η to increase locality
Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2
θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηlachair ηla ηsilla ηlathe ηlatable ηla ηmesa ηlathe ηlachair ηlatable ηla ηsilla ηlathe ηlachair ηlatable ηla ηmesa
Augment η with redundant ηs =
t′ ηst′ in E-step
M-step becomes θst ← ηst
ηs
Increases locality Total bandwidth savings: 84% (bigger if more nodes) Similar tricks for other models
SLIDE 56 Augmenting η to increase locality
Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2
θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηla ηlathe ηla ηlathe ηla ηlathe ηla
Augment η with redundant ηs =
t′ ηst′ in E-step
M-step becomes θst ← ηst
ηs
Increases locality Total bandwidth savings: 84% (bigger if more nodes) Similar tricks for other models
SLIDE 57 Choice of C-step topology
la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2
θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηla ηlathe ηla
No need for separate Reduce nodes By choosing connectivity, can trade off
bandwidth latency locality ...
SLIDE 58 Choice of C-step topology
la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2
θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηla ηlathe ηla
No need for separate Reduce nodes By choosing connectivity, can trade off
bandwidth latency locality ...
EM Node 1 EM Node 4 EM Node 5 EM Node 2 EM Node 3 EM Node 1 EM Node 3 EM Node 4 EM Node 2 EM Node 1 EM Node 3 EM Node 4 EM Node 2 EM Node 5 EM Node 1 EM Node 4 EM Node 5 EM Node 2 EM Node 3 EM Node 1 EM Node 4 EM Node 5 EM Node 2 EM Node 3
SLIDE 59 ALLPAIRS topology
EM Node 1 EM Node 3 EM Node 4 EM Node 2
50 100 150 200 250 1 2 5 10 20 Iteration time (s) # of nodes
AllPairs
M-Step C-Step E-Step
Total Bandwidth: 3.6 B counts per iteration
SLIDE 60 JUNCTIONTREE topology
EM Node 1 EM Node 3 EM Node 4 EM Node 2 EM Node 5
Nodes embedded in arbitrary tree structure Messages contain counts needed by nodes in both subtrees Tree can optimize for
bandwidth locality ...
We use maximum spanning tree to heuristically minimize bandwidth Future work: multiple trees
SLIDE 61 JUNCTIONTREE topology
EM Node 1 EM Node 3 EM Node 4 EM Node 2 EM Node 5
50 100 150 200 250 1 2 5 10 20 Iteration time (s) # of nodes
JunctionTree
M-Step C-Step E-Step
Total Bandwidth: 1.4 B counts per iteration
SLIDE 62
Locality in other models
Ex: Latent Dirichlet Allocation (LDA) for topic modeling
Parameters: unigram distributions for each topic p(w|t) Topic-word parameters local Similar augmentation trick to Model 1 Details and results in paper
Also applies to other EM models, beyond EM
Word locality is extremely common in NLP applications Variational inference Other computations that make sparse use of expectations
SLIDE 63
Conclusion
A fully distributed, maximally localized EM algorithm
exploits parameter locality for significant speedup is general; just define η for each datum is flexible with respect to communication topology
Many further improvements possible
intelligent partitioning of data running E- and C-steps in parallel better topologies (e.g., multiple trees) exploiting approximate sparsity/locality ...