[PPT] - Fully Distributed EM for Very Large Datasets Jason Wolfe Aria PowerPoint Presentation

SLIDE 1

Fully Distributed EM for Very Large Datasets

Jason Wolfe Aria Haghighi Dan Klein

Computer Science Division UC Berkeley

SLIDE 2

Overview

Task: unsupervised learning via EM Focus: models w/ many local parameters (relevant to few datums) Approach: fully distributed, localized EM

⋆ parameter locality → less bandwidth

ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ US Hosts Middle East Peace Conference Next Week US Hosts Middle East Peace Conference Next Week US Hosts Middle East Peace Conference Next Week ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ US Hosts Middle East Peace Conference Next Week

244 1 2 3

millions of parameters millions of data points

useful work communication

verhead

SLIDE 3

Overview

Task: unsupervised learning via EM Focus: models w/ many local parameters (relevant to few datums) Approach: fully distributed, localized EM

⋆ parameter locality → less bandwidth

ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ US Hosts Middle East Peace Conference Next Week US Hosts Middle East Peace Conference Next Week US Hosts Middle East Peace Conference Next Week ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ US Hosts Middle East Peace Conference Next Week

244 1 2 3

millions of parameters millions of data points

useful work communication

verhead

SLIDE 4

Overview

Task: unsupervised learning via EM Focus: models w/ many local parameters (relevant to few datums) Approach: fully distributed, localized EM

⋆ parameter locality → less bandwidth

ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ US Hosts Middle East Peace Conference Next Week US Hosts Middle East Peace Conference Next Week US Hosts Middle East Peace Conference Next Week ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ US Hosts Middle East Peace Conference Next Week

244 1 2 3

millions of parameters millions of data points

useful work communication

verhead

SLIDE 5

Outline

Running example: IBM Model 1 for word alignment Naive distributed EM Efficiently distributed EM

SLIDE 6

Word alignment for machine translation

Goal: parallel sentences → word-level translation model Parameters θst: probability that Spanish word s translates to English word t θ =                    θlathe θlachair θlatable θsillathe θsillachair θmesathe θmesatable

la silla the chair la mesa the table corpus of parallel sentences

SLIDE 7

Word alignment for machine translation

Goal: parallel sentences → word-level translation model Parameters θst: probability that Spanish word s translates to English word t θ =                    θlathe θlachair θlatable θsillathe θsillachair θmesathe θmesatable

la silla the chair la mesa the table corpus of parallel sentences la silla the chair la mesa the table possible alignment arcs

SLIDE 8

Word alignment for machine translation

Goal: parallel sentences → word-level translation model Parameters θst: probability that Spanish word s translates to English word t θ =                    θlathe = 1.0 θlachair = 0.0 θlatable = 0.0 θsillathe = 0.0 θsillachair = 1.0 θmesathe = 0.0 θmesatable = 1.0

la silla the chair la mesa the table corpus of parallel sentences la silla the chair la mesa the table possible alignment arcs la silla the chair la mesa the table unobserved true alignments

SLIDE 9

IBM Model 1 for word alignment

a Steve no le gustan las ferias grandes Steve does not like big ferris wheels ? ? ? ? ? ? ?

each target word is generated by exactly

ne source word chosen u.a.r

IBM Model 1: a simple generative model

For each target position i, independently

choose a source index ai u.a.r. choose a target word Ti ∼ θSai ·

SLIDE 10

IBM Model 1 for word alignment

a Steve no le gustan las ferias grandes Steve does not like big ferris wheels ? ? ? ? ? ? ?

each target word is generated by exactly

ne source word chosen u.a.r

IBM Model 1: a simple generative model

For each target position i, independently

choose a source index ai u.a.r. choose a target word Ti ∼ θSai ·

SLIDE 11

IBM Model 1 for word alignment

a Steve no le gustan las ferias grandes Steve does not like big ferris wheels ? ? ? ? ? ?

each target word is generated by exactly

ne source word chosen u.a.r

IBM Model 1: a simple generative model

For each target position i, independently

choose a source index ai u.a.r. choose a target word Ti ∼ θSai ·

SLIDE 12

IBM Model 1 for word alignment

a Steve no le gustan las ferias grandes Steve does not like big ferris wheels ? ? ? ? ? ?

each target word is generated by exactly

ne source word chosen u.a.r

IBM Model 1: a simple generative model

For each target position i, independently

choose a source index ai u.a.r. choose a target word Ti ∼ θSai ·

SLIDE 13

IBM Model 1 for word alignment

a Steve no le gustan las ferias grandes Steve does not like big ferris wheels ? ? ? ? ?

each target word is generated by exactly

ne source word chosen u.a.r

IBM Model 1: a simple generative model

For each target position i, independently

choose a source index ai u.a.r. choose a target word Ti ∼ θSai ·

SLIDE 14

IBM Model 1 for word alignment

a Steve no le gustan las ferias grandes Steve does not like big ferris wheels ? ? ? ? ?

each target word is generated by exactly

ne source word chosen u.a.r

IBM Model 1: a simple generative model

For each target position i, independently

choose a source index ai u.a.r. choose a target word Ti ∼ θSai ·

SLIDE 15

IBM Model 1 for word alignment

a Steve no le gustan las ferias grandes Steve does not like big ferris wheels ? ? ? ?

each target word is generated by exactly

ne source word chosen u.a.r

IBM Model 1: a simple generative model

For each target position i, independently

choose a source index ai u.a.r. choose a target word Ti ∼ θSai ·

SLIDE 16

IBM Model 1 for word alignment

a Steve no le gustan las ferias grandes Steve does not like big ferris wheels ? ? ? ?

each target word is generated by exactly

ne source word chosen u.a.r

IBM Model 1: a simple generative model

For each target position i, independently

choose a source index ai u.a.r. choose a target word Ti ∼ θSai ·

SLIDE 17

IBM Model 1 for word alignment

a Steve no le gustan las ferias grandes Steve does not like big ferris wheels ? ? ?

each target word is generated by exactly

ne source word chosen u.a.r

IBM Model 1: a simple generative model

For each target position i, independently

choose a source index ai u.a.r. choose a target word Ti ∼ θSai ·

SLIDE 18

IBM Model 1 for word alignment

a Steve no le gustan las ferias grandes Steve does not like big ferris wheels

each target word is generated by exactly

ne source word chosen u.a.r

IBM Model 1: a simple generative model

For each target position i, independently

choose a source index ai u.a.r. choose a target word Ti ∼ θSai ·

SLIDE 19

EM algorithm for IBM Model 1

θ ← some initial guess

θlathe=.33, θlachair=.33, θlatable=.33, θsillathe=.5,...

SLIDE 20

EM algorithm for IBM Model 1

θ ← some initial guess Iterate:

1

E-step: estimate alignment counts η

1

compute posteriors p(ai|θ)

θlathe=.33, θlachair=.33, θlatable=.33, θsillathe=.5,...

la silla the chair

.6=

.5 .33+.5 .33 .33+.5 =.4

SLIDE 21

EM algorithm for IBM Model 1

θ ← some initial guess Iterate:

1

E-step: estimate alignment counts η

1

compute posteriors p(ai|θ)

θlathe=.33, θlachair=.33, θlatable=.33, θsillathe=.5,...

la silla the chair

.6=

.5 .33+.5 .33 .33+.5 =.4

la mesa the table la silla the chair

.6 .4 .6 .4

SLIDE 22

EM algorithm for IBM Model 1

θ ← some initial guess Iterate:

1

E-step: estimate alignment counts η

1

compute posteriors p(ai|θ)

2

aggregate into expected counts ηst (expected # times st under θ)

ηst ←

C

θst

i′ θSi′t

θlathe=.33, θlachair=.33, θlatable=.33, θsillathe=.5,...

la silla the chair

.6=

.5 .33+.5 .33 .33+.5 =.4

la mesa the table la silla the chair

.6 .4 .6 .4

ηlathe=.8, ηlachair=.4, ηlatable=.4, ηsillathe=.6,...

SLIDE 23

EM algorithm for IBM Model 1

θ ← some initial guess Iterate:

1

E-step: estimate alignment counts η

1

compute posteriors p(ai|θ)

2

aggregate into expected counts ηst (expected # times st under θ)

ηst ←

C

θst

i′ θSi′t

2

M-step: normalize η to get new ML θ θst ← ηst

t′ ηst′

θlathe=.33, θlachair=.33, θlatable=.33, θsillathe=.5,...

la silla the chair

.6=

.5 .33+.5 .33 .33+.5 =.4

la mesa the table la silla the chair

.6 .4 .6 .4

ηlathe=.8, ηlachair=.4, ηlatable=.4, ηsillathe=.6,... θlathe=.5, θlachair=.25, θlatable=.25, θsillathe=.5,...

SLIDE 24

EM example continued

E-Step 1

la silla the chair la mesa the table

SLIDE 25

EM example continued

E-Step 2

la silla the chair la mesa the table

SLIDE 26

EM example continued

E-Step 3

la silla the chair la mesa the table

SLIDE 27

EM example continued

E-Step 4

la silla the chair la mesa the table

SLIDE 28

EM example continued

E-Step 5

la silla the chair la mesa the table

SLIDE 29

EM example continued

E-Step ∞

la silla the chair la mesa the table

SLIDE 30

UN Arabic English TIDES v2 corpus

ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ US Hosts Middle East Peace Conference Next Week US Hosts Middle East Peace Conference Next Week US Hosts Middle East Peace Conference Next Week ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ ﺍﻟﻮﻻﻳﺎﺕ ﺍﳌﺘﺤﺪﺓ ﺗﺴﺘﻀﻴﻒ ﻣﺆﲤﺮ ﺍﻟﺴﻼﻡ ﻓﻰ ﺍﻟﺸﺮﻕ ﺍﻻﻭﺳﻂ ﻓﻰ ﺍﻻﺳﺒﻮﻉ ﺍﻟﻘﺎﺩﻡ US Hosts Middle East Peace Conference Next Week

2.9 million sentence pairs from UN proceedings 243 million unique word pairs (translations possible in some sentence pair)

243 M parameters in θ 243 M counts in η

Even fitting all (indexed) parameters in 32-bit memory can be challenging

SLIDE 31

Outline

Running example: IBM Model 1 for word alignment Naive distributed EM Efficiently distributed EM

SLIDE 32

Previous approach: distributing the E-step

Reduce, M-step la silla the chair la mesa the table E-step 1 E-step 2

1

E-step computations distribute easily

partition data over k nodes alignments independent given θ

2

Nodes communicate partial counts to central Reduce node

3

Reduce node does global M-step

4

Reduce sends new parameters back Remaining problems:

Memory at Reduce node C-step (communication) bandwidth: 5.5 B numbers per iteration (on full dataset with 20 nodes) (Chu et al. 2006, Dyer et al. 2008, Newman et al. 2008, ...)

SLIDE 33

Previous approach: distributing the E-step

Reduce, M-step la silla the chair la mesa the table E-step 1 E-step 2

1

E-step computations distribute easily

partition data over k nodes alignments independent given θ

2

Nodes communicate partial counts to central Reduce node

3

Reduce node does global M-step

4

Reduce sends new parameters back Remaining problems:

Memory at Reduce node C-step (communication) bandwidth: 5.5 B numbers per iteration (on full dataset with 20 nodes) (Chu et al. 2006, Dyer et al. 2008, Newman et al. 2008, ...)

SLIDE 34

Previous approach: distributing the E-step

Reduce, M-step la silla the chair la mesa the table E-step 1 E-step 2

ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable

1

E-step computations distribute easily

partition data over k nodes alignments independent given θ

2

Nodes communicate partial counts to central Reduce node

3

Reduce node does global M-step

4

Reduce sends new parameters back Remaining problems:

Memory at Reduce node C-step (communication) bandwidth: 5.5 B numbers per iteration (on full dataset with 20 nodes) (Chu et al. 2006, Dyer et al. 2008, Newman et al. 2008, ...)

SLIDE 35

Previous approach: distributing the E-step

Reduce, M-step la silla the chair la mesa the table E-step 1 E-step 2

ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable

1

E-step computations distribute easily

partition data over k nodes alignments independent given θ

2

Nodes communicate partial counts to central Reduce node

3

Reduce node does global M-step

4

Reduce sends new parameters back Remaining problems:

Memory at Reduce node C-step (communication) bandwidth: 5.5 B numbers per iteration (on full dataset with 20 nodes) (Chu et al. 2006, Dyer et al. 2008, Newman et al. 2008, ...)

SLIDE 36

Previous approach: distributing the E-step

Reduce, M-step la silla the chair la mesa the table E-step 1 E-step 2

ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable All θ (7 params) All θ (7 params)

1

E-step computations distribute easily

partition data over k nodes alignments independent given θ

2

Nodes communicate partial counts to central Reduce node

3

Reduce node does global M-step

4

Reduce sends new parameters back Remaining problems:

Memory at Reduce node C-step (communication) bandwidth: 5.5 B numbers per iteration (on full dataset with 20 nodes) (Chu et al. 2006, Dyer et al. 2008, Newman et al. 2008, ...)

SLIDE 37

Previous approach: distributing the E-step

Reduce, M-step la silla the chair la mesa the table E-step 1 E-step 2

ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable All θ (7 params) All θ (7 params)

1

E-step computations distribute easily

partition data over k nodes alignments independent given θ

2

Nodes communicate partial counts to central Reduce node

3

Reduce node does global M-step

4

Reduce sends new parameters back Remaining problems:

Memory at Reduce node C-step (communication) bandwidth: 5.5 B numbers per iteration (on full dataset with 20 nodes) (Chu et al. 2006, Dyer et al. 2008, Newman et al. 2008, ...)

SLIDE 38

Speedup (on 200K total sentence pairs) Iteration time vs. # of E-step nodes

50 100 150 200 250 1 2 5 10 20

Iteration time (s) # of nodes

MapReduce

M-Step C-Step E-Step

SLIDE 39

Common practical solutions

Memory and bandwidth are real problems in practice Workarounds

Use less data Ignore rare words Train on independent chunks Swap to disk Distribute over multiple machines

SLIDE 40

Outline

Running example: IBM Model 1 for word alignment Naive distributed EM Efficiently distributed EM

SLIDE 41

Distributing the M-step locally

Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2

Distribute M-step alongside E-step

Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst

t′ ηst′

Bandwidth savings: 30%

SLIDE 42

Distributing the M-step locally

Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2

θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable

Distribute M-step alongside E-step

Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst

t′ ηst′

Bandwidth savings: 30%

SLIDE 43

Distributing the M-step locally

Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2

θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable

Distribute M-step alongside E-step

Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst

t′ ηst′

Bandwidth savings: 30%

SLIDE 44

Distributing the M-step locally

Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2

ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable

Distribute M-step alongside E-step

Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst

t′ ηst′

Bandwidth savings: 30%

SLIDE 45

Distributing the M-step locally

Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2

ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable

Distribute M-step alongside E-step

Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst

t′ ηst′

Bandwidth savings: 30%

SLIDE 46

Distributing the M-step locally

Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2

ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable

Distribute M-step alongside E-step

Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst

t′ ηst′

Bandwidth savings: 30%

SLIDE 47

Distributing the M-step locally

Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2

ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable

Distribute M-step alongside E-step

Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst

t′ ηst′

Bandwidth savings: 30%

SLIDE 48

Distributing the M-step locally

Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2

ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable

Distribute M-step alongside E-step

Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst

t′ ηst′

Bandwidth savings: 30%

SLIDE 49

Distributing the M-step locally

Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2

ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable

Distribute M-step alongside E-step

Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst

t′ ηst′

Bandwidth savings: 30%

SLIDE 50

Distributing the M-step locally

Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2

ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable

Distribute M-step alongside E-step

Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst

t′ ηst′

Bandwidth savings: 30%

SLIDE 51

Distributing the M-step locally

Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2

ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable

Distribute M-step alongside E-step

Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst

t′ ηst′

Bandwidth savings: 30%

SLIDE 52

Distributing the M-step locally

Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2

ηlathe ηlachair ηsillathe ηsillachair ηlathe ηlatable ηmesathe ηmesatable θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable ηlathe ηlachair ηlatable ηsillathe ηsillachair ηmesathe ηmesatable

Distribute M-step alongside E-step

Nodes store only needed params, compute them locally Reduce passes back counts Don’t need to hear about irrelevant source words Don’t need to tell (or hear) about purely local source words Need to hear everything about each source word: M-step denominator θst ← ηst

t′ ηst′

Bandwidth savings: 30%

SLIDE 53

Augmenting η to increase locality

Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2

θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηlachair ηlatable ηlathe ηlachair ηlatable ηlathe ηlachair ηlathe ηlatable

Augment η with redundant ηs =

t′ ηst′ in E-step

M-step becomes θst ← ηst

ηs

Increases locality Total bandwidth savings: 84% (bigger if more nodes) Similar tricks for other models

SLIDE 54

Augmenting η to increase locality

Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2

θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηlachair ηla ηsilla ηlathe ηlatable ηla ηmesa ηlathe ηlachair ηlatable ηla ηsilla ηlathe ηlachair ηlatable ηla ηmesa

Augment η with redundant ηs =

t′ ηst′ in E-step

M-step becomes θst ← ηst

ηs

Increases locality Total bandwidth savings: 84% (bigger if more nodes) Similar tricks for other models

SLIDE 55

Augmenting η to increase locality

Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2

θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηlachair ηla ηsilla ηlathe ηlatable ηla ηmesa ηlathe ηlachair ηlatable ηla ηsilla ηlathe ηlachair ηlatable ηla ηmesa

Augment η with redundant ηs =

t′ ηst′ in E-step

M-step becomes θst ← ηst

ηs

Increases locality Total bandwidth savings: 84% (bigger if more nodes) Similar tricks for other models

SLIDE 56

Augmenting η to increase locality

Reduce la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2

θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηla ηlathe ηla ηlathe ηla ηlathe ηla

Augment η with redundant ηs =

t′ ηst′ in E-step

M-step becomes θst ← ηst

ηs

Increases locality Total bandwidth savings: 84% (bigger if more nodes) Similar tricks for other models

SLIDE 57

Choice of C-step topology

la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2

θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηla ηlathe ηla

No need for separate Reduce nodes By choosing connectivity, can trade off

bandwidth latency locality ...

SLIDE 58

Choice of C-step topology

la silla the chair la mesa the table E-step 1 M-step 1 E-step 2 M-step 2

θlathe θlachair θsillathe θsillachair θlathe θlatable θmesathe θmesatable ηlathe ηla ηlathe ηla

No need for separate Reduce nodes By choosing connectivity, can trade off

bandwidth latency locality ...

EM Node 1 EM Node 4 EM Node 5 EM Node 2 EM Node 3 EM Node 1 EM Node 3 EM Node 4 EM Node 2 EM Node 1 EM Node 3 EM Node 4 EM Node 2 EM Node 5 EM Node 1 EM Node 4 EM Node 5 EM Node 2 EM Node 3 EM Node 1 EM Node 4 EM Node 5 EM Node 2 EM Node 3

SLIDE 59

ALLPAIRS topology

EM Node 1 EM Node 3 EM Node 4 EM Node 2

50 100 150 200 250 1 2 5 10 20 Iteration time (s) # of nodes

AllPairs

M-Step C-Step E-Step

Total Bandwidth: 3.6 B counts per iteration

SLIDE 60

JUNCTIONTREE topology

EM Node 1 EM Node 3 EM Node 4 EM Node 2 EM Node 5

Nodes embedded in arbitrary tree structure Messages contain counts needed by nodes in both subtrees Tree can optimize for

bandwidth locality ...

We use maximum spanning tree to heuristically minimize bandwidth Future work: multiple trees

SLIDE 61

JUNCTIONTREE topology

EM Node 1 EM Node 3 EM Node 4 EM Node 2 EM Node 5

50 100 150 200 250 1 2 5 10 20 Iteration time (s) # of nodes

JunctionTree

M-Step C-Step E-Step

Total Bandwidth: 1.4 B counts per iteration

SLIDE 62

Locality in other models

Ex: Latent Dirichlet Allocation (LDA) for topic modeling

Parameters: unigram distributions for each topic p(w|t) Topic-word parameters local Similar augmentation trick to Model 1 Details and results in paper

Also applies to other EM models, beyond EM

Word locality is extremely common in NLP applications Variational inference Other computations that make sparse use of expectations

SLIDE 63

Conclusion

A fully distributed, maximally localized EM algorithm

exploits parameter locality for significant speedup is general; just define η for each datum is flexible with respect to communication topology

Many further improvements possible

intelligent partitioning of data running E- and C-steps in parallel better topologies (e.g., multiple trees) exploiting approximate sparsity/locality ...