Flexible Priors for Deep Hierarchies Jacob Steinhardt Wednesday, - - PowerPoint PPT Presentation

flexible priors for deep hierarchies
SMART_READER_LITE
LIVE PREVIEW

Flexible Priors for Deep Hierarchies Jacob Steinhardt Wednesday, - - PowerPoint PPT Presentation

Flexible Priors for Deep Hierarchies Jacob Steinhardt Wednesday, November 9, 2011 Hierarchical Modeling many data are well-modeled by an underlying tree Wednesday, November 9, 2011 Hierarchical Modeling many data are well-modeled by


slide-1
SLIDE 1

Flexible Priors for Deep Hierarchies

Jacob Steinhardt

Wednesday, November 9, 2011

slide-2
SLIDE 2

Hierarchical Modeling

  • many data are well-modeled by an

underlying tree

Wednesday, November 9, 2011

slide-3
SLIDE 3

Hierarchical Modeling

  • many data are well-modeled by an

underlying tree

Wednesday, November 9, 2011

slide-4
SLIDE 4

Hierarchical Modeling

  • many data are well-modeled by an

underlying tree

Wednesday, November 9, 2011

slide-5
SLIDE 5

Hierarchical Modeling

  • many data are well-modeled by an

underlying tree

[Armenian] Armenian (Eastern) [Armenian] Armenian (Western) [Indic] Bengali [Indic] Marathi [Indic] Maithili [Iranian] Ossetic [Indic] Nepali [Indic] Sinhala [Indic] Kashmiri [Indic] Hindi [Indic] Panjabi [Iranian] Pashto [Slavic] Czech [Baltic] Latvian [Baltic] Lithuanian [Slavic] Russian [Slavic] Ukrainian [Slavic] Serbian−Croatian [Slavic] Slovene [Slavic] Polish [Albanian] Albanian [Romance] Catalan [Romance] Italian [Romance] Portuguese [Romance] Romanian [Slavic] Bulgarian [Greek] Greek (Modern) [Romance] Spanish [Germanic] Danish [Germanic] Norwegian [Germanic] Swedish [Germanic] Icelandic [Germanic] English [Germanic] Dutch [Germanic] German [Romance] French [Iranian] Kurdish (Central) [Iranian] Persian [Iranian] Tajik [Celtic] Breton [Celtic] Cornish [Celtic] Welsh [Celtic] Gaelic (Scots) [Celtic] Irish Wednesday, November 9, 2011

slide-6
SLIDE 6

Hierarchical Modeling

Wednesday, November 9, 2011

slide-7
SLIDE 7

Hierarchical Modeling

  • advantages of hierarchical modeling:

Wednesday, November 9, 2011

slide-8
SLIDE 8

Hierarchical Modeling

  • advantages of hierarchical modeling:
  • captures both broad and specific trends

Wednesday, November 9, 2011

slide-9
SLIDE 9

Hierarchical Modeling

  • advantages of hierarchical modeling:
  • captures both broad and specific trends
  • facilitates transfer learning

Wednesday, November 9, 2011

slide-10
SLIDE 10

Hierarchical Modeling

  • advantages of hierarchical modeling:
  • captures both broad and specific trends
  • facilitates transfer learning
  • issues:

Wednesday, November 9, 2011

slide-11
SLIDE 11

Hierarchical Modeling

  • advantages of hierarchical modeling:
  • captures both broad and specific trends
  • facilitates transfer learning
  • issues:
  • the underlying tree may not be known

Wednesday, November 9, 2011

slide-12
SLIDE 12

Hierarchical Modeling

  • advantages of hierarchical modeling:
  • captures both broad and specific trends
  • facilitates transfer learning
  • issues:
  • the underlying tree may not be known
  • predictions in deep hierarchies can be

strongly influenced by the prior

Wednesday, November 9, 2011

slide-13
SLIDE 13

Learning the Tree

Wednesday, November 9, 2011

slide-14
SLIDE 14

Learning the Tree

  • major approaches for choosing a tree:

Wednesday, November 9, 2011

slide-15
SLIDE 15

Learning the Tree

  • major approaches for choosing a tree:
  • agglomerative clustering

Wednesday, November 9, 2011

slide-16
SLIDE 16

Learning the Tree

  • major approaches for choosing a tree:
  • agglomerative clustering
  • Bayesian methods (place prior over trees)

Wednesday, November 9, 2011

slide-17
SLIDE 17

Learning the Tree

  • major approaches for choosing a tree:
  • agglomerative clustering
  • Bayesian methods (place prior over trees)
  • stochastic branching processes

Wednesday, November 9, 2011

slide-18
SLIDE 18

Learning the Tree

  • major approaches for choosing a tree:
  • agglomerative clustering
  • Bayesian methods (place prior over trees)
  • stochastic branching processes
  • nested random partitions

Wednesday, November 9, 2011

slide-19
SLIDE 19

Agglomerative Clustering

  • start with each datum in its own subtree
  • iteratively merge subtrees based on a similarity metric
  • issues:
  • can’t add new data
  • can’t form hierarchies over latent parameters
  • difficult to incorporate structured domain

knowledge

Wednesday, November 9, 2011

slide-20
SLIDE 20

Stochastic Branching Processes

  • fully Bayesian model
  • data starts at top and branches

based on an arrival process (Dirichlet diffusion trees)

  • can also start at bottom and

merge (Kingman coalescents)

  • Wednesday, November 9, 2011
slide-21
SLIDE 21

Stochastic Branching Processes

  • many nice properties
  • infinitely exchangeable
  • complexity of tree grows with the data
  • latent parameters must undergo a

continuous-time diffusion process

  • unclear how to construct such a process

for models over discrete data

Wednesday, November 9, 2011

slide-22
SLIDE 22

Random Partitions

  • stick-breaking process: a way to partition

the unit interval into countably many masses π1,π2,...

  • draw βk from Beta(1,γ)
  • let πk = βk x (1-β1) ... (1-βk-1)
  • the distribution over the πk is called a

Dirichlet process

Wednesday, November 9, 2011

slide-23
SLIDE 23

Random Partitions

  • suppose {πk}k=1,...,∞ are drawn from a

Dirichlet process

  • for n=1,..,N, let Xn ~ Multinomial({πk})
  • induces distribution over partitions of

{1,...,N}

  • given partition of {1,...,N}, add XN+1 to a part
  • f size s with probability s/(N+γ) and to a

new part with probability γ/(N+γ)

  • Chinese restaurant process

Wednesday, November 9, 2011

slide-24
SLIDE 24

Nested Random Partitions

  • a tree is equivalent to a collection of

nested partitions

  • nested tree <=> nested random partitions
  • partition at each node given by Chinese

restaurant process

  • issue: when to stop recursing?

Wednesday, November 9, 2011

slide-25
SLIDE 25

Martingale Property

  • martingale property:

E[f(θchild) | θparent] = f(θparent)

  • implies E[f(θv) | θu] = f(θu) for any ancestor

u of v

  • says that learning about a child does not

change beliefs in expectation

Wednesday, November 9, 2011

slide-26
SLIDE 26

Doob’s Theorem

Wednesday, November 9, 2011

slide-27
SLIDE 27

Doob’s Theorem

  • Let θ1, θ2,... be a sequence of random

variables such that E[f(θn+1) | θn] = f(θn) and supn E[|θn|] < ∞.

Wednesday, November 9, 2011

slide-28
SLIDE 28

Doob’s Theorem

  • Let θ1, θ2,... be a sequence of random

variables such that E[f(θn+1) | θn] = f(θn) and supn E[|θn|] < ∞.

  • Then limn!∞f(θn) exists with probability 1.

Wednesday, November 9, 2011

slide-29
SLIDE 29

Doob’s Theorem

  • Let θ1, θ2,... be a sequence of random

variables such that E[f(θn+1) | θn] = f(θn) and supn E[|θn|] < ∞.

  • Then limn!∞f(θn) exists with probability 1.
  • Intuition: each new random variable reveals

more information about f(θ) until it is completely determined.

Wednesday, November 9, 2011

slide-30
SLIDE 30

Doob’s Theorem

  • Use Doob’s theorem to build infinitely

deep hierarchy

  • data associated with infinite paths v1,v2,...

down the tree

  • each datum drawn from distribution

parameterized by limn f(θvn)

Wednesday, November 9, 2011

slide-31
SLIDE 31

Doob’s Theorem

  • all data have infinite depth
  • can think of effective depth of a datum as

first point where it is in a unique subtree

  • effective depth is O(logN)

Wednesday, November 9, 2011

slide-32
SLIDE 32

Letting the Complexity Grow with the Data

Wednesday, November 9, 2011

slide-33
SLIDE 33

Letting the Complexity Grow with the Data

500 1000 1500 2000 2500 3000 5 10 15 20 25 number of data points maximum depth nCRP TSSB−10−0.5 TSSB−20−1.0 TSSB−50−0.5 TSSB−100−0.8 500 1000 1500 2000 2500 3000 1 2 3 4 5 6 7 8 9 number of data points average depth nCRP TSSB−10−0.5 TSSB−20−1.0 TSSB−50−0.5 TSSB−100−0.8 Wednesday, November 9, 2011

slide-34
SLIDE 34

Hierarchical Beta Processes

Wednesday, November 9, 2011

slide-35
SLIDE 35

Hierarchical Beta Processes

  • θv lies in [0,1]D

Wednesday, November 9, 2011

slide-36
SLIDE 36

Hierarchical Beta Processes

  • θv lies in [0,1]D
  • θv,d | θp(v),d ~ Beta(cθp(v),d,c(1-θp(v),d))

Wednesday, November 9, 2011

slide-37
SLIDE 37

Hierarchical Beta Processes

  • θv lies in [0,1]D
  • θv,d | θp(v),d ~ Beta(cθp(v),d,c(1-θp(v),d))
  • martingale property for f(θv) = θv

Wednesday, November 9, 2011

slide-38
SLIDE 38

Hierarchical Beta Processes

  • θv lies in [0,1]D
  • θv,d | θp(v),d ~ Beta(cθp(v),d,c(1-θp(v),d))
  • martingale property for f(θv) = θv
  • let θ denote the limit

Wednesday, November 9, 2011

slide-39
SLIDE 39

Hierarchical Beta Processes

  • θv lies in [0,1]D
  • θv,d | θp(v),d ~ Beta(cθp(v),d,c(1-θp(v),d))
  • martingale property for f(θv) = θv
  • let θ denote the limit
  • Xd | θd ~ Bernoulli(θd), where θ is the limit

Wednesday, November 9, 2011

slide-40
SLIDE 40

Hierarchical Beta Processes

  • θv lies in [0,1]D
  • θv,d | θp(v),d ~ Beta(cθp(v),d,c(1-θp(v),d))
  • martingale property for f(θv) = θv
  • let θ denote the limit
  • Xd | θd ~ Bernoulli(θd), where θ is the limit
  • note that Xd | θv,d ~ Bernoulli(θv,d) as well

Wednesday, November 9, 2011

slide-41
SLIDE 41

Hierarchical Beta Processes

Wednesday, November 9, 2011

slide-42
SLIDE 42

Priors for Deep Hierarchies

  • for HBP

, θv,d converges to 0 or 1

  • rate of convergence: tower of exponentials
  • numerical issues + philosophically troubling

eeee···

Wednesday, November 9, 2011

slide-43
SLIDE 43

Priors for Deep Hierarchies

  • inverse Wishart time-series
  • Σn+1 | Σn ~ InvW(Σn)
  • converges to 0 with probability 1
  • becomes singular to numerical precision
  • rate also given by tower of exponentials

Wednesday, November 9, 2011

slide-44
SLIDE 44

Priors for Deep Hierarchies

  • fundamental issues with iterated gamma

distribution

  • θn+1 | θn ~ Γ(θn)
  • instead, do θn+1 | θn ~ cθn +dϕn
  • ϕn ~ Γ(θn)

Wednesday, November 9, 2011

slide-45
SLIDE 45

Priors for Deep Hierarchies

50 100 150 200 250 300 350 400 450 500 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 depth mass kappa = 10.0, epsilon = 0.1 50 100 150 200 250 300 350 400 450 500 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 depth mass proposed model 50 100 150 200 250 300 350 400 450 500 10 −25 10 −20 10 −15 10 −10 10 −5 10 depth mass kappa = 10.0, epsilon = 0.1 1 2 3 4 5 6 7 8 10 −10 10 −8 10 −6 10 −4 10 −2 10 depth mass kappa = 10.0, epsilon = 0.0

Wednesday, November 9, 2011