flexible priors for deep hierarchies
play

Flexible Priors for Deep Hierarchies Jacob Steinhardt Wednesday, - PowerPoint PPT Presentation

Flexible Priors for Deep Hierarchies Jacob Steinhardt Wednesday, November 9, 2011 Hierarchical Modeling many data are well-modeled by an underlying tree Wednesday, November 9, 2011 Hierarchical Modeling many data are well-modeled by


  1. Flexible Priors for Deep Hierarchies Jacob Steinhardt Wednesday, November 9, 2011

  2. Hierarchical Modeling • many data are well-modeled by an underlying tree Wednesday, November 9, 2011

  3. Hierarchical Modeling • many data are well-modeled by an underlying tree Wednesday, November 9, 2011

  4. Hierarchical Modeling • many data are well-modeled by an underlying tree Wednesday, November 9, 2011

  5. Hierarchical Modeling • many data are well-modeled by an underlying tree [Celtic] Irish [Celtic] Gaelic (Scots) [Celtic] Welsh [Celtic] Cornish [Celtic] Breton [Iranian] Tajik [Iranian] Persian [Iranian] Kurdish (Central) [Romance] French [Germanic] German [Germanic] Dutch [Germanic] English [Germanic] Icelandic [Germanic] Swedish [Germanic] Norwegian [Germanic] Danish [Romance] Spanish [Greek] Greek (Modern) [Slavic] Bulgarian [Romance] Romanian [Romance] Portuguese [Romance] Italian [Romance] Catalan [Albanian] Albanian [Slavic] Polish [Slavic] Slovene [Slavic] Serbian − Croatian [Slavic] Ukrainian [Slavic] Russian [Baltic] Lithuanian [Baltic] Latvian [Slavic] Czech [Iranian] Pashto [Indic] Panjabi [Indic] Hindi [Indic] Kashmiri [Indic] Sinhala [Indic] Nepali [Iranian] Ossetic [Indic] Maithili [Indic] Marathi [Indic] Bengali [Armenian] Armenian (Western) [Armenian] Armenian (Eastern) Wednesday, November 9, 2011

  6. Hierarchical Modeling Wednesday, November 9, 2011

  7. Hierarchical Modeling • advantages of hierarchical modeling: Wednesday, November 9, 2011

  8. Hierarchical Modeling • advantages of hierarchical modeling: • captures both broad and specific trends Wednesday, November 9, 2011

  9. Hierarchical Modeling • advantages of hierarchical modeling: • captures both broad and specific trends • facilitates transfer learning Wednesday, November 9, 2011

  10. Hierarchical Modeling • advantages of hierarchical modeling: • captures both broad and specific trends • facilitates transfer learning • issues: Wednesday, November 9, 2011

  11. Hierarchical Modeling • advantages of hierarchical modeling: • captures both broad and specific trends • facilitates transfer learning • issues: • the underlying tree may not be known Wednesday, November 9, 2011

  12. Hierarchical Modeling • advantages of hierarchical modeling: • captures both broad and specific trends • facilitates transfer learning • issues: • the underlying tree may not be known • predictions in deep hierarchies can be strongly influenced by the prior Wednesday, November 9, 2011

  13. Learning the Tree Wednesday, November 9, 2011

  14. Learning the Tree • major approaches for choosing a tree: Wednesday, November 9, 2011

  15. Learning the Tree • major approaches for choosing a tree: • agglomerative clustering Wednesday, November 9, 2011

  16. Learning the Tree • major approaches for choosing a tree: • agglomerative clustering • Bayesian methods (place prior over trees) Wednesday, November 9, 2011

  17. Learning the Tree • major approaches for choosing a tree: • agglomerative clustering • Bayesian methods (place prior over trees) • stochastic branching processes Wednesday, November 9, 2011

  18. Learning the Tree • major approaches for choosing a tree: • agglomerative clustering • Bayesian methods (place prior over trees) • stochastic branching processes • nested random partitions Wednesday, November 9, 2011

  19. Agglomerative Clustering • start with each datum in its own subtree • iteratively merge subtrees based on a similarity metric • issues: • can’t add new data • can’t form hierarchies over latent parameters • difficult to incorporate structured domain knowledge Wednesday, November 9, 2011

  20. Stochastic Branching Processes • fully Bayesian model � • data starts at top and branches � � � based on an arrival process � � � (Dirichlet diffusion trees) • can also start at bottom and ���� merge (Kingman coalescents) Wednesday, November 9, 2011

  21. Stochastic Branching Processes • many nice properties • infinitely exchangeable • complexity of tree grows with the data • latent parameters must undergo a continuous-time diffusion process • unclear how to construct such a process for models over discrete data Wednesday, November 9, 2011

  22. Random Partitions • stick-breaking process: a way to partition the unit interval into countably many masses π 1 , π 2 ,... • draw β k from Beta(1, γ ) • let π k = β k x (1- β 1 ) ... (1- β k-1 ) • the distribution over the π k is called a Dirichlet process Wednesday, November 9, 2011

  23. Random Partitions • suppose { π k } k=1,..., ∞ are drawn from a Dirichlet process • for n=1,..,N, let X n ~ Multinomial({ π k }) • induces distribution over partitions of {1,...,N} • given partition of {1,...,N}, add X N+1 to a part of size s with probability s/(N+ γ ) and to a new part with probability γ /(N+ γ ) • Chinese restaurant process Wednesday, November 9, 2011

  24. Nested Random Partitions • a tree is equivalent to a collection of nested partitions • nested tree <=> nested random partitions • partition at each node given by Chinese restaurant process • issue: when to stop recursing? Wednesday, November 9, 2011

  25. Martingale Property • martingale property: E[f( θ child ) | θ parent ] = f( θ parent ) • implies E[f( θ v ) | θ u ] = f( θ u ) for any ancestor u of v • says that learning about a child does not change beliefs in expectation Wednesday, November 9, 2011

  26. Doob’s Theorem Wednesday, November 9, 2011

  27. Doob’s Theorem • Let θ 1 , θ 2 ,... be a sequence of random variables such that E[f( θ n+1 ) | θ n ] = f( θ n ) and sup n E[| θ n |] < ∞ . Wednesday, November 9, 2011

  28. Doob’s Theorem • Let θ 1 , θ 2 ,... be a sequence of random variables such that E[f( θ n+1 ) | θ n ] = f( θ n ) and sup n E[| θ n |] < ∞ . • Then lim n ! ∞ f( θ n ) exists with probability 1. Wednesday, November 9, 2011

  29. Doob’s Theorem • Let θ 1 , θ 2 ,... be a sequence of random variables such that E[f( θ n+1 ) | θ n ] = f( θ n ) and sup n E[| θ n |] < ∞ . • Then lim n ! ∞ f( θ n ) exists with probability 1. • Intuition: each new random variable reveals more information about f( θ ) until it is completely determined. Wednesday, November 9, 2011

  30. Doob’s Theorem • Use Doob’s theorem to build infinitely deep hierarchy • data associated with infinite paths v 1 ,v 2 ,... down the tree • each datum drawn from distribution parameterized by lim n f( θ vn ) Wednesday, November 9, 2011

  31. Doob’s Theorem • all data have infinite depth • can think of effective depth of a datum as first point where it is in a unique subtree • effective depth is O(logN) Wednesday, November 9, 2011

  32. Letting the Complexity Grow with the Data Wednesday, November 9, 2011

  33. Letting the Complexity Grow with the Data 9 25 8 20 7 6 maximum depth average depth 15 5 10 4 3 nCRP nCRP 5 TSSB − 10 − 0.5 TSSB − 10 − 0.5 TSSB − 20 − 1.0 TSSB − 20 − 1.0 2 TSSB − 50 − 0.5 TSSB − 50 − 0.5 TSSB − 100 − 0.8 TSSB − 100 − 0.8 0 1 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 number of data points number of data points Wednesday, November 9, 2011

  34. Hierarchical Beta Processes Wednesday, November 9, 2011

  35. Hierarchical Beta Processes • θ v lies in [0,1] D Wednesday, November 9, 2011

  36. Hierarchical Beta Processes • θ v lies in [0,1] D • θ v,d | θ p(v),d ~ Beta(c θ p(v),d ,c(1- θ p(v),d )) Wednesday, November 9, 2011

  37. Hierarchical Beta Processes • θ v lies in [0,1] D • θ v,d | θ p(v),d ~ Beta(c θ p(v),d ,c(1- θ p(v),d )) • martingale property for f( θ v ) = θ v Wednesday, November 9, 2011

  38. Hierarchical Beta Processes • θ v lies in [0,1] D • θ v,d | θ p(v),d ~ Beta(c θ p(v),d ,c(1- θ p(v),d )) • martingale property for f( θ v ) = θ v • let θ denote the limit Wednesday, November 9, 2011

  39. Hierarchical Beta Processes • θ v lies in [0,1] D • θ v,d | θ p(v),d ~ Beta(c θ p(v),d ,c(1- θ p(v),d )) • martingale property for f( θ v ) = θ v • let θ denote the limit • X d | θ d ~ Bernoulli( θ d ), where θ is the limit Wednesday, November 9, 2011

  40. Hierarchical Beta Processes • θ v lies in [0,1] D • θ v,d | θ p(v),d ~ Beta(c θ p(v),d ,c(1- θ p(v),d )) • martingale property for f( θ v ) = θ v • let θ denote the limit • X d | θ d ~ Bernoulli( θ d ), where θ is the limit • note that X d | θ v,d ~ Bernoulli( θ v,d ) as well Wednesday, November 9, 2011

  41. Hierarchical Beta Processes Wednesday, November 9, 2011

  42. Priors for Deep Hierarchies • for HBP , θ v,d converges to 0 or 1 • rate of convergence: tower of exponentials e e ee ··· • numerical issues + philosophically troubling Wednesday, November 9, 2011

  43. Priors for Deep Hierarchies • inverse Wishart time-series • Σ n+1 | Σ n ~ InvW( Σ n ) • converges to 0 with probability 1 • becomes singular to numerical precision • rate also given by tower of exponentials Wednesday, November 9, 2011

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend