FrankWood Gatsby UCL - PowerPoint PPT Presentation

�� Frank�Wood Gatsby� UCL Cedric�Archambeau Gatsby Jan�Gasthaus HKUST Lancelot�James Gatsby Yee�Whye Teh

�� • Model – Smoothing�Markov�model�of�discrete�sequences – Extension�of�hierarchical�Pitman�Yor�process�[Teh�2006] • Unbounded�depth�(context�length) • Algorithms�and�estimation – Linear�time�suffix0tree�graphical�model�identification�and�construction – Standard�Chinese�restaurant�franchise�sampler • Results – Maximum�contextual�information�used�during�inference – Competitive�language�modelling�results • Limit�of� n 0gram�language�model�as� n →∞ – Same�computational�cost�as�a�Bayesian�interpolating�50gram�language� model

�� • Uses – Any�situation�in�which�a�low0order�Markov�model�of�discrete� sequences�is�insufficient – Drop�in�replacement�for�smoothing�Markov�model • Name? – ‘‘A�Stochastic�Memoizer�for�Sequence�Data ’’ →� Sequence� Memoizer�(SM)� • Describes�posterior�inference�[Goodman�et�al� ‘ 08]

�� • Sequence�Markov�models�are�usually�constructed�by�treating�a� sequence�as�a�set�of�(exchangeable)�observations�in�fixed0length� contexts    o | []    a | o   �    c | ao    a | []   a | cao c | a oacac → oacac → a | ca oacac → oacac → c | []  c | aca  a | c      c | ac   a | []   c | a   c | [] unigram bigram trigram 40gram Increasing�context�length�/�order�of�Markov�model Decreasing�number�of�observations Increasing�number�of�conditional�distributions�to�estimate�(indexed�by�context) Increasing�power�of�model

�� N � P ( x i | x � , . . . x i − � ) P ( x �� N ) = i �� N � ≈ P ( x i | x i − n �� , . . . x i − � ) , n = 2 i �� = P ( x � ) P ( x � | x � ) P ( x � | x � ) P ( x � | x � ) . . . • Example P (o) P (a | o) P (c | a) P (a | c) P (c | a) P (oacac) = G �� (o) G �� (a) G �� (a) G �� (c) G �� (a) =

�� • Discrete�distribution� ↔ vector�of�parameters G � � � = [ π � , . . . , π K ] , K ∈ | Σ | • Counting�/�Maximum�likelihood�estimation� – Training�sequence� x �� N π k = � { � k } ˆ G � � � ( X = k ) = ˆ � { � } G � � � – Predictive�inference P ( X n �� | x � . . . x N ) = ˆ G � � � ( X n �� ) • Example x i – Non0smoothed�unigram�model�( �� ǫ ) i = 1 : N

!�� • Estimation P ( G � � � | x �� n ) ∝ P ( x �� n |G � � � ) P ( G � � � ) • Predictive�inference � U P ( X n �� | x �� n ) = P ( X n �� |G � � � ) P ( G � � � | x �� n ) d G � � � • Priors�over�distributions G � � � ∼ Dirichlet( U ) , G � � � ∼ PY( d, c, U ) G � � � • Net�effect – Inference�is� “ smoothed ” w.r.t.�uncertainty�about� unknown� distribution • Example x i – Smoothed�unigram�( �� ǫ ) i = 1 : N

"�#�� discount concentration G � � � ∼ PY( d, c, G � σ � � �� ) ∼ G � � � base distribution x i • Tool�for�tying�together�related�distributions�in�hierarchical�models • Measure�over�measures • Base�measure�is�the� “ mean ” measure E [ G � � � ( dx )] = G � σ � � �� ( dx ) • A�distribution�drawn�from�a�Pitman�Yor�process�is�related�to�its base� distribution� – (equal�when� � =� ∞ or� �� =�1) �� ’ ��

$��%&��$�� • Generalization�of�the�Dirichlet�process�( �� =�0) – Different�(power0law)�properties – Better�for�text�[Teh,�2006]�and�images�[Sudderth and�Jordan,�2009] • Posterior�predictive�distribution Can’t actually do this integral this way � P ( X N �� | x �� N ; c, d ) ≈ P ( x N �� |G � � � ) P ( G � � � | x �� N ; c, d ) d G � � � �� K � k �� ( m k − d ) � ( φ k = X N �� ) + c + dK � = c + N G � σ � � �� ( X N �� ) c + N • Forms�the�basis�for�straightforward,�simple�samplers • Rule�for�stochastic�memoization

'��!�� • Estimation U {G � � � , G � � � , G � � � } , Θ = ( = σ ( � ) = σ ( � ) P (Θ | x �� N ) ∝ P ( x �� N | Θ) P (Θ) • Predictive�inference G � � � P ( X N �� | x �� N ) � P ( X N �� | Θ) P (Θ | x �� N ) d Θ = • Naturally�related�distributions�tied� G � � � G � � � together G � the United States � ∼ PY( d, c, G � United States � ) Net�effect� • x i x j – Observations�in�one�context�affect� inference�in�other�context. – Statistical�strength�is�shared�between� similar�contexts i = 1 : N � � � j = 1 : N � � � • Example – Smoothing�bi0gram�( �� ǫ � � � �� ∈ Σ )

��)'$&$��"�� Observations Conditional Distributions Posterior Predictive Probabilities U G �� G �� G �� G ��

�*��$��$��+,�� Observations Conditional Distributions Posterior Predictive Probabilities U CP U G �� G �� G �� G ��

�*��$��$��+,�� Observations Conditional Distributions Posterior Predictive Probabilities U CP U G �� CP U G �� G �� G ��

'$&$��"�� • Share�statistical�strength�between� sequentially�related�predictive� �� G �� conditional�distributions – Estimates�of�highly�specific� conditional�distributions G �� G �� G �� G �� – Are�coupled�with�others�that�are� related G �� – Through�a�single�common,�more0 G �� G �� G �� general�shared�ancestor G �� G �� • Corresponds�intuitively�to�back0off G �� G �� G �� G �� G �� G �� G ��

'��$��&��$�� G �� | d � , U ∼ PY( d � , 0 , U ) G � � � | d | � | , G � σ � � �� ∼ PY( d | � | , 0 , G � σ � � �� ) x i | � �� i − � = � ∼ G � � � i = 1 , . . . , T ∀ � ∈ Σ n − � • Bayesian�generalization�of�smoothing� n 0gram�Markov�model� • Language�model�:�outperforms�interpolated�Kneser0Ney�(KN)�smoothing • Efficient�inference�algorithms�exist� – [Goldwater�et�al� ’ 05;�Teh,� ’ 06;�Teh,�Kurihara,�Welling,� ’ 08] • Sharing�between�contexts�that�differ�in�most�distant�symbol�only • Finite�depth �� ’ �� ’ ��

"�� • A�sequence�can�be�characterized�by�a�set�of� single observations�in�unique�contexts�of�growing�length   o | []     Increasing�context�length  a | o  oacac → c | ao Always�a�single�observation    a | cao     c | acao Foreshadowing:�all�suffixes�of�the�string� “ cacao ”

FrankWood Gatsby UCL - PowerPoint PPT Presentation

FrankWood Gatsby UCL CedricArchambeau Gatsby JanGasthaus HKUST LancelotJames Gatsby YeeWhye Teh

Variational Inference for Diffusion Processes C edric Archambeau Xerox Research Centre Europe

Introduction to Neural Coding Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational

Information Theory Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience

Population Coding Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit

Neural Encoding Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience

Estimation of information-theoretic quantities Liam Paninski Gatsby Computational Neuroscience

Statistical methods for neural decoding Liam Paninski Gatsby Computational Neuroscience Unit

RMLL 2009 Network virtualisation using Netkit and Dynamips Cedric Foll 07.08.09 Cedric Foll

UCL Biobanking and UCL NIHR BioResource Dr Kirstin Goldring (UCL Biobank and NIHR BioResource

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV

Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University

Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational

Scalable Hyperparameter Transfer learning Valerio Perrone , Rodolphe Jenatton , C edric

UCL PRO/CON Debate: Aggressive versus progressive therapeutic approach Dr. Shahin Moledina -

Hypothesis Testing with Kernels Zolt an Szab o (Gatsby Unit, UCL) PRNI, Trento June 22,

Lecture 1: Introduction to RKHS MLSS Cadiz, 2016 Gatsby Unit, CSML, UCL May 12, 2016 Lecture 1:

STAGE SETTING (OR AIMS OF OUR PROJECT) Create carrier bacteria able to deliver chosen genes

Corporate Presentation March 2019 TSXV: ALEF OTC: ALEAF Aleafia Health Inc Confidential

OGI Investor Presentation OGI: NASDAQ TSX Cautionary Statement This document is current as of

Investor Presentation November 2018 Disclaimer and Cautionary Note Regarding Forward- Looking

Student Recital Planning Request a Date Check the website for available recital dates

Bridge By: Mr. Chung KHChung@interact.ccsd.net Main Materials: Grid paper (12x18)

Perplexity on Reduced Corpora Analysis of Cutoff by Power Law Hayato Kobayashi Yahoo Japan

Hand Washing T O P R O T E C T T H E C U S T O M E R S H E A L T H P R E S E N T E D B Y