Autoregressive Models Stefano Ermon, Aditya Grover Stanford - PowerPoint PPT Presentation

Autoregressive Models Stefano Ermon, Aditya Grover Stanford University Lecture 3 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 1 / 1

Learning a generative model We are given a training set of examples, e.g., images of dogs We want to learn a probability distribution p ( x ) over images x such that Generation: If we sample x new ∼ p ( x ), x new should look like a dog 1 ( sampling ) Density estimation: p ( x ) should be high if x looks like a dog, and low 2 otherwise ( anomaly detection ) Unsupervised representation learning: We should be able to learn 3 what these images have in common, e.g., ears, tail, etc. ( features ) First question: how to represent p ( x ). Second question: how to learn it. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 2 / 1

Recap: Bayesian networks vs neural models Using Chain Rule p ( x 1 , x 2 , x 3 , x 4 ) = p ( x 1 ) p ( x 2 | x 1 ) p ( x 3 | x 1 , x 2 ) p ( x 4 | x 1 , x 2 , x 3 ) Fully General, no assumptions needed (exponential size, no free lunch) Bayes Net ✘ x 1 , x 2 ) p CPT ( x 4 | x 1 , ✘✘ p ( x 1 , x 2 , x 3 , x 4 ) ≈ p CPT ( x 1 ) p CPT ( x 2 | x 1 ) p CPT ( x 3 | ✚ x 2 , x 3 ) ✚ Assumes conditional independencies; tabular representations via conditional probability tables (CPT) Neural Models p ( x 1 , x 2 , x 3 , x 4 ) ≈ p ( x 1 ) p ( x 2 | x 1 ) p Neural ( x 3 | x 1 , x 2 ) p Neural ( x 4 | x 1 , x 2 , x 3 ) Assumes specific functional form for the conditionals. A sufficiently deep neural net can approximate any function. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 3 / 1

Neural Models for classification Setting: binary classification of Y ∈ { 0 , 1 } given inputs X ∈ { 0 , 1 } n For classification, we care about p ( Y | x ), and assume that p ( Y = 1 | x ; α ) = f ( x , α ) Logistic regression : let z ( α , x ) = α 0 + � n i =1 α i x i . p logit ( Y = 1 | x ; α ) = σ ( z ( α , x )), where σ ( z ) = 1 / (1 + e − z ) Non-linear dependence: let h ( A , b , x ) = f ( A x + b ) be a non-linear transformation of the inputs ( features ). p Neural ( Y = 1 | x ; α , A , b ) = σ ( α 0 + � h i =1 α i h i ) More flexible More parameters: A , b , α Repeat multiple times to get a multilayer perceptron (neural network) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 4 / 1

Motivating Example: MNIST Suppose we have a dataset D of handwritten digits (binarized MNIST) Each image has n = 28 × 28 = 784 pixels. Each pixel can either be black (0) or white (1). We want to learn a probability distribution p ( v ) = p ( v 1 , · · · , v 784 ) over v ∈ { 0 , 1 } 784 such that when v ∼ p ( v ), v looks like a digit Idea: define a model family { p θ ( v ) , θ ∈ Θ } , then pick a good one based on training data D (more on that later) How to parameterize p θ ( v )? Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 5 / 1

Fully Visible Sigmoid Belief Network We can pick an ordering, i.e., order variables (pixels) from top-left ( X 1 ) to bottom-right ( X n =784 ) Use chain rule factorization without loss of generality: p ( v 1 , · · · , v 784 ) = p ( v 1 ) p ( v 2 | v 1 ) p ( v 3 | v 1 , v 2 ) · · · p ( v n | v 1 , · · · , v n − 1 ) Some conditionals are too complex to be stored in tabular form. So we assume p ( v 1 , · · · , v 784 ) = p CPT ( v 1 ; α 1 ) p logit ( v 2 | v 1 ; α 2 ) p logit ( v 3 | v 1 , v 2 ; α 3 ) · · · p logit ( v n | v 1 , · · · , v n − 1 ; α n ) More explicitly: p CPT ( V 1 = 1; α 1 ) = α 1 , p ( V 1 = 0) = 1 − α 1 p logit ( V 2 = 1 | v 1 ; α 2 ) = σ ( α 2 0 + α 2 1 v 1 ) p logit ( V 3 = 1 | v 1 , v 2 ; α 3 ) = σ ( α 3 0 + α 3 1 v 1 + α 3 2 v 2 ) Note: This is a modeling assumption . We are using a logistic regression to predict next pixel based on the previous ones. Called autoregressive . Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 6 / 1

Fully Visible Sigmoid Belief Network The conditional variables V i | V 1 , · · · , V i − 1 are Bernoulli with parameters i − 1 � v i = p ( V i = 1 | v 1 , · · · , v i − 1 ; α i ) = p ( V i = 1 | v < i ; α i ) = σ ( α i α i ˆ 0 + j v j ) j =1 How to evaluate p ( v 1 , · · · , v 784 )? Multiply all the conditionals (factors) In the above example: p ( V 1 = 0 , V 2 = 1 , V 3 = 1 , V 4 = 0) = (1 − ˆ v 1 ) × ˆ v 2 × ˆ v 3 × (1 − ˆ v 4 ) = (1 − ˆ v 1 ) × ˆ v 2 ( V 1 = 0) × ˆ v 3 ( V 1 = 0 , V 2 = 1) × (1 − ˆ v 4 ( V 1 = 0 , V 2 = 1 , V 3 = 1)) How to sample from p ( v 1 , · · · , v 784 )? Sample v 1 ∼ p ( v 1 ) ( np.random.choice([1,0],p=[ ˆ v 1 , 1 − ˆ v 1 ]) ) 1 Sample v 2 ∼ p ( v 2 | v 1 = v 1 ) 2 Sample v 3 ∼ p ( v 3 | v 1 = v 1 , v 2 = v 2 ) · · · 3 How many parameters? 1 + 2 + 3 · · · + n ≈ n 2 / 2 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 7 / 1

FVSBN Results Figure from Learning Deep Sigmoid Belief Networks with Data Augmentation, 2015 . Training data on the left ( Caltech 101 Silhouettes ). Samples from the model on the right. Best performing model they tested on this dataset in 2015 (more on evaluation later). Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 8 / 1

NADE: Neural Autoregressive Density Estimation To improve model: use one layer neural network instead of logistic regression h i = σ ( A i v < i + c i ) v i = p ( v i | v 1 , · · · , v i − 1 ; A i , c i , α i , b i ˆ ) = σ ( α i h i + b i ) � �� parameters     � � � � � � � �     . . . . . . For example h 2 = σ v 1 +  h 3 = σ ( v 1 v 2 ) + . . . . .     . . . .    �� c 2 c 3 A 2 A 3 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 9 / 1

NADE: Neural Autoregressive Density Estimation Tie weights to reduce the number of parameters and speed up computation (see blue dots in the figure): h i = σ ( W · ,< i v < i + c ) v i = p ( v i | v 1 , · · · , v i − 1 ) = σ ( α i h i + b i ) ˆ                         . . . . . .       . . . . . .              .   . .   . . .  For example             h 2 = σ v 1 h 3 = σ ( v 1 v 2 ) h 4 = σ ( v 1 v 2 v 3 )        w 1   w 1 w 2   w 1 w 2 w 3                    . . . . . .        .   . .   . . .   .   . .   . . .        � �� A 2 A 3 A 3 How many parameters? Linear in n : W ∈ R H × n , and n logistic regression coefficient vectors α i , b i ∈ R H +1 . Probability is evaluated in O ( nH ). Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 10 / 1

NADE results Figure from The Neural Autoregressive Distribution Estimator, 2011 . Samples on the left. Conditional probabilities ˆ v i on the right. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 11 / 1

General discrete distributions How to model non-binary discrete random variables V i ∈ { 1 , · · · , K } (e.g., color images)? Solution: let ˆ v i parameterize a categorical distribution h i = σ ( W · ,< i v < i + c ) p ( v i | v 1 , · · · , v i − 1 ) = Cat ( p 1 i , · · · , p K i ) v i = ( p 1 i , · · · , p K ˆ i ) = softmax ( V i h i + b i ) Softmax generalizes the sigmoid/logistic function σ ( · ) and transforms a vector of K numbers into a vector of K probabilities (non-negative, sum to 1). � � exp( a 1 ) exp( a K ) softmax ( a ) = softmax ( a 1 , · · · , a K ) = i exp( a i ) , · · · , � � i exp( a i ) np.exp(a)/np.sum(np.exp(a)) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 12 / 1

RNADE How to model continuous random variables V i ∈ R (e.g., speech signals)? Solution: let ˆ v i parameterize a continuous distribution (e.g., mixture of K Gaussians) ˆ v i needs to specify the mean and variance of each Gaussian Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 13 / 1

RNADE How to model continuous random variables V i ∈ R (e.g., speech signals)? Solution: let ˆ v i parameterize a continuous distribution (e.g., mixture of K Gaussians) K 1 � K N ( v i ; µ j i , σ j p ( v i | v 1 , · · · , v i − 1 ) = i ) j =1 h i = σ ( W · ,< i v < i + c ) v i = ( µ 1 i , · · · , µ K i , σ 1 i , · · · , σ K ˆ i ) = f ( h i ) v i defines the mean and variance of each Gaussian ( µ j , σ j ). Can use exponential ˆ exp( · ) to ensure variance is non-negative Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 14 / 1

Autoregressive Models Stefano Ermon, Aditya Grover Stanford - PowerPoint PPT Presentation

Autoregressive Models Stefano Ermon, Aditya Grover Stanford University Lecture 3 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 1 / 1 Learning a generative model We are given a training set of examples, e.g., images

Autoregressive Models Autoregressive Models In [1]: from mxnet import autograd, nd, gluon, init

Chapter 4: Video 1 - Supplemental slides The Autoregressive Model Autoregressive (AR) processes

Lecture 12: Autoregressive Filters Mark Hasegawa-Johnson ECE 401: Signal and Image Analysis, Fall

Financial Econometrics Econ 40357 ARIMA Part 2: Autoregressive Models N.C. Mark University of

Adaptive Estimation of Autoregressive Models with Time-Varying Variances Ke-Li Xu and Peter C.

CSC321 Lecture 20: Reversible and Autoregressive Models Roger Grosse Roger Grosse CSC321

Hint-Based Training for Non-Autoregressive Translation Zhuohan Li Zi Lin Fei Tian Tao Qin

Agenda Automated Automated Modeling and Modeling and Forecasting Forecasting Vector Vector

Time Domain Models Box & Jenkins popularized an approach to time series analysis based on

Bayesian Graphical Models for Structural Vector Autoregressive Processes Daniel Ahelegbey, Monica

Causal analysis within the framework of structural autoregressive models Alessio Moneta Scuola

Spatial Autoregressive Models Sudipto Banerjee Division of Biostatistics, University of Minnesota

Autoregressive and Invertible Models CSC2541 Fall 2016 Haider Al-Lawati Christopher Meaney

ardl: Stata module to estimate autoregressive distributed lag models Sebastian Kripfganz 1 Daniel

On a Class of Nonparametric Bayesian Autoregressive Models Maria Anna Di Lucca 1 , Alessandra

Markov-switching autoregressive latent variable models for longitudinal data University of

RETAIL SECTOR. May 2020 AGENDA Introduction Where do I start?! Awareness- Signage and

THE NODE.JS HIGHWAY: ATTACKS AT FULL THROTTLE Susan St.Clair, Solutions Architect Checkmarx

21/11/2016 2 Data and Information: opportunities and challenges DR CLAIRE GRIFFITHS 3 4 5 6

Agenda Opening Session I. Report from Chair, Debra Dawson S Reports from Executive II. III.

Re-structuring Health Care Delivery in Ontario Michael Scarpitti Lead, Health Systems

The Business of Higher The Business of Higher Education - Education - Sustainability,

Jan 23 Conceptual models of ecological systems Example of drawing strong generalizations in

Reproducible Research in Ecology with R: distribution of threatened mammals in Equatorial Guinea

Sambuz

Useful Links

Newsletter

Mail Us

Autoregressive Models Stefano Ermon, Aditya Grover Stanford - PowerPoint PPT Presentation

Autoregressive Models Stefano Ermon, Aditya Grover Stanford University Lecture 3 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 3 1 / 1 Learning a generative model We are given a training set of examples, e.g., images

Autoregressive Models Autoregressive Models In [1]: from mxnet import autograd, nd, gluon, init

Chapter 4: Video 1 - Supplemental slides The Autoregressive Model Autoregressive (AR) processes

Lecture 12: Autoregressive Filters Mark Hasegawa-Johnson ECE 401: Signal and Image Analysis, Fall

Financial Econometrics Econ 40357 ARIMA Part 2: Autoregressive Models N.C. Mark University of

Adaptive Estimation of Autoregressive Models with Time-Varying Variances Ke-Li Xu and Peter C.

CSC321 Lecture 20: Reversible and Autoregressive Models Roger Grosse Roger Grosse CSC321

Hint-Based Training for Non-Autoregressive Translation Zhuohan Li Zi Lin Fei Tian Tao Qin

Agenda Automated Automated Modeling and Modeling and Forecasting Forecasting Vector Vector

Time Domain Models Box &amp; Jenkins popularized an approach to time series analysis based on

Bayesian Graphical Models for Structural Vector Autoregressive Processes Daniel Ahelegbey, Monica

Causal analysis within the framework of structural autoregressive models Alessio Moneta Scuola

Spatial Autoregressive Models Sudipto Banerjee Division of Biostatistics, University of Minnesota

Autoregressive and Invertible Models CSC2541 Fall 2016 Haider Al-Lawati Christopher Meaney

ardl: Stata module to estimate autoregressive distributed lag models Sebastian Kripfganz 1 Daniel

On a Class of Nonparametric Bayesian Autoregressive Models Maria Anna Di Lucca 1 , Alessandra

Markov-switching autoregressive latent variable models for longitudinal data University of

RETAIL SECTOR. May 2020 AGENDA Introduction Where do I start?! Awareness- Signage and

THE NODE.JS HIGHWAY: ATTACKS AT FULL THROTTLE Susan St.Clair, Solutions Architect Checkmarx

21/11/2016 2 Data and Information: opportunities and challenges DR CLAIRE GRIFFITHS 3 4 5 6

Agenda Opening Session I. Report from Chair, Debra Dawson S Reports from Executive II. III.

Re-structuring Health Care Delivery in Ontario Michael Scarpitti Lead, Health Systems

The Business of Higher The Business of Higher Education - Education - Sustainability,

Jan 23 Conceptual models of ecological systems Example of drawing strong generalizations in

Reproducible Research in Ecology with R: distribution of threatened mammals in Equatorial Guinea

Sambuz

Useful Links

Newsletter

Mail Us

Time Domain Models Box & Jenkins popularized an approach to time series analysis based on