Representation Stefano Ermon, Aditya Grover Stanford University - PowerPoint PPT Presentation

Representation Stefano Ermon, Aditya Grover Stanford University Lecture 2 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 1 / 30

Learning a generative model We are given a training set of examples, e.g., images of dogs We want to learn a probability distribution p ( x ) over images x such that Generation: If we sample x new ∼ p ( x ), x new should look like a dog ( sampling ) Density estimation: p ( x ) should be high if x looks like a dog, and low otherwise ( anomaly detection ) Unsupervised representation learning: We should be able to learn what these images have in common, e.g., ears, tail, etc. ( features ) First question: how to represent p ( x ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 2 / 30

Basic discrete distributions Bernoulli distribution: (biased) coin flip D = { Heads , Tails } Specify P ( X = Heads ) = p . Then P ( X = Tails ) = 1 − p . Write: X ∼ Ber ( p ) Sampling: flip a (biased) coin Categorical distribution: (biased) m -sided dice D = { 1 , · · · , m } Specify P ( Y = i ) = p i , such that � p i = 1 Write: Y ∼ Cat ( p 1 , · · · , p m ) Sampling: roll a (biased) die Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 3 / 30

Example of joint distribution Modeling a single pixel’s color. Three discrete random variables: Red Channel R . Val( R ) = { 0 , · · · , 255 } Green Channel G . Val( G ) = { 0 , · · · , 255 } Blue Channel B . Val( B ) = { 0 , · · · , 255 } Sampling from the joint distribution ( r , g , b ) ∼ p ( R , G , B ) randomly generates a color for the pixel. How many parameters do we need to specify the joint distribution p ( R = r , G = g , B = b )? 256 ∗ 256 ∗ 256 − 1 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 4 / 30

Example of joint distribution Suppose X 1 , . . . , X n are binary (Bernoulli) random variables, i.e., Val( X i ) = { 0 , 1 } = { Black , White } . How many possible states? = 2 n 2 × 2 × · · · × 2 � �� n times Sampling from p ( x 1 , . . . , x n ) generates an image How many parameters to specify the joint distribution p ( x 1 , . . . , x n ) over n binary pixels? 2 n − 1 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 5 / 30

Structure through independence If X 1 , . . . , X n are independent, then p ( x 1 , . . . , x n ) = p ( x 1 ) p ( x 2 ) · · · p ( x n ) How many possible states? 2 n How many parameters to specify the joint distribution p ( x 1 , . . . , x n )? How many to specify the marginal distribution p ( x 1 )? 1 2 n entries can be described by just n numbers (if | Val( X i ) | = 2)! Independence assumption is too strong. Model not likely to be useful For example, each pixel chosen independently when we sample from it. Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 6 / 30

Key notion: conditional independence Two events A , B are conditionally independent given event C if p ( A ∩ B | C ) = p ( A | C ) p ( B | C ) Random variables X , Y are conditionally independent given Z if for all values x ∈ Val( X ), y ∈ Val( Y ), z ∈ Val( Z ) p ( X = x ∩ Y = y | Z = z ) = p ( X = x | Z = z ) p ( Y = y | Z = z ) We will also write p ( X , Y | Z ) = p ( X | Z ) p ( Y | Z ). Note the more compact notation. Equivalent definition: p ( X | Y , Z ) = p ( X | Z ). We write X ⊥ Y | Z Similarly for sets of random variables, X ⊥ Y | Z Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 7 / 30

Two important rules 1 Chain rule Let S 1 , . . . S n be events, p ( S i ) > 0. p ( S 1 ∩ S 2 ∩ · · · ∩ S n ) = p ( S 1 ) p ( S 2 | S 1 ) · · · p ( S n | S 1 ∩ . . . ∩ S n − 1 ) 2 Bayes’ rule Let S 1 , S 2 be events, p ( S 1 ) > 0 and p ( S 2 ) > 0. p ( S 1 | S 2 ) = p ( S 1 ∩ S 2 ) = p ( S 2 | S 1 ) p ( S 1 ) p ( S 2 ) p ( S 2 ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 8 / 30

Structure through conditional independence Using Chain Rule p ( x 1 , . . . , x n ) = p ( x 1 ) p ( x 2 | x 1 ) p ( x 3 | x 1 , x 2 ) · · · p ( x n | x 1 , · · · , x n − 1 ) How many parameters? 1 + 2 + · · · + 2 n − 1 = 2 n − 1 p ( x 1 ) requires 1 parameter p ( x 2 | x 1 = 0) requires 1 parameter, p ( x 2 | x 1 = 1) requires 1 parameter Total 2 parameters. · · · 2 n − 1 is still exponential, chain rule does not buy us anything. Now suppose X i +1 ⊥ X 1 , . . . , X i − 1 | X i , then ✘ x 1 , x 2 ) · · · p ( x n | ✘✘✘ p ( x 1 , . . . , x n ) = p ( x 1 ) p ( x 2 | x 1 ) p ( x 3 | ✚ x 1 , · · · , x n − 1 ) ✚ = p ( x 1 ) p ( x 2 | x 1 ) p ( x 3 | x 2 ) · · · p ( x n | x n − 1 ) How many parameters? 2 n − 1. Exponential reduction! Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 9 / 30

Structure through conditional independence Suppose we have 4 random variables X 1 , · · · , X 4 Using Chain Rule we can always write p ( x 1 , . . . , x 4 ) = p ( x 1 ) p ( x 2 | x 1 ) p ( x 3 | x 1 , x 2 ) p ( x 4 | x 1 , x 2 , x 3 ) If X 4 ⊥ X 2 | { X 1 , X 3 } , we can simplify as p ( x 1 , . . . , x n ) = p ( x 1 ) p ( x 2 | x 1 ) p ( x 3 | x 1 , x 2 ) p ( x 4 | x 1 , ✚ x 2 , x 3 ) ✚ Using Chain Rule with a different ordering we can always also write p ( x 1 , . . . , x 4 ) = p ( x 4 ) p ( x 3 | x 4 ) p ( x 2 | x 3 , x 4 ) p ( x 1 | x 2 , x 3 , x 4 ) If X 1 ⊥ { X 2 , X 3 } | X 4 , we can simplify as ✘ p ( x 1 , . . . , x 4 ) = p ( x 4 ) p ( x 3 | x 4 ) p ( x 2 | x 3 , x 4 ) p ( x 1 | ✘✘ x 2 , x 3 , x 4 ) Bayesian Networks: assume an ordering and a set of conditional independencies to get compact representation Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 10 / 30

Bayes Network: General Idea Use conditional parameterization (instead of joint parameterization) For each random variable X i , specify p ( x i | x A i ) for set X A i of random variables Then get joint parametrization as � p ( x 1 , . . . , x n ) = p ( x i | x A i ) i Need to guarantee it is a legal probability distribution. It has to correspond to a chain rule factorization, with factors simplified due to assumed conditional independencies Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 11 / 30

Bayesian networks A Bayesian network is specified by a directed acyclic graph G = ( V , E ) with: One node i ∈ V for each random variable X i 1 One conditional probability distribution (CPD) per node, p ( x i | x Pa ( i ) ), 2 specifying the variable’s probability conditioned on its parents’ values Graph G = ( V , E ) is called the structure of the Bayesian Network Defines a joint distribution: � p ( x 1 , . . . x n ) = p ( x i | x Pa ( i ) ) i ∈ V Claim: p ( x 1 , . . . x n ) is a valid probability distribution Economical representation : exponential in | Pa ( i ) | , not | V | Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 12 / 30

Example DAG stands for Directed Acyclic Graph Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 13 / 30

Example Consider the following Bayesian network: What is its joint distribution? � p ( x 1 , . . . x n ) = p ( x i | x Pa ( i ) ) i ∈ V p ( d , i , g , s , l ) = p ( d ) p ( i ) p ( g | i , d ) p ( s | i ) p ( l | g ) Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 14 / 30

Bayesian network structure implies conditional independencies! The joint distribution corresponding to the above BN factors as p ( d , i , g , s , l ) = p ( d ) p ( i ) p ( g | i , d ) p ( s | i ) p ( l | g ) However, by the chain rule, any distribution can be written as p ( d , i , g , s , l ) = p ( d ) p ( i | d ) p ( g | i , d ) p ( s | i , d , g ) p ( l | g , d , i , s ) Thus, we are assuming the following additional independencies: D ⊥ I , S ⊥ { D , G } | I , L ⊥ { I , D , S } | G . Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 15 / 30

Summary Bayesian networks given by ( G , P ) where P is specified as a set of local conditional probability distributions associated with G ’s nodes Efficient representation using a graph-based data structure Computing the probability of any assignment is obtained by multiplying CPDs Can identify some conditional independence properties by looking at graph properties Next: generative vs. discriminative; functional parameterizations Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 16 / 30

Naive Bayes for single label prediction Classify e-mails as spam ( Y = 1) or not spam ( Y = 0) Let 1 : n index the words in our vocabulary (e.g., English) X i = 1 if word i appears in an e-mail, and 0 otherwise E-mails are drawn according to some distribution p ( Y , X 1 , . . . , X n ) Words are conditionally independent given Y : Then n � p ( y , x 1 , . . . x n ) = p ( y ) p ( x i | y ) i =1 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 17 / 30

Representation Stefano Ermon, Aditya Grover Stanford University - PowerPoint PPT Presentation

Representation Stefano Ermon, Aditya Grover Stanford University Lecture 2 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 1 / 30 Learning a generative model We are given a training set of examples, e.g., images of

K K Knowledge Knowledge l d l d Representation Representation Representation

Stable and Efficient Representation Learning with Nonnegativity Constraints Tsung-Han Lin and

Precise and Approximate Representation of Numbers The Cartesian-Lagrangian representation of

Image and Video Coding: Representation, Acquisition, Display ... 10011 ... encoder decoder

Number representation in Java Scientific notation Overview topics Binary representation of

parametric surface patches 1 implicit representation implicit surface representation f ( P ) = 0

What is meant by a flashforward? The mental representation of an The mental

Unit 11 Signed Representation Systems Binary Arithmetic 11.2 BINARY REPRESENTATION SYSTEMS

Unit 11 Signed Representation Systems BINARY REPRESENTATION SYSTEMS Binary Arithmetic REVIEW

Data Representation and Data Representation and Remote Procedure Calls Remote Procedure Calls

Lecture 5: Data Representation 1 / 43 Data Representation Discussion Deep learning job postings

Integer Representation Bits, binary numbers, and bytes Fixed-width representation of integers:

Nameless Representation of Terms CIS500: Software Foundations Nameless Representation of Terms

Boundary representation of objects Smooth surfaces Implicit representation f(x, y, z)

Unit 10 Signed Representation Systems Binary Arithmetic 10.2 BINARY REPRESENTATION SYSTEMS

High Level Synthesis Design Representation Intermediate representation essential for efficient

Functions Announcements Expressions Types of expressions An expression describes a computation

Hindsight The Key to Effective Foresight Edmonton Canadian Club Centennial Presented by John

What theme does Milford convey through the characterization of Natalie? In this lesson you will

James James Author: James, Brother Of Jesus Date Written: 62 Recipient: Jewish Believers

Probability Theory Intro Jonathan Pillow Mathematical Tools for Neuroscience (NEU 314) Spring,

Magnetism and Matter MM-2: Electronic and magnetic properties Stefan Blgel Peter Grnberg

1 Projects Projects Final Reports Final Report Textual description of your system

Sparse dictionary learning in the presence of noise & outliers Rmi Gribonval INRIA Rennes

Representation Stefano Ermon, Aditya Grover Stanford University - PowerPoint PPT Presentation

Representation Stefano Ermon, Aditya Grover Stanford University Lecture 2 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 1 / 30 Learning a generative model We are given a training set of examples, e.g., images of

K K Knowledge Knowledge l d l d Representation Representation Representation

Stable and Efficient Representation Learning with Nonnegativity Constraints Tsung-Han Lin and

Precise and Approximate Representation of Numbers The Cartesian-Lagrangian representation of

Image and Video Coding: Representation, Acquisition, Display ... 10011 ... encoder decoder

Number representation in Java Scientific notation Overview topics Binary representation of

parametric surface patches 1 implicit representation implicit surface representation f ( P ) = 0

What is meant by a flashforward? The mental representation of an The mental

Unit 11 Signed Representation Systems Binary Arithmetic 11.2 BINARY REPRESENTATION SYSTEMS

Unit 11 Signed Representation Systems BINARY REPRESENTATION SYSTEMS Binary Arithmetic REVIEW

Data Representation and Data Representation and Remote Procedure Calls Remote Procedure Calls

Lecture 5: Data Representation 1 / 43 Data Representation Discussion Deep learning job postings

Integer Representation Bits, binary numbers, and bytes Fixed-width representation of integers:

Nameless Representation of Terms CIS500: Software Foundations Nameless Representation of Terms

Boundary representation of objects Smooth surfaces Implicit representation f(x, y, z)

Unit 10 Signed Representation Systems Binary Arithmetic 10.2 BINARY REPRESENTATION SYSTEMS

High Level Synthesis Design Representation Intermediate representation essential for efficient

Functions Announcements Expressions Types of expressions An expression describes a computation

Hindsight The Key to Effective Foresight Edmonton Canadian Club Centennial Presented by John

What theme does Milford convey through the characterization of Natalie? In this lesson you will

James James Author: James, Brother Of Jesus Date Written: 62 Recipient: Jewish Believers

Probability Theory Intro Jonathan Pillow Mathematical Tools for Neuroscience (NEU 314) Spring,

Magnetism and Matter MM-2: Electronic and magnetic properties Stefan Blgel Peter Grnberg

1 Projects Projects Final Reports Final Report Textual description of your system

Sparse dictionary learning in the presence of noise &amp; outliers Rmi Gribonval INRIA Rennes

Sparse dictionary learning in the presence of noise & outliers Rmi Gribonval INRIA Rennes