Probability Density Function (PDF) Joint Probability Distribution - - PowerPoint PPT Presentation

probability density function pdf joint probability
SMART_READER_LITE
LIVE PREVIEW

Probability Density Function (PDF) Joint Probability Distribution - - PowerPoint PPT Presentation

Fundamentals of AI Introduction and the most basic concepts Probability Density Function (PDF) Joint Probability Distribution Jo Banana -shaped probability distribution Probability of any combination of features to happen


slide-1
SLIDE 1

Introduction and the most basic concepts

Fundamentals of AI

Probability Density Function (PDF)

slide-2
SLIDE 2

Jo Joint Probability Distribution

  • Probability of any combination of features to

happen

  • Fundamental assumption: dataset is i.i.d.

(Independent and identically distributed) sample following PDF

  • If we know PDF underlying our dataset then we

can predict everything (any dependence, together with uncertainties)!

  • Moreover, knowing PDF we can generate infinite

number of similar datasets with the same or different number of points

  • Really Platonian thing!

‘Banana-shaped probability distribution’ Probability density function (PDF)

slide-3
SLIDE 3

Probability Density Function

  • PDF is a way to define joint probability distribution for

features with continuous (numerical) values

  • Can immediately get us Bayesian methods that are sensible

with real-valued data

  • You’ll need to intimately understand PDFs in order to do

kernel methods, clustering with Mixture Models, analysis of variance, time series and many other things

  • Will introduce us to linear and non-linear regression
slide-4
SLIDE 4

Example of a 1D PDF

slide-5
SLIDE 5

Example of a 1D PDF

slide-6
SLIDE 6

What’s the meaning of p(x)?

If p(5.31) = 0.06 and p(5.92) = 0.03 then when a value X is sampled from the distribution, you are 2 times as likely to find that X is “very close to” 5.31 than that X is “very close to” 5.92.

slide-7
SLIDE 7

True or False? 1 ) ( :   x p x ) ( :    x X P x

TRUE TRUE

slide-8
SLIDE 8

Expectations (aka mean value)

E[X] = the expected value of random variable X = the average value we’d see if we took a very large number of random samples of X

  

x

dx x p x ) (

slide-9
SLIDE 9

Expectations

E[X] = the expected value of random variable X = the average value we’d see if we took a very large number of random samples of X

  

x

dx x p x ) (

= the first moment of the shape formed by the axes and the blue curve = the best value to choose if you must guess an unknown person’s age and you’ll be fined the square of your error E[age]=35.897

slide-10
SLIDE 10

Variance

s2 = Var[X] = the expected squared difference between x and E[X]

  

 

x

dx x p x ) ( ) (

2 2

 s

= amount you’d expect to lose if you must guess an unknown person’s age and you’ll be fined the square of your error, and assuming you play

  • ptimally

02 . 498 ] age [ Var 

slide-11
SLIDE 11

Standard Deviation

s2 = Var[X] = the expected squared difference between x and E[X]

  

 

x

dx x p x ) ( ) (

2 2

 s

= amount you’d expect to lose if you must guess an unknown person’s age and you’ll be fined the square of your error, and assuming you play

  • ptimally

s = Standard Deviation = “typical” deviation of X from its mean

02 . 498 ] age [ Var  ] [ Var X  s 32 . 22  s

slide-12
SLIDE 12

In 2 dimensions

p(x,y) = probability density of random variables (X,Y) at location (x,y)

slide-13
SLIDE 13

In 2 dimensions

Let X,Y be a pair of continuous random variables, and let R be some region of (X,Y) space…



 

R y x

dydx y x p R Y X P

) , (

) , ( ) ) , ((

P( 20<mpg<30 and 2500<weight<3000) = area under the 2-d surface within the red rectangle

slide-14
SLIDE 14

Independence

If X and Y are independent then knowing the value of X does not help predict the value of Y

) ( ) ( ) , ( : y x, iff y p x p y x p Y X   

mpg,weight NOT independent

slide-15
SLIDE 15

Independence

If X and Y are independent then knowing the value of X does not help predict the value of Y

) ( ) ( ) , ( : y x, iff y p x p y x p Y X   

the contours say that acceleration and weight are independent

slide-16
SLIDE 16

Multivariate Expectation x x x X μX

  d p E ) ( ] [

The centroid of the cloud E[mpg,weight] = (24.5,2600)

slide-17
SLIDE 17

Marginal Distributions

  

y

dy y x p x p ) , ( ) (

slide-18
SLIDE 18

Conditional Distributions

y Y X y x p   when

  • f

p.d.f. ) | (

) 4600 weight | mpg (  p ) 3200 weight | mpg (  p ) 2000 weight | mpg (  p

slide-19
SLIDE 19

Conditional Distributions

y Y X y x p   when

  • f

p.d.f. ) | (

) 4600 weight | mpg (  p

) ( ) , ( ) | ( y p y x p y x p 

Why?

slide-20
SLIDE 20

Gaussian (normal) distribution

  • The most used PDF
  • Most of the classical statistical learning theory is based on Gaussians
  • Connection to the mean-squared loss
  • Connection with linearity
  • Connection with Euclidean space
  • Connection to a mean of (many) independent variables
  • Distribution with the largest entropy among all distributions with unit

variance

  • Mixture of Gaussians can approximate (almost) everything
slide-21
SLIDE 21

Gaussian (normal) distribution

  • The most used PDF
  • Most of the classical statistical learning theory is based on Gaussians
  • Connection to the mean-squared loss
  • Connection with linearity
  • Connection with Euclidean space
  • Connection to a mean of (many) independent variables
  • Distribution with the largest entropy among all distributions with unit

variance

  • Mixture of Gaussians can approximate (almost) everything
slide-22
SLIDE 22

The dataset is a finite set of points. The PDF is continuous. How this is possible?

slide-23
SLIDE 23

Learning PDF from data

  • Part of unsupervised machine learning
  • Histograms and multi-dimensional histograms
  • Naïve Bayes : P(X,Y,Z,T) = P(X)P(Y)P(Z)P(T)
  • Bayesian networks, graphical models
  • Kernel density estimate
slide-24
SLIDE 24

Estimating PDF from data: Kernel Density Estimate

https://www.youtube.com/watch?v=gPWsDh59zdo

slide-25
SLIDE 25

Estimating PDF from data: Kernel Density Estimate

https://www.youtube.com/watch?v=gPWsDh59zdo

slide-26
SLIDE 26

Estimating PDF from data: Kernel Density Estimate

https://www.youtube.com/watch?v=gPWsDh59zdo

slide-27
SLIDE 27

Estimating PDF from data: Kernel Density Estimate

https://www.youtube.com/watch?v=gPWsDh59zdo

slide-28
SLIDE 28

Estimating PDF from data: Kernel Density Estimate

https://www.youtube.com/watch?v=gPWsDh59zdo

slide-29
SLIDE 29

Estimating PDF from data: Kernel Density Estimate

https://www.youtube.com/watch?v=gPWsDh59zdo

slide-30
SLIDE 30

Estimating PDF from data: Kernel Density Estimate

slide-31
SLIDE 31

Estimating PDF from data: Kernel Density Estimate

slide-32
SLIDE 32

Estimating PDF from data: Kernel Density Estimate

slide-33
SLIDE 33

Estimating PDF from data: Kernel Density Estimate

slide-34
SLIDE 34

Estimating PDF from data: Kernel Density Estimate

slide-35
SLIDE 35

Estimating PDF from data: Kernel Density Estimate

slide-36
SLIDE 36

Estimating PDF from data: Kernel Density Estimate

slide-37
SLIDE 37
slide-38
SLIDE 38

Estimating PDF from data: Kernel Density Estimate

Choice of bandwidth

Too narrow Wide

slide-39
SLIDE 39

d-dimensional case

slide-40
SLIDE 40
slide-41
SLIDE 41

What to take from this lesson

  • Probability density function (PDF) is the right way to describe the

joint probability distribution of continuous numerical features Good news:

  • Knowing PDF gives us all necessary information about the data
  • There are ways to estimate PDF directly from data in non-

parameteric way (KDE) Bad news:

  • In data spaces with high intrinsic dimension (not equivalent to the

number of features!), PDF can not be computed from data in any reasonable form