machine learning
play

Machine Learning Overview of probability Hamid Beigy Sharif - PowerPoint PPT Presentation

Machine Learning Overview of probability Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 23 Table of contents Probability 1 Random variables 2 Variance


  1. Machine Learning Overview of probability Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 23

  2. Table of contents Probability 1 Random variables 2 Variance and and Covariance 3 Probability distributions 4 Discrete distributions Continuous distributions Bayes theorem 5 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 2 / 23

  3. Outline Probability 1 Random variables 2 Variance and and Covariance 3 Probability distributions 4 Discrete distributions Continuous distributions Bayes theorem 5 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 3 / 23

  4. Probability Probability theory is the study of uncertainty. Elements of probability Sample space Ω : The set of all the outcomes of a random experiment. Event space F : A set whose elements A ∈ F (called events) are subsets of Ω. Probability measure : A function P : F → R that satisfies the following properties, P ( A ) ≥ 0, for all A ∈ F . 1 P (Ω) = 1. 2 If A 1 , A 2 , . . . are disjoint events (i.e., A i ∩ A j = ∅ whenever i ̸ = j ),then 3 ∑ P ( ∪ i A i ) = P ( A i ) i Properties of probability If A ⊆ B = ⇒ P ( A ) ≤ P ( B ). 1 P ( A ∩ B ) ≤ min( P ( A ) , P ( B )). 2 P ( A ∪ B ) ≤ P ( A ) + P ( B ). This property is called union bound. 3 P (Ω \ A ) = 1 − P ( A ). 4 If A 1 , A 2 , . . . , A k are disjoint events such that ∪ k i =1 A i = Ω,then 5 k ∑ P ( A i ) = 1 i =1 This property is called law of total probability. Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 3 / 23

  5. Probability Conditional probability and independence Let B be an event with non-zero probability. The conditional probability of any event A given B is defined as, P ( A | B ) = P ( A ∩ B ) P ( B ) In other words, P ( A | B ) is the probability measure of the event A after observing the occurrence of event B . Two events are called independent if and only if P ( A ∩ B ) = P ( A ) P ( B ) , or equivalently, P ( A | B ) = P ( A ). Therefore, independence is equivalent to saying that observing B does not have any effect on the probability of A . The probability of an event is the fraction of times that an event occurs out of some number of trials, as the number of trials approaches infinity. Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 4 / 23

  6. Outline Probability 1 Random variables 2 Variance and and Covariance 3 Probability distributions 4 Discrete distributions Continuous distributions Bayes theorem 5 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 5 / 23

  7. Random variables Consider an experiment in which we flip 10 coins, and we want to know the number of coins that come up heads. Here, the elements of the sample space Ω are 10-length sequences of heads and tails. However, in practice, we usually do not care about the probability of obtaining any particular sequence of heads and tails. Instead we usually care about real-valued functions of outcomes, such as the number of heads that appear among our 10 tosses, or the length of the longest run of tails. These functions, under some technical conditions, are known as random variables. More formally, a random variable X is a function X : Ω → R Typically, we will denote random variables using upper case letters X ( ω ) or more simply X , where ω is an event. We will denote the value that a random variable X may take on using lower case letter x . Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 5 / 23

  8. Random variables A random variable can be discrete or continuous. A random variable is associated with a probability mass function or probability distribution . Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 6 / 23

  9. Discrete random variables For a discrete random variable X , p ( x ) denotes the probability that p ( X = x ). p ( x ) is called the probability mass function (PMF). This function has the following properties: p ( x ) ≥ 0 p ( x ) ≤ 1 ∑ p ( x ) = 1 x Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 7 / 23

  10. Continuous random variables For a continuous random variable X , a probability p ( X = x ) is meaningless. Instead we use p ( x ) to denote the probability density function (PDF). p ( x ) ≥ 0 ∫ p ( x ) = 1 x ∈ Probability that a continuous random variable X ∈ ( x , x + δ x ) is p ( x ) δ x as δ x → 0. Probability that X ∈ ( −∞ , z ) is given by the cumulative distribution function (CDF) P ( z ), where ∫ z P ( z ) = p ( X ≤ z ) = p ( x ) dx −∞ � dP ( z ) � � � p ( x ) = z � � dz � � z = x Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 8 / 23

  11. Joint probability Joint probability p ( X , Y ) models probability of co-occurrence of two random variables X and Y . Let n ij be the number of times events x i and y j simultaneously occur. Let N = ∑ ∑ j n ij . i Joint probability is p ( X = x i , Y = y j ) = n ij N . Let c i = ∑ j n ij , and r j = ∑ i n ij . The probability of X irrespective of Y is p ( X = x i ) = c i N . Therefore, we can marginalize or sum over Y , i.e. p ( X = x i ) = ∑ j p ( X = x i , Y = y j ) . For discrete random variables, we have ∑ ∑ y p ( X = x , Y = y ) = 1. x ∫ ∫ For continuous random variables, we have y p ( X = x , Y = y ) = 1. x Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 9 / 23

  12. Marginalization Consider only instances where the fraction of instances Y = y j when X = x i . This is conditional probability and is written p ( Y = y j | X = x i ), the probability of Y given X . p ( Y = y j | X = x i ) = n ij c i Now consider n ij N = n ij c i p ( X = x i , Y = y j ) = c i N p ( Y = y j | X = x i ) p ( X = x i ) = If two events are independent, p ( X , Y ) = p ( X ) p ( Y ) and p ( X | Y ) = p ( X ) Sum rule p ( X ) = ∑ Y p ( X , Y ) Product rule p ( X , Y ) = p ( Y | X ) p ( X ) Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 10 / 23

  13. Expected value Expectation, expected value, or mean of a random variable X , denoted by E [ X ], is the average value of X in a large number of experiments. ∑ E [ x ] = p ( x ) x x or ∫ E [ x ] = p ( x ) xdx The definition of Expectation also applies to functions of random variables (e.g., E [ f ( x )]) Linearity of expectation E [ α f ( x ) + β g ( x )] = α E [ f ( x )] + β E [ g ( x )] Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 11 / 23

  14. Outline Probability 1 Random variables 2 Variance and and Covariance 3 Probability distributions 4 Discrete distributions Continuous distributions Bayes theorem 5 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 12 / 23

  15. Variance and and Covariance Variance ( σ 2 ) measures how much X varies around the expected value and is defined as. ( X − E [ X ]) 2 ] = E [ X 2 ] − µ 2 [ Var ( X ) = E √ Standard deviation : std [ X ] = var [ X ] = σ . Covariance indicates between two random variables X and Y the relationship between two random variables X and Y . [ ] ( X − E [ X ]) T ( Y − E [ Y ]) Cov ( X , Y ) = E X , Y Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 12 / 23

  16. Outline Probability 1 Random variables 2 Variance and and Covariance 3 Probability distributions 4 Discrete distributions Continuous distributions Bayes theorem 5 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 13 / 23

  17. Common probability distributions We will use these probability distributions extensively to model data as well as parameters Some discrete distributions and what they can model: Bernoulli : Binary numbers, e.g., outcome (head/tail, 0/1) of a coin toss 1 Binomial : Bounded non-negative integers, e.g., the number of heads in n coin tosses 2 Multinomial : One of K ( > 2) possibilities, e.g., outcome of a dice roll 3 Poisson : Non-negative integers, e.g., the number of words in a document 4 Some continuous distributions and what they can model: Uniform: Numbers defined over a fixed range 1 Beta: Numbers between 0 and 1, e.g., probability of head for a biased coin 2 Gamma: Positive unbounded real numbers 3 Dirichlet : Vectors that sum of 1 (fraction of data points in different clusters) 4 Gaussian: Real-valued numbers or real-valued vectors 5 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 13 / 23

  18. Outline Probability 1 Random variables 2 Variance and and Covariance 3 Probability distributions 4 Discrete distributions Continuous distributions Bayes theorem 5

  19. Bernoulli distribution Distribution over a binary random variable x ∈ { 0 , 1 } , like a coin-toss outcome Defined by a probability parameter p ∈ (0 , 1). p [ X = 1] = p p [ X = 0] = 1 − p Distribution defined as: Bernoulli ( x ; p ) = p x (1 − p ) 1 − x The expected value and the variance of X are equal to E [ X ] = p Var ( X ) = p (1 − p ) Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 14 / 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend