Machine Learning Overview of probability Hamid Beigy Sharif - PowerPoint PPT Presentation

Machine Learning Overview of probability Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 23

Table of contents Probability 1 Random variables 2 Variance and and Covariance 3 Probability distributions 4 Discrete distributions Continuous distributions Bayes theorem 5 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 2 / 23

Outline Probability 1 Random variables 2 Variance and and Covariance 3 Probability distributions 4 Discrete distributions Continuous distributions Bayes theorem 5 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 3 / 23

Probability Probability theory is the study of uncertainty. Elements of probability Sample space Ω : The set of all the outcomes of a random experiment. Event space F : A set whose elements A ∈ F (called events) are subsets of Ω. Probability measure : A function P : F → R that satisfies the following properties, P ( A ) ≥ 0, for all A ∈ F . 1 P (Ω) = 1. 2 If A 1 , A 2 , . . . are disjoint events (i.e., A i ∩ A j = ∅ whenever i ̸ = j ),then 3 ∑ P ( ∪ i A i ) = P ( A i ) i Properties of probability If A ⊆ B = ⇒ P ( A ) ≤ P ( B ). 1 P ( A ∩ B ) ≤ min( P ( A ) , P ( B )). 2 P ( A ∪ B ) ≤ P ( A ) + P ( B ). This property is called union bound. 3 P (Ω \ A ) = 1 − P ( A ). 4 If A 1 , A 2 , . . . , A k are disjoint events such that ∪ k i =1 A i = Ω,then 5 k ∑ P ( A i ) = 1 i =1 This property is called law of total probability. Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 3 / 23

Probability Conditional probability and independence Let B be an event with non-zero probability. The conditional probability of any event A given B is defined as, P ( A | B ) = P ( A ∩ B ) P ( B ) In other words, P ( A | B ) is the probability measure of the event A after observing the occurrence of event B . Two events are called independent if and only if P ( A ∩ B ) = P ( A ) P ( B ) , or equivalently, P ( A | B ) = P ( A ). Therefore, independence is equivalent to saying that observing B does not have any effect on the probability of A . The probability of an event is the fraction of times that an event occurs out of some number of trials, as the number of trials approaches infinity. Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 4 / 23

Random variables Consider an experiment in which we flip 10 coins, and we want to know the number of coins that come up heads. Here, the elements of the sample space Ω are 10-length sequences of heads and tails. However, in practice, we usually do not care about the probability of obtaining any particular sequence of heads and tails. Instead we usually care about real-valued functions of outcomes, such as the number of heads that appear among our 10 tosses, or the length of the longest run of tails. These functions, under some technical conditions, are known as random variables. More formally, a random variable X is a function X : Ω → R Typically, we will denote random variables using upper case letters X ( ω ) or more simply X , where ω is an event. We will denote the value that a random variable X may take on using lower case letter x . Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 5 / 23

Random variables A random variable can be discrete or continuous. A random variable is associated with a probability mass function or probability distribution . Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 6 / 23

Discrete random variables For a discrete random variable X , p ( x ) denotes the probability that p ( X = x ). p ( x ) is called the probability mass function (PMF). This function has the following properties: p ( x ) ≥ 0 p ( x ) ≤ 1 ∑ p ( x ) = 1 x Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 7 / 23

Continuous random variables For a continuous random variable X , a probability p ( X = x ) is meaningless. Instead we use p ( x ) to denote the probability density function (PDF). p ( x ) ≥ 0 ∫ p ( x ) = 1 x ∈ Probability that a continuous random variable X ∈ ( x , x + δ x ) is p ( x ) δ x as δ x → 0. Probability that X ∈ ( −∞ , z ) is given by the cumulative distribution function (CDF) P ( z ), where ∫ z P ( z ) = p ( X ≤ z ) = p ( x ) dx −∞ � dP ( z ) � � � p ( x ) = z � � dz � � z = x Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 8 / 23

Joint probability Joint probability p ( X , Y ) models probability of co-occurrence of two random variables X and Y . Let n ij be the number of times events x i and y j simultaneously occur. Let N = ∑ ∑ j n ij . i Joint probability is p ( X = x i , Y = y j ) = n ij N . Let c i = ∑ j n ij , and r j = ∑ i n ij . The probability of X irrespective of Y is p ( X = x i ) = c i N . Therefore, we can marginalize or sum over Y , i.e. p ( X = x i ) = ∑ j p ( X = x i , Y = y j ) . For discrete random variables, we have ∑ ∑ y p ( X = x , Y = y ) = 1. x ∫ ∫ For continuous random variables, we have y p ( X = x , Y = y ) = 1. x Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 9 / 23

Marginalization Consider only instances where the fraction of instances Y = y j when X = x i . This is conditional probability and is written p ( Y = y j | X = x i ), the probability of Y given X . p ( Y = y j | X = x i ) = n ij c i Now consider n ij N = n ij c i p ( X = x i , Y = y j ) = c i N p ( Y = y j | X = x i ) p ( X = x i ) = If two events are independent, p ( X , Y ) = p ( X ) p ( Y ) and p ( X | Y ) = p ( X ) Sum rule p ( X ) = ∑ Y p ( X , Y ) Product rule p ( X , Y ) = p ( Y | X ) p ( X ) Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 10 / 23

Expected value Expectation, expected value, or mean of a random variable X , denoted by E [ X ], is the average value of X in a large number of experiments. ∑ E [ x ] = p ( x ) x x or ∫ E [ x ] = p ( x ) xdx The definition of Expectation also applies to functions of random variables (e.g., E [ f ( x )]) Linearity of expectation E [ α f ( x ) + β g ( x )] = α E [ f ( x )] + β E [ g ( x )] Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 11 / 23

Variance and and Covariance Variance ( σ 2 ) measures how much X varies around the expected value and is defined as. ( X − E [ X ]) 2 ] = E [ X 2 ] − µ 2 [ Var ( X ) = E √ Standard deviation : std [ X ] = var [ X ] = σ . Covariance indicates between two random variables X and Y the relationship between two random variables X and Y . [ ] ( X − E [ X ]) T ( Y − E [ Y ]) Cov ( X , Y ) = E X , Y Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 12 / 23

Common probability distributions We will use these probability distributions extensively to model data as well as parameters Some discrete distributions and what they can model: Bernoulli : Binary numbers, e.g., outcome (head/tail, 0/1) of a coin toss 1 Binomial : Bounded non-negative integers, e.g., the number of heads in n coin tosses 2 Multinomial : One of K ( > 2) possibilities, e.g., outcome of a dice roll 3 Poisson : Non-negative integers, e.g., the number of words in a document 4 Some continuous distributions and what they can model: Uniform: Numbers defined over a fixed range 1 Beta: Numbers between 0 and 1, e.g., probability of head for a biased coin 2 Gamma: Positive unbounded real numbers 3 Dirichlet : Vectors that sum of 1 (fraction of data points in different clusters) 4 Gaussian: Real-valued numbers or real-valued vectors 5 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 13 / 23

Outline Probability 1 Random variables 2 Variance and and Covariance 3 Probability distributions 4 Discrete distributions Continuous distributions Bayes theorem 5

Bernoulli distribution Distribution over a binary random variable x ∈ { 0 , 1 } , like a coin-toss outcome Defined by a probability parameter p ∈ (0 , 1). p [ X = 1] = p p [ X = 0] = 1 − p Distribution defined as: Bernoulli ( x ; p ) = p x (1 − p ) 1 − x The expected value and the variance of X are equal to E [ X ] = p Var ( X ) = p (1 − p ) Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 14 / 23

Machine Learning Overview of probability Hamid Beigy Sharif - PowerPoint PPT Presentation

Machine Learning Overview of probability Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 23 Table of contents Probability 1 Random variables 2 Variance

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Tutorial 2 the outline Example-1 from linear algebra Conditional probability Example 2:

Stochastic Simulation Introduction Bo Friis Nielsen Applied Mathematics and Computer Science

Conditional Probability and Independence Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department

15-251 Great Theoretical Ideas in Computer Science Lecture 21: Introduction to Randomness and

Machine Learning 2007: Lecture 11 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website:

Notes for 848 lecture 4: A ML basis for compatibility and parsimony Figure 1: The unrooted tree AB

Humanoid Robotics 6D Localization for Humanoid Robots Maren Bennewitz 1 Motivation To

ROBOTICS 01PEEQW Basilio Bona DAUIN Politecnico di Torino Probabilistic Fundamentals in