Basic Concepts G. Urvoy-Keller urvoy@unice.fr Probabilty and - - PowerPoint PPT Presentation

basic concepts
SMART_READER_LITE
LIVE PREVIEW

Basic Concepts G. Urvoy-Keller urvoy@unice.fr Probabilty and - - PowerPoint PPT Presentation

Basic Concepts G. Urvoy-Keller urvoy@unice.fr Probabilty and Statistics Outline Basic concepts Probability Conditional Probability Moments Common Distributions Binomial Zipf Poisson Uniform


slide-1
SLIDE 1

Basic Concepts

  • G. Urvoy-Keller

urvoy@unice.fr Probabilty and Statistics

slide-2
SLIDE 2

2

Outline

  • Basic concepts
  • Probability
  • Conditional Probability
  • Moments
  • Common Distributions
  • Binomial
  • Zipf
  • Poisson
  • Uniform
  • Normal
  • Beta
  • Gamma
slide-3
SLIDE 3

3

Basic Concepts

  • A random experiment is an experiment whose outcome

cannot be predicted with certainty

  • The sample space is the set of all possible outcomes from an

experiment

  • The outcomes from random experiments are called random

variables and often represented as uppercase variables (e.g. X)

  • Random variables can be discrete or continuous
  • An event is a subset of outcomes in the sample space
  • Mutually exclusive events: 2 events that cannot occur

altogether

  • Extension: n events that taken in every possible pairs are

mutually exclusive

slide-4
SLIDE 4

4

Probability

  • Probability is the measure of the likelihood that some event will occur
  • Historically, there are two ways of computing probabilities
  • Equal likelihood model (classical theory):
  • For an event E we count (no experiment) the number n of favorable outcomes
  • We also know the total number of possible outcomes N
  • We then set P=n/N
  • We thus assume that all outcomes are equally likely
  • Works well for coin and die tossing, cards.
  • Relative frequency methods:
  • Can be used when all outcomes are not equally likely
  • “Active method” where the experiment is carried out n times
  • If the event E occurred f times, then P=f/n
  • Modern theory of probabilities is based on axiomatic theory
  • The probability of an event is computed based on:
  • Probability density function in the case of a continuous random variable
  • Probability mass function in the case of a discrete random variable
  • Common convention: use density (or pdf) for discrete and continuous rv
slide-5
SLIDE 5

5

Probability in the case of a continuous random variable

  • Let f(x)=P(x<X<x+dx)/dx be the probability density function

(pdf)

P(a≤ X ≤b)=∫

a b

f (x)dx

f(x) x) x

slide-6
SLIDE 6

6

a b

Probability in the case of a discrete random variable

  • Let f(x) be the probability mass function (pmf)

P(a≤ X ≤b)=∑

a b

f ( x)

f(x) x)

slide-7
SLIDE 7

7

Cumulative Distribution Function

  • The cdf F(x) is the probability that the random variable X is

less than or equal to x:

F( x)=∫

−∞ x

f (u)du (continuous case) F (x)=∑

xi≤x

f ( xi) (discrete case)

Cdf dfs s conver converge ge to

  • 1
slide-8
SLIDE 8

8

Axioms of Probability

  • Let S be the sample space and E be an event (i.e., subset of S)
  • Axiom 1: The probability of event E must be between 0 and 1:

0≤P(E)≤1

  • Axiom 2:

P(S)=1

  • Axiom 3: for mutually exclusive events E1,E2,…,En

P(E1∪E 2∪...∪En)=∑

1 n

P(Ei)

slide-9
SLIDE 9

9

Axioms of Probability

  • Axiom 1 states that a probability must be between 0 and 1.

This means that pdf and pmf must be positive and sum to 1

  • Axiom 2 says that an outcome must occur and the sample

space cover all possible outcomes

  • Axiom 3 enables to compute the probability that at least one
  • f the mutually exclusive events occur by summing their

individual probabilities.

slide-10
SLIDE 10

10

Conditional Probability and Independence

  • The conditional probability of event E given event F is defined

as:

P(E∣F )= P(E∩F ) P(F )

  • P(E∩F) represents the probability that E and F occur together
  • P(F) appears as a “re-normalization” factor
  • Example: for mutually exclusive events E and F, P(E∩F) =0

and thus P(E|F)=0. The latter denotes a very strong dependence between the two events!

slide-11
SLIDE 11

11

Conditional Probability and Independence

  • Independence: two events E and F are said to be independent

if:

P(E∣F )=P(E) which is equivalent to: P(E∩F )=P(E)P(F )

  • Definition for the case of n events: E1,…En are said to be

independent if any subset E(1), E(2),.. E(k), is independent

  • Independence is not transitive!!!!

If E1 is independent from E2 and E2 from E3, E1 might depend on E3

  • Independence is reflexive: if E is independent from F, F is

independent from E since P(F∣E)=P(E∣F ) P(F )

P(E)

P(E(1)∩E(2)...∩E(k ))=P(E(1))×P(E(2))×....×P(E(k ))

slide-12
SLIDE 12

12

Conditional Probability- Illustration

  • It has been demonstrated that there was a lot free riders in

Gnutella networks.

  • Free-riders: clients that retrieve documents but do not provide

any data to other peers.

  • A natural question that may arise when studying such

systems is: “How many files does a client share with its peers?”

  • Due to free-riding, you will find very low figures. It is thus

better to split the above question into two sub-questions:

  • What is the probability that a client is a free-rider?
  • What is the probability that a non free-rider shares n files?
slide-13
SLIDE 13

13

Conditional Probability- Illustration

  • Let:
  • Q be the random variable that denotes the number of files offered

by a client

  • S be the random variable that denotes the type of client
  • F: free-rider
  • Non-F: not a free rider
  • The previous questions can be formulated as follows:

P ( S= F ) P(Q=n∣S=non− F )

slide-14
SLIDE 14

14

Independence - Illustration

  • A die is tossed twice. Consider the following events:
  • A: the first toss gives an odd number
  • B: the second toss gives an odd number
  • C: the sum of the two tosses is an odd number
  • Any pair of the previous events are independent. Indeed
  • P(A)=P(B)=P(C)=1/2
  • P(A∩B)=P(A∩C)=P(B∩C)=1/4
  • Since to obtain an odd number, you need one odd number
  • Still, P(A∩B∩C)=0. Hence (A,B,C) are not independent
slide-15
SLIDE 15

15

Total Probability Theorem

  • Theorem: Let E1,E2,…En be n mutually exclusive events such

that UiEi=S (S is the sample space) and P(Ei)≠0. Let B be an

  • event. Then:

P(B)=∑

i=1 n

P(B∣Ei)P(Ei)

  • Proof:
slide-16
SLIDE 16

16

Bayes Theorem

  • Bayes theorem allows to estimate a “posteriori” probabilities

from “a priori” probabilities.

  • Consider the following problem: one wants to evaluate the

efficiency of a test for a disease. Let:

  • A= event that the test states that the person is infected
  • B=event that the person is infected
  • Ac=event that the test states that the person is not infected
  • Bc=event that the person is not infected
  • Suppose we have the following a-priori information:
  • P(A|B)=P(Ac|Bc)=0.95 - obtained from tests on well defined

populations

  • P(B)=0.005
  • A good measure of the efficiency of the test is the “a

posteriori” probability P(B|A)

slide-17
SLIDE 17

17

Bayes Theorem

  • Theorem: given a event F and a set of mutually exclusive

events E1,E2,…En whose union makes up the entire sample space:

P(Ei∣F )= P(Ei)P(F∣Ei)

k=1 n

P(F∣Ek )P(Ek)

  • Derivation of the theorem is straightforward using the

definition of conditional probabilities

A post poster erio

  • inf

nfor

  • rmat

ation

  • n

A pr prior

  • ri inf

nfor

  • rmat

ation

  • n
slide-18
SLIDE 18

18

Bayes Theorem

  • Applied to the “disease test” problem stated before, we
  • btain:

P(B∣A)=P(B)P(A∣B) P(B)P(A∣B)+P(Bc)P( A∣Bc) 0.005×0.95 0.005×0.95+0.995(1−0.95) =0.087

  • Thus, when the test is positive, the person is in fact infected in
  • nly 8.7% of the cases! Very bad!!!
  • Conclusion: even if “a priori” tests were correct for 95% of the

cases, this was not enough due to the scarcity of the disease

  • For example, with P(B)=0.1 and , we would have obtained:

P(B|A)=68% (not that good either…)

slide-19
SLIDE 19

19

Mean and Variance

  • The mean or average value E[X]=µ of a distribution provides a

measure of the tendency of a distribution.

σ

2=V ( X )=E[( X − μ) 2]=∫ −∞ +∞

(x−μ)

2 f (x)dx (continuous case)

i=1 +∞

(xi−μ)2 f (xi) (dis crete case)

The variance V(X)=σ2 of a random variable (r.v.) X measures the average dispersion around the mean μ

E[ X ]=∫

−∞ +∞

xf ( x)dx (continuous case)

i=1 +∞

xi f ( xi) (discrete case)

slide-20
SLIDE 20

20

Mean and Variance

  • E[] is a linear

function

  • α

is a scalar

  • X,Y r.v.
  • Practical formula:

V ( X )=E [( X −μ)2]=E[ X 2−2μX+μ2] E[ X 2]−2μE[ X ]+μ2 E[ X 2]−μ2

E[αX ]=αE[ X ] E[ X+Y ]=E[ X ]+E[Y ]

slide-21
SLIDE 21

21

Coefficient of Variation

  • is called the standard deviation of the r.v. X
  • is called the coefficient of variation of the r.v. X
  • Interpretation:
  • “C measures the level of divergence of X with respect to its mean”
  • r
  • “C measures the variation of X in units of its mean”
  • C allows to compare two distributions with different means
  • C is independent of the chosen unity

σ=√V ( X ) C= σ μ

slide-22
SLIDE 22

22

Coefficient of Variation

To illustrate C, let us consider two sets of values drawn from normal distributions (defined later):

Distribution 1 with µ=1, σ=10 => => C=10 Distribution 2 with μ=100,σ=10 => C=0.1

Looking at the pdfs, you might miss how values can be close or far away from the means:

106. 106.2 108 108 109. 109.4 90. 90.08 08 102. 102.1 102. 102.4 89. 89.92 92 92. 92.58 58 110. 110.8 98. 98.69 69 Set et 2

  • 11

11

  • 9.

9.6 16 16 1. 1.6

  • 11

11 0. 0.59 59

  • 10

10

  • 12

12

  • 1.

1.6 11 11 Set et 1

slide-23
SLIDE 23

23

Moments

  • r-th moment of a rv X is:

μr

' =E[ X r ]

  • r-th central moment of a rv X is:

μr =E[( X −μ)r ]

  • The mean corresponds to µ’1 and the variance to µ2
slide-24
SLIDE 24

24

Skewness

  • The third central moment µ3 relates to the skewness or asymmetry of

the distribution

  • The skweness is defined so as to be independent of the chosen

unity:

γ1= μ3 μ2

3/2

  • For a normal rv, γ1=0
  • For rv that are skewed to the left, γ1≤0
  • For rv that are skewed to the right, γ1≥0

Remark:γ1=0 does not mean that the distribution is symmetric

slide-25
SLIDE 25

25

Kurtosis

  • Kurtosis indicates the level of flatness of a rv.

γ2= μ4 μ2

2

  • Kurtosis for a normal rv is equal to 3.
  • Since departure from normality is important, one defines a

coefficient of excess kurtosis γ

2

'= μ4

μ2

2−3

γ’2≤0, rv is flatter than a normal distribution γ’2≥0, rv is more peaked than a normal distribution

slide-26
SLIDE 26

26

Quantiles, Percentiles

  • Definition: consider a rv X with pmf or pdf f(x). The n-th

percentile (or quantile) of X is the value x such as n percents

  • f the mass is below x and 100-n percents of the mass is

above x:

  • Median= 50-th percentile. For a large set of numbers drawn

from the same rv, 50% should be smaller than x and 50% larger

80 80th per percent centile ≈1

For a continuous rv: n 100 =∫

−∞ x

f (u)du=F (x) ⇔ x=F

−1(n/100)

For a discrete rv: x=min

i {xi∣ ∑ j=−∞ i

f (x j)≥n 100}

slide-27
SLIDE 27

27

Let p = the percentile of interest, n = the number of data points or observations, then i = (p/100)n is an index number we will use to find pth percentile. IF a) i is not an integer, round up to the next integer. In other words, the next integer higher than i is the position of the pth percentile. b) i is an integer, the pth percentile is the average of the values in the i and i + 1 positions.

Quantiles, Percentiles- Practical Computation

slide-28
SLIDE 28

28

An example: last 10 golf scores for 18 holes, sorted: 82, 83, 84, 85, 85, 88, 90, 90, 93, 95 To find the 50th percentile score, first find i. i = (50/100)10 = 5. So the 50th percentile is the average of the values in the 5th and 6th positions – 85 and 88, for an average

  • f 86.5

Thus 86.5 is the 50th percentile. Note the 50th percentile is the median we saw before. Half the values are below this value and half are above this value. To find the 25th percentile, we take i = (25/100)10 = 2.5. So the 25th percentile is the value in the 3rd position – 84. 25% of the values are less than or equal to this.

Quantiles, Percentiles- Practical Computation

slide-29
SLIDE 29

Common Distributions

slide-30
SLIDE 30

30

Binomial

  • Consider an experiment whose outcome can be labeled as

“success’’ or “failure” corresponding to Y=0 and Y=1 respectively

  • pmf of a single experiment:

f (1)=P(Y=1)=p f (0)=P(Y=0)=1− p

  • Suppose we repeat this experiment n times (independent

trials)

  • Let X be the discrete rv that denotes the number of

successes.

  • X follows the binomial distribution with parameter (n,p) if its

pmf follows:

f (k,n,p )=P( X=k )=( n k) pk(1− p)n−k ; k∈[0, n] E[ X ]=np V [ X ]=np(1− p)

slide-31
SLIDE 31

31

Binomial

slide-32
SLIDE 32

32

Binomial - Example

  • A computer receives 10 messages
  • Each message might be corrupted with a probability of 0.01
  • Q: what is the probability of at least one error?
  • A:

P(more than 1 error/s)=1−P(0 error/s) =1− f (0,10,0.01) =1−( 10 0 )0.9910×0.010 0.0956

slide-33
SLIDE 33

33

Zipf Law

  • Rank is inversely proportional to frequency
  • Example: frequency of English words
slide-34
SLIDE 34

34

Zipf Law

P( X=k )=C .1 k1+α , α>0, k=1,2,3,.... with C=1

i=1 ∞

1/i1+α

  • Modeling application:
  • Popularity of files on a VoD server, popularity of files in Kazaa
  • Interpretation: a few highly popular files, the others are highly

unpopular (requested < 1% of times)

slide-35
SLIDE 35

35

Poisson

  • A discrete rv X follows a Poisson law with parameter λ

if its pmf is such that:

f (x,λ)=P( X=x)=e−λ λx x! ;x=0,1,...

  • The Poisson law can be interpreted as a limit for the Binomial

law when with λ =np for n>>1 and p<<1, with np finite.

  • In practice, n>50 and p<0.1

E[ X ]=λ V [ X ]=λ

slide-36
SLIDE 36

36

Poisson

  • Proof: x is kept constant!
slide-37
SLIDE 37

37

Poisson

slide-38
SLIDE 38

38

Poisson - Example

  • Consider an appliance with 10,000 components.
  • Components fail independently
  • Yearly failure probability is 10-4
  • Q: what is the probability that 10 components fail during 1

year?

  • A: Binomial trial with x=10; n=10,000 and p=10-4

f binomial(10;10 ,000 ;10−4)=( 10 ,000 10

)(10−4)

10 (1−10−4) 9990 [ Arghh!!!]

f Poisson(10 ;10 ,000×10−4) [n>>1, p<<1, np=1] 110 10! e−1≈10-7

slide-39
SLIDE 39

39

Uniform

  • A continuous rv is uniformly distributed in the interval (a,b)

such that -∞<a<b<∞ if its pdf is such that: f (x;a,b)=1 b−a E[ X ]=a+b 2 V [ X ]=(b−a)2 12

slide-40
SLIDE 40

40

Normal

  • A rv follows a normal (or Gaussian) law with mean µ and

variance σ2 if its pdf is such that:

f (x;μ,σ 2)= 1 σ √2π exp{−( x−μ)2 2σ2

}

  • Common notation:
  • Standard normal rv:

X ~ N (μ,σ 2)

X ~ N (0,1)

  • If

then

X ~ N (μ,σ 2)

Z= X −μ σ ~ N (0,1)

slide-41
SLIDE 41

41

Normal

  • The cdf of a standard normal rv is often denoted by:

Φ(x)= 1

√2π ∫

−∞ x

exp{ − y2 2 }dy

  • The normal distribution appears extensively in theory of

statistics

  • Here, we state without proof that the Normal distribution is a

limit for the binomial and the Poisson distributions:

slide-42
SLIDE 42

42

Normal

  • For the previous approximations, we have approximated the

two discrete variables by a continuous one

  • If we are two evaluate cdfs, we obtain the following intuitive

formula: P[α≤ X binomial≤β ]≈Φ[ β−np+0.5

√np(1− p)]−Φ[

α−np−0.5

√np(1− p)]

P[ α≤ X Poisson≤β ]≈Φ[ β−λ+0 .5

√ λ

]−Φ[

α−λ−0 .5

√ λ

]

slide-43
SLIDE 43

43

Normal

slide-44
SLIDE 44

44

Normal

  • The normal random variable is such that 95% of the values fall

within an interval of 4σ around the mean Proof (µ=0, σ=1):

Φ(2)−Φ(−2)= 1

√2π∫

−2 2

exp( −y2 2 )dy≈0.9545

slide-45
SLIDE 45

45

Exponential

  • A rv X is exponentially distributed with parameter λ if its pdf is

such that:

f (x;λ)=λe−λx for x≥0;λ>0

E[ X ]=1 λ V [ X ]=1 λ2

  • Often used to model the time between two events:
  • Phone calls
  • Time until a part fails
slide-46
SLIDE 46

46

Exponential

slide-47
SLIDE 47

47

Gamma

  • A rv X is Gamma distributed with parameters λ>0 (called the

scale parameter) and t>0 (shape parameter) if:

f (x;λ,t )=λe−λx(λx)t−1 Γ (t) ;x>0 where Γ (t)=∫

e-y yt−1dy

E[ X ]=t λ V [ X ]=t λ2

slide-48
SLIDE 48

48

Gamma

slide-49
SLIDE 49

49

Chi-Square

  • A gamma distribution with λ=0.5 and t=ϖ

/2, where ϖ is a positive integer is called a chi-square distribution with ϖ degrees of freedom

  • Often denoted as
  • Important for goodness of fit tests

χν

2

f (x;v)= 1 Γ (ν/2)( 1 2)

ν/2

xν/2−1e−x/2; x≥0

E[ X ]=ν V [ X ]=2ν

slide-50
SLIDE 50

50

Beta Distribution

  • Very flexible: wide range of shapes depending on parameters
  • Parameters: α>0 and β>0

f (x;α;β )=1 B(α,β) xα−1(1−x)β−1; 0<x<1 where B(α,β )=∫

1

xα−1(1−x)β−1dx= Γ (α)Γ ( β) Γ (α+β )

E[ X ]=α α+β V [ X ]=αβ

(α+β )2 (α+β+1)

slide-51
SLIDE 51

51

Beta Distribution

U shape shape J J shape shape

slide-52
SLIDE 52

Pearson System of Classification

  • G. Urvoy-Keller

urvoy@unice.fr Probability and Statistics

slide-53
SLIDE 53

2

Introduction

  • The first four moments of a distribution relate respectively to:
  • The average value of the distribution (µ)
  • The average variation around the mean (σ)
  • The asymmetry or skweness (γ1)
  • The flatness (γ2 )
  • Assume:
  • One considers only unimodal distributions, i.e. distributions with a

single mode or anti-mode

  • The shape of the distribution is defined through γ1 and γ2
  • The Pearson system allows you to find the “best” distributions

for a given γ1 and γ2

slide-54
SLIDE 54

3

Cooking guide

  • Informally, you:

1.

Observe a dataset whose histogram looks unimodal

2.

Compute γ1 and γ2

3.

Use the Pearson Graph to decide which type of distributions models your data the best

Step ep 1 Step ep 2 γ1,γ2 Step ep 3 γ 1 γ 2

slide-55
SLIDE 55

4

Pearson Diagram

  • The starting point is the distributions (densities) satisfying the

following differential equation:

df ( x) dx =− (x+a) (c0+c1 x+c2 x2) f (x)

  • One restricts to the case where -a is not a root of

c0+c1x+c2x2=0

  • A maximum/minimum (the mode) is reached at
  • x=−a

f ( x )⃗

∣x∣→∞0 because: {

f ( x )≥0

∫−∞

f ( x )<∞ Hence, df ( x ) dx

∣x∣→∞0

slide-56
SLIDE 56

5

Pearson Diagram

  • The shape of the solution of the previous equation depends

(considerably) on the values of a, c0,c1 and c2

  • It is possible to express the different solution domains in terms of

the skweness and kurtosis parameters

  • More formally, let us define

w=1 4 γ1(γ2+3)2 (4γ2−3γ1)(2γ2−3γ1−6)

slide-57
SLIDE 57

6

Pearson Diagram

  • The distributions are

classified as follows:

I (Beta ): w<0 II : γ1=0, γ2<3 III (Gamma ): 2γ2−3γ1−6=0 IV : 0<w<1 V : w=1 VI : w>1 VII : γ1=0, γ2>3 N (Normal) : γ1=0, γ2=3(single point)

γ 1 γ 2

slide-58
SLIDE 58

7

What you have to remember

  • Such a system exists, and it can be used to model approximately a

random sample you have obtained

  • Some types do not correspond to “common” distributions. They are

referred in the literature as Pearson type x

  • Type I (Beta) U means U-shape
  • Type I (J) means J-shape
slide-59
SLIDE 59

8

What you have to remember

  • We will see later in the course more advance methods to assess if

a sample “really follows” a given distribution

  • For the moment, we can use the Pearson to system to check that

2 (or more) samples come from different distributions

  • This would be the case if their corresponding γ1 and γ2 fall in widely

different regions of the Pearson graph

slide-60
SLIDE 60

Exploratory Data Analysis

  • G. Urvoy-Keller

urvoy@unice.fr Probability and Statistics

slide-61
SLIDE 61

2

Outline

  • Representations of a single random variable
  • Histograms
  • QQ plot
  • Assessing Relations between random variables
  • Quantile-based plots
  • Scatterplot
  • Bivariate Histograms
slide-62
SLIDE 62

REPRESENTATION OF A SINGLE RANDOM VARIABLE

3

slide-63
SLIDE 63

4

Histograms

  • Frequency histograms: cluster data into sets of non overlapping

bins that cover whole range of data

  • Relative frequency bins: obtained by dividing the height of a bin by

the total number of samples

  • Density histogram: histogram that is normalized so that it

integrates to one, like a density. Let:

  • Bk be the range of x values in bin k=1,..,n
  • N be the total number of data
  • h be the size of each bin
  • Vk the number of data falling in Bk

The density histogram is defined through the following function: ̂ f (x)= νk Nh ; ∀ x∈Bk

slide-64
SLIDE 64

5

Histograms – Choice of Parameters

  • An important parameter to choose when drawing an histogram is

the number of bins n

  • Since we draw (density) histograms to estimate density, the choice
  • f n will influence the error we make when considering

Instead of

f (x)

  • Intuitively:
  • If n is small, the histogram is smooth and errors are larger but do not

vary too much from one x to the other

  • If n is large, the histogram is sharp and the error can be smaller for

some x and larger for others

̂ f (x)

slide-65
SLIDE 65

6

Histograms – Choice of Parameters

  • Example with a normal random

sample of size 1000

̂ f (x)

f (x)

slide-66
SLIDE 66

7

Histograms – Error Metrics

  • Step1: perform measurement, i.e. collect a sample
  • Step 2: based on sample, compute an estimate of the density
  • Step 3: Estimate the error when using instead of
  • We can define the following measure of errors:
  • The mean square error:
  • The integrated squared error (for a given trajectory):
  • The mean integrated square error which is ISE averaged over all

possible trajectories:

MSE[ ̂ f ( x)]=E[( ̂ f (x)− f (x))

2]

ISE=∫( ̂ f (x)− f ( x))2dx

MISE=E [∫( ̂ f (x)− f ( x))2dx]

̂ f (x)

f (x)

̂ f (x)

slide-67
SLIDE 67

8

Histograms – Error Metrics

  • Under some assumptions, Scott (1992) has proved the following

upper bound on MSE:

MSE( ̂ f (x ))≤ f (ξ k ) Nh +γk

2h2

where: and the density is assumed to be Lipschitz-continuous:

∣ f (x)− f ( y)∣<γk∣x−y∣; ∀ x , y in Bk

  • Consequence: h has to be chosen neither too small nor too large
  • Most rules to choose h try to minimize MSE
slide-68
SLIDE 68

9

Histograms – Choice of Parameters

  • Sturge’s rule:

n=1+log2 N

  • Further assuming the existence of R( f )=∫( f

'(x)) 2dx

We can obtain an asymptotic value for the MISE:

MISE⃗

N large 1

Nh + 1 12 h2 R( f ' )

  • For N large, the optimal bin width h* is obtained as:

h

*=(

6 NR( f ' ))

1/3

slide-69
SLIDE 69

10

Histograms – Choice of Parameters

  • For a normally distributed

variable, we obtain:

R( f ' )= 1 4σ3√π

  • This yield the Normal Reference rule that “is used” for every

variables (not only normally distributed ones):

h

*=(

24σ

3√π

N

)

1/3

≈3.5σN

−1/3

  • Last issue: how to estimate σ:
  • Scott suggested the sample standard deviation s
  • Friedman and Diaconis suggested the IQR (more robust)

IQR=75thquantile−25thquantile

slide-70
SLIDE 70

11

Average Shifted Histograms

  • So far, we have discussed the choice of n or h (number and size of bins)

but not the issue of x0, the initial value

  • x0 is obtained as the minimum of the data
  • A simple idea to improve the precision of the histogram w/o reducing the

size of the bins is to consider the m following histograms:

̂ f 1 ,.. , ̂ f m where: ̂ f 1 is the histogram that start at x0 ̂ f 2 is the histogram that start at x0+h m .... ̂ f m is the histogram that start at x0+(m−1)h m

slide-71
SLIDE 71

12

Average Shifted Histograms

  • We then average to obtain the average shifted histogram (ASH)

̂ f 1 ,.. , ̂ f m

̂ f ASH (x)= 1 m∑

i=1 m

̂ f i( x)

  • Example with normal

sample of size 100

  • On the =left side,

default histogram of matlab (10 bins)

  • On the right side,

ASH with m=5 and 10 bins

slide-72
SLIDE 72

13

Box Plot

  • Idea: obtain a compact representation of distribution samples to

compare them

  • In practice, the following values are used for a given distribution:
  • The three quartiles
  • The maximum and the minimum values of the samples that fall in the

following range: ( ̂

  • q0. 25 , ̂
  • q0. 5, ̂

q0.75) [ ̂ q0.25−1.5×I ̂ Q R , ̂ q0.75+1.5×I ̂ Q R]

These values are called the adjacent values

  • The outliers that lie in the following ranges:

[min (x), ̂ q0.25−1.5×I ̂ Q R] and [ ̂ q0.75+1.5×I ̂ Q R,max(x)]

  • Outliers might be due to:
  • Error in measurements
  • The complexity of the distribution we are looking at – in this case outliers

are not real outlier

slide-73
SLIDE 73

14

Box Plot

  • Example: left side: uniform sample, middle: normal sample, right:

exponential

slide-74
SLIDE 74

REPRESENTATION OF PAIRS OF RANDOM VARIABLES

15

slide-75
SLIDE 75

16

Quantile-Based Plots

  • Used when one needs to compare two distributions. The idea is to

plot the quantiles of one distribution against the quantiles of the

  • ther.
  • If these two distributions are empirical distributions (i.e. 2 samples),

we call the quantile-based plot a Quantile-Quantile plot or Q-Q plot

  • If we compare a theoretical distribution to a random sample, we

call the quantile-based plot a Quantile plot

slide-76
SLIDE 76

17

Q-Q Plot

  • Consider two samples:

( x1 ,..., xn) and ( y1 ,... , ym); m≤n

  • Ordered statistics:

( x(1),... , x(n)) and ( y(1),... , y(m)); m≤n

  • Let us first assume that m=n
  • The Q-Q plot is obtained by plotting y(i) against x(i)

x(1)≤x(2)...≤x(n) and y(1)≤y(2)...≤ y(m)

slide-77
SLIDE 77

18

Q-Q Plot - Examples

  • Two standard normal samples of size 100:
slide-78
SLIDE 78

19

Q-Q Plot - Examples

  • One standard normal sample against one normal sample with µ= 0

and σ=2

slide-79
SLIDE 79

20

Q-Q Plot - Examples

One normal sample with µ=1 and σ=1 against one exponential sample with λ=1 (i.e. µ=1 and σ=1 )

slide-80
SLIDE 80

21

Q-Q Plot

  • Consider now the case where n>m
  • In this case, we plot y(i), i=1,..,m against the (i-0.5)/m quantile of x
  • In fact, we could plot against any quantile q such that:

i−1 m ≤q≤ i m

  • (i-0.5)/m is considered here because this is the middle of this interval
slide-81
SLIDE 81

22

Q-Plot

  • A Q-plot is one where the theoretical quantiles are plotted against

the order statistics of a sample x of size n

  • If F is the theoretical cdf, then one plots:

F−1( i−0.5 n ) against x(i )

slide-82
SLIDE 82

23

Quantile-based Plots - Interpretation

  • Quantile-based plots can help rejecting the hypothesis “two

samples are drawn from the same variables”

  • If samples are too small, comparison is uneasy
  • To help taking decision, it is possible to add two lines:
  • The bisector
  • The line based on the 25th and 75th quantiles of each data set
slide-83
SLIDE 83

24

Quantile-based Plots - Interpretation

  • 2 standard normal samples of size 50 and 75
  • There are maybe to few samples to draw conclusions (if it were a

blind test)

slide-84
SLIDE 84

25

Scatterplot

  • One often wants to assess the relationship between two variables;

e.g. how RTTs and throughputs are related in the Internet

  • A simple graphical way is to plot the scatterplot which is the plot of

all obtained 2-uples (xi,yi) from the two datasets x and y

  • Example: left: two dependent normal samples -right: two independent

normal samples (much more widespread)

slide-85
SLIDE 85

Scatterplot vs. QQ-plot

  • Scatterplots look for temporal relations between data
  • QQ-plots look for a specific relation (exactly the same distribution)

for the two data. Time ordering is lost here!

  • Examples:
  • Assume you have two random variables X and Y and there exists a

temporal relation Y=aX+b

  • X and Y do not have the same means and variance, so the qqplot will

diverge from the bissector (unless a is close to 1 and b to 0)

  • The scatterplot will look a line
  • Assume that X and Y are drawn independently from the same

distribution

  • Qqplot is going to be close to the bissector (if you have enough samples)
  • Scatterplot will present a shaded area

26

slide-86
SLIDE 86

27

Bivariate histograms

  • Bivariate histograms
  • Let hi be the width of the bin in dimension i
  • The density histogram is the plot of the density estimate:

̂ f (⃗ x)= νk Nh1h2 ; ⃗ x∈Bk

  • Example: for a bivariate standard normal random sample of size 1000