[PDF] - Outline Experimentation (6.1) Data Presentation (6.2) ata ese tat PDF Document

SLIDE 1

1/4/2007 1

219323 Probability y and Statistics for Software and Knowledge Engineers

Lecture 7: Descriptive Statistics

Monchai Sopitkamon, Ph.D.

Outline

Experimentation (6.1) Data Presentation (6.2)

ata ese tat o (6 )

Sample Statistics (6.3) Examples (6.4)

SLIDE 2

1/4/2007 2

Experimentation I (6.1)

The relationship betw een probability theory and statistical inference

Experimentation: Samples I (6.1.1)

Populations and Samples

Population: all possible observations – Population: all possible observations available from a particular probability distribution – Sample: a particular subset of the population that an experimenter measures and uses to investigate the unknown probability distribution. p y – Random sample: a sample where the elements of the sample are chosen at random from the population to ensure that the sample is representative of the population

SLIDE 3

1/4/2007 3

Experimentation: Samples I (6.1.1)

Data types:

– Data observation (x1, …, xn) can be divided into two major types: categorical (or nominal) and numerical data types

Experimentation: Examples I (6.1.2)

Categorical

bservations

Data set of m achine breakdow ns

SLIDE 4

1/4/2007 4

Experimentation: Examples II (6.1.2)

Num erical observations of the num ber of defective com puter chips in each of 8 0 random ly sam pled boxes random ly-sam pled boxes

Data summarization

Outline

Experimentation (6.1) Data Presentation (6.2)

ata ese tat o (6 )

Sample Statistics (6.3) Examples (6.4)

SLIDE 5

1/4/2007 5

Data Presentation (6.2)

After data is gathered, how do we

After data is gathered, how do we represent the collected data in such an informative way than using just tables of numbers?

By using graphs or charts

Data Presentation: Bar Charts and Pareto Charts I (6.2.1)

Bar charts are generally suitable for illustrating g categorical data sets Bar chart of m achine breakdow ns data set

SLIDE 6

1/4/2007 6

Data Presentation: Bar Charts and Pareto Charts II (6.2.1)

Pareto charts are bar charts used in quality control where categories are sorted in order categories are sorted in order

f decreasing frequency

Data set and Pareto chart of customer complaints for Internet company Excel spreadsheet

Data Presentation: Histograms I (6.2.3)

Look similar to bar charts, but are used to present numerical data instead of categorical one data instead of categorical one

Data set and histogram of computer chips data set

SLIDE 7

1/4/2007 7

Data Presentation: Histograms II (6.2.3)

A histogram w ith positive skew ness A histogram w ith negative skew ness

Data Presentation: Histograms III (6.2.3)

A histogram for a bim odal distribution

SLIDE 8

1/4/2007 8

Data Presentation: Outliers I (6.2.4)

Data points that appear to be

separate from the rest of the data set separate from the rest of the data set

Usually should be removed from

data set before applying statistical inference techniques

In general, outliers are misrecorded

data observation, which can be data observation, which can be corrected

Important issue: whether the outlier

represents true variation or whether it is caused by an outside influence

Data Presentation: Outliers II (6.2.4)

Histogram of a data set with a possible outlier

SLIDE 9

1/4/2007 9

Outline

Experimentation (6.1) Data Presentation (6.2)

Data Presentation (6.2)

Sample Statistics (6.3) Examples (6.4)

Sample Statistics (6.3)

Data set Probability distribution

Sample mean Sample median Sample SD

y

Expectation Median SD

SLIDE 10

1/4/2007 10

Sample Statistics: Sample Mean I (6.3.1)

Arithmetic average of the data

bservations

If a data set consists of n

bservations x1, …, xn, then the

sample mean is

n

∑

n x x

n i i

∑ =

=

1

Sample Statistics: Sample Mean II (6.3.1)

I llustrative data set

SLIDE 11

1/4/2007 11

Sample Statistics: Sample Median (6.3.2)

The value of the “middle” of the sorted

data points

For odd number of n observations, the

sample median is equal to

For even number of n observations, the

sample median is equal to the average

⎡ ⎤

2 / n

p q g

f the two middle values.

A symmetric sample has a sample

mean quite equal to a sample median

Sample Statistics: Sample Trimmed Mean I (6.3.3)

A trimmed mean is obtained by

A trimmed mean is obtained by deleting some of its largest and smallest data observations, and by taking the mean of the remaining

bservations.

For example, a 10% trimmed mean

f sorted observations x1, …, x50 is

equal to

40

45 6

∑ =

=

i i

x x

SLIDE 12

1/4/2007 12

Sample Statistics: Sample Trimmed Mean II (6.3.3)

Relationship betw een the sam ple m ean, m edian, and trim m ed m ean for positively and negatively skew ed data sets

Sample Statistics: Sample Variance (6.3.5)

Sample variance of a set of data

bservations x1, …, xn is defined as

( )

1

1 2 2

− − = ∑ = n x x s

n i i

Alternate formulas for sample

variance s2 are

( ) ( ) ( )

1 / 1

2 1 1 2 2 1 2 2

− − = − − =

∑ ∑ ∑

= = =

n n x x n x n x s

n i i n i i n i i

SLIDE 13

1/4/2007 13

Sample Statistics: Sample Quantiles (6.3.6)

The pth sample quantile is a value that has

i f h l ki l a proportion p of the sample taking values smaller than it and a proportion 1 – p taking values larger than it.

Sample median = 50th percentile of the

sample

Upper and lower sample quartiles = 75th

percentile and 25th percentile of the sample percentile and 25 percentile of the sample.

Sample interquartile range = 75th – 25th

percentiles of the sample

Excel spreadsheet

Sample Statistics: Boxplots I (6.3.7)

Schematic presentation of the

sample median, the upper and lower p , pp sample quartiles, and the largest and smallest data observations.

Half of observations

SLIDE 14

1/4/2007 14

Sample Statistics: Boxplots II (6.3.7)

Sample mean = 3.725 Sample median = 4.25 Lower sample quartile = 2.65 Upper sample quartile = 4.675

Boxplot for data set in Figure 6 .2 2

Sample Statistics: Coefficient of Variation I (6.3.8)

Measures the spread of the data

l ti t th iddl l relative to the middle value

Large values of CV imply that the

x s CV =

Sample standard deviation Sample mean

g p y variability is large relative to the sample average.

Small values indicate that the variability

is small relative to the sample average.

SLIDE 15

1/4/2007 15

Sample Statistics: Coefficient of Variation II (6.3.8)

Ex.42 pg.283: African elephants

Ex.42 pg.283: African elephants have average sample weight = 4550 kg and a sample SD of se = 150 kg, while mice have average sample weight = 30 g and a sample SD of sm = 1.67 g.

e

x

m

x

p

m

g

056 . 30 67 . 1 033 . 4550 150 = = = = = =

m m m e e e

x s CV x s CV

∴Mice have more variability in their weights than the elephants relative to their respective average weights

Outline

Experimentation (6.1) Data Presentation (6.2)

ata ese tat o (6 )

Sample Statistics (6.3) Examples (6.4)

SLIDE 16

1/4/2007 16

Examples (6.4)

Ex.44 pg.286:

Excel spreadsheet

utliers