Humanoid Robotics Statistical Testing Maren Bennewitz 1 - - PowerPoint PPT Presentation

humanoid robotics statistical testing
SMART_READER_LITE
LIVE PREVIEW

Humanoid Robotics Statistical Testing Maren Bennewitz 1 - - PowerPoint PPT Presentation

Humanoid Robotics Statistical Testing Maren Bennewitz 1 Motivation Publishing scientific work usually requires comparing the performance of algorithms Typical situation: Existing technique A You developed a new technique B


slide-1
SLIDE 1

1

Humanoid Robotics Statistical Testing

Maren Bennewitz

slide-2
SLIDE 2

2

Motivation

  • Publishing scientific work usually requires

comparing the performance of algorithms

  • Typical situation:
  • Existing technique A
  • You developed a new technique B
  • Key question: Can you confidently claim

that B is better than A?

  • Run experiments with both algorithms and

compare the outcome

slide-3
SLIDE 3

3

Evaluating Experiments

  • Define a performance measure such as
  • Run time
  • Error
  • Robustness (e.g., success rate)
  • Design a set of experiments or collect

benchmark datasets d

  • Run both techniques on d
  • How to compare the obtained results A(d)

and B(d)?

slide-4
SLIDE 4

4

Example

Scenario

  • A, B are two path planning techniques
  • Performance measure: planning time
  • Data d is a given map, start and goal pose

Example

  • A(d) = 0.5 s
  • B(d) = 0.6 s

What does that mean?

slide-5
SLIDE 5

5

Example: More Data

Same scenario but four planning instances Example

  • A(d) = 0.5 s, 0.4 s, 0.6 s, 0.4 s
  • B(d) = 0.4 s, 0.3 s, 0.6 s, 0.5 s

What does that mean?

slide-6
SLIDE 6

6

Example: More Data

Same scenario but four planning instances Example

  • A(d) = 0.5 s, 0.4 s, 0.6 s, 0.4 s
  • B(d) = 0.4 s, 0.3 s, 0.6 s, 0.5 s

Average of the planning time

  • 𝑦𝐵 = 1.9 s/4 = 0.475 s
  • 𝑦𝐶 = 1.8 s/4 = 0.45 s

It B really better than A?

slide-7
SLIDE 7

7

Is B better than A?

  • 𝑦𝐵 = 0.475 s,

𝑦𝐶 = 0.45 s

  • 𝑦𝐵 >

𝑦𝐶, so B is better than A?

  • We only performed four tests, thus

𝑦𝐵 and 𝑦𝐶 are only rough estimates

  • We saw too few data to make statements

with high confidence

  • How many samples do we need to be

confident that B is better than A?

slide-8
SLIDE 8

9

Population and Samples

  • The data we observe is often only a small

fraction of the possible outcomes

  • Population = set of potential

measurements, values, or outcomes

  • Sample = the data we observe
  • Sampling distribution = distribution of

possible samples given a fixed sample size

slide-9
SLIDE 9

10

Sampling Distribution

  • Distribution of a statistics calculated from all

possible samples of a given size, drawn from a given population

  • Example: Toss a coin twice

0 heads 1 head 2 heads 0.25 0.5

slide-10
SLIDE 10

11

Sampling Distributions

  • Rather theoretical entities
  • Distribution of all possible samples are likely

to be large or infinite

  • Very few closed form solutions only
  • However, one can compute an empirical

sampling distribution based on a set of samples

slide-11
SLIDE 11

12

Experiment: Error of the Mean

We estimate the mean by averaging N

  • samples. How big will the expected error be?

µ

slide-12
SLIDE 12

13

Experiment: Error of the Mean

We estimate the mean by averaging N

  • samples. How big will the expected error be?

µ

slide-13
SLIDE 13

14

Experiment: Error of the Mean

We estimate the mean by averaging N

  • samples. How big will the expected error be?

µ

slide-14
SLIDE 14

15

Experiment: Error of the Mean

We estimate the mean by averaging N

  • samples. How big will the expected error be?

µ

slide-15
SLIDE 15

16

Experiment: Error of the Mean

We estimate the mean by averaging N

  • samples. How big will the expected error be?

µ

slide-16
SLIDE 16

17

Experiment: Error of the Mean

We estimate the mean by averaging N

  • samples. How big will the expected error be?

µ

slide-17
SLIDE 17

18

Experiment: Error of the Mean

We estimate the mean by averaging N

  • samples. How big will the expected error be?
slide-18
SLIDE 18

19

Experiment: Error of the Mean

We estimate the mean by averaging N

  • samples. How big will the expected error be?
slide-19
SLIDE 19

20

Experiment: Error of the Mean

𝜏 𝑂

We estimate the mean by averaging N

  • samples. How big will the expected error be?
slide-20
SLIDE 20

21

Central Limit Theorem

  • The distribution of the average of

N samples approaches a normal distribution as N goes to infinity

  • If the samples are drawn from a

population with mean  and standard deviation , then the mean of the sampling distribution is  with standard deviation

  • These statements hold irrespectively of

the shape of the population distribution from which the samples are drawn

slide-21
SLIDE 21

22

Standard Error of the Mean

  • The standard deviation of the mean of the

sampling distribution is often called standard error (SE)

  • Central limit theorem:
  • The standard error represents the

uncertainty about the mean and is given by

slide-22
SLIDE 22

23

Standard Error of the Mean

  • Central limit theorem for N going to infinity:
  • Rearranging gives:
slide-23
SLIDE 23

24

The Normal Distribution

slide-24
SLIDE 24

25

Confidence Intervals

  • For a normal with known  and , 95% of

the samples fall within

  • Thus, we can state that

contains the mean (for large N) with 95% probability

  • Correct statement: “I am 95% sure that the

interval around contains the mean”

slide-25
SLIDE 25

27

Hypothesis Testing

  • Question: Is technique B better than A?
  • Scenario:
  • Assume we know the mean and standard

deviation of the performance of A

  • We collect N sample outcomes of B from

experiments

  • Are the distributions of A and B equal or

different?

slide-26
SLIDE 26

28

Motivational Example

  • From which distribution have these samples

been drawn?

All of these populations can explain the samples The samples were probably not drawn from this population

slide-27
SLIDE 27

29

Hypothesis Testing

  • It is impossible to confirm that a finite

set of samples was drawn from a particular distribution

  • But we can confidently rule out some

very unlikely distributions

  • We can show that B is better than A by

showing that the opposite is very unlikely

slide-28
SLIDE 28

30

Hypothesis Testing

  • “Answer a yes-no question about a

population and assess that the answer is wrong.”

[Cohen, 1995]

  • Example:
  • Assume we know the mean and standard

deviation of the performance of A

  • We have N outcomes of B
  • To test that B is different from A, assume they

are truly equal

  • Then, assess the probability that they are equal

given the data

  • If the probability is small, reject the hypothesis
slide-29
SLIDE 29

31

The Null Hypothesis H0

  • The null hypothesis is the hypothesis that
  • ne wants to reject given the data (=result
  • f the experiments)
  • A statistical test can never proof H0
  • A statistical test can only reject or

fail to reject H0

  • Example: To show that method B is

different from A, use H0: A=B

slide-30
SLIDE 30

32

Possible Null and Alternative Hypotheses

slide-31
SLIDE 31

33

The Normal Distribution

slide-32
SLIDE 32

34

Z Score

  • The Z score indicates how many standard

deviations the value x is above or below the mean

  • The Z table provides the probability for this

event

  • Z<3 : p=99.9%
  • Z<0 : p=50%
  • Z<-1 : p=15.9%
  • -2<Z<2 : p=95%
slide-33
SLIDE 33

35

One Sample Z-Test

  • Test if a sample has a significantly different

mean than a given known population

  • Given a  and  of a population
  • Sample of size N
  • Compute Z-score:
  • Look up the Z-score in the Z-table to obtain

the probability that the sample follows the known population distribution

slide-34
SLIDE 34

36

Z-Test Example (1)

  • Scores of all German students in a test
  • In Germany: =100, =12
  • A sample of 55 students in Bonn obtained

an average score of 96

  • H1: Students from Bonn are worse than the

average German students

  • H0: Students from Bonn are at least as good

as the average German students

  • a
slide-35
SLIDE 35

37

Z-Test Example (2)

  • Z-table: the probability of observing a value

smaller than -2.47 is 0.68%

  • Reject the H0
  • H1 is true with high probability
slide-36
SLIDE 36

38

Z-Test: Assumptions

  • Independently generated samples
  • Mean and variance of the population distribution

are known

  • Sampling distribution is approximately normal
  • The sample set is sufficiently large (N>~30)

Comments

  • Often,  can be approximated using the variance in

the sample set

  • In practice, the size of the sample set is often too

small for the Z-test

slide-37
SLIDE 37

39

When N is Small: t-Test

  • Variant of the Z-test for N<~30
  • Instead of the Normal distribution,

use the t-distribution

  • t-distribution: distribution of the mean for

small N under the assumption that the population is normally distributed

  • The t-distribution is similar to a normal

distribution but has bigger tails

slide-38
SLIDE 38

40

t-Distribution

  • The t-distribution depends on N
  • For large N, it approaches a normal

source: wikipedia

slide-39
SLIDE 39

41

One Sample t-Test

  • The t-value is similar to the Z-value
  • It defines the allowed distance to the mean

and is used to reject H0

  • To be compared to the values in the t-table
  • The t-table depends on the degree of

freedom (DoF) which is closely related to the sample size (here: DoF=N-1)

  • std. dev. estimated

from the sample sample size

slide-40
SLIDE 40

42

t-Table 1/2

degree of freedom confidence level

slide-41
SLIDE 41

43

t-Table 2/2

https://en.wikipedia.org/wiki/Student%27s_t-distribution#Table_of_selected_values

slide-42
SLIDE 42

44

One Sample t-Test: Example (1)

  • The average price of a car in city is $12k
  • Five cars park in front of a house with an average

price of $20,270 and standard deviation of $5,811

  • H1: The cars are more expensive than in the rest of

the city

  • H0: The cars are no more expensive than in the

rest of the city

  • a
  • DoF=4 (for the one sample t-test: sample size -1)
  • Set confidence level to 95%

(5% error probability)

slide-43
SLIDE 43

45

t-Table 1/2

degree of freedom confidence level

slide-44
SLIDE 44

46

One Sample t-Test: Example (2)

  • a
  • Since t=3.18 > 2.132 (see t-table) reject H0
  • H1 is probably true, i.e., the cars are more

expensive (with 5% error probability)

slide-45
SLIDE 45

47

One Sample t-Test: Assumptions

  • Independently generated samples
  • The population distribution is Gaussian

(otherwise the t-distribution is not the correct sampling distribution since N is small)

  • The mean is known

Comments

  • The t-test is quite robust under non-Gaussian

distributions

  • Often a 95% or 99% confidence (=5% or 1%

significance) level is used

slide-46
SLIDE 46

48

Two Sample t-Test (See Exercise Sheet 9)

  • Compare the means of two samples to see if

both are drawn from populations with equal means

  • Example: Compare outcome of two pose

estimation methods

  • Compute:
  • Pooled, estimated variance of the differences of

the means

  • Pooled, estimated SE of the sampling distribution
  • f the difference of means
  • t-value, compare to values in t-table
slide-47
SLIDE 47

49

What Happens for Large N?

  • The larger the sample size, the easier it is

to show differences…

  • … but for large sample sizes, we can show

any statistical significant difference no matter how small it is

  • A statistically significant difference does not

tell anything about if the difference is meaningful!

slide-48
SLIDE 48

50

Conclusion

  • To support the claim that Algorithm A is

better than Algorithm B, use a statistical test

  • Formulate a hypothesis H1 and try to reject

the null hypothesis H0, or come to the conclusion that the H0 cannot be rejected

  • In the first case, H1 is true with high

probability and in the second case, no conclusion can be drawn

  • The t-test is a frequently used test in

science

slide-49
SLIDE 49

51

Literature

  • Empirical Methods for AI, Ch. 4

P.R. Cohen, 1995

  • Wikipedia
slide-50
SLIDE 50

52

Acknowledgement

  • A previous version of the slides has been

created by C. Stachniss