Statistics'and' Hypothesis'Testing - - PowerPoint PPT Presentation

statistics and hypothesis testing
SMART_READER_LITE
LIVE PREVIEW

Statistics'and' Hypothesis'Testing - - PowerPoint PPT Presentation

Statistics'and' Hypothesis'Testing NENS230:DataAnalysisfortheBiosciencesusingMATLAB EddyAlbarran November3,2015 AnalysisMethodology Data Exploratory Hypothesis DataAnalysis Testing


slide-1
SLIDE 1

Statistics'and' Hypothesis'Testing

NENS230:DataAnalysisfortheBiosciencesusingMATLAB EddyAlbarran November3,2015

slide-2
SLIDE 2

AnalysisMethodology

Data Exploratory DataAnalysis Hypothesis Testing

  • SummaryStatistics
  • DimensionalityReduction/PCA
  • Visualization
  • Histogram
  • Scatterplots
  • Boxplots
  • etc.
  • T-Test
  • Z-test
  • Chi-Square
  • etc.

Generate Hypotheses

Reject Null Failto rejectnull

slide-3
SLIDE 3

Outline

Summary statistics functions Random Variables

– Random variables, PDF, CDFs – Estimates of central tendency and dispersion – Standard error of the mean, confidence intervals

Statistical Hypothesis Testing

– Tests and significance – Student’s t test walkthrough – Other commonly used tests

Analysis of Variance Homework

slide-4
SLIDE 4

Summary Statistics

Commonly used functions:

– mean() – std() – var() – sum() – min() – max()

slide-5
SLIDE 5

mean()function

mean()computestheaverage(samplemean)ofa vector.Withmatrices,youneedtospecifywhich dimensiontoaveragealong. mean(X, 1)meansreturntheaveragerow (averageacrosstherows).Thisisthedefaultifyou

  • nlyspecifyoneargument.

mean(X, 2)meansreturntheaveragecolumn (averageacrossthecolumns)

slide-6
SLIDE 6

mean()function

mean()computestheaverage(samplemean)ofa vector.Whendealingwithmatrices,youneedto specifywhichdimensiontoaveragealong.

26 15 1 2.4 15 1

X = mean(X, 1) evaluatesto

13 15 1 1.2

mean(X, 2) evaluatesto Dim1 Dim2

11.1 4

mean(X)

slide-7
SLIDE 7

mean()function

mean()operatesonitsfirstargument.Be carefulwhenaveragingtwothingstogether thatyoupacktheminavectorusing[ ]

  • mean(1, 5) evaluatesto1

“Takethemeanof[1]alongthe5th dimension”

  • mean([1 5])evaluatesto3
slide-8
SLIDE 8

std()function

std()computesthestandarddeviationofalistofnumbers

­— Whendealingwithmatrices,youneedtospecifywhichdimensiontoaverage along,as'the'third'argument.' ­— Thesecondargumentshouldbe0ifyouwanttheunbiasedestimatorthat normalizesbyn-1,wherenisthenumberofsamples

26 15 1 2.4 15 1

X = std(X, 0, 1) evaluatesto

18.3848 1.6971

std(X, 0, 2) evaluatesto Dim1 Dim2

11.7604 7.3485

std(X)

slide-9
SLIDE 9

var()function

var()computesthesamplevarianceofalistofnumbers

­— Whendealingwithmatrices,youneedtospecifywhichdimensiontooperate along,as'the'third'argument.' ­— Thesecondargumentshouldbe0ifyouwanttheunbiasedestimatorthat normalizesbyn-1,wherenisthenumberofsamples.(Thisisthedefault)

26 15 1 2.4 15 1

X = var(X, 0, 1) evaluatesto

338 2.88

var(X, 0, 2) evaluatesto Dim1 Dim2

138.31 54

var(X)

slide-10
SLIDE 10

sum()function

sum()computesthesumofavector.When dealingwithmatrices,youshouldspecifywhich dimensiontoaveragealong. sum(X, 1)meansreturnthesumoverrows(sum

  • verrowswithineachcolumn).Thisisthedefaultif

youonlyspecifyoneargument. sum(X, 2)meansreturnthesumovercolumns (sumovercolumnswithineachrow)

slide-11
SLIDE 11

min()function

min()computestheminimumofavector.When dealingwithmatrices,youshouldspecifywhich dimensiontofindtheminimumalong. min(X, Y)meansreturnanarraythesamesizeas XandYconsistingofthesmalleroftheelementsin XandYateachlocation. min(X, [], 1)meansreturntheminimumvalue ineachcolumn.Thisisthedefaultifyouonly specifyoneargument. min(X, [], 2)meansreturntheminimumin eachrow.

slide-12
SLIDE 12

max()function

max()computesthemaximumofavector.When dealingwithmatrices,youshouldspecifywhich dimensiontofindthemaximumalong. max(X, Y)meansreturnanarraythesamesizeas XandYconsistingofthelargeroftheelementsin XandYateachlocation. max(X, [], 1)meansreturnthemaximumvalue ineachcolumn.Thisisthedefaultifyouonly specifyoneargument. max(X, [], 2)meansreturnthemaximumin eachrow.

slide-13
SLIDE 13

Outline

Summarystatisticsfunctions Random'Variables'

­— Random'variables,'PDF,'CDFs' ­— Estimates'of'central'tendency'and'dispersion' ­— Standard'error'of'the'mean,'confidence'intervals'

StatisticalHypothesisTesting

­— Testsandsignificance ­— Student’sttestwalkthrough ­— Othercommonlyusedtests

AnalysisofVariance Homework

slide-14
SLIDE 14

Discreterandomvariables

SupposewehavearandomvariableX. Discrete'random'variables'takeonevaluewithina setofkpossiblevalues. Probability'mass'function:Foragivenvaluexi returnstheprobabilitypiofXtakingthatvalue.

  • Sumoftheseprobabilitiesmustbe1.

p1 + p2 + · · · + pk = 1

Pr[X = xi] = pi

slide-15
SLIDE 15

ProbabilityMassFunction

slide-16
SLIDE 16

Continuousrandomvariables

SupposewehavearandomvariableX. Continuous'random'variables'takevalueswithin somecontinuousrangeofvalues. Probability'density'function'(PDF):integratingthis functionoversomeintervalgivesyouthe probabilitythatXliesinthatinterval.

Therefore,theintegralunderthisfunctionis1.

Pr[a ≤ X ≤ b] = Z b

a

f(x)dx Z ∞

−∞

f(x)dx = 1

slide-17
SLIDE 17

NormalorGaussiandistributionsdescribemanynaturally

  • ccurringphenomena,duetothecentrallimittheorem.

Specifiedbytwoparameters:

­— Location'parameter:themean(μ) ­— Scale'parameter:thestandarddeviation(σ)

Normaldistribution

Source:wikipedia.org

1 σ p (2π) e− (x−µ)2

2σ2

slide-18
SLIDE 18

PDFfornormaldistribution

slide-19
SLIDE 19

Cumulativedistributionfunction

Cumulative'distribution'function'(CDF):howlikely isXlessthanorequaltoaparticularvalue.

  • TheCDFistheintegralofthePDF.

ThePDFisthederivativeoftheCDF.Therefore,the partsoftheCDFwiththesteepestslopearethe highestpointsofthePDF,i.e.wheremostofthe valueslie.

Pr[X ≤ x] = F(x)

slide-20
SLIDE 20

CDFfornormaldistribution

slide-21
SLIDE 21

Theexpectedvalueofarandomvariableisit’s mean.Youcancalculatetheexpectedvalueofa randomvariableXbytakingtheweightedaverage

  • fallitspossiblevalues.Theweightsarethe

probabilityofXtakingeachvalue.

ExpectedValue

E[X] = x1p1 + x2p2 + · · · + xkpk

E[X] = Z ∞

−∞

xf(x)dx

DiscreteRV: ContinuousRV:

slide-22
SLIDE 22

Samplemean

Sampling:'Whenwemeasuresomequantityinan experiment,wethinkofitastakingsamplesfroma distribution. Sample'mean:'Bytakingtheaverage,weareestimating themeanorexpectedvalueoftheunderlying distributionwhichgeneratedthesequantities. A'central'problem'in'statistics:'Howcloseisthis estimateofthemean(theaverageofoursamples)to thetrue,underlyingmean?

slide-23
SLIDE 23

SupposewemakeNmeasurementsofX,sampling fromanormaldistributionwithmeanμand

standarddeviationσ.

IfwetaketheaverageoftheseNsamples,our estimate'of'the'mean'is'a'normal'distribution. Themeanofthissamplingdistributionisμ The'standard'error'is'σ'/'sqrt(N).' Thismeansthatonaverage,ourestimatewillbe correct.Thespreadaroundthetruemeanshrinks as1/sqrt(N).

StandardErroroftheMean

slide-24
SLIDE 24

SupposewemakeNmeasurementsofXwhichmay

  • rnotbenormallydistributed.

IfwetaketheaverageoftheseNsamples,our estimateofthemeanapproachesanormal distributionasNgetslarger(centrallimittheorem). Themeanofthissamplingdistributionisμ Thestandarderrorisσ/sqrt(N).

StandardErroroftheMean

slide-25
SLIDE 25

Confidenceintervals

Basedonthedatayou’vecollected,youcan estimatethetruevalueofsomequantity,e.g.the truemean. Thisestimateofthequantityisn’tperfect. Confidenceintervalstellyouarangeofvalues wherethetruevaluelieswithsomeprobability 95%confidenceintervalsaretherangewherethe truevalueofthequantitywillliewith95% probability.

slide-26
SLIDE 26

Outline

Summarystatisticsfunctions RandomVariables

­— Randomvariables,PDF,CDFs ­— Estimatesofcentraltendencyanddispersion ­— Standarderrorofthemean,confidenceintervals

Statistical'Hypothesis'Testing'

­— Tests'and'significance' ­— Student’s't'test'walkthrough' ­— Other'commonly'used'tests'

AnalysisofVariance DimensionalityReduction

­— PCA

FinalProject

slide-27
SLIDE 27

Statisticalhypothesistesting

Thepointofstatisticaltestsistocastdoubtontheveracity

  • fanull'hypothesis.

Ifnullhypothesistrue,itwouldbeveryunlikelytoobserve thegivendata. Statisticaltestsrejectnullhypothesisiftheun-likelihoodof thedatacrossesathreshold. Thisthreshold'or'significance'level'istypicallyexpressedasa p-value:thelikelihoodoffalse-rejections

­— i.e.thelikelihoodthatthenullhypothesiswouldberejectedif itweretrue.

slide-28
SLIDE 28

Statisticalhypothesistesting

  • 1. Statethenull(H0),andalternative(H1)hypotheses
  • 2. Statetheassumptions

Independenceofsamples? Normality?

  • 3. Determineanappropriateteststatistic
  • 4. Derivethedistributionoftheteststatisticunderthenullhypothesis
  • 5. Determinethecriticalregionfortheteststatistic
  • 6. Computetheobservedvalueoftheteststatistic
  • 7. Rejectorfail'to'rejectthenullhypothesis(H0)
  • 8. i.e.:Computethestrongestsignificancelevelatwhichthenull

hypothesiswouldberejected(p-value)

slide-29
SLIDE 29

Student’st-testexample

Supposewemonitorscoresonsomebehavioralassay beforeandaftertreatment.Wetakethedifferenceofthe scoresforeachsubject. Eachsubject’schangeinscoresisxi

  • 1)Statethenull(andalternative)hypotheses:

Null'hypothesis:xiaredrawnfromanormaldistribution withzeromean. Alternative'hypothesis:'xiaredrawnfromanormal distributionwithnon-zeromean.

slide-30
SLIDE 30

Student’st-testexample

State'the'assumptions:' Allsamplesareindependent. Changesinscoreshaveanormaldistribution.This followsfromscoresonthetestbeforeandafter havingnormaldistributions.

slide-31
SLIDE 31

TestStatistics

Determine'an'appropriate'test'statistic.' Ateststatisticisanumericalsummaryofdatathat reducestheinformationneededtoperforma hypothesistoasinglevalue(orasmallnumberof values). Theimportantpointisthatweknowwhatthis quantity’sdistributionwouldlooklikeunderthe nullhypothesis.Iftheteststatisticcomputedfrom thedataisveryunlikelytobedrawnfromthat distribution,wecanrejectthenullhypothesis.

slide-32
SLIDE 32

Sincethesetofvaluesarenormallydistributed,the Student’st-testisappropriate.Therefore,thetest statisticis:

Student’st-testexample

¯ Xn = 1 n(X1 + · · · + Xn)

where

S2

n =

1 n − 1

n

X

i=1

(Xi − ¯ Xn)2

and (samplemean) (samplevariance)

T = ¯ Xn − µ

Sn √n

μiszero,themeanfor

thenullhypothesis

ThisTvaluehasaStudent’stdistributionwithN-1 degreesoffreedom.

slide-33
SLIDE 33

Derive'the'distribution'of'the'test'statistic'under'the' null'hypothesis:'Student’st-distribution,n-1df

Student’st-testexample

Source: wikipedia.org

slide-34
SLIDE 34

Student’st-testexample

Determine'the'critical'region'for'the'test'statistic. SupposeN=6.Thenwehave5degreesoffreedom. Atthe95%significancelevel,thecriticalvaluefor Tis2.447. Thus,if|T|≥2.447,werejectthenullhypothesis atthe95%significancelevel. Inotherwords,ifthemeanwerereallyzero,|T| wouldbelargerthan2.447only5%ofthetime.

slide-35
SLIDE 35

Determine'the'critical'region'for'the'test'statistic.'

Student’st-testexample

Source:wikipedia.org

Rejectnullhypothesis Failtoreject nullhypothesis

slide-36
SLIDE 36

Determine'the'critical'region'for'the'test'statistic.'

Student’st-testexample

Source:wikipedia.org

Rejectnullhypothesis Failtorejectnullhypothesis

slide-37
SLIDE 37

Student’st-testexample

Compute'the'observed'value'of'the'test'statistic.' SupposewecalculatedT=3.1 Decide'to'fail'to'reject'or'to'reject'the'null' hypothesis.' 98.66%ofthet-distributionwith5dfliestothe leftof3.1. Thereforewecanrejectthenullhypothesisatthe 95%level. Ourp-valueis0.0134

slide-38
SLIDE 38

ttest()function

[h,p,ci,stats] = ttest(X) Testsagainstthenullhypothesisthatthevaluesin vectorXaredrawnfromanormaldistributionwith zeromean. h:trueifthenullhypothesisisrejected p:p-valueassociatedwiththeresult ci:95%confidenceintervalsforthetruevalueof themean stats:astructurethatMATLABcanusefordoing followuptests,suchasusingmultcompare.

slide-39
SLIDE 39

ttest()function

[h,p,ci,stats] = ttest(X, nullMean) Testsagainstthenullhypothesisthatthevaluesin vectorXaredrawnfromanormaldistributionwith meannullMean. h:trueifthenullhypothesisisrejected p:p-valueassociatedwiththeresult ci:95%confidenceintervalsforthetruevalueof themean stats:astructurethatMATLABcanusefordoing followuptests,suchasusingmultcompare.

slide-40
SLIDE 40

ttest()function

[h,p,ci,stats] = ttest(X, nullMean, thresh)

Testsagainstthenullhypothesisthatthevaluesin vectorXaredrawnfromanormaldistributionwith meannullMean. h:trueifthenullhypothesisisrejectedatthreshold thresh(defaultis0.05) p:p-valueassociatedwiththeresult ci:95%confidenceintervalsforthetruevalueof themean stats:astructurethatMATLABcanusefordoing followuptests,suchasusingmultcompare.

slide-41
SLIDE 41

Pairedt-test

Inapaired,t-testyoumaketwomeasurementson thesamesubject,usuallybeforeandafter.Thenull hypothesisisthatthetwomeasurementsaredrawn fromdistributionswithequalmeans. Internally,allthisdoesistaketheafter-before differenceforeachsubjectandtestagainstthenull hypothesisthatthesedifferenceshavezeromean. [h,p,ci,stats] = ttest(X, Y)

slide-42
SLIDE 42

Outline

Summarystatisticsfunctions RandomVariables

­— Randomvariables,PDF,CDFs ­— Estimatesofcentraltendencyanddispersion ­— Standarderrorofthemean,confidenceintervals

StatisticalHypothesisTesting

­— Testsandsignificance ­— Student’sttestwalkthrough ­— Othercommonlyusedtests

Analysis'of'Variance' Homework

slide-43
SLIDE 43

AnalysisofVariance

Analysis'of'variance'(ANOVA)isasetofstatistical modelsandmethodsforpartitioningvariancein somequantityintocomponentsattributableto differentsources. ANOVAextendsthet-testtomultiplegroupsand allowsyoutotestagainst'the'null'hypothesis'that' somequantitymeasuredfromall'of'these'groups' have'the'same'mean.

slide-44
SLIDE 44

AnalysisofVariance

[p table stats] = anova1(X, groupNames) Performsaone-waybalancedANOVAcomparing themeansofthecolumnsofX. Eachcolumnisagroup,eachrowinthatcolumnis adatapoint.Thusthenumberofsubjectsineach groupmustbeequal(i.e.balanceddesign). p-valueisthesignificancethresholdassociating withrejectingthenull'hypothesis'that'all'means'are' the'same.

slide-45
SLIDE 45

AnalysisofVariance

TheANOVAtestmakesthefollowingassumptions aboutthedata: Allsamplepopulationsarenormallydistributed. Allsamplepopulationshaveequalvariance. Allobservationsaremutuallyindependent.

  • TheANOVAtestisknowntoberobustwithrespect

tomodestviolationsofthefirsttwoassumptions.

Source:mathworks.com

slide-46
SLIDE 46

Multiplecomparisons

[c,m] = multcompare(stats); Whenyouhavemanygroups,allowstestingfor differencesbetweenpairsofgroups,withoutthe rateoffalsepositivesincreasingwitheach comparison.

slide-47
SLIDE 47

N-wayanalysisofvariance

InanN-wayanalysisofvariance,youhaveasingle

  • utputmeasurementforeachsubject,andeach

subjectisdescribedbyNfactors. TheaimistodeterminewhichoftheseNfactors (orinteractionsamongthesefactors)affectthe

  • utputquantity.

Worksfinewithunbalanced'designsaswell, meaningsomegroupscanhavemoredatapoints thanothers.

slide-48
SLIDE 48

N-wayanalysisofvariance

Eachsubject/trial/datapointetc.isdescribedbya singleoutputmeasurementY.Italsobelongsto

  • neofseveralgroupswithineachfactor.
slide-49
SLIDE 49

N-wayanalysisofvariance

Example:'Eachdatapointrepresentsonemouse.

­— Output'quantity:scoreonsomebehavioralassay ­— Factor'1:'Agegroup(Young,Old) ­— Factor'2:'Drugtreatment(Control,DrugA,DrugB)

Maineffects:

­— Doestheagegroupaffectthescore? ­— Doesthedrugtreatmentgroupaffectthescore?

Interactioneffects:

­— Doestheagegroupaffectthescoredifferently dependingonwhatdruggroupamouseisin? (Equivalently,viceversa?)

slide-50
SLIDE 50

N-wayanalysisofvariance

assayScore = [24 101 56 ... ] ageGroup = {‘young’, ‘old’, ‘young’, ... } treatmentGroup = {‘control’, ‘drugA’, ‘drugA’ ...}

  • [p t stats terms] = anovan(assayScore, ...


{ageGroup treatmentGroup}, ... 'varnames', {'Age Group','Treatment Group'}, ... 'model', 'interaction');

slide-51
SLIDE 51

Outline

Summarystatisticsfunctions RandomVariables

­— Randomvariables,PDF,CDFs ­— Estimatesofcentraltendencyanddispersion ­— Standarderrorofthemean,confidenceintervals

StatisticalHypothesisTesting

­— Testsandsignificance ­— Student’sttestwalkthrough ­— Othercommonlyusedtests

AnalysisofVariance Homework

slide-52
SLIDE 52

Homework

slide-53
SLIDE 53

StatisticalTestReference

slide-54
SLIDE 54

Twosamplet-tests:ttest2()

Youmeasuresomequantityfortwoseparate groupsofsubjects,andyouwanttoknowwhether themeansaredifferentbetweenthetwogroups. Needtoestimatewhetherthetwogroupsof measurementshaveequalvariances.Isonegroup morevariablethantheother?

[h,p,ci,stats] = ttest2(X, Y); ­— Assumesequalvariances


[h, p, ci, stats] = ttest2(X,Y,thresh,tail,‘unequal’);

­— Assumesunequalvariances

slide-55
SLIDE 55

Two-tailedversusone-tailed

Two'tailed't-test:'teststhealternativehypothesis thatthemeanisdifferentfromzero,ineither direction.Youalmostalwayswantthisone. One'tailed't-test:'teststhealternativehypothesis thatthemeanisdifferentfromzeroinapariticular direction(i.e.greaterthanORlessthanzero).This effectivelyhalves'your'p-value,sopeopleareoften skepticalwhenyouusethis.

slide-56
SLIDE 56

Testsfornormality:Chi-square

h = chi2gof(x) Performsachi-squaregoodness-of-fittestofthe defaultnullhypothesisthatthedatainvectorxare arandomsamplefromanormaldistributionwith meanandvarianceestimatedfromx,againstthe alternativethatthedataarenotnormally distributedwiththeestimatedmeanandvariance.

Source:mathworks.com

slide-57
SLIDE 57

Lillieforstestfornormality

h = lillietest(x) PerformsaLillieforstestofthedefaultnull hypothesisthatthesampleinvectorxcomesfrom adistributioninthenormalfamily,againstthe alternativethatitdoesnotcomefromanormal distribution.

Source:mathworks.com

slide-58
SLIDE 58

Rank-sumtests

KnownasMann­—WhitneyUorWilcoxonranksum. Assesseswhetherquantitiesinonegrouptendto behigherthantheother. Doesn’trequiredatatobenormal,unlikethet-test.

  • p = ranksum(x,y)
slide-59
SLIDE 59

Sign-ranktest

Operatesanalogouslytothet-testorpairedt-test, assessingwhetherthereasetofnumbershasa mediandifferentfromzeroorwhetherthereisa differenceinmediansbetweenpaired measurements. Doesn’trequiredatatobenormal.

  • p = signrank(x)

p = signrank(x,y)

slide-60
SLIDE 60

Chi-squarevariancetest

[h, p] = vartest(X,v) Performsachi-squaretestofthenullhypothesis thatthesamplesinvectorxcomesfromanormal distributionwithvariancev,againstthealternative thatXcomesfromanormaldistributionwitha differentvariance. Datamustbenormal.

Source:mathworks.com

slide-61
SLIDE 61

Comparevariancesoftwogroups

[h, p, ci] = vartest2(X,Y) PerformsanFtestofthehypothesisthattwo independentsamples,inthevectorsXandY,come fromnormaldistributionswiththesamevariance, againstthealternativethattheycomefromnormal distributionswithdifferentvariances. ciisa95%confidenceintervalforthetrue varianceratiovar(X) / var(Y) Datamustbenormal

Source:mathworks.com