SLIDE 1 Statistics'and' Hypothesis'Testing
NENS230:DataAnalysisfortheBiosciencesusingMATLAB EddyAlbarran November3,2015
SLIDE 2 AnalysisMethodology
Data Exploratory DataAnalysis Hypothesis Testing
- SummaryStatistics
- DimensionalityReduction/PCA
- Visualization
- Histogram
- Scatterplots
- Boxplots
- etc.
- T-Test
- Z-test
- Chi-Square
- etc.
Generate Hypotheses
Reject Null Failto rejectnull
SLIDE 3 Outline
Summary statistics functions Random Variables
– Random variables, PDF, CDFs – Estimates of central tendency and dispersion – Standard error of the mean, confidence intervals
Statistical Hypothesis Testing
– Tests and significance – Student’s t test walkthrough – Other commonly used tests
Analysis of Variance Homework
SLIDE 4
Summary Statistics
Commonly used functions:
– mean() – std() – var() – sum() – min() – max()
SLIDE 5 mean()function
mean()computestheaverage(samplemean)ofa vector.Withmatrices,youneedtospecifywhich dimensiontoaveragealong. mean(X, 1)meansreturntheaveragerow (averageacrosstherows).Thisisthedefaultifyou
mean(X, 2)meansreturntheaveragecolumn (averageacrossthecolumns)
SLIDE 6
mean()function
mean()computestheaverage(samplemean)ofa vector.Whendealingwithmatrices,youneedto specifywhichdimensiontoaveragealong.
26 15 1 2.4 15 1
X = mean(X, 1) evaluatesto
13 15 1 1.2
mean(X, 2) evaluatesto Dim1 Dim2
11.1 4
mean(X)
SLIDE 7 mean()function
mean()operatesonitsfirstargument.Be carefulwhenaveragingtwothingstogether thatyoupacktheminavectorusing[ ]
“Takethemeanof[1]alongthe5th dimension”
SLIDE 8 std()function
std()computesthestandarddeviationofalistofnumbers
— Whendealingwithmatrices,youneedtospecifywhichdimensiontoaverage along,as'the'third'argument.' — Thesecondargumentshouldbe0ifyouwanttheunbiasedestimatorthat normalizesbyn-1,wherenisthenumberofsamples
26 15 1 2.4 15 1
X = std(X, 0, 1) evaluatesto
18.3848 1.6971
std(X, 0, 2) evaluatesto Dim1 Dim2
11.7604 7.3485
std(X)
SLIDE 9 var()function
var()computesthesamplevarianceofalistofnumbers
— Whendealingwithmatrices,youneedtospecifywhichdimensiontooperate along,as'the'third'argument.' — Thesecondargumentshouldbe0ifyouwanttheunbiasedestimatorthat normalizesbyn-1,wherenisthenumberofsamples.(Thisisthedefault)
26 15 1 2.4 15 1
X = var(X, 0, 1) evaluatesto
338 2.88
var(X, 0, 2) evaluatesto Dim1 Dim2
138.31 54
var(X)
SLIDE 10 sum()function
sum()computesthesumofavector.When dealingwithmatrices,youshouldspecifywhich dimensiontoaveragealong. sum(X, 1)meansreturnthesumoverrows(sum
- verrowswithineachcolumn).Thisisthedefaultif
youonlyspecifyoneargument. sum(X, 2)meansreturnthesumovercolumns (sumovercolumnswithineachrow)
SLIDE 11
min()function
min()computestheminimumofavector.When dealingwithmatrices,youshouldspecifywhich dimensiontofindtheminimumalong. min(X, Y)meansreturnanarraythesamesizeas XandYconsistingofthesmalleroftheelementsin XandYateachlocation. min(X, [], 1)meansreturntheminimumvalue ineachcolumn.Thisisthedefaultifyouonly specifyoneargument. min(X, [], 2)meansreturntheminimumin eachrow.
SLIDE 12
max()function
max()computesthemaximumofavector.When dealingwithmatrices,youshouldspecifywhich dimensiontofindthemaximumalong. max(X, Y)meansreturnanarraythesamesizeas XandYconsistingofthelargeroftheelementsin XandYateachlocation. max(X, [], 1)meansreturnthemaximumvalue ineachcolumn.Thisisthedefaultifyouonly specifyoneargument. max(X, [], 2)meansreturnthemaximumin eachrow.
SLIDE 13
Outline
Summarystatisticsfunctions Random'Variables'
— Random'variables,'PDF,'CDFs' — Estimates'of'central'tendency'and'dispersion' — Standard'error'of'the'mean,'confidence'intervals'
StatisticalHypothesisTesting
— Testsandsignificance — Student’sttestwalkthrough — Othercommonlyusedtests
AnalysisofVariance Homework
SLIDE 14 Discreterandomvariables
SupposewehavearandomvariableX. Discrete'random'variables'takeonevaluewithina setofkpossiblevalues. Probability'mass'function:Foragivenvaluexi returnstheprobabilitypiofXtakingthatvalue.
- Sumoftheseprobabilitiesmustbe1.
p1 + p2 + · · · + pk = 1
Pr[X = xi] = pi
SLIDE 15
ProbabilityMassFunction
SLIDE 16 Continuousrandomvariables
SupposewehavearandomvariableX. Continuous'random'variables'takevalueswithin somecontinuousrangeofvalues. Probability'density'function'(PDF):integratingthis functionoversomeintervalgivesyouthe probabilitythatXliesinthatinterval.
Therefore,theintegralunderthisfunctionis1.
Pr[a ≤ X ≤ b] = Z b
a
f(x)dx Z ∞
−∞
f(x)dx = 1
SLIDE 17 NormalorGaussiandistributionsdescribemanynaturally
- ccurringphenomena,duetothecentrallimittheorem.
Specifiedbytwoparameters:
— Location'parameter:themean(μ) — Scale'parameter:thestandarddeviation(σ)
Normaldistribution
Source:wikipedia.org
1 σ p (2π) e− (x−µ)2
2σ2
SLIDE 18
PDFfornormaldistribution
SLIDE 19 Cumulativedistributionfunction
Cumulative'distribution'function'(CDF):howlikely isXlessthanorequaltoaparticularvalue.
- TheCDFistheintegralofthePDF.
ThePDFisthederivativeoftheCDF.Therefore,the partsoftheCDFwiththesteepestslopearethe highestpointsofthePDF,i.e.wheremostofthe valueslie.
Pr[X ≤ x] = F(x)
SLIDE 20
CDFfornormaldistribution
SLIDE 21 Theexpectedvalueofarandomvariableisit’s mean.Youcancalculatetheexpectedvalueofa randomvariableXbytakingtheweightedaverage
- fallitspossiblevalues.Theweightsarethe
probabilityofXtakingeachvalue.
ExpectedValue
E[X] = x1p1 + x2p2 + · · · + xkpk
E[X] = Z ∞
−∞
xf(x)dx
DiscreteRV: ContinuousRV:
SLIDE 22
Samplemean
Sampling:'Whenwemeasuresomequantityinan experiment,wethinkofitastakingsamplesfroma distribution. Sample'mean:'Bytakingtheaverage,weareestimating themeanorexpectedvalueoftheunderlying distributionwhichgeneratedthesequantities. A'central'problem'in'statistics:'Howcloseisthis estimateofthemean(theaverageofoursamples)to thetrue,underlyingmean?
SLIDE 23
SupposewemakeNmeasurementsofX,sampling fromanormaldistributionwithmeanμand
standarddeviationσ.
IfwetaketheaverageoftheseNsamples,our estimate'of'the'mean'is'a'normal'distribution. Themeanofthissamplingdistributionisμ The'standard'error'is'σ'/'sqrt(N).' Thismeansthatonaverage,ourestimatewillbe correct.Thespreadaroundthetruemeanshrinks as1/sqrt(N).
StandardErroroftheMean
SLIDE 24 SupposewemakeNmeasurementsofXwhichmay
- rnotbenormallydistributed.
IfwetaketheaverageoftheseNsamples,our estimateofthemeanapproachesanormal distributionasNgetslarger(centrallimittheorem). Themeanofthissamplingdistributionisμ Thestandarderrorisσ/sqrt(N).
StandardErroroftheMean
SLIDE 25
Confidenceintervals
Basedonthedatayou’vecollected,youcan estimatethetruevalueofsomequantity,e.g.the truemean. Thisestimateofthequantityisn’tperfect. Confidenceintervalstellyouarangeofvalues wherethetruevaluelieswithsomeprobability 95%confidenceintervalsaretherangewherethe truevalueofthequantitywillliewith95% probability.
SLIDE 26
Outline
Summarystatisticsfunctions RandomVariables
— Randomvariables,PDF,CDFs — Estimatesofcentraltendencyanddispersion — Standarderrorofthemean,confidenceintervals
Statistical'Hypothesis'Testing'
— Tests'and'significance' — Student’s't'test'walkthrough' — Other'commonly'used'tests'
AnalysisofVariance DimensionalityReduction
— PCA
FinalProject
SLIDE 27 Statisticalhypothesistesting
Thepointofstatisticaltestsistocastdoubtontheveracity
Ifnullhypothesistrue,itwouldbeveryunlikelytoobserve thegivendata. Statisticaltestsrejectnullhypothesisiftheun-likelihoodof thedatacrossesathreshold. Thisthreshold'or'significance'level'istypicallyexpressedasa p-value:thelikelihoodoffalse-rejections
— i.e.thelikelihoodthatthenullhypothesiswouldberejectedif itweretrue.
SLIDE 28 Statisticalhypothesistesting
- 1. Statethenull(H0),andalternative(H1)hypotheses
- 2. Statetheassumptions
Independenceofsamples? Normality?
- 3. Determineanappropriateteststatistic
- 4. Derivethedistributionoftheteststatisticunderthenullhypothesis
- 5. Determinethecriticalregionfortheteststatistic
- 6. Computetheobservedvalueoftheteststatistic
- 7. Rejectorfail'to'rejectthenullhypothesis(H0)
- 8. i.e.:Computethestrongestsignificancelevelatwhichthenull
hypothesiswouldberejected(p-value)
SLIDE 29 Student’st-testexample
Supposewemonitorscoresonsomebehavioralassay beforeandaftertreatment.Wetakethedifferenceofthe scoresforeachsubject. Eachsubject’schangeinscoresisxi
- 1)Statethenull(andalternative)hypotheses:
Null'hypothesis:xiaredrawnfromanormaldistribution withzeromean. Alternative'hypothesis:'xiaredrawnfromanormal distributionwithnon-zeromean.
SLIDE 30
Student’st-testexample
State'the'assumptions:' Allsamplesareindependent. Changesinscoreshaveanormaldistribution.This followsfromscoresonthetestbeforeandafter havingnormaldistributions.
SLIDE 31
TestStatistics
Determine'an'appropriate'test'statistic.' Ateststatisticisanumericalsummaryofdatathat reducestheinformationneededtoperforma hypothesistoasinglevalue(orasmallnumberof values). Theimportantpointisthatweknowwhatthis quantity’sdistributionwouldlooklikeunderthe nullhypothesis.Iftheteststatisticcomputedfrom thedataisveryunlikelytobedrawnfromthat distribution,wecanrejectthenullhypothesis.
SLIDE 32 Sincethesetofvaluesarenormallydistributed,the Student’st-testisappropriate.Therefore,thetest statisticis:
Student’st-testexample
¯ Xn = 1 n(X1 + · · · + Xn)
where
S2
n =
1 n − 1
n
X
i=1
(Xi − ¯ Xn)2
and (samplemean) (samplevariance)
T = ¯ Xn − µ
Sn √n
μiszero,themeanfor
thenullhypothesis
ThisTvaluehasaStudent’stdistributionwithN-1 degreesoffreedom.
SLIDE 33
Derive'the'distribution'of'the'test'statistic'under'the' null'hypothesis:'Student’st-distribution,n-1df
Student’st-testexample
Source: wikipedia.org
SLIDE 34
Student’st-testexample
Determine'the'critical'region'for'the'test'statistic. SupposeN=6.Thenwehave5degreesoffreedom. Atthe95%significancelevel,thecriticalvaluefor Tis2.447. Thus,if|T|≥2.447,werejectthenullhypothesis atthe95%significancelevel. Inotherwords,ifthemeanwerereallyzero,|T| wouldbelargerthan2.447only5%ofthetime.
SLIDE 35 Determine'the'critical'region'for'the'test'statistic.'
Student’st-testexample
Source:wikipedia.org
Rejectnullhypothesis Failtoreject nullhypothesis
SLIDE 36 Determine'the'critical'region'for'the'test'statistic.'
Student’st-testexample
Source:wikipedia.org
Rejectnullhypothesis Failtorejectnullhypothesis
SLIDE 37
Student’st-testexample
Compute'the'observed'value'of'the'test'statistic.' SupposewecalculatedT=3.1 Decide'to'fail'to'reject'or'to'reject'the'null' hypothesis.' 98.66%ofthet-distributionwith5dfliestothe leftof3.1. Thereforewecanrejectthenullhypothesisatthe 95%level. Ourp-valueis0.0134
SLIDE 38
ttest()function
[h,p,ci,stats] = ttest(X) Testsagainstthenullhypothesisthatthevaluesin vectorXaredrawnfromanormaldistributionwith zeromean. h:trueifthenullhypothesisisrejected p:p-valueassociatedwiththeresult ci:95%confidenceintervalsforthetruevalueof themean stats:astructurethatMATLABcanusefordoing followuptests,suchasusingmultcompare.
SLIDE 39
ttest()function
[h,p,ci,stats] = ttest(X, nullMean) Testsagainstthenullhypothesisthatthevaluesin vectorXaredrawnfromanormaldistributionwith meannullMean. h:trueifthenullhypothesisisrejected p:p-valueassociatedwiththeresult ci:95%confidenceintervalsforthetruevalueof themean stats:astructurethatMATLABcanusefordoing followuptests,suchasusingmultcompare.
SLIDE 40
ttest()function
[h,p,ci,stats] = ttest(X, nullMean, thresh)
Testsagainstthenullhypothesisthatthevaluesin vectorXaredrawnfromanormaldistributionwith meannullMean. h:trueifthenullhypothesisisrejectedatthreshold thresh(defaultis0.05) p:p-valueassociatedwiththeresult ci:95%confidenceintervalsforthetruevalueof themean stats:astructurethatMATLABcanusefordoing followuptests,suchasusingmultcompare.
SLIDE 41
Pairedt-test
Inapaired,t-testyoumaketwomeasurementson thesamesubject,usuallybeforeandafter.Thenull hypothesisisthatthetwomeasurementsaredrawn fromdistributionswithequalmeans. Internally,allthisdoesistaketheafter-before differenceforeachsubjectandtestagainstthenull hypothesisthatthesedifferenceshavezeromean. [h,p,ci,stats] = ttest(X, Y)
SLIDE 42
Outline
Summarystatisticsfunctions RandomVariables
— Randomvariables,PDF,CDFs — Estimatesofcentraltendencyanddispersion — Standarderrorofthemean,confidenceintervals
StatisticalHypothesisTesting
— Testsandsignificance — Student’sttestwalkthrough — Othercommonlyusedtests
Analysis'of'Variance' Homework
SLIDE 43
AnalysisofVariance
Analysis'of'variance'(ANOVA)isasetofstatistical modelsandmethodsforpartitioningvariancein somequantityintocomponentsattributableto differentsources. ANOVAextendsthet-testtomultiplegroupsand allowsyoutotestagainst'the'null'hypothesis'that' somequantitymeasuredfromall'of'these'groups' have'the'same'mean.
SLIDE 44
AnalysisofVariance
[p table stats] = anova1(X, groupNames) Performsaone-waybalancedANOVAcomparing themeansofthecolumnsofX. Eachcolumnisagroup,eachrowinthatcolumnis adatapoint.Thusthenumberofsubjectsineach groupmustbeequal(i.e.balanceddesign). p-valueisthesignificancethresholdassociating withrejectingthenull'hypothesis'that'all'means'are' the'same.
SLIDE 45 AnalysisofVariance
TheANOVAtestmakesthefollowingassumptions aboutthedata: Allsamplepopulationsarenormallydistributed. Allsamplepopulationshaveequalvariance. Allobservationsaremutuallyindependent.
- TheANOVAtestisknowntoberobustwithrespect
tomodestviolationsofthefirsttwoassumptions.
Source:mathworks.com
SLIDE 46
Multiplecomparisons
[c,m] = multcompare(stats); Whenyouhavemanygroups,allowstestingfor differencesbetweenpairsofgroups,withoutthe rateoffalsepositivesincreasingwitheach comparison.
SLIDE 47 N-wayanalysisofvariance
InanN-wayanalysisofvariance,youhaveasingle
- utputmeasurementforeachsubject,andeach
subjectisdescribedbyNfactors. TheaimistodeterminewhichoftheseNfactors (orinteractionsamongthesefactors)affectthe
Worksfinewithunbalanced'designsaswell, meaningsomegroupscanhavemoredatapoints thanothers.
SLIDE 48 N-wayanalysisofvariance
Eachsubject/trial/datapointetc.isdescribedbya singleoutputmeasurementY.Italsobelongsto
- neofseveralgroupswithineachfactor.
SLIDE 49
N-wayanalysisofvariance
Example:'Eachdatapointrepresentsonemouse.
— Output'quantity:scoreonsomebehavioralassay — Factor'1:'Agegroup(Young,Old) — Factor'2:'Drugtreatment(Control,DrugA,DrugB)
Maineffects:
— Doestheagegroupaffectthescore? — Doesthedrugtreatmentgroupaffectthescore?
Interactioneffects:
— Doestheagegroupaffectthescoredifferently dependingonwhatdruggroupamouseisin? (Equivalently,viceversa?)
SLIDE 50 N-wayanalysisofvariance
assayScore = [24 101 56 ... ] ageGroup = {‘young’, ‘old’, ‘young’, ... } treatmentGroup = {‘control’, ‘drugA’, ‘drugA’ ...}
- [p t stats terms] = anovan(assayScore, ...
{ageGroup treatmentGroup}, ... 'varnames', {'Age Group','Treatment Group'}, ... 'model', 'interaction');
SLIDE 51
Outline
Summarystatisticsfunctions RandomVariables
— Randomvariables,PDF,CDFs — Estimatesofcentraltendencyanddispersion — Standarderrorofthemean,confidenceintervals
StatisticalHypothesisTesting
— Testsandsignificance — Student’sttestwalkthrough — Othercommonlyusedtests
AnalysisofVariance Homework
SLIDE 52
Homework
SLIDE 53
StatisticalTestReference
SLIDE 54 Twosamplet-tests:ttest2()
Youmeasuresomequantityfortwoseparate groupsofsubjects,andyouwanttoknowwhether themeansaredifferentbetweenthetwogroups. Needtoestimatewhetherthetwogroupsof measurementshaveequalvariances.Isonegroup morevariablethantheother?
[h,p,ci,stats] = ttest2(X, Y); — Assumesequalvariances
[h, p, ci, stats] = ttest2(X,Y,thresh,tail,‘unequal’);
— Assumesunequalvariances
SLIDE 55
Two-tailedversusone-tailed
Two'tailed't-test:'teststhealternativehypothesis thatthemeanisdifferentfromzero,ineither direction.Youalmostalwayswantthisone. One'tailed't-test:'teststhealternativehypothesis thatthemeanisdifferentfromzeroinapariticular direction(i.e.greaterthanORlessthanzero).This effectivelyhalves'your'p-value,sopeopleareoften skepticalwhenyouusethis.
SLIDE 56 Testsfornormality:Chi-square
h = chi2gof(x) Performsachi-squaregoodness-of-fittestofthe defaultnullhypothesisthatthedatainvectorxare arandomsamplefromanormaldistributionwith meanandvarianceestimatedfromx,againstthe alternativethatthedataarenotnormally distributedwiththeestimatedmeanandvariance.
Source:mathworks.com
SLIDE 57 Lillieforstestfornormality
h = lillietest(x) PerformsaLillieforstestofthedefaultnull hypothesisthatthesampleinvectorxcomesfrom adistributioninthenormalfamily,againstthe alternativethatitdoesnotcomefromanormal distribution.
Source:mathworks.com
SLIDE 58 Rank-sumtests
KnownasMann—WhitneyUorWilcoxonranksum. Assesseswhetherquantitiesinonegrouptendto behigherthantheother. Doesn’trequiredatatobenormal,unlikethet-test.
SLIDE 59 Sign-ranktest
Operatesanalogouslytothet-testorpairedt-test, assessingwhetherthereasetofnumbershasa mediandifferentfromzeroorwhetherthereisa differenceinmediansbetweenpaired measurements. Doesn’trequiredatatobenormal.
p = signrank(x,y)
SLIDE 60 Chi-squarevariancetest
[h, p] = vartest(X,v) Performsachi-squaretestofthenullhypothesis thatthesamplesinvectorxcomesfromanormal distributionwithvariancev,againstthealternative thatXcomesfromanormaldistributionwitha differentvariance. Datamustbenormal.
Source:mathworks.com
SLIDE 61 Comparevariancesoftwogroups
[h, p, ci] = vartest2(X,Y) PerformsanFtestofthehypothesisthattwo independentsamples,inthevectorsXandY,come fromnormaldistributionswiththesamevariance, againstthealternativethattheycomefromnormal distributionswithdifferentvariances. ciisa95%confidenceintervalforthetrue varianceratiovar(X) / var(Y) Datamustbenormal
Source:mathworks.com