Empirical problem solving Statistical method R.W. Oldford - - PowerPoint PPT Presentation

empirical problem solving
SMART_READER_LITE
LIVE PREVIEW

Empirical problem solving Statistical method R.W. Oldford - - PowerPoint PPT Presentation

Empirical problem solving Statistical method R.W. Oldford Empirical problem solving - PPDAC The reasoning chain of any empirical study has five essential links or stages: Each stage has its own concerns to be dealt with. Each stage


slide-1
SLIDE 1

Empirical problem solving

Statistical method R.W. Oldford

slide-2
SLIDE 2

Empirical problem solving - PPDAC

The reasoning chain of any empirical study has five essential links or stages:

◮ Each stage has its own concerns to be dealt with. ◮ Each stage depends on those stages which went before. ◮ None of which can be overlooked.

Reference: Scientific Method, Statistical Method, and the Speed of Light

slide-3
SLIDE 3

PPDAC - Problem

◮ Target population/process (units and collection) ◮ Variates (explanatory and response) ◮ Population attribute(s) of interest ◮ Problem aspect(s)

◮ Causative, descriptive, predictive

slide-4
SLIDE 4

PPDAC - Problem: Target Population

Target population/process PTarget

◮ the collection of units that we want to learn about ◮ sometimes a process which produces units over time is easier to define than

is a fixed population.

◮ e.g. stock trading prices, production lines, streaming data, . . .

◮ carefully define what constitutes an individual unit of PTarget ◮ PTarget often includes inaccessible units, for example

◮ future units (especially if PTarget is a process), ◮ units that cannot be studied for ethical reasons, . . .

◮ write all this information down, keep notes

slide-5
SLIDE 5

PPDAC - Problem: Variates

Variates x(u), y(u), z(u), . . . , ∀u ∈ PTarget

◮ brainstorm all characteristics that might attached any individual unit u (i.e.

variates)

◮ err on the side of too many variates ◮ critical review can come later

◮ for each variate, identify the kind of values it might take

◮ discrete, continuous, ◮ finite, practically infinite, ◮ categorical, ordinal, interval, ratio scale

◮ arrange variates on a fishbone diagram (or possibly several diagrams)

◮ distinguish response variates from explanatory variates ◮ use fishbone to help elicit possible variates ◮ use fishbone to help group and organize possible variates ◮ understanding the variates helps define PTarget

◮ write all this information down, keep notes

slide-6
SLIDE 6

PPDAC - Problem: Variates: Fishbone diagram

◮ place every variate on

the diagram

◮ use branches (e.g. "6

Ms") to organize

◮ sub-branches organize

further

◮ e.g. measurement:

gauge, person, method

◮ might have more than

  • ne fishbone diagram
slide-7
SLIDE 7

PPDAC - Problem: Population attributes

Population attributes a1 (PTarget) , a2 (PTarget) , a3 (PTarget) , . . .

◮ numerical: counts, locations, scales, correlations, other coefficients, . . . ◮ functions: regressions (parametric and nonparametric), density functions,

prediction functions, dependence graphs, . . .

◮ graphical: barplots, scatterplots, density estimates, contour plots, heatmaps,

. . .

slide-8
SLIDE 8

PPDAC - Problem: Aspect

Problem aspect

◮ Descriptive:

◮ attributes are really population/process summaries ◮ interest lies in learning their values ◮ often interst lies in relating variate values

◮ Predictive:

◮ attributes relate variate values of units ◮ interest lies in predicting values of some variate values from those of others ◮ e.g. predicting y(u) for some u in the future, perhaps given x(u)

◮ Causative:

◮ interest lies in discovering a causal relation ◮ interest lies in changes of attributes

slide-9
SLIDE 9

PPDAC - Problem: Aspect: Causation

It is useful to have a general working definition of causation. Have:

◮ an explanatory variate’s value x(u) can be set for all u ∈ PTarget ◮ an attribute of interest a (P)

Of interest:

◮ when x(u) is changed to x ⋆(u) for every u, and ◮ no other changes are made to any other variate z(u), ◮ does the attribute of interest a (P) change in response?

If so, then we say that the changes in x caused the change in a(P). Note that this defining the causal effect on the population attribute a(P) level, not on an individual unit u level.

slide-10
SLIDE 10

PPDAC - Problem: Aspect: Causation

Begin with a population P and set the value of x(u) for all u ∈ P (denoted x(u) ← “value”) Now, if a (P | x ← “red′′) = a (P | x ← “blue′′), then we say that the change in x caused the change in a (P) and write ∆x = ⇒ ∆a (P) .

slide-11
SLIDE 11

PPDAC - Problem: Aspect: Causation

In contrast, if we only observe x(u) for all u ∈ P (i.e. that x(u) = “value”) Then a (P | x = “red′′) and a (P | x = “blue′′) are simply different attributes. This says nothing about a causal relation between in x caused and a (P). We simply observe whatever differences exist between the two attributes.

slide-12
SLIDE 12

PPDAC - Problem: Aspect: Causation

Note that

◮ if we cannot set the values of x(u) we cannot assert causal relation, ◮ the changes in x(u) need not all be to the same value as in the above

example,more likely are changes

◮ x(u) → x(u) + δ, or ◮ x(u) → x(u) + δu, or ◮ x(u) → (1 + δu) × x(u) and ◮ any x(u) could be vector valued, or more complex.

◮ causal effect is at the population level ◮ this causal definition is an idealization, ◮ but shows the difference between a causal relation and an observational one, ◮ and suggests that establishing causation is likely to be a challenge. ◮ the challenge includes variations on

◮ study error ◮ sample error ◮ measurement error

slide-13
SLIDE 13

PPDAC - Plan

◮ Study population/process, Variates, Attributes

◮ experimental or observational

◮ Develop sampling protocol ◮ Dealing with variates

◮ fishbone diagram ◮ selecting response variate ◮ controlling explanatory variates ◮ experimental variates (causative aspect)

◮ measuring process(es) ◮ data collection protocol(s)

slide-14
SLIDE 14

PPDAC - Plan: Study population

Here we determine the study population/process PStudy,

◮ the collection of units that we want have access to ◮ should resemble PTarget as much as possible, especially in the population

attributes of interest

◮ could also be thought of as a process produces units over time but

◮ it will have a finite (possibly large) number of units, ◮ either because it is a population, or ◮ because the study must be done within a finite fixed time period.

◮ Again, carefully define what constitutes an individual unit of PStudy ◮ PStudy includes only units which are available and accessible for study,

during the time of the study.

◮ write all this information down, keep notes

slide-15
SLIDE 15

PPDAC - Plan: Sampling protocol

Here we determine a sampling protocol,

◮ how should units from PStudy be selected to be part of the sample S, for

example

◮ possibly rule out some samples as possibilities ◮ determine how samples are to be selected ◮ deterministically or by some probability mechanism? ◮ with what probabilities of selection?

◮ there are numerous sampling plans to choose from, depending on the

problem

◮ determine how sample selection (including how to randomly select) will be

implemented

◮ the sampling protocol might depend on how particular variates are dealt

with

◮ write down all procedures and instructions, keep notes

slide-16
SLIDE 16

PPDAC - Plan: Variates

Dealing with variates

◮ determine response and explanatory variates using another fishbone

diagram,

◮ controlling explanatory variates, and on fishbone diagram mark explanatory

variates as

◮ (B) for “blocking” variates, if their values are fixed, or otherwise severely

constrained by design,

◮ (R) for “randomized” if the values are assigned by deliberate randomization ◮ (E) for “experimental” if the variate will be set purposefully and differently to

evaluate its causal effect (if any),

◮ (O) for “observed” if the value will simply be measured and observed, ◮ values of remaining variates are not to be recorded

◮ write all this information down, keep notes

If there is an experimental variate, i.e. whose value is deliberately set, then this is an experimental study, otherwise it is an observational study.

slide-17
SLIDE 17

PPDAC - Plan: Population attributes

Population attributes a1 (PStudy) , a2 (PStudy) , a3 (PStudy) , . . .

◮ numerical: counts, locations, scales, correlations, other coefficients, . . . ◮ functions: regressions (parametric and nonparametric), density functions,

prediction functions, dependence graphs, . . .

◮ graphical: barplots, scatterplots, density estimates, contour plots, heatmaps,

. . . These should, as much as possible, match those for the target population.

slide-18
SLIDE 18

PPDAC - Plan: Measuring process(es)

For every variate whose value is to be determined in the study, there will be an associated measuring system, or process, used to determine that value. For each variate, identify

◮ the gauge(s) or instrument(s) to be used to determine the value ◮ the person, or persons, involved in determining the value ◮ the method to be followed by the person, or person(s), using the gauge(s)

Some effort will be required to ensure that sources of measuring variability and bias are identified and made as small as practically possible and scientifically

  • necessary. This could involve a separate study (and PPDAC) to make the

measuring systems sufficiently reliable (e.g. a gauge repeatability and reproducibiilty study).

slide-19
SLIDE 19

PPDAC - Plan: Data collection protocols

Every variate value will need to be recorded and stored, for subsequent analysis. The data collection protocol should give instructions to

◮ record the value and units of measurement (e.g. metres, count, Newtons,

. . . ), the variate itself, and the sample unit whose variate was being determined.

◮ what range of values might be expected for each variate, ◮ how and where the results are to be recorded, ◮ what edits and other quality checks might be conducted as the data were

being collected

◮ how to deal with exceptions

The larger and more involved the study, and the more complex the data, the greater the detail in which these (and possibly other) instructions will need to be to minimize opportunities for error.

slide-20
SLIDE 20

PPDAC - The inductive path

Throughout the problem solving process, but especially at the Plan stage, serious consideration of, and reflection upon, the inductive path should be given. Think about sources of Study, Sample, and Measurement errors. Use sampling and measuring protocols having low (absolute) bias and low variability.

slide-21
SLIDE 21

PPDAC - Data

The “do" stage

◮ Execute the plan

◮ select study, sample, ◮ all plan protocols ◮ record departures

◮ Data monitoring

◮ as collected ◮ data validity ◮ data cleaning

◮ Data examination

◮ internal consistency ◮ explore patterns

◮ Data storage

◮ media ◮ data structures ◮ meta-data

slide-22
SLIDE 22

PPDAC - Analysis

◮ Data summary

◮ numerical and graphical

◮ Model construction

◮ build, fit, criticize cycle

◮ Formal analysis

◮ inference

slide-23
SLIDE 23

PPDAC - Conclusions

◮ synthesis

◮ plain language, effective presentation graphics

◮ limitations

◮ discussion of potential errors

slide-24
SLIDE 24

PPDAC - Our focus has been

◮ Execute the plan

◮ select study, sampling, ◮ all plan protocols ◮ record departures

◮ Data monitoring

◮ as collected ◮ data validity ◮ data cleaning

◮ Data examination

◮ internal consistency ◮ explore patterns

◮ Data storage

◮ media ◮ data structures ◮ meta-data

◮ Data summary

◮ numerical and graphical

◮ Model construction (some)

◮ build, fit, criticize cycle

◮ Formal analysis (some)

slide-25
SLIDE 25

PPDAC - Our focus

Activities in both stages might create, grow, and refine the data available to our study:

How might each of these change the repository? Examples? Upstream activities might have to be revisited in light of what might be learned downstream.

slide-26
SLIDE 26

PPDAC - The data analysis environment

A common pattern for data analyses (Grolemund & Wickham with their “tidy”): which, while obviously applicable to the analysis stage, could (at least in part) appear wherever there is interaction with the data, including any step in the creation, growth, or refinement of the data repository. Most of this will be done within an analysis environment (e.g. R) Curated analyses should appear in a reproducible form such as a “notebook” environment using RMarkdown or Jupyter, etc., possibly interactive such as with RStudio’s Shiny.

slide-27
SLIDE 27

But, wait, there’s more . . .

There is in fact, much, much, much, more.