Empirical problem solving Statistical method R.W. Oldford - - PowerPoint PPT Presentation
Empirical problem solving Statistical method R.W. Oldford - - PowerPoint PPT Presentation
Empirical problem solving Statistical method R.W. Oldford Empirical problem solving - PPDAC The reasoning chain of any empirical study has five essential links or stages: Each stage has its own concerns to be dealt with. Each stage
Empirical problem solving - PPDAC
The reasoning chain of any empirical study has five essential links or stages:
◮ Each stage has its own concerns to be dealt with. ◮ Each stage depends on those stages which went before. ◮ None of which can be overlooked.
Reference: Scientific Method, Statistical Method, and the Speed of Light
PPDAC - Problem
◮ Target population/process (units and collection) ◮ Variates (explanatory and response) ◮ Population attribute(s) of interest ◮ Problem aspect(s)
◮ Causative, descriptive, predictive
PPDAC - Problem: Target Population
Target population/process PTarget
◮ the collection of units that we want to learn about ◮ sometimes a process which produces units over time is easier to define than
is a fixed population.
◮ e.g. stock trading prices, production lines, streaming data, . . .
◮ carefully define what constitutes an individual unit of PTarget ◮ PTarget often includes inaccessible units, for example
◮ future units (especially if PTarget is a process), ◮ units that cannot be studied for ethical reasons, . . .
◮ write all this information down, keep notes
PPDAC - Problem: Variates
Variates x(u), y(u), z(u), . . . , ∀u ∈ PTarget
◮ brainstorm all characteristics that might attached any individual unit u (i.e.
variates)
◮ err on the side of too many variates ◮ critical review can come later
◮ for each variate, identify the kind of values it might take
◮ discrete, continuous, ◮ finite, practically infinite, ◮ categorical, ordinal, interval, ratio scale
◮ arrange variates on a fishbone diagram (or possibly several diagrams)
◮ distinguish response variates from explanatory variates ◮ use fishbone to help elicit possible variates ◮ use fishbone to help group and organize possible variates ◮ understanding the variates helps define PTarget
◮ write all this information down, keep notes
PPDAC - Problem: Variates: Fishbone diagram
◮ place every variate on
the diagram
◮ use branches (e.g. "6
Ms") to organize
◮ sub-branches organize
further
◮ e.g. measurement:
gauge, person, method
◮ might have more than
- ne fishbone diagram
PPDAC - Problem: Population attributes
Population attributes a1 (PTarget) , a2 (PTarget) , a3 (PTarget) , . . .
◮ numerical: counts, locations, scales, correlations, other coefficients, . . . ◮ functions: regressions (parametric and nonparametric), density functions,
prediction functions, dependence graphs, . . .
◮ graphical: barplots, scatterplots, density estimates, contour plots, heatmaps,
. . .
PPDAC - Problem: Aspect
Problem aspect
◮ Descriptive:
◮ attributes are really population/process summaries ◮ interest lies in learning their values ◮ often interst lies in relating variate values
◮ Predictive:
◮ attributes relate variate values of units ◮ interest lies in predicting values of some variate values from those of others ◮ e.g. predicting y(u) for some u in the future, perhaps given x(u)
◮ Causative:
◮ interest lies in discovering a causal relation ◮ interest lies in changes of attributes
PPDAC - Problem: Aspect: Causation
It is useful to have a general working definition of causation. Have:
◮ an explanatory variate’s value x(u) can be set for all u ∈ PTarget ◮ an attribute of interest a (P)
Of interest:
◮ when x(u) is changed to x ⋆(u) for every u, and ◮ no other changes are made to any other variate z(u), ◮ does the attribute of interest a (P) change in response?
If so, then we say that the changes in x caused the change in a(P). Note that this defining the causal effect on the population attribute a(P) level, not on an individual unit u level.
PPDAC - Problem: Aspect: Causation
Begin with a population P and set the value of x(u) for all u ∈ P (denoted x(u) ← “value”) Now, if a (P | x ← “red′′) = a (P | x ← “blue′′), then we say that the change in x caused the change in a (P) and write ∆x = ⇒ ∆a (P) .
PPDAC - Problem: Aspect: Causation
In contrast, if we only observe x(u) for all u ∈ P (i.e. that x(u) = “value”) Then a (P | x = “red′′) and a (P | x = “blue′′) are simply different attributes. This says nothing about a causal relation between in x caused and a (P). We simply observe whatever differences exist between the two attributes.
PPDAC - Problem: Aspect: Causation
Note that
◮ if we cannot set the values of x(u) we cannot assert causal relation, ◮ the changes in x(u) need not all be to the same value as in the above
example,more likely are changes
◮ x(u) → x(u) + δ, or ◮ x(u) → x(u) + δu, or ◮ x(u) → (1 + δu) × x(u) and ◮ any x(u) could be vector valued, or more complex.
◮ causal effect is at the population level ◮ this causal definition is an idealization, ◮ but shows the difference between a causal relation and an observational one, ◮ and suggests that establishing causation is likely to be a challenge. ◮ the challenge includes variations on
◮ study error ◮ sample error ◮ measurement error
PPDAC - Plan
◮ Study population/process, Variates, Attributes
◮ experimental or observational
◮ Develop sampling protocol ◮ Dealing with variates
◮ fishbone diagram ◮ selecting response variate ◮ controlling explanatory variates ◮ experimental variates (causative aspect)
◮ measuring process(es) ◮ data collection protocol(s)
PPDAC - Plan: Study population
Here we determine the study population/process PStudy,
◮ the collection of units that we want have access to ◮ should resemble PTarget as much as possible, especially in the population
attributes of interest
◮ could also be thought of as a process produces units over time but
◮ it will have a finite (possibly large) number of units, ◮ either because it is a population, or ◮ because the study must be done within a finite fixed time period.
◮ Again, carefully define what constitutes an individual unit of PStudy ◮ PStudy includes only units which are available and accessible for study,
during the time of the study.
◮ write all this information down, keep notes
PPDAC - Plan: Sampling protocol
Here we determine a sampling protocol,
◮ how should units from PStudy be selected to be part of the sample S, for
example
◮ possibly rule out some samples as possibilities ◮ determine how samples are to be selected ◮ deterministically or by some probability mechanism? ◮ with what probabilities of selection?
◮ there are numerous sampling plans to choose from, depending on the
problem
◮ determine how sample selection (including how to randomly select) will be
implemented
◮ the sampling protocol might depend on how particular variates are dealt
with
◮ write down all procedures and instructions, keep notes
PPDAC - Plan: Variates
Dealing with variates
◮ determine response and explanatory variates using another fishbone
diagram,
◮ controlling explanatory variates, and on fishbone diagram mark explanatory
variates as
◮ (B) for “blocking” variates, if their values are fixed, or otherwise severely
constrained by design,
◮ (R) for “randomized” if the values are assigned by deliberate randomization ◮ (E) for “experimental” if the variate will be set purposefully and differently to
evaluate its causal effect (if any),
◮ (O) for “observed” if the value will simply be measured and observed, ◮ values of remaining variates are not to be recorded
◮ write all this information down, keep notes
If there is an experimental variate, i.e. whose value is deliberately set, then this is an experimental study, otherwise it is an observational study.
PPDAC - Plan: Population attributes
Population attributes a1 (PStudy) , a2 (PStudy) , a3 (PStudy) , . . .
◮ numerical: counts, locations, scales, correlations, other coefficients, . . . ◮ functions: regressions (parametric and nonparametric), density functions,
prediction functions, dependence graphs, . . .
◮ graphical: barplots, scatterplots, density estimates, contour plots, heatmaps,
. . . These should, as much as possible, match those for the target population.
PPDAC - Plan: Measuring process(es)
For every variate whose value is to be determined in the study, there will be an associated measuring system, or process, used to determine that value. For each variate, identify
◮ the gauge(s) or instrument(s) to be used to determine the value ◮ the person, or persons, involved in determining the value ◮ the method to be followed by the person, or person(s), using the gauge(s)
Some effort will be required to ensure that sources of measuring variability and bias are identified and made as small as practically possible and scientifically
- necessary. This could involve a separate study (and PPDAC) to make the
measuring systems sufficiently reliable (e.g. a gauge repeatability and reproducibiilty study).
PPDAC - Plan: Data collection protocols
Every variate value will need to be recorded and stored, for subsequent analysis. The data collection protocol should give instructions to
◮ record the value and units of measurement (e.g. metres, count, Newtons,
. . . ), the variate itself, and the sample unit whose variate was being determined.
◮ what range of values might be expected for each variate, ◮ how and where the results are to be recorded, ◮ what edits and other quality checks might be conducted as the data were
being collected
◮ how to deal with exceptions
The larger and more involved the study, and the more complex the data, the greater the detail in which these (and possibly other) instructions will need to be to minimize opportunities for error.
PPDAC - The inductive path
Throughout the problem solving process, but especially at the Plan stage, serious consideration of, and reflection upon, the inductive path should be given. Think about sources of Study, Sample, and Measurement errors. Use sampling and measuring protocols having low (absolute) bias and low variability.
PPDAC - Data
The “do" stage
◮ Execute the plan
◮ select study, sample, ◮ all plan protocols ◮ record departures
◮ Data monitoring
◮ as collected ◮ data validity ◮ data cleaning
◮ Data examination
◮ internal consistency ◮ explore patterns
◮ Data storage
◮ media ◮ data structures ◮ meta-data
PPDAC - Analysis
◮ Data summary
◮ numerical and graphical
◮ Model construction
◮ build, fit, criticize cycle
◮ Formal analysis
◮ inference
PPDAC - Conclusions
◮ synthesis
◮ plain language, effective presentation graphics
◮ limitations
◮ discussion of potential errors
PPDAC - Our focus has been
◮ Execute the plan
◮ select study, sampling, ◮ all plan protocols ◮ record departures
◮ Data monitoring
◮ as collected ◮ data validity ◮ data cleaning
◮ Data examination
◮ internal consistency ◮ explore patterns
◮ Data storage
◮ media ◮ data structures ◮ meta-data
◮ Data summary
◮ numerical and graphical
◮ Model construction (some)
◮ build, fit, criticize cycle