The Automatic Statistician an AI for Data Science Zoubin Ghahramani - - PowerPoint PPT Presentation

the automatic statistician
SMART_READER_LITE
LIVE PREVIEW

The Automatic Statistician an AI for Data Science Zoubin Ghahramani - - PowerPoint PPT Presentation

The Automatic Statistician an AI for Data Science Zoubin Ghahramani Department of Engineering University of Cambridge zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Intelligent Machines, Nijmegen, 2015 James Robert Lloyd David


slide-1
SLIDE 1

The Automatic Statistician

an AI for Data Science

Zoubin Ghahramani Department of Engineering University of Cambridge

zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Intelligent Machines, Nijmegen, 2015

James Robert Lloyd David Duvenaud Roger Grosse Josh Tenenbaum Cambridge Cambridge → Harvard MIT → Toronto MIT

slide-2
SLIDE 2

THERE IS A GROWING NEED FOR DATA ANALYSIS

◮ We live in an era of abundant data ◮ The McKinsey Global Institute claim

◮ “The United States alone faces a shortage of 140,000 to

190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of big data.”

◮ Diverse fields increasingly relying on expert statisticians,

machine learning researchers and data scientists e.g.

◮ Computational sciences (e.g. biology, astronomy, . . . ) ◮ Online advertising ◮ Quantitative finance ◮ . . . James Robert Lloyd and Zoubin Ghahramani 2 / 43

slide-3
SLIDE 3

WHAT WOULD AN AUTOMATIC STATISTICIAN DO?

Data Search Language of models Evaluation Model Prediction Translation Checking Report

James Robert Lloyd and Zoubin Ghahramani 3 / 43

slide-4
SLIDE 4

GOALS OF THE AUTOMATIC STATISTICIAN PROJECT

◮ Provide a set of tools for understanding data that require

minimal expert input

◮ Uncover challenging research problems in e.g.

◮ Automated inference ◮ Model construction and comparison ◮ Data visualisation and interpretation

◮ Advance the field of machine learning in general

James Robert Lloyd and Zoubin Ghahramani 4 / 43

slide-5
SLIDE 5

INGREDIENTS OF AN AUTOMATIC STATISTICIAN

Data Search Language of models Evaluation Model Prediction Translation Checking Report

◮ An open-ended language of models

◮ Expressive enough to capture real-world phenomena. . . ◮ . . . and the techniques used by human statisticians

◮ A search procedure

◮ To efficiently explore the language of models

◮ A principled method of evaluating models

◮ Trading off complexity and fit to data

◮ A procedure to automatically explain the models

◮ Making the assumptions of the models explicit.. . ◮ . . . in a way that is intelligible to non-experts James Robert Lloyd and Zoubin Ghahramani 5 / 43

slide-6
SLIDE 6

PREVIEW: AN ENTIRELY AUTOMATIC ANALYSIS

Raw data 1950 1952 1954 1956 1958 1960 1962 100 200 300 400 500 600 700 Full model posterior with extrapolations 1950 1952 1954 1956 1958 1960 1962 100 200 300 400 500 600 700

Four additive components have been identified in the data

◮ A linearly increasing function. ◮ An approximately periodic function with a period of 1.0 years and

with linearly increasing amplitude.

◮ A smooth function. ◮ Uncorrelated noise with linearly increasing standard deviation.

James Robert Lloyd and Zoubin Ghahramani 6 / 43

slide-7
SLIDE 7

DEFINING A LANGUAGE OF MODELS

Data Search Language of models Evaluation Model Prediction Translation Checking Report

James Robert Lloyd and Zoubin Ghahramani 7 / 43

slide-8
SLIDE 8

DEFINING A LANGUAGE OF REGRESSION MODELS

◮ Regression consists of learning a function f : X → Y

from inputs to outputs from example input / output pairs

◮ Language should include simple parametric forms...

◮ e.g. Linear functions, Polynomials, Exponential functions

◮ ...as well as functions specified by high level properties

◮ e.g. Smoothness, Periodicity

◮ Inference should be tractable for all models in language

James Robert Lloyd and Zoubin Ghahramani 8 / 43

slide-9
SLIDE 9

WE CAN BUILD REGRESSION MODELS WITH GAUSSIAN PROCESSES

◮ GPs are distributions over functions such that any finite

subset of function evaluations, (f(x1), f(x2), . . . f(xN)), have a joint Gaussian distribution

◮ A GP is completely specified by

◮ Mean function, µ(x) = E(f(x)) ◮ Covariance / kernel function, k(x, x′) = Cov(f(x), f(x′)) ◮ Denoted f ∼ GP(µ, k)

x f(x )

GP Posterior Mean GP Posterior Uncertainty

x f(x )

GP Posterior Mean GP Posterior Uncertainty Data

x f(x )

GP Posterior Mean GP Posterior Uncertainty Data

x f(x )

GP Posterior Mean GP Posterior Uncertainty Data

James Robert Lloyd and Zoubin Ghahramani 9 / 43

slide-10
SLIDE 10

THE ATOMS OF OUR LANGUAGE

Five base kernels

Squared

  • exp. (SE)

Periodic (PER) Linear (LIN) Constant (C) White noise (WN)

Encoding for the following types of functions

Smooth functions Periodic functions Linear functions Constant functions Gaussian noise

James Robert Lloyd and Zoubin Ghahramani 10 / 43

slide-11
SLIDE 11

THE COMPOSITION RULES OF OUR LANGUAGE

◮ Two main operations: addition, multiplication

LIN × LIN quadratic functions SE × PER locally periodic LIN + PER periodic plus linear trend SE + PER periodic plus smooth trend

James Robert Lloyd and Zoubin Ghahramani 11 / 43

slide-12
SLIDE 12

AN EXPRESSIVE LANGUAGE OF MODELS

Regression model Kernel

GP smoothing

SE + WN Linear regression C + LIN + WN Multiple kernel learning SE + WN Trend, cyclical, irregular SE + PER + WN Fourier decomposition C + cos + WN Sparse spectrum GPs cos + WN Spectral mixture SE × cos + WN Changepoints e.g. CP(SE, SE) + WN Heteroscedasticity e.g. SE + LIN × WN Note: cos is a special case of our version of PER

James Robert Lloyd and Zoubin Ghahramani 12 / 43

slide-13
SLIDE 13

DISCOVERING A GOOD MODEL VIA SEARCH

Data Search Language of models Evaluation Model Prediction Translation Checking Report

James Robert Lloyd and Zoubin Ghahramani 13 / 43

slide-14
SLIDE 14

DISCOVERING A GOOD MODEL VIA SEARCH

◮ Language defined as the arbitrary composition of five base

kernels (WN, C, LIN, SE, PER) via three operators (+, ×, CP).

◮ The space spanned by this language is open-ended and can

have a high branching factor requiring a judicious search

◮ We propose a greedy search for its simplicity and

similarity to human model-building

James Robert Lloyd and Zoubin Ghahramani 14 / 43

slide-15
SLIDE 15

EXAMPLE: MAUNA LOA KEELING CURVE

RQ 2000 2005 2010 −20 20 40 60

Start SE RQ SE + RQ ... PER + RQ SE + PER + RQ ... SE × (PER + RQ) ... ... ... ... ... PER × RQ LIN PER

James Robert Lloyd and Zoubin Ghahramani 15 / 43

slide-16
SLIDE 16

EXAMPLE: MAUNA LOA KEELING CURVE

( Per + RQ ) 2000 2005 2010 10 20 30 40

Start SE RQ SE + RQ ... PER + RQ SE + PER + RQ ... SE × (PER + RQ) ... ... ... ... ... PER × RQ LIN PER

James Robert Lloyd and Zoubin Ghahramani 15 / 43

slide-17
SLIDE 17

EXAMPLE: MAUNA LOA KEELING CURVE

SE × ( Per + RQ ) 2000 2005 2010 10 20 30 40 50

Start SE RQ SE + RQ ... PER + RQ SE + PER + RQ ... SE × (PER + RQ) ... ... ... ... ... PER × RQ LIN PER

James Robert Lloyd and Zoubin Ghahramani 15 / 43

slide-18
SLIDE 18

EXAMPLE: MAUNA LOA KEELING CURVE

( SE + SE × ( Per + RQ ) ) 2000 2005 2010 20 40 60

Start SE RQ SE + RQ ... PER + RQ SE + PER + RQ ... SE × (PER + RQ) ... ... ... ... ... PER × RQ LIN PER

James Robert Lloyd and Zoubin Ghahramani 15 / 43

slide-19
SLIDE 19

MODEL EVALUATION

Data Search Language of models Evaluation Model Prediction Translation Checking Report

James Robert Lloyd and Zoubin Ghahramani 16 / 43

slide-20
SLIDE 20

MODEL EVALUATION

◮ After proposing a new model its kernel parameters are

  • ptimised by conjugate gradients

◮ We evaluate each optimised model, M, using the model

evidence (marginal likelihood) which can be computed analytically for GPs

◮ We penalise the marginal likelihood for the optimised

kernel parameters using the Bayesian Information Criterion (BIC): −0.5 × BIC(M) = log p(D | M) − p 2 log n where p is the number of kernel parameters, D represents the data, and n is the number of data points.

James Robert Lloyd and Zoubin Ghahramani 17 / 43

slide-21
SLIDE 21

AUTOMATIC TRANSLATION OF MODELS

Data Search Language of models Evaluation Model Prediction Translation Checking Report

James Robert Lloyd and Zoubin Ghahramani 18 / 43

slide-22
SLIDE 22

AUTOMATIC TRANSLATION OF MODELS

◮ Search can produce arbitrarily complicated models from

  • pen-ended language but two main properties allow

description to be automated

◮ Kernels can be decomposed into a sum of products

◮ A sum of kernels corresponds to a sum of functions ◮ Therefore, we can describe each product of kernels

separately

◮ Each kernel in a product modifies a model in a consistent

way

◮ Each kernel roughly corresponds to an adjective James Robert Lloyd and Zoubin Ghahramani 19 / 43

slide-23
SLIDE 23

SUM OF PRODUCTS NORMAL FORM

Suppose the search finds the following kernel SE × (WN × LIN + CP(C, PER))

James Robert Lloyd and Zoubin Ghahramani 20 / 43

slide-24
SLIDE 24

SUM OF PRODUCTS NORMAL FORM

Suppose the search finds the following kernel SE × (WN × LIN + CP(C, PER)) The changepoint can be converted into a sum of products SE × (WN × LIN + C × σ + PER × ¯ σ)

James Robert Lloyd and Zoubin Ghahramani 20 / 43

slide-25
SLIDE 25

SUM OF PRODUCTS NORMAL FORM

Suppose the search finds the following kernel SE × (WN × LIN + CP(C, PER)) The changepoint can be converted into a sum of products SE × (WN × LIN + C × σ + PER × ¯ σ) Multiplication can be distributed over addition SE × WN × LIN + SE × C × σ + SE × PER × ¯ σ

James Robert Lloyd and Zoubin Ghahramani 20 / 43

slide-26
SLIDE 26

SUM OF PRODUCTS NORMAL FORM

Suppose the search finds the following kernel SE × (WN × LIN + CP(C, PER)) The changepoint can be converted into a sum of products SE × (WN × LIN + C × σ + PER × ¯ σ) Multiplication can be distributed over addition SE × WN × LIN + SE × C × σ + SE × PER × ¯ σ Simplification rules are applied WN × LIN + SE × σ + SE × PER × ¯ σ

James Robert Lloyd and Zoubin Ghahramani 20 / 43

slide-27
SLIDE 27

SUMS OF KERNELS ARE SUMS OF FUNCTIONS

If f1 ∼ GP(0, k1) and independently f2 ∼ GP(0, k2) then f1 + f2 ∼ GP(0, k1 + k2) e.g.

1960 1970 1980 1990 2000 2010 2020 −40 −20 20 40 60 80

=

1960 1965 1970 1975 1980 1985 1990 1995 2000 −30 −20 −10 10 20 30 40

+

1960 1965 1970 1975 1980 1985 1990 1995 2000 −4 −3 −2 −1 1 2 3 4

+

1960 1965 1970 1975 1980 1985 1990 1995 2000 −2 −1.5 −1 −0.5 0.5 1 1.5 2 1950 1952 1954 1956 1958 1960 100 200 300 400 500 600 700

=

1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 100 200 300 400 500

+

1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 −150 −100 −50 50 100 150 200

+

1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 −40 −30 −20 −10 10 20 30 40

We can therefore describe each component separately

James Robert Lloyd and Zoubin Ghahramani 21 / 43

slide-28
SLIDE 28

PRODUCTS OF KERNELS

PER

  • periodic function

On their own, each kernel is described by a standard noun phrase

James Robert Lloyd and Zoubin Ghahramani 22 / 43

slide-29
SLIDE 29

PRODUCTS OF KERNELS - SE

SE

  • approximately

× PER

  • periodic function

Multiplication by SE removes long range correlations from a model since SE(x, x′) decreases monotonically to 0 as |x − x′| increases.

James Robert Lloyd and Zoubin Ghahramani 23 / 43

slide-30
SLIDE 30

PRODUCTS OF KERNELS - LIN

SE

  • approximately

× PER

  • periodic function

× LIN

  • with linearly growing amplitude

Multiplication by LIN is equivalent to multiplying the function being modeled by a linear function. If f(x) ∼ GP(0, k), then x f(x) ∼ GP (0, k × LIN). This causes the standard deviation of the model to vary linearly without affecting the correlation.

James Robert Lloyd and Zoubin Ghahramani 24 / 43

slide-31
SLIDE 31

PRODUCTS OF KERNELS - CHANGEPOINTS

SE

  • approximately

× PER

  • periodic function

× LIN

  • with linearly growing amplitude

× σ

  • until 1700

Multiplication by σ is equivalent to multiplying the function being modeled by a sigmoid.

James Robert Lloyd and Zoubin Ghahramani 25 / 43

slide-32
SLIDE 32

AUTOMATICALLY GENERATED REPORTS

Data Search Language of models Evaluation Model Prediction Translation Checking Report

James Robert Lloyd and Zoubin Ghahramani 26 / 43

slide-33
SLIDE 33

EXAMPLE: AIRLINE PASSENGER VOLUME

Raw data 1950 1952 1954 1956 1958 1960 1962 100 200 300 400 500 600 700 Full model posterior with extrapolations 1950 1952 1954 1956 1958 1960 1962 100 200 300 400 500 600 700

Four additive components have been identified in the data

◮ A linearly increasing function. ◮ An approximately periodic function with a period of 1.0 years and

with linearly increasing amplitude.

◮ A smooth function. ◮ Uncorrelated noise with linearly increasing standard deviation.

James Robert Lloyd and Zoubin Ghahramani 27 / 43

slide-34
SLIDE 34

EXAMPLE: AIRLINE PASSENGER VOLUME

This component is linearly increasing.

Posterior of component 1 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 100 200 300 400 500 Sum of components up to component 1 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 100 200 300 400 500 600 700

James Robert Lloyd and Zoubin Ghahramani 28 / 43

slide-35
SLIDE 35

EXAMPLE: AIRLINE PASSENGER VOLUME

This component is approximately periodic with a period of 1.0 years and varying amplitude. Across periods the shape of this function varies very

  • smoothly. The amplitude of the function increases linearly. The shape of this

function within each period has a typical lengthscale of 6.0 weeks.

Posterior of component 2 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 −150 −100 −50 50 100 150 200 Sum of components up to component 2 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 100 200 300 400 500 600 700

James Robert Lloyd and Zoubin Ghahramani 29 / 43

slide-36
SLIDE 36

EXAMPLE: AIRLINE PASSENGER VOLUME

This component is a smooth function with a typical lengthscale of 8.1 months.

Posterior of component 3 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 −40 −30 −20 −10 10 20 30 40 Sum of components up to component 3 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 100 200 300 400 500 600 700

James Robert Lloyd and Zoubin Ghahramani 30 / 43

slide-37
SLIDE 37

EXAMPLE: AIRLINE PASSENGER VOLUME

This component models uncorrelated noise. The standard deviation of the noise increases linearly.

Posterior of component 4 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 −20 −15 −10 −5 5 10 15 20 Sum of components up to component 4 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 100 200 300 400 500 600 700

James Robert Lloyd and Zoubin Ghahramani 31 / 43

slide-38
SLIDE 38

OTHER EXAMPLES

See http://www.automaticstatistician.com

James Robert Lloyd and Zoubin Ghahramani 32 / 43

slide-39
SLIDE 39

GOOD PREDICTIVE PERFORMANCE AS WELL

Standardised RMSE over 13 data sets

1.0 1.5 2.0 2.5 3.0 3.5 ABCD accuracy ABCD interpretability Spectral kernels Trend, cyclical irregular Bayesian MKL Eureqa Changepoints Squared Exponential Linear regression Standardised RMSE

◮ Tweaks can be made to the algorithm to improve accuracy

  • r interpretability of models produced...

◮ ...but both methods are highly competitive at extrapolation

(shown above) and interpolation

James Robert Lloyd and Zoubin Ghahramani 33 / 43

slide-40
SLIDE 40

MODEL CHECKING AND CRITICISM

◮ Good statistical modelling should include model criticism:

◮ Does the data match the assumptions of the model? ◮ For example, if the model assumed Gaussian noise, does a

Q-Q plot reveal non-Gaussian residuals?

◮ Our automatic statistician does posterior predictive checks,

dependence tests and residual tests

◮ We have also been developing more systematic

nonparametric approaches to model criticism using kernel two-sample testing with MMD.

Lloyd, J. R., and Ghahramani, Z. (2014) Statistical Model Criticism using Kernel Two Sample

  • Tests. http://mlg.eng.cam.ac.uk/Lloyd/papers/kernel-model-checking.pdf

James Robert Lloyd and Zoubin Ghahramani 34 / 43

slide-41
SLIDE 41

CHALLENGES

◮ Interpretability / accuracy ◮ Increasing the expressivity of language

◮ e.g. Monotonocity, positive functions, symmetries

◮ Computational complexity of searching through a huge

space of models

◮ Extending the automatic reports to multidimensional

datasets

◮ Search and descriptions naturally extend to multiple

dimensions, but automatically generating relevant visual summaries harder

James Robert Lloyd and Zoubin Ghahramani 35 / 43

slide-42
SLIDE 42

CURRENT AND FUTURE DIRECTIONS

◮ Automatic statistician for:

* One-dimensional time series * Linear regression (classical)

◮ Multivariate nonlinear regression (c.f. Duvenaud, Lloyd et

al, ICML 2013)

◮ Multivariate classification (w/ Mrksic), clustering and

learning graphical models (w/ Smith, Lloyd)

◮ Single-cell transcriptomics (gene expression) data?? (w/

Lloyd, Stegle, Buettner)

◮ Probabilistic programming ◮ Bayesian optimisation, and the rational allocation of

computational resources

James Robert Lloyd and Zoubin Ghahramani 36 / 43

slide-43
SLIDE 43

PROBABILISTIC PROGRAMMING

Problem: Probabilistic model development and the derivation

  • f inference algorithms is time-consuming and error-prone.

Solution:

◮ Develop Turing-complete Probabilistic Programming

Languages for expressing probabilistic models as computer programs that generate data (i.e. simulators).

◮ Derive Universal Inference Engines for these languages

that sample over program traces given observed data.

◮ This can be used to implement seach over the model space

in an automatic statistician Example languages: Church, Venture, Anglican, Stochastic Python*, ones based on Haskell*, Julia* Example inference algorithms: Metropolis-Hastings MCMC, variational inference, particle filtering, slice sampling*, particle MCMC, nested particle inference*, austerity MCMC*

James Robert Lloyd and Zoubin Ghahramani 37 / 43

slide-44
SLIDE 44

PROBABILISTIC PROGRAMMING

1

sProbs = (1.0/3 , 1.0/3 , 1.0/3)

2

tProbs = {0 : (0.1 , 0.5, 0.4) , 1 : (0.2 , 0.2, 0.6) , 2 : (0.15 , 0.15 , 0.7)}

3

eMeans = (-1,1,0)

4 5 def hmm (): 6

states = []

7

states.append(stocPy. categorical (sProbs ,obs=True))

8

for ind , ob in stocPy.readCSV("hmm -data.csv"):

9

states.append(stocPy. categorical (tProbs[states[ ind]], obs=True))

10

stocPy.normal(eMeans[states[ind]], 1, cond=ob)

10

An example probabilistic program in StocPy implementing a 3-state hidden Markov model (HMM).

James Robert Lloyd and Zoubin Ghahramani 38 / 43

slide-45
SLIDE 45

BAYESIAN OPTIMISATION

Posterior t=3 Acquisition function next point Posterior new

  • bserv.

t=4 Acquisition function

Problem: Global optimisation of black-box functions that are expensive to evaluate Solution: treat as a problem of sequential decision-making and model uncertainty in the function. This can speed up model search in the automatic statistician.

James Robert Lloyd and Zoubin Ghahramani 39 / 43

slide-46
SLIDE 46

SUMMARY

◮ We have presented the beginnings of an automatic

statistician

◮ Our system

◮ Defines an open-ended language of models ◮ Searches greedily through this space ◮ Produces detailed reports describing patterns in data ◮ Performs automatic model criticism

◮ Extrapolation and interpolation performance highly

competitive

◮ We believe this line of research has the potential to make

powerful statistical model-building techniques accessible to non-experts

James Robert Lloyd and Zoubin Ghahramani 40 / 43

slide-47
SLIDE 47

REFERENCES

Website: http://www.automaticstatistician.com Duvenaud, D., Lloyd, J. R., Grosse, R., Tenenbaum, J. B. and Ghahramani, Z. (2013) Structure Discovery in Nonparametric Regression through Compositional Kernel Search. ICML 2013. Hernández-Lobato, J. M., Hoffman, M. W., and Ghahramani, Z. (2014) Predictive entropy search for efficient global optimization of black-box functions. NIPS 2014 Lloyd, J. R., Duvenaud, D., Grosse, R., Tenenbaum, J. B. and Ghahramani, Z. (2014) Automatic Construction and Natural-language Description of Nonparametric Regression Models AAAI

  • 2014. http://arxiv.org/pdf/1402.4304v2.pdf

Lloyd, J. R., and Ghahramani, Z. (2014) Statistical Model Criticism using Kernel Two Sample Tests http://mlg.eng.cam.ac.uk/Lloyd/papers/kernel-model-checking.pdf Ghahramani, Z. (2013) Bayesian nonparametrics and the probabilistic approach to modelling. Philosophical Trans. Royal Society A 371: 20110553. Ranca, R. (2015) StocPy. https://github.com/RazvanRanca/StocPy Ranca, R. and Ghahramani, Z. (2015) Slice sampling for probabilistic programming. http://arxiv.org/abs/1501.04684 Valera, I. and Ghahramani, Z. (2014) General Table Completion using a Bayesian Nonparametric Model. NIPS 2014.

James Robert Lloyd and Zoubin Ghahramani 41 / 43