16-11-04 Statistical Science and Data Science Nancy Reid 27 - - PDF document

16 11 04
SMART_READER_LITE
LIVE PREVIEW

16-11-04 Statistical Science and Data Science Nancy Reid 27 - - PDF document

16-11-04 Statistical Science and Data Science Nancy Reid 27 October 2016 2 Fisher Memorial Lecture 27 Oct 2016 Fisher Memorial Lecture 27 Oct 2016 Fisher Number Fisher Number Selected Correspondence of R. A. Fisher Edited by J.H. Bennett


slide-1
SLIDE 1

16-11-04 1

Fisher Memorial Lecture 27 Oct 2016

Statistical Science and Data Science

Nancy Reid 27 October 2016

Fisher Memorial Lecture 27 Oct 2016 2

Fisher Number

Fisher Memorial Lecture 27 Oct 2016 3

Selected Correspondence of

  • R. A. Fisher

Edited by J.H. Bennett

Fisher Number

Fisher Memorial Lecture 27 Oct 2016 4

“Do not forget to look up Walter Bodmer, who also has some experience being ‘bawled down’ by the Neymanians” 11 Jan 1962

slide-2
SLIDE 2

16-11-04 2

“Some aspect of big data”

= Big Machines = Lots of Computing = Complex Architectures = Computer Science

Fisher Memorial Lecture 27 Oct 2016 5

Small data

= equations and formulas = mathematical modelling = a little computing = Statistical Science

Fisher Memorial Lecture 27 Oct 2016 6

p(v, h; η) ∝ 1 Z(η) exp{aT v + bT h + vT Wh}, η = (a, b, W)

Big Data

  • Interesting
  • Detailed
  • Informative
  • Fun

Fisher Memorial Lecture 27 Oct 2016 7

Small Data

Fisher Memorial Lecture 27 Oct 2016 8

So yesterday

slide-3
SLIDE 3

16-11-04 3

Small Data

Fisher Memorial Lecture 27 Oct 2016 9

Big Data 2013

Fisher Memorial Lecture 27 Oct 2016 11

Gartner Hype Cycle

Fisher Memorial Lecture 27 Oct 2016 12

Big Data 2014

slide-4
SLIDE 4

16-11-04 4

Fisher Memorial Lecture 27 Oct 2016 13

2015

Machine Learning

The push back

Fisher Memorial Lecture 27 Oct 2016 14

The push back

Fisher Memorial Lecture 27 Oct 2016 15

“Big data” has arrived, but big insights have not

The push back

Fisher Memorial Lecture 27 Oct 2016 16

How big data threatens democracy and increases inequality

“if the assessment never asks about race, how could the algorithm throw up racially biased results?” “Credit scores are used by nearly half of American employers to screen potential employees”

slide-5
SLIDE 5

16-11-04 5

Canadian Institute for Statistical Sciences

Pacific Institute for Mathematical Sciences Centre de Recherches Mathématiques Fields Institute for Resesarch in the Mathematical Sciences

Workshops

  • Opening Conference and Bootcamp
  • Statistical Machine Learning
  • Optimization and Matrix Methods
  • Visualization: Strategies and Principles
  • Big Data in Health Policy
  • Big Data for Social Policy
  • Networks, Web mining, and Cyber-security
  • Statistical Theory for Large-scale Data
  • Challenges in Environmental Science
  • Complex Spatio-temporal Data
  • Commercial and Retail Banking

Fisher Memorial Lecture 27 Oct 2016 19

Opening Conference and Bootcamp

Introduction to topics at following workshops One day on each topic Many speakers started by trying to define big data “I shall not today attempt further to define the kinds of material I understand to be embraced within that shorthand description, and perhaps I could never succeed in intelligibly doing so. But I know it when I see it … ”

Justice Potter Stewart; Jacobellis v. Ohio 22 June 1964 Robert Bell, Google, Plenary Opening Lecture

Fisher Memorial Lecture 27 Oct 2016 20

slide-6
SLIDE 6

16-11-04 6

Some highlights

  • Statistical Machine Learning
  • Optimization
  • Visualization
  • Health Policy
  • Social Policy

Fisher Memorial Lecture 27 Oct 2016 21

Some highlights

  • Statistical Machine Learning

Fisher Memorial Lecture 27 Oct 2016 22

Statistical Machine Learning

Fisher Memorial Lecture 27 Oct 2016 23

η = (a, b, W)

f(v, h; η) ∝ 1 Z(η) exp{aT v + bT h + vT Wh}

Restricted Boltzmann machine

Fisher Memorial Lecture 27 Oct 2016 24

  • natural gradient ascent
  • uses Fisher information as metric tensor

Girolami and Calderhead (2011); Amari (1987); Rao (1945)

  • Gaussian graphical model approximation to force

sparse inverse

Grosse and Salakhutdinov (2016) 32nd Internat. Conf. on Machine Learning

⌘ ⌘ + ✏ i(⌘)−1rη`(⌘; v, h) i = E(−`00)

f(v, h; η) ∝ 1 Z(η) exp{aT v + bT h + vT Wh}

` = log f

slide-7
SLIDE 7

16-11-04 7

Restricted Boltzmann machine

Fisher Memorial Lecture 27 Oct 2016 25

  • if just one binary top node, model for

is a logistic regression

  • with several binary top nodes, model for

is also a logistic regression, with odds ratio depending

  • nly on
  • deep learning has ~10 layers, with millions of units

in each layer

  • estimating parameters is an optimization problem

h | v ht | v, h−t v

f(v, h; η) ∝ 1 Z(η) exp{aT v + bT h + vT Wh}

Restricted Boltzmann machine

Fisher Memorial Lecture 27 Oct 2016 26

Leung et al Bioinformatics 2014

Brendan Frey, Infinite Genomes Project

FieldsLive January 27 2015

Some highlights

  • Statistical Machine Learning
  • Optimization
  • Visualization
  • Health Policy
  • Social Policy

Fisher Memorial Lecture 27 Oct 2016 27

Some highlights

  • Optimization

Fisher Memorial Lecture 27 Oct 2016 28

max

θ { 1

n

n

X

i=1

log f(yi | xi; θ) − Pλ(θ)}

slide-8
SLIDE 8

16-11-04 8

Optimization

Fisher Memorial Lecture 27 Oct 2016 29

max

θ { 1

n

n

X

i=1

log f(yi | xi; θ) − Pλ(θ)}

  • lasso penalty
  • is convex relaxation of
  • many interesting penalties are non-convex
  • optimization routines may not find global optimum

||θ||0

Pλ(θ) = λ||θ||1 = λΣ|θj| ||θ||1

Optimization

Fisher Memorial Lecture 27 Oct 2016 30

max

θ { 1

n

n

X

i=1

log f(yi | xi; θ) − Pλ(θ)}

  • statistical error neighbourhood of true value
  • approximation error iterating over t

ˆ θ − θ∗ θt − ˆ θ

Wainwright FieldsLive Jan 16 2015 Loh and Wainwright JMLR 2015

Some highlights

  • Statistical Machine Learning
  • Optimization
  • Visualization
  • Health Policy
  • Social Policy

Fisher Memorial Lecture 27 Oct 2016 31

Some highlights

  • Visualization

Fisher Memorial Lecture 27 Oct 2016 32

Innovis.cpsc.ucalgary.ca

slide-9
SLIDE 9

16-11-04 9

Visualization

  • statistical graphics

– data representation – data exploration – filtering, sampling aggregation

  • information visualization
  • scientific visualization
  • cognitive science and design

Fisher Memorial Lecture 27 Oct 2016 33

Visualization

Fisher Memorial Lecture 27 Oct 2016 34

KPMG Data Observatory, IC

Visualization

Fisher Memorial Lecture 27 Oct 2016 35

KPMG Data Observatory, IC

Visualization

Fisher Memorial Lecture 27 Oct 2016 36

fivethirtyeight.com

slide-10
SLIDE 10

16-11-04 10

Visualization

Fisher Memorial Lecture 27 Oct 2016 37

fivethirtyeight.com

Visualization

Fisher Memorial Lecture 27 Oct 2016 38

New York Times “The duty of beauty”

Some highlights

  • Statistical Machine Learning
  • Optimization
  • Visualization
  • Health Policy
  • Social Policy

Fisher Memorial Lecture 27 Oct 2016 39

Some highlights

  • Health Policy

Fisher Memorial Lecture 27 Oct 2016 40

slide-11
SLIDE 11

16-11-04 11

Health Policy Administrative Databases

Fisher Memorial Lecture 27 Oct 2016 41

Institute for Clinical and Evaluative Sciences

Health Policy Administrative Databases

Fisher Memorial Lecture 27 Oct 2016 42

Institute for Clinical and Evaluative Science Thérèse Stukel, ICES

Some highlights

  • Statistical Machine Learning
  • Optimization
  • Visualization
  • Health Policy
  • Social Policy

Fisher Memorial Lecture 27 Oct 2016 44

slide-12
SLIDE 12

16-11-04 12

Some highlights

  • Social Policy

Fisher Memorial Lecture 27 Oct 2016 45

Thérèse Stukel, ICES

Privacy

  • “Big Data and Innovation, Setting the Record Straight:

De-identification Does Work”

Privacy Commissioner of Ontario, July 2014

  • “No silver bullet: De-identification still doesn’t work”

Narayan & Felten, July 2014

  • Statistical Disclosure Limitation
  • Differential Privacy
  • Multi-party Communication

Fisher Memorial Lecture 27 Oct 2016 47

Some highlights

  • Statistical Machine Learning
  • Optimization
  • Visualization
  • Health Policy
  • Social Policy
  • inference, environmental science, networks, genomics,

finance, physical sciences, software infrastructure, …

Fisher Memorial Lecture 27 Oct 2016 48

slide-13
SLIDE 13

16-11-04 13

What did we learn?

  • Statistical models for big data are complex,

high-dimensional

– inference is well-studied, but difficult

  • Computational challenges include size and speed

– ideas of statistical inference get lost in the machine

  • Data owners understand 2., but not 1.
  • Data science may be the best way to combine these

Fisher Memorial Lecture 27 Oct 2016 49

What is data science?

  • a course?
  • a set of courses?
  • a job?
  • a technology?
  • a new field of research?
  • a collaboration?

Fisher Memorial Lecture 27 Oct 2016 50

Data Science Program(s)

  • mathematical reasoning
  • statistical theory
  • statistical and machine learning methods
  • programming and software development
  • algorithms and data structure
  • communication results and limitations

Fisher Memorial Lecture 27 Oct 2016 51

Data Science Research

  • data collection and data quality
  • large N, small p

– computational strategies, e.g. Spark, Hadoop – divide and conquer

  • small n, large p

– inferential and computational strategies – dimension reduction – post-selection inference – inference for extremes

  • ‘new’ types of data: networks, graphs, text, images, …

– “alternative sources”

Fisher Memorial Lecture 27 Oct 2016 52

slide-14
SLIDE 14

16-11-04 14

… Data Science Research

  • collaboration and communication
  • data wrangling, database development, record linkage
  • replicability, reproducibility, new workflows
  • visualization
  • outside the ivory tower -- industry, government,

media, public

Fisher Memorial Lecture 27 Oct 2016 53 Fisher Memorial Lecture 27 Oct 2016 54

hEp://arxiv.org/abs/1609.00037v1

… Good Enough

  • Data Management – from raw to ‘analysable’

knitr

  • Software – programming

tidyr

  • Collaboration

dplyr

  • Project Organization

ggplot2

  • Keeping Track

ggvis

  • Writing

Github

Fisher Memorial Lecture 27 Oct 2016 55

“How do you see your area developing in the future?”

  • I suspect that the new data scientists will discover that

the old core is important

  • and that theoretical statisticians may be in short supply
  • even within statistical science we are going to need

a lot of translation

  • as the discipline expands it will be increasingly difficult

to be a ‘polystat’

  • we’ll still have lots of small data, but its analysis will be

influenced by the trend to massive data

Fisher Memorial Lecture 27 Oct 2016 56

slide-15
SLIDE 15

16-11-04 15

“A range of other problems”

Fisher Memorial Lecture 27 Oct 2016 57

Michael Jordan, UC Berkeley

“while I do think of neural networks as one important tool in the toolbox, I find myself surprisingly rarely going to that tool when I’m consulting out in industry. I find that industry people are often looking to solve a range of other problems, often not involving “pattern recognition” problems”

accurate answers quickly; meaningful error bars; merge various data sources; visualize and present conclusions; diagnostics; non- stationarity; targetted experiments within databases

Caution can be a good thing

Fisher Memorial Lecture 27 Oct 2016 58

“Digital Hippocratic Oath”

Caution can be a good thing

Fisher Memorial Lecture 27 Oct 2016 59

“…from data we will get the cure for cancer as well as better hospitals; schools that adapt to children’s needs making them happier and smarter; better policing and safer homes; and of course jobs.”

Guardian 2 July 2016

Fisher Memorial Lecture 27 Oct 2016 60

Gartner Hype Cycle 2016

Smart Data Discovery

slide-16
SLIDE 16

16-11-04 16

Fisher Memorial Lecture 27 Oct 2016

Thank You!