On Computational Thinking, Inferential Thinking and Big Data - - PowerPoint PPT Presentation

on computational thinking inferential thinking and big
SMART_READER_LITE
LIVE PREVIEW

On Computational Thinking, Inferential Thinking and Big Data - - PowerPoint PPT Presentation

On Computational Thinking, Inferential Thinking and Big Data Michael I. Jordan University of California, Berkeley November 16, 2015 What Is the Big Data Phenomenon? Science in confirmatory mode (e.g., particle physics) inferential


slide-1
SLIDE 1

On Computational Thinking, Inferential Thinking and “Big Data”

Michael I. Jordan

University of California, Berkeley

November 16, 2015

slide-2
SLIDE 2

What Is the Big Data Phenomenon?

  • Science in confirmatory mode (e.g., particle physics)

– inferential issue: massive number of nuisance variables

slide-3
SLIDE 3

What Is the Big Data Phenomenon?

  • Science in confirmatory mode (e.g., particle physics)

– inferential issue: massive number of nuisance variables

  • Science in exploratory mode (e.g., astronomy, genomics)

– inferential issue: massive number of hypotheses

slide-4
SLIDE 4

What Is the Big Data Phenomenon?

  • Science in confirmatory mode (e.g., particle physics)

– inferential issue: massive number of nuisance variables

  • Science in exploratory mode (e.g., astronomy, genomics)

– inferential issue: massive number of hypotheses

  • Measurement of human activity, particularly online

activity, is generating massive datasets that can be used (e.g.) for personalization and for creating markets

slide-5
SLIDE 5

What Is the Big Data Phenomenon?

  • Science in confirmatory mode (e.g., particle physics)

– inferential issue: massive number of nuisance variables

  • Science in exploratory mode (e.g., astronomy, genomics)

– inferential issue: massive number of hypotheses

  • Measurement of human activity, particularly online

activity, is generating massive datasets that can be used (e.g.) for personalization and for creating markets

– inferential issues: many, including heterogeneity, unknown sampling frames, compound loss function

slide-6
SLIDE 6

What Is the Big Data Phenomenon?

  • Science in confirmatory mode (e.g., particle physics)

– inferential issue: massive number of nuisance variables

  • Science in exploratory mode (e.g., astronomy, genomics)

– inferential issue: massive number of hypotheses

  • Measurement of human activity, particularly online

activity, is generating massive datasets that can be used (e.g.) for personalization and for creating markets

– inferential issues: many, including heterogeneity, unknown sampling frames, compound loss function

  • And then there are the computational issues
slide-7
SLIDE 7

What Is the Big Data Phenomenon?

  • Science in confirmatory mode (e.g., particle physics)

– inferential issue: massive number of nuisance variables

  • Science in exploratory mode (e.g., astronomy, genomics)

– inferential issue: massive number of hypotheses

  • Measurement of human activity, particularly online

activity, is generating massive datasets that can be used (e.g.) for personalization and for creating markets

– inferential issues: many, including heterogeneity, unknown sampling frames, compound loss function

  • And then there are the computational issues

– and, most notably, the interactions of computational and inferential issues

slide-8
SLIDE 8

A Job Description, circa 2015

  • Your Boss: “I need a Big Data system that will

replace our classic service with a personalized service”

slide-9
SLIDE 9

A Job Description, circa 2015

  • Your Boss: “I need a Big Data system that will

replace our classic service with a personalized service”

  • “It should work reasonably well for anyone and

everyone; I can tolerate a few errors but not too many dumb ones that will embarrass us”

slide-10
SLIDE 10

A Job Description, circa 2015

  • Your Boss: “I need a Big Data system that will

replace our classic service with a personalized service”

  • “It should work reasonably well for anyone and

everyone; I can tolerate a few errors but not too many dumb ones that will embarrass us”

  • “It should run just as fast as our classic service”
slide-11
SLIDE 11

A Job Description, circa 2015

  • Your Boss: “I need a Big Data system that will

replace our classic service with a personalized service”

  • “It should work reasonably well for anyone and

everyone; I can tolerate a few errors but not too many dumb ones that will embarrass us”

  • “It should run just as fast as our classic service”
  • “It should only improve as we collect more data; in

particular it shouldn’t slow down”

slide-12
SLIDE 12

A Job Description, circa 2015

  • Your Boss: “I need a Big Data system that will

replace our classic service with a personalized service”

  • “It should work reasonably well for anyone and

everyone; I can tolerate a few errors but not too many dumb ones that will embarrass us”

  • “It should run just as fast as our classic service”
  • “It should only improve as we collect more data; in

particular it shouldn’t slow down”

  • “There are serious privacy concerns of course, and

they vary across the clients”

slide-13
SLIDE 13

Some Challenges Driven by Big Data

  • Big Data analysis requires a thorough blending of

computational thinking and inferential thinking

slide-14
SLIDE 14

Some Challenges Driven by Big Data

  • Big Data analysis requires a thorough blending of

computational thinking and inferential thinking

  • What I mean by computational thinking

– abstraction, modularity, scalability, robustness, etc.

slide-15
SLIDE 15

Some Challenges Driven by Big Data

  • Big Data analysis requires a thorough blending of

computational thinking and inferential thinking

  • What I mean by computational thinking

– abstraction, modularity, scalability, robustness, etc.

  • Inferential thinking means (1) considering the real-

world phenomenon behind the data, (2) considering the sampling pattern that gave rise to the data, and (3) developing procedures that will go “backwards” from the data to the underlying phenomenon

slide-16
SLIDE 16

Some Challenges Driven by Big Data

  • Big Data analysis requires a thorough blending of

computational thinking and inferential thinking

  • What I mean by computational thinking

– abstraction, modularity, scalability, robustness, etc.

  • Inferential thinking means (1) considering the real-

world phenomenon behind the data, (2) considering the sampling pattern that gave rise to the data, and (3) developing procedures that will go “backwards” from the data to the underlying phenomenon

– merely computing “statistics” or running machine-learning algorithms generally isn’t inferential thinking – a focus on confidence intervals---not just “outputs”

slide-17
SLIDE 17

The Challenges are Daunting

  • The core theories in computer science and statistics

developed separately and there is an oil and water problem

  • Core statistical theory doesn’t have a place for

runtime and other computational resources

  • Core computational theory doesn’t have a place for

statistical risk

slide-18
SLIDE 18
  • Inference under privacy constraints
  • Inference under communication constraints
  • Inference (confidence intervals) and parallel,

distributed computing

Outline

slide-19
SLIDE 19

Part I: Inference and Privacy

with John Duchi and Martin Wainwright

slide-20
SLIDE 20
  • Individuals are not generally willing to allow their

personal data to be used without control on how it will be used and now much privacy loss they will incur

  • “Privacy loss” can be quantified via differential privacy
  • We want to trade privacy loss against the value we
  • btain from “data analysis”
  • The question becomes that of quantifying such value

and juxtaposing it with privacy loss

Privacy and Data Analysis

slide-21
SLIDE 21

Privacy

query database

slide-22
SLIDE 22

Privacy

query database

˜ θ

slide-23
SLIDE 23

Privacy

query database privatized database

Q

˜ θ

slide-24
SLIDE 24

Privacy

query database query privatized database

Q

˜ θ

slide-25
SLIDE 25

Privacy

query database

ˆ θ

query privatized database

Q

˜ θ

slide-26
SLIDE 26

Privacy

query database

ˆ θ

query privatized database

Q

ˆ θ

˜ θ

Classical problem in differential privacy: show that and are close under constraints on

Q

˜ θ

slide-27
SLIDE 27

Inference

query database

˜ θ

slide-28
SLIDE 28

Inference

query database

P

S

˜ θ

slide-29
SLIDE 29

Inference

query database

P

query

S

θ ˜ θ

slide-30
SLIDE 30

Inference

query database Classical problem in statistical theory: show that and are close under constraints on

P

query

S

S

θ

θ

˜ θ

˜ θ

slide-31
SLIDE 31

Privacy and Inference

query database

ˆ θ

query privatized database

˜ θ

Q

The privacy-meets-inference problem: show that and are close under constraints on and on

Q

θ

query

S

θ P

S

ˆ θ