Low-Cost Learning via Active Data Procurement September 2015 Jacob - - PowerPoint PPT Presentation

low cost learning via active data procurement
SMART_READER_LITE
LIVE PREVIEW

Low-Cost Learning via Active Data Procurement September 2015 Jacob - - PowerPoint PPT Presentation

Low-Cost Learning via Active Data Procurement September 2015 Jacob Abernethy Yiling Chen Chien-Ju Ho Bo Waggoner 1 Coming soon to a society near you data-needers s r e d l o h - a t a d ex: pharmaceutical co. ex: medical data


slide-1
SLIDE 1

Low-Cost Learning via Active Data Procurement

September 2015 Jacob Abernethy Yiling Chen Chien-Ju Ho Bo Waggoner

1

slide-2
SLIDE 2

2

Coming soon to a society near you

d a t a

  • h
  • l

d e r s ex: medical data data-needers ex: pharmaceutical co.

slide-3
SLIDE 3

3

Classic ML problem

z1 z2

learning alg

data source h

hypothesis data data-needer Goal: use small amount of data, output “good” h.

slide-4
SLIDE 4

4

Example learning task: classification

  • Data: (point, label) where label is or
  • Hypothesis: hyperplane separating the two types

h

h

slide-5
SLIDE 5

5

Twist: data is now held by individuals

c1 z1 z2

mechanism

data source

c2

h

hypothesis “Cost of revealing data” (formal model later…) Goal: spend small budget, output “good” h. data-needer data-holders

slide-6
SLIDE 6

6

Why is this difficult?

  • 1. (Relatively) few data are useful

Studying ACTN-3 mutation and endurance running have mutation runners

slide-7
SLIDE 7

7

Why is this difficult?

  • 2. Utility may be correlated with cost (causing bias)

Paying $10 for data (to study HIV) HIV-negative yes yes no yes yes HIV-positive no no yes

slide-8
SLIDE 8

8

Why is this difficult?

  • 2. Utility may be correlated with cost (causing bias)

Paying $10 for data (to study HIV) HIV-negative yes yes no yes yes HIV-positive no no yes

Machine Learning roadblock: how to deal with biases?

slide-9
SLIDE 9

9

Why is this difficult?

  • 3. Utility (ML) and cost (econ) live in different worlds

learning alg

entropies, gradients, loss functions, divergences

mechanism

auctions, budgets, value distributions, reserve prices

slide-10
SLIDE 10

10

Why is this difficult?

  • 3. Utility (ML) and cost (econ) live in different worlds

learning alg

entropies, gradients, loss functions, divergences

mechanism

auctions, budgets, value distributions, reserve prices

Econ roadblock: how to assign value to data?

slide-11
SLIDE 11

11

Broad research challenge:

  • 1. How to assign value (prices) to pieces of data?
  • 2. How to design mechanisms for procuring and

learning from data?

  • 3. Develop a theory of budget-constrained learning:

what is (im)possible to learn given budget B and parameters of the problem?

slide-12
SLIDE 12

12

Outline

  • 1. Overview of literature,
  • ur contributions
  • 2. Online learning model/results
  • 3. “Statistical learning” result,

conclusion

slide-13
SLIDE 13

13

Related work

Meir, Procaccia, Rosenschein 2012 Cummings, Ligett, Roth, Wu, Ziani 2015 Dekel, Fisher, Procaccia 2008 Ghosh, Ligett, Roth, Schoenebeck 2014 Horel, Ionnadis, Muthukrishnan 2014 Roth, Schoenebeck 2012 Ligett, Roth 2012 Cai, Daskalakis, Papadimitriou 2015 can fabricate data (like in peer- prediction) principal-agent style, data depends on effort agents cannot fabricate data, have costs this work

Model: how are agents strategic?

slide-14
SLIDE 14

14

Related work

minimize variance

  • r related goal

Meir, Procaccia, Rosenschein 2012 Cummings, Ligett, Roth, Wu, Ziani 2015 Dekel, Fisher, Procaccia 2008 Ghosh, Ligett, Roth, Schoenebeck 2014 Horel, Ionnadis, Muthukrishnan 2014 Roth, Schoenebeck 2012 Ligett, Roth 2012 Cai, Daskalakis, Papadimitriou 2015 can fabricate data (like in peer- prediction) principal-agent style, data depends on effort agents cannot fabricate data, have costs

risk/regret bounds

this work

Type of goal

slide-15
SLIDE 15

Conducting Truthful Surveys, Cheaply

  • Each datapoint is a number. Task is to estimate the mean
  • Approach: offer each agent a price drawn i.i.d.
  • Idea: obtains cheap but biased data; can de-bias it
  • Result: derives price distribution to minimize variance of estimate

15

e.g. Roth-Schoenebeck, EC 2012

c1 1

data source

c2

h

i.i.d.

slide-16
SLIDE 16

16

What we wanted to do differently

  • 1. Prove ML-style risk or regret bounds

rather than “minimize the variance” type goals. Why: understand error rate as function of budget and problem characteristics (as in ML)

slide-17
SLIDE 17

17

What we wanted to do differently

  • 1. Prove ML-style risk or regret bounds

rather than “minimize the variance” type goals. Why: understand error rate as function of budget and problem characteristics (as in ML)

  • 2. Interface with existing ML algorithms.

Why: understand how value derives from learning alg. Toward black-box use of learners in mechanisms.

slide-18
SLIDE 18

18

Related work

minimize variance

  • r related goal

Meir, Procaccia, Rosenschein 2012 Cummings, Ligett, Roth, Wu, Ziani 2015 Dekel, Fisher, Procaccia 2008 Ghosh, Ligett, Roth, Schoenebeck 2014 Horel, Ionnadis, Muthukrishnan 2014 Roth, Schoenebeck 2012 Ligett, Roth 2012 can fabricate data (like in peer- prediction) principal-agent style, data depends on effort agents cannot fabricate data, have costs

risk/regret bounds

this work

“ g e n e r a l ” l e a r n i n g p r

  • b

l e m s r e g r e s s i

  • n

population average classification

Cai, Daskalakis, Papadimitriou 2015

slide-19
SLIDE 19

19

What we wanted to do differently

  • 1. Prove ML-style risk or regret bounds

rather than “minimize the variance” style. Why: understand error rate as function of budget and problem characteristics (as in ML)

  • 2. Interface with existing ML algorithms.

Why: understand how value derives from learning alg. Toward black-box use of learners in mechanisms.

  • 3. Online data arrival

rather than “batch” setting. Why: allows “active learning” approach, nice model

slide-20
SLIDE 20

20

Related work

minimize variance

  • r related goal

Meir, Procaccia, Rosenschein 2012 Cummings, Ligett, Roth, Wu, Ziani 2015 Dekel, Fisher, Procaccia 2008 Ghosh, Ligett, Roth, Schoenebeck 2014 Horel, Ionnadis, Muthukrishnan 2014 Roth, Schoenebeck 2012 Ligett, Roth 2012 can fabricate data (like in peer- prediction) principal-agent style, data depends on effort agents cannot fabricate data, have costs

risk/regret bounds

this work

  • n

l i n e , a c t i v e “batch”

Cai, Daskalakis, Papadimitriou 2015

slide-21
SLIDE 21

21

Related work

minimize variance

  • r related goal

Meir, Procaccia, Rosenschein 2012 Cummings, Ligett, Roth, Wu, Ziani 2015 Dekel, Fisher, Procaccia 2008 Ghosh, Ligett, Roth, Schoenebeck 2014 Horel, Ionnadis, Muthukrishnan 2014 Roth, Schoenebeck 2012 Ligett, Roth 2012 Cai, Daskalakis, Papadimitriou 2015 can fabricate data (like in peer- prediction) principal-agent style, data depends on effort agents cannot fabricate data, have costs

risk/regret bounds

this work Abernethy, Frongillo, W. NIPS 2015

slide-22
SLIDE 22

22

Overview of our contributions

Propose model of online learning with purchased data: T arriving data points and budget B. Convert any “FTRL” algorithm into a mechanism. Show regret on order of T / √B and lower bounds of same order.

slide-23
SLIDE 23

Extend model to case where data is drawn i.i.d. (“statistical learning”) Extend result to “risk” bound on order of 1 / √B .

23

Overview of our contributions

Propose model of online learning with purchased data: T arriving data points and budget B. Convert any “FTRL” algorithm into a mechanism. Show regret on order of T / √B and lower bounds of same order.

slide-24
SLIDE 24

24

Outline

  • 1. Overview of literature,
  • ur contributions
  • 2. Online learning model/results
  • 3. “Statistical learning” result,

conclusion

slide-25
SLIDE 25

25

Online learning with purchased data

  • a. Review of online learning
  • b. Our model: adding $$
  • c. Deriving our mechanism

and results

slide-26
SLIDE 26

26

Standard online learning model

For t = 1, …, T:

  • algorithm posts a hypothesis ht
  • data point zt arrives
  • algorithm sees zt and updates to ht+1

Loss = ∑t ℓ(ht, zt) Regret = Loss - ∑t ℓ(h*, zt)

where h* minimizes sum

slide-27
SLIDE 27

27

Follow-the-Regularized-Leader (FTRL)

Assume: loss function is convex and Lipschitz, hypothesis space is Hilbert, etc

Algorithm: ht = argmin ∑s<t ℓ(h, zs) + R(h)/η

slide-28
SLIDE 28

28

Follow-the-Regularized-Leader (FTRL)

Assume: loss function is convex and Lipschitz, hypothesis space is Hilbert, etc

Algorithm: ht = argmin ∑s<t ℓ(h, zs) + R(h)/η Example 1 (Euclidean norm): R(h) = ǁhǁ2

2

⇒ ht = ht-1 - η∇ℓ(h, zt)

  • nline gradient descent
slide-29
SLIDE 29

29

Follow-the-Regularized-Leader (FTRL)

Assume: loss function is convex and Lipschitz, hypothesis space is Hilbert, etc

Algorithm: ht = argmin ∑s<t ℓ(h, zs) + R(h)/η Example 1 (Euclidean norm): R(h) = ǁhǁ2

2

⇒ ht = ht-1 - η∇ℓ(h, zt)

  • nline gradient descent

Example 2 (negative entropy): R(h) = ∑j h(j) ln(h(j)). ⇒ ht

(j) ∝ ht-1 (j) exp[ η∇ℓ(ht-1, zt ) ]

multiplicative weights

slide-30
SLIDE 30

30

Regret Bound for FTRL

Fact: the regret of FTRL is bounded by O of 1/η + η ∑t Δt

2 where Δt = ǁ ∇ℓ(ht, zt) ǁ.

slide-31
SLIDE 31

31

Regret Bound for FTRL

Fact: the regret of FTRL is bounded by O of 1/η + η ∑t Δt

2 where Δt = ǁ ∇ℓ(ht, zt) ǁ.

We know Δt ≤ 1 by assumption, so we can choose η=1/√T and get Regret ≤ O(√T ). “No regret”: average regret → 0.

slide-32
SLIDE 32

32

Online learning with purchased data

  • a. Review of online learning
  • b. Our model: adding $$
  • c. Deriving our mechanism

and results

slide-33
SLIDE 33

33

Model of strategic data-holder

Model of agent:

  • holds data zt and cost ct
  • cost is threshold price

○ agent agrees to sell data iff price ≥ ct ○ interpretations: privacy, transaction cost, ….

  • Assume: all costs ≤ 1

ct zt

slide-34
SLIDE 34

34

Model of agent-mechanism interaction

  • Mechanism posts menu of prices offered:
  • agent t arrives
  • If ct ≤ price(zt), agent accepts:

○ agent reveals (zt, ct) ○ mechanism pays agent price(zt)

  • Otherwise, agent rejects:

○ mechanism learns that agent rejected, pays nothing

data: (32,12) (20,18) (32,12) price: $0.22 $0.41 $0.88

ct zt

slide-35
SLIDE 35

35

Recall: standard online learning model

For t = 1, …, T:

  • algorithm posts a hypothesis ht
  • data point zt arrives
  • algorithm sees zt and updates to ht+1
slide-36
SLIDE 36

36

Our model: online learning with $$

For t = 1, …, T:

  • mechanism posts a hypothesis ht

and a menu of prices

  • data point zt arrives with cost ct
  • If ct ≤ menu price of zt: mech pays price, learns zt
  • else: mech pays nothing

Loss = ∑t ℓ(ht, zt) Regret = Loss - ∑t ℓ(h*, zt)

where h* minimizes sum

ct zt

slide-37
SLIDE 37

37

Online learning with purchased data

  • a. Review of online learning
  • b. Our model: adding $$
  • c. Deriving our mechanism

and results

slide-38
SLIDE 38

38

Start easy

Suppose all costs are 1. ⇒ Determine which data points to sample.

ct zt

data: (32,12) (20,18) (32,12) price: $1 $0 $0

slide-39
SLIDE 39

39

Start easy

Suppose all costs are 1. ⇒ Determine which data points to sample. Examples:

  • B = T/2
  • B = √T
  • B = log(T)

ct zt

data: (32,12) (20,18) (32,12) price: $1 $0 $0

slide-40
SLIDE 40

40

Key idea #1: randomly sample

Can purchase each data point zt with probability qt(zt). Menu is now randomly chosen:

data: (32,12) (20,18) (32,12) Pr[price=1]: 0.3 0.06 0.41

1/η + E [ η ∑t (Δt

2 / qt) ]

slide-41
SLIDE 41

41

Key idea #1: randomly sample

Can purchase each data point zt with probability qt(zt). Menu is now randomly chosen:

data: (32,12) (20,18) (32,12) Pr[price=1]: 0.3 0.06 0.41

Lemma (importance-weighted regret bound): For any qts, the regret of (modified) FTRL is O of 1/η + η E [ ∑t (Δt

2 / qt) ]

See also: Importance-Weighted Active Learning, Beygelzimer et al, ICML 2009.

slide-42
SLIDE 42

42

Result for easy case

Lemma (importance-weighted regret bound): For any qts, the regret of (modified) FTRL is O of Corollary: Setting all qt = B/T and choosing η =√B / T yields regret ≤ T / √B . 1/η + η E [ ∑t (Δt

2 / qt) ]

“No data, no regret”: average amount of data → 0 and average regret → 0.

slide-43
SLIDE 43

43

Result for easy case

Lemma (importance-weighted regret bound): For any qts, the regret of (modified) FTRL is O of Corollary: Setting all qt = B/T and choosing η =√B / T yields regret ≤ T / √B . Theorem: This is tight.

(Predict a repeated coin toss whose bias is either 1+1/√B or 1-1/√B )

1/η + η E [ ∑t (Δt

2 / qt) ]

slide-44
SLIDE 44

44

Now a bit harder….

Costs can be arbitrary, but agents are nonstrategic: they will accept payment exactly ct. At each time step, randomly choose which (data, cost) pairs to purchase. Question: how to set probabilities of purchase qt?

data,cost: (32,12) , c=0.3 (20,18) , c=0.8 Pr[purchase]: 0.12 0.08

slide-45
SLIDE 45

45

Key idea #2: sample proportional to...

Imagine we knew the arrivals in advance. Optimization problem: minimize ∑t (Δt

2 / qt)

s.t. ∑t qt ct ≤ B qt ≤ 1. Solution: qt = Δt

/ K √ct (K a normalizing constant).

ct zt

slide-46
SLIDE 46

46

Key idea #2: sample proportional to...

Imagine we knew the arrivals in advance. Optimization problem: minimize ∑t (Δt

2 / qt)

s.t. ∑t qt ct ≤ B qt ≤ 1. Solution: qt = Δt

/ K √ct (K a normalizing constant).

The point: only need advance knowledge of K to implement the “optimal” sampling strategy! Turns out: K = T / B, where ∈ [0,1] (discuss later)

ct zt

slide-47
SLIDE 47

47

Result for this “at-cost” setting

Theorem: Given rough advance estimate of , can achieve regret ≤ T / √B Theorem: This is tight (in a reasonable sense).

(Same bad instance, but with “useless” free data points sprinkled in.)

Implication: is capturing the “difficulty of the problem”.

slide-48
SLIDE 48

= (1/T) ∑t Δt √ct = average sqrt(difficulty * cost).

48

Discussion

slide-49
SLIDE 49

= (1/T) ∑t Δt √ct = average sqrt(difficulty * cost).

  • Low avg cost ⇒ low regret
  • Low avg difficulty ⇒ low regret
  • good correlations ⇒ low regret

49

Discussion

Example simplified corollary: Given rough advance estimate of avg cost μ, regret ≤ √μ T / √B

slide-50
SLIDE 50

50

Finally, the “full” problem.

ct zt

Now agents are strategic and we must post prices. Recall: had sampling probability qt = Δt

/ K √ct .

But: we don’t know ct.

slide-51
SLIDE 51

51

Finally, the “full” problem.

ct zt

Now agents are strategic and we must post prices. Recall: had sampling probability qt = Δt

/ K √ct .

But: we don’t know ct. Key idea #3: randomly draw price from the distribution s.t. Pr[ price ≥ ct] = Δt

/ K √ct .

⇒ achieve the “right” probability for every ct simultaneously!

slide-52
SLIDE 52

52

Description of final mechanism

Input: estimate of At each time t:

  • post hypothesis ht ← FTRL
  • for each data point zt, compute Δt = ǁ ∇ℓ(ht, zt) ǁ

and post random price from distribution

  • If arriving agent accepts,

send “re-weighted” zt → FTRL

slide-53
SLIDE 53

53

Main result for online learning setting

Theorem: Given rough advance estimate of , can achieve Theorem (recall): No mechanism for the easier, “at-cost” setting can beat regret ≤ T / √B regret ≤ √ T / √B Note: lost a √ factor compared to easier setting, due to paying our posted price rather than the agent’s cost. (“cost of strategic behavior”)

slide-54
SLIDE 54

54

Outline

  • 1. Overview of literature,
  • ur contributions
  • 2. Online learning model/results
  • 3. “Statistical learning” result,

conclusion

slide-55
SLIDE 55

Extend model to case where data is drawn i.i.d. (“statistical learning”) Extend result to “risk” bound on order of 1 / √B .

55

Recalling contributions

Propose model of online learning with purchased data: T arriving data points and budget B. Convert any “FTRL” algorithm into a mechanism. Show regret on order of T / √B and lower bounds of same order.

slide-56
SLIDE 56

56

Classic statistical learning model

For classification:

E loss( h ) ≤ E loss( h* ) + O

VC-dim

T

z1 z2

learning alg

data source h

hypothesis i.i.d.

slide-57
SLIDE 57

57

Our statistical learning model

c1 z1 z2

mechanism

data source

c2

h

hypothesis i.i.d.

  • B

costs (still) may be adversarially chosen

slide-58
SLIDE 58

58

Our statistical learning model

c1 z1 z2

mechanism

data source

c2

h

hypothesis i.i.d. Theorem: Given rough advance estimate of , can achieve

E loss( h ) ≤ E loss( h* ) + O

  • B
slide-59
SLIDE 59

59

Our statistical learning model

c1 z1 z2

mechanism

data source

c2

h

hypothesis i.i.d. Theorem: Given rough advance estimate of , can achieve

E loss( h ) ≤ E loss( h* ) + O

  • B

Proof: known “online-to-batch conversion”: regret R ⇒ risk R/T

slide-60
SLIDE 60

60

Summary

Model:

  • online arrival of agents
  • post prices to procure data
  • adversarial costs and data

(online learning setting)

  • adversarial costs, i.i.d. data

(statistical learning setting)

slide-61
SLIDE 61

61

Summary

Results:

  • upper/lower bounds on regret

(online learning setting)

  • upper bound on risk

(statistical learning setting)

slide-62
SLIDE 62

62

Summary

Big picture:

  • design mechanisms to interface with existing

learning algs

  • prove ML-style bounds: risk and regret
  • toward a “theory of the learnable...on a budget”
slide-63
SLIDE 63

63

Future work

  • Improve bounds (!)
  • Propose “universal quantity” to replace

γ in bounds (analogue of VC-dimension?)

  • Explore models for purchasing data
slide-64
SLIDE 64

64

Future work

  • Improve bounds (!)
  • Propose “universal quantity” to replace

γ in bounds (analogue of VC-dimension?)

  • Explore models for purchasing data

Thanks!

slide-65
SLIDE 65

Additional slides

65

slide-66
SLIDE 66

66

Simulation results

MNIST dataset -- handwritten digit classification Brighter green = higher cost Toy problem: classify (1 or 4) vs (9 or 8)

slide-67
SLIDE 67

67

Simulation results

  • T = 8503
  • train on half,

test on half

  • Alg: Online Gradient

Descent Naive: pay 1 until budget is exhausted, then run alg Baseline: run alg on all data points (no budget) Large γ: bad correlations Small γ: independent cost/data

slide-68
SLIDE 68

68

Pricing distribution