Active Regression via Linear-Sample Sparsification Xue Chen Eric - - PowerPoint PPT Presentation

active regression via linear sample sparsification
SMART_READER_LITE
LIVE PREVIEW

Active Regression via Linear-Sample Sparsification Xue Chen Eric - - PowerPoint PPT Presentation

Active Regression via Linear-Sample Sparsification Xue Chen Eric Price UT Austin Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 1 / 18 Agnostic learning Xue Chen, Eric Price (UT Austin) Active Regression


slide-1
SLIDE 1

Active Regression via Linear-Sample Sparsification

Xue Chen Eric Price

UT Austin

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 1 / 18

slide-2
SLIDE 2

Agnostic learning

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 2 / 18

slide-3
SLIDE 3

Agnostic learning

See pairs (x, y) sampled from unknown distribution.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 2 / 18

slide-4
SLIDE 4

Agnostic learning

See pairs (x, y) sampled from unknown distribution. Guaranteed y ≈ f (x) for some f ∈ F.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 2 / 18

slide-5
SLIDE 5

Agnostic learning

See pairs (x, y) sampled from unknown distribution. Guaranteed y ≈ f (x) for some f ∈ F. Want to find f so y ≈ f (x) on fresh samples.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 2 / 18

slide-6
SLIDE 6

Agnostic learning

See pairs (x, y) sampled from unknown distribution. Guaranteed y ≈ f (x) for some f ∈ F. Want to find f so y ≈ f (x) on fresh samples. This work: adversarial error measured in ℓ2.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 2 / 18

slide-7
SLIDE 7

Agnostic learning

See pairs (x, y) sampled from unknown distribution. Guaranteed y ≈ f (x) for some f ∈ F. Want to find f so y ≈ f (x) on fresh samples. This work: adversarial error measured in ℓ2. Guaranteed E

x,y[(y − f (x))2] ≤ σ2

and want E

x,y[(y −

f (x))2] ≤ Cσ2

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 2 / 18

slide-8
SLIDE 8

Agnostic learning

See pairs (x, y) sampled from unknown distribution. Guaranteed y ≈ f (x) for some f ∈ F. Want to find f so y ≈ f (x) on fresh samples. This work: adversarial error measured in ℓ2. Guaranteed E

x,y[(y − f (x))2] ≤ σ2

and want E

x,y[(y −

f (x))2] ≤ Cσ2

  • r (equivalently, up to constants in C)

E

x [(f (x) −

f (x))2] ≤ Cσ2.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 2 / 18

slide-9
SLIDE 9

Agnostic learning

See pairs (x, y) sampled from unknown distribution. Guaranteed y ≈ f (x) for some f ∈ F. Want to find f so y ≈ f (x) on fresh samples. This work: adversarial error measured in ℓ2. Guaranteed E

x,y[(y − f (x))2] ≤ σ2

and want E

x,y[(y −

f (x))2] ≤ Cσ2

  • r (equivalently, up to constants in C)

f − f 2

D := E x [(f (x) −

f (x))2] ≤ Cσ2. where D is the marginal distribution on x.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 2 / 18

slide-10
SLIDE 10

Agnostic learning of linear spaces

Suppose F is a linear space of functions f , gD := Ex[f (x)g(x)] y F f ∗

  • f
slide-11
SLIDE 11

Agnostic learning of linear spaces

Suppose F is a linear space of functions

◮ f (x) = αTφ(x) for some φ : X → Rd.

f , gD := Ex[f (x)g(x)] y F f ∗

  • f
slide-12
SLIDE 12

Agnostic learning of linear spaces

Suppose F is a linear space of functions

◮ f (x) = αTφ(x) for some φ : X → Rd. ◮ Example: univariate degree d − 1 polynomials.

f , gD := Ex[f (x)g(x)] y F f ∗

  • f
slide-13
SLIDE 13

Agnostic learning of linear spaces

Suppose F is a linear space of functions

◮ f (x) = αTφ(x) for some φ : X → Rd. ◮ Example: univariate degree d − 1 polynomials.

f f , gD := Ex[f (x)g(x)] y F f ∗

  • f

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 3 / 18

slide-14
SLIDE 14

Agnostic learning of linear spaces

Suppose F is a linear space of functions

◮ f (x) = αTφ(x) for some φ : X → Rd. ◮ Example: univariate degree d − 1 polynomials.

f y f , gD := Ex[f (x)g(x)] y F f ∗

  • f

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 3 / 18

slide-15
SLIDE 15

Agnostic learning of linear spaces

Suppose F is a linear space of functions

◮ f (x) = αTφ(x) for some φ : X → Rd. ◮ Example: univariate degree d − 1 polynomials.

f y

  • f

f , gD := Ex[f (x)g(x)] y F f ∗

  • f

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 3 / 18

slide-16
SLIDE 16

Agnostic learning of linear spaces

Suppose F is a linear space of functions

◮ f (x) = αTφ(x) for some φ : X → Rd. ◮ Example: univariate degree d − 1 polynomials.

f y

  • f

f , gD := Ex[f (x)g(x)] y F f ∗

  • f

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 3 / 18

slide-17
SLIDE 17

Agnostic learning of linear spaces

Suppose F is a linear space of functions

◮ f (x) = αTφ(x) for some φ : X → Rd. ◮ Example: univariate degree d − 1 polynomials.

f y

  • f

f , gD := Ex[f (x)g(x)] y F f ∗

  • f

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 3 / 18

slide-18
SLIDE 18

Agnostic learning of linear spaces

Suppose F is a linear space of functions

◮ f (x) = αTφ(x) for some φ : X → Rd. ◮ Example: univariate degree d − 1 polynomials.

f y

  • f

f , gD := Ex[f (x)g(x)] y F f ∗

  • f

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 3 / 18

slide-19
SLIDE 19

Agnostic learning of linear spaces

Suppose F is a linear space of functions

◮ f (x) = αTφ(x) for some φ : X → Rd. ◮ Example: univariate degree d − 1 polynomials.

f y

  • f

f , gD := Ex[f (x)g(x)] y F f ∗

  • f

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 3 / 18

slide-20
SLIDE 20

Agnostic learning of linear spaces

Suppose F is a linear space of functions

◮ f (x) = αTφ(x) for some φ : X → Rd. ◮ Example: univariate degree d − 1 polynomials.

f y

  • f

f , gD := Ex[f (x)g(x)] y F f ∗

  • f

Ideal: f ∗ = arg miny − f ∗2

D.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 3 / 18

slide-21
SLIDE 21

Agnostic learning of linear spaces

Suppose F is a linear space of functions

◮ f (x) = αTφ(x) for some φ : X → Rd. ◮ Example: univariate degree d − 1 polynomials.

f y

  • f

f , gD := Ex[f (x)g(x)] y F f ∗

  • f

Ideal: f ∗ = arg miny − f ∗2

D.

Settle for empirical risk minimizer (ERM)

  • f = arg miny −

f 2

S := 1

m

m

  • i=1

(yi − f (xi))2.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 3 / 18

slide-22
SLIDE 22

Agnostic learning of linear spaces

Suppose F is a linear space of functions

◮ f (x) = αTφ(x) for some φ : X → Rd. ◮ Example: univariate degree d − 1 polynomials.

f y

  • f

f , gD := Ex[f (x)g(x)] y F f ∗

  • f

Ideal: f ∗ = arg miny − f ∗2

D.

Settle for empirical risk minimizer (ERM)

  • f = arg miny −

f 2

S := 1

m

m

  • i=1

(yi − f (xi))2. Idea: with enough samples, empirical norm ≈ true norm under D.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 3 / 18

slide-23
SLIDE 23

Agnostic learning of linear spaces

Suppose F is a linear space of functions

◮ f (x) = αTφ(x) for some φ : X → Rd. ◮ Example: univariate degree d − 1 polynomials.

f y

  • f

f , gD := Ex[f (x)g(x)] y F f ∗

  • f

Ideal: f ∗ = arg miny − f ∗2

D.

Settle for empirical risk minimizer (ERM)

  • f = arg miny −

f 2

S := 1

m

m

  • i=1

(yi − f (xi))2. Idea: with enough samples, empirical norm ≈ true norm under D.

◮ Will get

f − f ∗D ≤ ǫf ∗ − yD.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 3 / 18

slide-24
SLIDE 24

Agnostic learning of linear spaces: results

Degree 5 polynomial, σ = 1, x ∈ [−1, 1].

1/d2

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 4 / 18

slide-25
SLIDE 25

Agnostic learning of linear spaces: results

Degree 5 polynomial, σ = 1, x ∈ [−1, 1].

1/d2

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 4 / 18

slide-26
SLIDE 26

Agnostic learning of linear spaces: results

Degree 5 polynomial, σ = 1, x ∈ [−1, 1].

1/d2

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 4 / 18

slide-27
SLIDE 27

Agnostic learning of linear spaces: results

Degree 5 polynomial, σ = 1, x ∈ [−1, 1].

1/d2

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 4 / 18

slide-28
SLIDE 28

Agnostic learning of linear spaces: results

Degree 5 polynomial, σ = 1, x ∈ [−1, 1].

1/d2

(Matrix) Chernoff bound depends on K := sup

x

sup

f ∈F f D=1

f (x)2.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 4 / 18

slide-29
SLIDE 29

Agnostic learning of linear spaces: results

Degree 5 polynomial, σ = 1, x ∈ [−1, 1].

1/d2

(Matrix) Chernoff bound depends on K := sup

x

sup

f ∈F f D=1

f (x)2. O(K log d + K

ǫ ) samples suffice for agnostic learning

[Cohen-Davenport-Leviatan ’13, Hsu-Sabato ’14]

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 4 / 18

slide-30
SLIDE 30

Agnostic learning of linear spaces: results

Degree 5 polynomial, σ = 1, x ∈ [−1, 1].

1/d2

(Matrix) Chernoff bound depends on K := sup

x

sup

f ∈F f D=1

f (x)2. O(K log d + K

ǫ ) samples suffice for agnostic learning

[Cohen-Davenport-Leviatan ’13, Hsu-Sabato ’14]

◮ Mean zero noise:

f − f ∗2

D ≤ ǫf ∗ − y2 D

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 4 / 18

slide-31
SLIDE 31

Agnostic learning of linear spaces: results

Degree 5 polynomial, σ = 1, x ∈ [−1, 1].

1/d2

(Matrix) Chernoff bound depends on K := sup

x

sup

f ∈F f D=1

f (x)2. O(K log d + K

ǫ ) samples suffice for agnostic learning

[Cohen-Davenport-Leviatan ’13, Hsu-Sabato ’14]

◮ Mean zero noise:

f − f ∗2

D ≤ ǫf ∗ − y2 D

◮ Generic noise:

  • f − f 2

D ≤ (1 + ǫ)f − y2 D

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 4 / 18

slide-32
SLIDE 32

Agnostic learning of linear spaces: results

Degree 5 polynomial, σ = 1, x ∈ [−1, 1].

1/d2

(Matrix) Chernoff bound depends on K := sup

x

sup

f ∈F f D=1

f (x)2. O(K log d + K

ǫ ) samples suffice for agnostic learning

[Cohen-Davenport-Leviatan ’13, Hsu-Sabato ’14]

◮ Mean zero noise:

f − f ∗2

D ≤ ǫf ∗ − y2 D

◮ Generic noise:

  • f − f 2

D ≤ (1 + ǫ)f − y2 D

Also necessary (coupon collector)

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 4 / 18

slide-33
SLIDE 33

Agnostic learning of linear spaces: results

Degree 5 polynomial, σ = 1, x ∈ [−1, 1].

1/d2

(Matrix) Chernoff bound depends on K := sup

x

sup

f ∈F f D=1

f (x)2. O(K log d + K

ǫ ) samples suffice for agnostic learning

[Cohen-Davenport-Leviatan ’13, Hsu-Sabato ’14]

◮ Mean zero noise:

f − f ∗2

D ≤ ǫf ∗ − y2 D

◮ Generic noise:

  • f − f 2

D ≤ (1 + ǫ)f − y2 D

Also necessary (coupon collector) How can we avoid the dependence on K?

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 4 / 18

slide-34
SLIDE 34

Our result: avoid K with more powerful access patterns

With more powerful access models, can replace K := sup

x

sup

f ∈F f D=1

f (x)2 with κ := E

x

sup

f ∈F f D=1

f (x)2.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 5 / 18

slide-35
SLIDE 35

Our result: avoid K with more powerful access patterns

With more powerful access models, can replace K := sup

x

sup

f ∈F f D=1

f (x)2 with κ := E

x

sup

f ∈F f D=1

f (x)2. For linear spaces of functions, κ = d.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 5 / 18

slide-36
SLIDE 36

Our result: avoid K with more powerful access patterns

With more powerful access models, can replace K := sup

x

sup

f ∈F f D=1

f (x)2 with κ := E

x

sup

f ∈F f D=1

f (x)2. For linear spaces of functions, κ = d. Query model:

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 5 / 18

slide-37
SLIDE 37

Our result: avoid K with more powerful access patterns

With more powerful access models, can replace K := sup

x

sup

f ∈F f D=1

f (x)2 with κ := E

x

sup

f ∈F f D=1

f (x)2. For linear spaces of functions, κ = d. Query model:

◮ Can pick xi of our choice, see yi ∼ (Y |X = xi). Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 5 / 18

slide-38
SLIDE 38

Our result: avoid K with more powerful access patterns

With more powerful access models, can replace K := sup

x

sup

f ∈F f D=1

f (x)2 with κ := E

x

sup

f ∈F f D=1

f (x)2. For linear spaces of functions, κ = d. Query model:

◮ Can pick xi of our choice, see yi ∼ (Y |X = xi). ◮ Know D (which just defines f −

f D).

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 5 / 18

slide-39
SLIDE 39

Our result: avoid K with more powerful access patterns

With more powerful access models, can replace K := sup

x

sup

f ∈F f D=1

f (x)2 with κ := E

x

sup

f ∈F f D=1

f (x)2. For linear spaces of functions, κ = d. Query model:

◮ Can pick xi of our choice, see yi ∼ (Y |X = xi). ◮ Know D (which just defines f −

f D).

Active learning model:

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 5 / 18

slide-40
SLIDE 40

Our result: avoid K with more powerful access patterns

With more powerful access models, can replace K := sup

x

sup

f ∈F f D=1

f (x)2 with κ := E

x

sup

f ∈F f D=1

f (x)2. For linear spaces of functions, κ = d. Query model:

◮ Can pick xi of our choice, see yi ∼ (Y |X = xi). ◮ Know D (which just defines f −

f D).

Active learning model:

◮ Receive x1, . . . , xm ∼ D Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 5 / 18

slide-41
SLIDE 41

Our result: avoid K with more powerful access patterns

With more powerful access models, can replace K := sup

x

sup

f ∈F f D=1

f (x)2 with κ := E

x

sup

f ∈F f D=1

f (x)2. For linear spaces of functions, κ = d. Query model:

◮ Can pick xi of our choice, see yi ∼ (Y |X = xi). ◮ Know D (which just defines f −

f D).

Active learning model:

◮ Receive x1, . . . , xm ∼ D ◮ Pick S ⊂ [m] of size s Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 5 / 18

slide-42
SLIDE 42

Our result: avoid K with more powerful access patterns

With more powerful access models, can replace K := sup

x

sup

f ∈F f D=1

f (x)2 with κ := E

x

sup

f ∈F f D=1

f (x)2. For linear spaces of functions, κ = d. Query model:

◮ Can pick xi of our choice, see yi ∼ (Y |X = xi). ◮ Know D (which just defines f −

f D).

Active learning model:

◮ Receive x1, . . . , xm ∼ D ◮ Pick S ⊂ [m] of size s ◮ See yi for i ∈ S. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 5 / 18

slide-43
SLIDE 43

Our result: avoid K with more powerful access patterns

With more powerful access models, can replace K := sup

x

sup

f ∈F f D=1

f (x)2 with κ := E

x

sup

f ∈F f D=1

f (x)2. For linear spaces of functions, κ = d. Query model:

◮ Can pick xi of our choice, see yi ∼ (Y |X = xi). ◮ Know D (which just defines f −

f D).

Active learning model:

◮ Receive x1, . . . , xm ∼ D ◮ Pick S ⊂ [m] of size s ◮ See yi for i ∈ S.

Some results for non-linear spaces.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 5 / 18

slide-44
SLIDE 44

Query model: basic approach

ERM needs empirical norm to f S to approximate f D for all f ∈ F.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 6 / 18

slide-45
SLIDE 45

Query model: basic approach

ERM needs empirical norm to f S to approximate f D for all f ∈ F. This takes O(K log d) samples from D.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 6 / 18

slide-46
SLIDE 46

Query model: basic approach

ERM needs empirical norm to f S to approximate f D for all f ∈ F. This takes O(K log d) samples from D. Improved by biasing samples towards high-variance points. D′(x) = D(x) sup

f ∈F f D=1

f (x)2

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 6 / 18

slide-47
SLIDE 47

Query model: basic approach

ERM needs empirical norm to f S to approximate f D for all f ∈ F. This takes O(K log d) samples from D. Improved by biasing samples towards high-variance points. D′(x) = 1 κD(x) sup

f ∈F f D=1

f (x)2

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 6 / 18

slide-48
SLIDE 48

Query model: basic approach

ERM needs empirical norm to f S to approximate f D for all f ∈ F. This takes O(K log d) samples from D. Improved by biasing samples towards high-variance points. D′(x) = 1 κD(x) sup

f ∈F f D=1

f (x)2 Estimate norm via f 2

S,D′ := 1

m

m

  • i=1

D(xi) D′(xi)f (xi)2

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 6 / 18

slide-49
SLIDE 49

Query model: basic approach

ERM needs empirical norm to f S to approximate f D for all f ∈ F. This takes O(K log d) samples from D. Improved by biasing samples towards high-variance points. D′(x) = 1 κD(x) sup

f ∈F f D=1

f (x)2 Estimate norm via f 2

S,D′ := 1

m

m

  • i=1

D(xi) D′(xi)f (xi)2 Still equals f 2

D in expectation, but now max contribution is κ.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 6 / 18

slide-50
SLIDE 50

Query model: basic approach

ERM needs empirical norm to f S to approximate f D for all f ∈ F. This takes O(K log d) samples from D. Improved by biasing samples towards high-variance points. D′(x) = 1 κD(x) sup

f ∈F f D=1

f (x)2 Estimate norm via f 2

S,D′ := 1

m

m

  • i=1

D(xi) D′(xi)f (xi)2 Still equals f 2

D in expectation, but now max contribution is κ.

◮ This gives O(κ log d) sample complexity by Matrix Chernoff. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 6 / 18

slide-51
SLIDE 51

Bounding κ for linear function spaces

κ = E

x

sup

f ∈F f D=1

f (x)2 Express f ∈ F via an orthonormal basis: f (x) =

  • j

αjφj(x).

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 7 / 18

slide-52
SLIDE 52

Bounding κ for linear function spaces

κ = E

x

sup

f ∈F f D=1

f (x)2 Express f ∈ F via an orthonormal basis: f (x) =

  • j

αjφj(x). Then sup

f D=1

f (x)2 = sup

α2=1

α, {φj(x)}d

j=12

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 7 / 18

slide-53
SLIDE 53

Bounding κ for linear function spaces

κ = E

x

sup

f ∈F f D=1

f (x)2 Express f ∈ F via an orthonormal basis: f (x) =

  • j

αjφj(x). Then sup

f D=1

f (x)2 = sup

α2=1

α, {φj(x)}d

j=12 = d

  • j=1

φj(x)2.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 7 / 18

slide-54
SLIDE 54

Bounding κ for linear function spaces

κ = E

x

sup

f ∈F f D=1

f (x)2 Express f ∈ F via an orthonormal basis: f (x) =

  • j

αjφj(x). Then sup

f D=1

f (x)2 = sup

α2=1

α, {φj(x)}d

j=12 = d

  • j=1

φj(x)2. Hence κ =

d

  • j=1

E

x φj(x)2 = d.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 7 / 18

slide-55
SLIDE 55

Query model: so far

Upsampling x proportional to supf f (x)2 gets O(d log d) sample complexity.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 8 / 18

slide-56
SLIDE 56

Query model: so far

Upsampling x proportional to supf f (x)2 gets O(d log d) sample complexity.

◮ Essentially the same as leverage score sampling. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 8 / 18

slide-57
SLIDE 57

Query model: so far

Upsampling x proportional to supf f (x)2 gets O(d log d) sample complexity.

◮ Essentially the same as leverage score sampling. ◮ Also analogous to Spielman-Srivastava graph sparsification Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 8 / 18

slide-58
SLIDE 58

Query model: so far

Upsampling x proportional to supf f (x)2 gets O(d log d) sample complexity.

◮ Essentially the same as leverage score sampling. ◮ Also analogous to Spielman-Srivastava graph sparsification

Can we bring this down to O(d)?

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 8 / 18

slide-59
SLIDE 59

Query model: so far

Upsampling x proportional to supf f (x)2 gets O(d log d) sample complexity.

◮ Essentially the same as leverage score sampling. ◮ Also analogous to Spielman-Srivastava graph sparsification

Can we bring this down to O(d)?

◮ Not with independent sampling (coupon collector). Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 8 / 18

slide-60
SLIDE 60

Query model: so far

Upsampling x proportional to supf f (x)2 gets O(d log d) sample complexity.

◮ Essentially the same as leverage score sampling. ◮ Also analogous to Spielman-Srivastava graph sparsification

Can we bring this down to O(d)?

◮ Not with independent sampling (coupon collector). ◮ Analogous to Batson-Spielman-Srivastava linear size sparsification. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 8 / 18

slide-61
SLIDE 61

Query model: so far

Upsampling x proportional to supf f (x)2 gets O(d log d) sample complexity.

◮ Essentially the same as leverage score sampling. ◮ Also analogous to Spielman-Srivastava graph sparsification

Can we bring this down to O(d)?

◮ Not with independent sampling (coupon collector). ◮ Analogous to Batson-Spielman-Srivastava linear size sparsification. ◮ Yes – using Lee-Sun sparsification. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 8 / 18

slide-62
SLIDE 62

Query model: so far

Upsampling x proportional to supf f (x)2 gets O(d log d) sample complexity.

◮ Essentially the same as leverage score sampling. ◮ Also analogous to Spielman-Srivastava graph sparsification

Can we bring this down to O(d)?

◮ Not with independent sampling (coupon collector). ◮ Analogous to Batson-Spielman-Srivastava linear size sparsification. ◮ Yes – using Lee-Sun sparsification.

Mean zero noise: E[( f (x) − f (x))2] ≤ ǫ E[(y − f (x))2].

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 8 / 18

slide-63
SLIDE 63

Query model: so far

Upsampling x proportional to supf f (x)2 gets O(d log d) sample complexity.

◮ Essentially the same as leverage score sampling. ◮ Also analogous to Spielman-Srivastava graph sparsification

Can we bring this down to O(d)?

◮ Not with independent sampling (coupon collector). ◮ Analogous to Batson-Spielman-Srivastava linear size sparsification. ◮ Yes – using Lee-Sun sparsification.

Mean zero noise: E[( f (x) − f (x))2] ≤ ǫ E[(y − f (x))2]. Generic noise: E[( f (x) − f (x))2] ≤ (1 + ǫ) E[(y − f (x))2].

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 8 / 18

slide-64
SLIDE 64

Active learning

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

slide-65
SLIDE 65

Active learning

Query model supposes we know D and can query any point.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

slide-66
SLIDE 66

Active learning

Query model supposes we know D and can query any point. Active learning:

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

slide-67
SLIDE 67

Active learning

Query model supposes we know D and can query any point. Active learning:

◮ Get x1, . . . , xm ∼ D. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

slide-68
SLIDE 68

Active learning

Query model supposes we know D and can query any point. Active learning:

◮ Get x1, . . . , xm ∼ D. ◮ Pick S ⊆ [m] of size s. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

slide-69
SLIDE 69

Active learning

Query model supposes we know D and can query any point. Active learning:

◮ Get x1, . . . , xm ∼ D. ◮ Pick S ⊆ [m] of size s. ◮ Learn yi for i ∈ S. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

slide-70
SLIDE 70

Active learning

Query model supposes we know D and can query any point. Active learning:

◮ Get x1, . . . , xm ∼ D. ◮ Pick S ⊆ [m] of size s. ◮ Learn yi for i ∈ S.

Minimize s:

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

slide-71
SLIDE 71

Active learning

Query model supposes we know D and can query any point. Active learning:

◮ Get x1, . . . , xm ∼ D. ◮ Pick S ⊆ [m] of size s. ◮ Learn yi for i ∈ S.

Minimize s:

◮ m → ∞ Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

slide-72
SLIDE 72

Active learning

Query model supposes we know D and can query any point. Active learning:

◮ Get x1, . . . , xm ∼ D. ◮ Pick S ⊆ [m] of size s. ◮ Learn yi for i ∈ S.

Minimize s:

◮ m → ∞ =

⇒ learn D and query any point

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

slide-73
SLIDE 73

Active learning

Query model supposes we know D and can query any point. Active learning:

◮ Get x1, . . . , xm ∼ D. ◮ Pick S ⊆ [m] of size s. ◮ Learn yi for i ∈ S.

Minimize s:

◮ m → ∞ =

⇒ learn D and query any point = ⇒ query model.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

slide-74
SLIDE 74

Active learning

Query model supposes we know D and can query any point. Active learning:

◮ Get x1, . . . , xm ∼ D. ◮ Pick S ⊆ [m] of size s. ◮ Learn yi for i ∈ S.

Minimize s:

◮ m → ∞ =

⇒ learn D and query any point = ⇒ query model.

◮ Hence s = Θ(d) optimal. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

slide-75
SLIDE 75

Active learning

Query model supposes we know D and can query any point. Active learning:

◮ Get x1, . . . , xm ∼ D. ◮ Pick S ⊆ [m] of size s. ◮ Learn yi for i ∈ S.

Minimize s:

◮ m → ∞ =

⇒ learn D and query any point = ⇒ query model.

◮ Hence s = Θ(d) optimal.

Minimize m:

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

slide-76
SLIDE 76

Active learning

Query model supposes we know D and can query any point. Active learning:

◮ Get x1, . . . , xm ∼ D. ◮ Pick S ⊆ [m] of size s. ◮ Learn yi for i ∈ S.

Minimize s:

◮ m → ∞ =

⇒ learn D and query any point = ⇒ query model.

◮ Hence s = Θ(d) optimal.

Minimize m:

◮ Label every point =

⇒ agnostic learning.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

slide-77
SLIDE 77

Active learning

Query model supposes we know D and can query any point. Active learning:

◮ Get x1, . . . , xm ∼ D. ◮ Pick S ⊆ [m] of size s. ◮ Learn yi for i ∈ S.

Minimize s:

◮ m → ∞ =

⇒ learn D and query any point = ⇒ query model.

◮ Hence s = Θ(d) optimal.

Minimize m:

◮ Label every point =

⇒ agnostic learning.

◮ Hence m = Θ(K log d + K

ǫ ) optimal.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

slide-78
SLIDE 78

Active learning

Query model supposes we know D and can query any point. Active learning:

◮ Get x1, . . . , xm ∼ D. ◮ Pick S ⊆ [m] of size s. ◮ Learn yi for i ∈ S.

Minimize s:

◮ m → ∞ =

⇒ learn D and query any point = ⇒ query model.

◮ Hence s = Θ(d) optimal.

Minimize m:

◮ Label every point =

⇒ agnostic learning.

◮ Hence m = Θ(K log d + K

ǫ ) optimal.

Our result: both at the same time.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

slide-79
SLIDE 79

Active learning

Query model supposes we know D and can query any point. Active learning:

◮ Get x1, . . . , xm ∼ D. ◮ Pick S ⊆ [m] of size s. ◮ Learn yi for i ∈ S.

Minimize s:

◮ m → ∞ =

⇒ learn D and query any point = ⇒ query model.

◮ Hence s = Θ(d) optimal.

Minimize m:

◮ Label every point =

⇒ agnostic learning.

◮ Hence m = Θ(K log d + K

ǫ ) optimal.

Our result: both at the same time.

◮ In this talk: mostly s = O(d log d) version. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

slide-80
SLIDE 80

Active learning

Query model supposes we know D and can query any point. Active learning:

◮ Get x1, . . . , xm ∼ D. ◮ Pick S ⊆ [m] of size s. ◮ Learn yi for i ∈ S.

Minimize s:

◮ m → ∞ =

⇒ learn D and query any point = ⇒ query model.

◮ Hence s = Θ(d) optimal.

Minimize m:

◮ Label every point =

⇒ agnostic learning.

◮ Hence m = Θ(K log d + K

ǫ ) optimal.

Our result: both at the same time.

◮ In this talk: mostly s = O(d log d) version. ◮ Prior work: s = O((d log d)5/4) [Sabato-Munos ’14], Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

slide-81
SLIDE 81

Active learning

Query model supposes we know D and can query any point. Active learning:

◮ Get x1, . . . , xm ∼ D. ◮ Pick S ⊆ [m] of size s. ◮ Learn yi for i ∈ S.

Minimize s:

◮ m → ∞ =

⇒ learn D and query any point = ⇒ query model.

◮ Hence s = Θ(d) optimal.

Minimize m:

◮ Label every point =

⇒ agnostic learning.

◮ Hence m = Θ(K log d + K

ǫ ) optimal.

Our result: both at the same time.

◮ In this talk: mostly s = O(d log d) version. ◮ Prior work: s = O((d log d)5/4) [Sabato-Munos ’14], s = O(d log d)

via “volume sampling” [Derezinski-Warmuth-Hsu ’18].

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 9 / 18

slide-82
SLIDE 82

Active learning

Warmup: suppose we know D.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 10 / 18

slide-83
SLIDE 83

Active learning

Warmup: suppose we know D. Can simulate the query algorithm via rejection sampling: Pr[Label xi] ∝ sup

f ∈F f D=1

f (xi)2.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 10 / 18

slide-84
SLIDE 84

Active learning

Warmup: suppose we know D. Can simulate the query algorithm via rejection sampling: Pr[Label xi] = 1 K sup

f ∈F f D=1

f (xi)2.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 10 / 18

slide-85
SLIDE 85

Active learning

Warmup: suppose we know D. Can simulate the query algorithm via rejection sampling: Pr[Label xi] = 1 K sup

f ∈F f D=1

f (xi)2. Just needs s = O(d log d).

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 10 / 18

slide-86
SLIDE 86

Active learning

Warmup: suppose we know D. Can simulate the query algorithm via rejection sampling: Pr[Label xi] = 1 K sup

f ∈F f D=1

f (xi)2. Just needs s = O(d log d). Chance each sample gets labeled is E

x [Pr[Label xi]] = κ

K

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 10 / 18

slide-87
SLIDE 87

Active learning

Warmup: suppose we know D. Can simulate the query algorithm via rejection sampling: Pr[Label xi] = 1 K sup

f ∈F f D=1

f (xi)2. Just needs s = O(d log d). Chance each sample gets labeled is E

x [Pr[Label xi]] = κ

K = d K .

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 10 / 18

slide-88
SLIDE 88

Active learning

Warmup: suppose we know D. Can simulate the query algorithm via rejection sampling: Pr[Label xi] = 1 K sup

f ∈F f D=1

f (xi)2. Just needs s = O(d log d). Chance each sample gets labeled is E

x [Pr[Label xi]] = κ

K = d K . Gives m = O(K log d) unlabeled samples, s = O(d log d) labeled samples.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 10 / 18

slide-89
SLIDE 89

Active learning

Warmup: suppose we know D. Can simulate the query algorithm via rejection sampling: Pr[Label xi] = 1 K sup

f ∈F f D=1

f (xi)2. Just needs s = O(d log d). Chance each sample gets labeled is E

x [Pr[Label xi]] = κ

K = d K . Gives m = O(K log d) unlabeled samples, s = O(d log d) labeled samples.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 10 / 18

slide-90
SLIDE 90

Active learning

without knowing D

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 11 / 18

slide-91
SLIDE 91

Active learning

without knowing D

Want to perform rejection sampling: Pr[Label xi] = 1 K sup

f ∈F f D=1

f (xi)2. but don’t know D.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 11 / 18

slide-92
SLIDE 92

Active learning

without knowing D

Want to perform rejection sampling: Pr[Label xi] = 1 K sup

f ∈F f D=1

f (xi)2. but don’t know D. Just need to estimate f D for all f ∈ F.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 11 / 18

slide-93
SLIDE 93

Active learning

without knowing D

Want to perform rejection sampling: Pr[Label xi] = 1 K sup

f ∈F f D=1

f (xi)2. but don’t know D. Just need to estimate f D for all f ∈ F. Matrix Chernoff gets this with m = O(K log d) unlabeled samples.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 11 / 18

slide-94
SLIDE 94

Active learning

without knowing D

Want to perform rejection sampling: Pr[Label xi] = 1 K sup

f ∈F f D=1

f (xi)2. but don’t know D. Just need to estimate f D for all f ∈ F. Matrix Chernoff gets this with m = O(K log d) unlabeled samples. Gives m = O(K log d) unlabeled samples, s = O(d log d) labeled samples.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 11 / 18

slide-95
SLIDE 95

Active learning

without knowing D

Want to perform rejection sampling: Pr[Label xi] = 1 K sup

f ∈F f D=1

f (xi)2. but don’t know D. Just need to estimate f D for all f ∈ F. Matrix Chernoff gets this with m = O(K log d) unlabeled samples. Gives m = O(K log d) unlabeled samples, s = O(d log d) labeled samples. Can improve to m = O(K log d), s = O(d).

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 11 / 18

slide-96
SLIDE 96

Getting to s = O(d)

Based on Lee-Sun ’15

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 12 / 18

slide-97
SLIDE 97

Getting to s = O(d)

Based on Lee-Sun ’15

O(d log d) comes from coupon collector.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 12 / 18

slide-98
SLIDE 98

Getting to s = O(d)

Based on Lee-Sun ’15

O(d log d) comes from coupon collector. Change to non-independent sampling:

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 12 / 18

slide-99
SLIDE 99

Getting to s = O(d)

Based on Lee-Sun ’15

O(d log d) comes from coupon collector. Change to non-independent sampling:

◮ xi ∼ Di where Di depends on x1, . . . , xi−1. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 12 / 18

slide-100
SLIDE 100

Getting to s = O(d)

Based on Lee-Sun ’15

O(d log d) comes from coupon collector. Change to non-independent sampling:

◮ xi ∼ Di where Di depends on x1, . . . , xi−1. ◮ D1 = D′, D2 avoids points near x1, etc. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 12 / 18

slide-101
SLIDE 101

Getting to s = O(d)

Based on Lee-Sun ’15

O(d log d) comes from coupon collector. Change to non-independent sampling:

◮ xi ∼ Di where Di depends on x1, . . . , xi−1. ◮ D1 = D′, D2 avoids points near x1, etc.

Need two properties:

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 12 / 18

slide-102
SLIDE 102

Getting to s = O(d)

Based on Lee-Sun ’15

O(d log d) comes from coupon collector. Change to non-independent sampling:

◮ xi ∼ Di where Di depends on x1, . . . , xi−1. ◮ D1 = D′, D2 avoids points near x1, etc.

Need two properties:

◮ Norms preserved for all functions in class: Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 12 / 18

slide-103
SLIDE 103

Getting to s = O(d)

Based on Lee-Sun ’15

O(d log d) comes from coupon collector. Change to non-independent sampling:

◮ xi ∼ Di where Di depends on x1, . . . , xi−1. ◮ D1 = D′, D2 avoids points near x1, etc.

Need two properties:

◮ Norms preserved for all functions in class: ◮ Noise variance bounded for every sample: Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 12 / 18

slide-104
SLIDE 104

Getting to s = O(d)

Based on Lee-Sun ’15

O(d log d) comes from coupon collector. Change to non-independent sampling:

◮ xi ∼ Di where Di depends on x1, . . . , xi−1. ◮ D1 = D′, D2 avoids points near x1, etc.

Need two properties:

◮ Norms preserved for all functions in class:

E

x∼D f (x)2 ≈ s

  • i=1

1 s D(xi) Di(xi)f (xi)2

◮ Noise variance bounded for every sample: Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 12 / 18

slide-105
SLIDE 105

Getting to s = O(d)

Based on Lee-Sun ’15

O(d log d) comes from coupon collector. Change to non-independent sampling:

◮ xi ∼ Di where Di depends on x1, . . . , xi−1. ◮ D1 = D′, D2 avoids points near x1, etc.

Need two properties:

◮ Norms preserved for all functions in class:

E

x∼D f (x)2 ≈ s

  • i=1

1 s D(xi) Di(xi)f (xi)2

◮ Noise variance bounded for every sample:

1 s KDi ≤ ǫ

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 12 / 18

slide-106
SLIDE 106

Getting to s = O(d)

Based on Lee-Sun ’15

O(d log d) comes from coupon collector. Change to non-independent sampling:

◮ xi ∼ Di where Di depends on x1, . . . , xi−1. ◮ D1 = D′, D2 avoids points near x1, etc.

Need two properties:

◮ Norms preserved for all functions in class:

E

x∼D f (x)2 ≈ s

  • i=1

1 s D(xi) Di(xi)f (xi)2

◮ Noise variance bounded for every sample:

sup

f ,x

f (x)2 Ex′∼Di f (x′)2 =: 1 s KDi ≤ ǫ

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 12 / 18

slide-107
SLIDE 107

Getting to s = O(d)

Based on Lee-Sun ’15

O(d log d) comes from coupon collector. Change to non-independent sampling:

◮ xi ∼ Di where Di depends on x1, . . . , xi−1. ◮ D1 = D′, D2 avoids points near x1, etc.

Need two properties:

◮ Norms preserved for all functions in class:

E

x∼D f (x)2 ≈ s

  • i=1

αi D(xi) Di(xi)f (xi)2

◮ Noise variance bounded for every sample:

sup

f ,x

f (x)2 Ex′∼Di f (x′)2 =: αiKDi ≤ ǫ

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 12 / 18

slide-108
SLIDE 108

Getting to s = O(d)

Based on Lee-Sun ’15

O(d log d) comes from coupon collector. Change to non-independent sampling:

◮ xi ∼ Di where Di depends on x1, . . . , xi−1. ◮ D1 = D′, D2 avoids points near x1, etc.

Need two properties:

◮ Norms preserved for all functions in class:

E

x∼D f (x)2 ≈ s

  • i=1

αi D(xi) Di(xi)f (xi)2

◮ Noise variance bounded for every sample:

sup

f ,x

f (x)2 Ex′∼Di f (x′)2 =: αiKDi ≤ ǫ

Both properties achievable with Lee-Sun sparsification.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 12 / 18

slide-109
SLIDE 109

Nonlinear spaces

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 13 / 18

slide-110
SLIDE 110

Nonlinear spaces

Consider functions with sparse Fourier representations: f (x) =

d

  • j=1

vje2πifjx.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 13 / 18

slide-111
SLIDE 111

Nonlinear spaces

Consider functions with sparse Fourier representations: f (x) =

d

  • j=1

vje2πifjx. Can pick sample points x ∈ [0, 1], want to minimize E

x∈[0,1](

f (x) − f (x))2.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 13 / 18

slide-112
SLIDE 112

Nonlinear spaces

Consider functions with sparse Fourier representations: f (x) =

d

  • j=1

vje2πifjx. Can pick sample points x ∈ [0, 1], want to minimize E

x∈[0,1](

f (x) − f (x))2. For noise tolerance, need empirical norm ≈ actual norm.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 13 / 18

slide-113
SLIDE 113

Nonlinear spaces

Consider functions with sparse Fourier representations: f (x) =

d

  • j=1

vje2πifjx. Can pick sample points x ∈ [0, 1], want to minimize E

x∈[0,1](

f (x) − f (x))2. For noise tolerance, need empirical norm ≈ actual norm.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 13 / 18

slide-114
SLIDE 114

Estimating the norm in nonlinear spaces

Uniform sampling depends on K = sup

x

sup

f ∈F f D=1

f (x)2.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 14 / 18

slide-115
SLIDE 115

Estimating the norm in nonlinear spaces

Uniform sampling depends on K = sup

x

sup

f ∈F f D=1

f (x)2.

◮ Unknown exactly what this is for Fourier-sparse signals. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 14 / 18

slide-116
SLIDE 116

Estimating the norm in nonlinear spaces

Uniform sampling depends on K = sup

x

sup

f ∈F f D=1

f (x)2.

◮ Unknown exactly what this is for Fourier-sparse signals. ◮ d2 ≤ K d4 log3 d. [Chen-Kane-Price-Song ’16] Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 14 / 18

slide-117
SLIDE 117

Estimating the norm in nonlinear spaces

Uniform sampling depends on K = sup

x

sup

f ∈F f D=1

f (x)2.

◮ Unknown exactly what this is for Fourier-sparse signals. ◮ d2 ≤ K d4 log3 d. [Chen-Kane-Price-Song ’16]

Biasing the samples lets us reduce this to κ = E

x

sup

f ∈F f D=1

f (x)2.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 14 / 18

slide-118
SLIDE 118

Estimating the norm in nonlinear spaces

Uniform sampling depends on K = sup

x

sup

f ∈F f D=1

f (x)2.

◮ Unknown exactly what this is for Fourier-sparse signals. ◮ d2 ≤ K d4 log3 d. [Chen-Kane-Price-Song ’16]

Biasing the samples lets us reduce this to κ = E

x

sup

f ∈F f D=1

f (x)2.

◮ d ≤ κ d log2 d. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 14 / 18

slide-119
SLIDE 119

Estimating the norm in nonlinear spaces

Uniform sampling depends on K = sup

x

sup

f ∈F f D=1

f (x)2.

◮ Unknown exactly what this is for Fourier-sparse signals. ◮ d2 ≤ K d4 log3 d. [Chen-Kane-Price-Song ’16]

Biasing the samples lets us reduce this to κ = E

x

sup

f ∈F f D=1

f (x)2.

◮ d ≤ κ d log2 d.

Analogous to distinction between Markov Brothers’ inequality and Bernstein’s inequality for polynomials.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 14 / 18

slide-120
SLIDE 120

Proof sketch: κ for Fourier-sparse functions

For any ∆ > 0, consider the degree-d polynomial p(z) = d

i=1 βizi

with roots at e2πifj∆ for all j.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 15 / 18

slide-121
SLIDE 121

Proof sketch: κ for Fourier-sparse functions

For any ∆ > 0, consider the degree-d polynomial p(z) = d

i=1 βizi

with roots at e2πifj∆ for all j. For any x,

d

  • i=1

βif (x + ∆i)

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 15 / 18

slide-122
SLIDE 122

Proof sketch: κ for Fourier-sparse functions

For any ∆ > 0, consider the degree-d polynomial p(z) = d

i=1 βizi

with roots at e2πifj∆ for all j. For any x,

d

  • i=1

βif (x + ∆i) =

d

  • j=1

vje2πifjxp(e2πifj∆)

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 15 / 18

slide-123
SLIDE 123

Proof sketch: κ for Fourier-sparse functions

For any ∆ > 0, consider the degree-d polynomial p(z) = d

i=1 βizi

with roots at e2πifj∆ for all j. For any x,

d

  • i=1

βif (x + ∆i) =

d

  • j=1

vje2πifjxp(e2πifj∆) = 0.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 15 / 18

slide-124
SLIDE 124

Proof sketch: κ for Fourier-sparse functions

For any ∆ > 0, consider the degree-d polynomial p(z) = d

i=1 βizi

with roots at e2πifj∆ for all j. For any x,

d

  • i=1

βif (x + ∆i) =

d

  • j=1

vje2πifjxp(e2πifj∆) = 0.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 15 / 18

slide-125
SLIDE 125

Proof sketch: κ for Fourier-sparse functions

For any ∆ > 0, consider the degree-d polynomial p(z) = d

i=1 βizi

with roots at e2πifj∆ for all j. For any x,

d

  • i=1

βif (x + ∆i) =

d

  • j=1

vje2πifjxp(e2πifj∆) = 0.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 15 / 18

slide-126
SLIDE 126

Proof sketch: κ for Fourier-sparse functions

For any ∆ > 0, consider the degree-d polynomial p(z) = d

i=1 βizi

with roots at e2πifj∆ for all j. For any x,

d

  • i=1

βif (x + ∆i) =

d

  • j=1

vje2πifjxp(e2πifj∆) = 0.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 15 / 18

slide-127
SLIDE 127

Proof sketch: κ for Fourier-sparse functions

For any ∆ > 0, consider the degree-d polynomial p(z) = d

i=1 βizi

with roots at e2πifj∆ for all j. For any x,

d

  • i=1

βif (x + ∆i) =

d

  • j=1

vje2πifjxp(e2πifj∆) = 0. In particular, for i∗ = arg max|βi|, |f (x + i∗∆)| ≤

  • i∈[d]\i∗

|f (x + i∆)|

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 15 / 18

slide-128
SLIDE 128

Proof sketch: κ for Fourier-sparse functions

For any ∆ > 0, consider the degree-d polynomial p(z) = d

i=1 βizi

with roots at e2πifj∆ for all j. For any x,

d

  • i=1

βif (x + ∆i) =

d

  • j=1

vje2πifjxp(e2πifj∆) = 0. In particular, for i∗ = arg max|βi|, |f (x + i∗∆)| ≤

  • i∈[d]\i∗

|f (x + i∆)| Hence |f (x)| ≤

d

  • i=− d

i=0

|f (x + i∆)|

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 15 / 18

slide-129
SLIDE 129

Proof sketch: κ for Fourier-sparse functions

For any ∆ > 0, consider the degree-d polynomial p(z) = d

i=1 βizi

with roots at e2πifj∆ for all j. For any x,

d

  • i=1

βif (x + ∆i) =

d

  • j=1

vje2πifjxp(e2πifj∆) = 0. In particular, for i∗ = arg max|βi|, |f (x + i∗∆)| ≤

  • i∈[d]\i∗

|f (x + i∆)| Hence (with a little more care) |f (x)|2 ≤ 3

2d

  • i=−2d

i=0

|f (x + i∆)|2

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 15 / 18

slide-130
SLIDE 130

Proof sketch: κ for Fourier-sparse functions

Lemma

If f is d-Fourier-sparse, then for all x and ∆ we have |f (x)|2 ≤ 3

2d

  • i=−2d

i=0

|f (x + i∆)|2

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 16 / 18

slide-131
SLIDE 131

Proof sketch: κ for Fourier-sparse functions

Lemma

If f is d-Fourier-sparse, then for all x and ∆ we have |f (x)|2 ≤ 3

2d

  • i=−2d

i=0

|f (x + i∆)|2 Suppose D is uniform on [−1, 1]. Then for all x ∈ [−1, 1], |f (x)|2 d log d 1 − |x| E

x′ f (x′)2.

by integrating ∆ from 0 to 1 − |x|.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 16 / 18

slide-132
SLIDE 132

Proof sketch: κ for Fourier-sparse functions

Lemma

If f is d-Fourier-sparse, then for all x and ∆ we have |f (x)|2 ≤ 3

2d

  • i=−2d

i=0

|f (x + i∆)|2 Suppose D is uniform on [−1, 1]. Then for all x ∈ [−1, 1], |f (x)|2 d log d 1 − |x| E

x′ f (x′)2.

by integrating ∆ from 0 to 1 − |x|. Hence κ = E

x

sup

f ∈F f D=1

|f (x)|2 d log2 d.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 16 / 18

slide-133
SLIDE 133

Query/active learning in nonlinear spaces

Biased sampling means f S ≈ f D in O(κ) samples for any single f ∈ F.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 17 / 18

slide-134
SLIDE 134

Query/active learning in nonlinear spaces

Biased sampling means f S ≈ f D in O(κ) samples for any single f ∈ F. Linear spaces: O(κ log d) for every f ∈ F by matrix Chernoff.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 17 / 18

slide-135
SLIDE 135

Query/active learning in nonlinear spaces

Biased sampling means f S ≈ f D in O(κ) samples for any single f ∈ F. Linear spaces: O(κ log d) for every f ∈ F by matrix Chernoff. Sparse Fourier: need to union bound over a net

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 17 / 18

slide-136
SLIDE 136

Query/active learning in nonlinear spaces

Biased sampling means f S ≈ f D in O(κ) samples for any single f ∈ F. Linear spaces: O(κ log d) for every f ∈ F by matrix Chernoff. Sparse Fourier: need to union bound over a net

◮ Known net size is 2 ˜

O(d3).

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 17 / 18

slide-137
SLIDE 137

Query/active learning in nonlinear spaces

Biased sampling means f S ≈ f D in O(κ) samples for any single f ∈ F. Linear spaces: O(κ log d) for every f ∈ F by matrix Chernoff. Sparse Fourier: need to union bound over a net

◮ Known net size is 2 ˜

O(d3).

◮ Gives ˜

O(d3κ) = ˜ O(d4) queries/labeled samples.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 17 / 18

slide-138
SLIDE 138

Query/active learning in nonlinear spaces

Biased sampling means f S ≈ f D in O(κ) samples for any single f ∈ F. Linear spaces: O(κ log d) for every f ∈ F by matrix Chernoff. Sparse Fourier: need to union bound over a net

◮ Known net size is 2 ˜

O(d3).

◮ Gives ˜

O(d3κ) = ˜ O(d4) queries/labeled samples.

◮ Gives ˜

O(d3K) = ˜ O(d7) unlabeled samples.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 17 / 18

slide-139
SLIDE 139

Conclusions and open questions

Active learning can be optimal in both criteria simultaneously

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 18 / 18

slide-140
SLIDE 140

Conclusions and open questions

Active learning can be optimal in both criteria simultaneously

◮ O(K log d + K

ǫ ) unlabeled examples.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 18 / 18

slide-141
SLIDE 141

Conclusions and open questions

Active learning can be optimal in both criteria simultaneously

◮ O(K log d + K

ǫ ) unlabeled examples.

◮ O(d/ǫ) labeled examples Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 18 / 18

slide-142
SLIDE 142

Conclusions and open questions

Active learning can be optimal in both criteria simultaneously

◮ O(K log d + K

ǫ ) unlabeled examples.

◮ O(d/ǫ) labeled examples

Gets some improvement for Fourier-sparse signals.

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 18 / 18

slide-143
SLIDE 143

Conclusions and open questions

Active learning can be optimal in both criteria simultaneously

◮ O(K log d + K

ǫ ) unlabeled examples.

◮ O(d/ǫ) labeled examples

Gets some improvement for Fourier-sparse signals.

◮ Tight results via chaining and/or better net? Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 18 / 18

slide-144
SLIDE 144

Conclusions and open questions

Active learning can be optimal in both criteria simultaneously

◮ O(K log d + K

ǫ ) unlabeled examples.

◮ O(d/ǫ) labeled examples

Gets some improvement for Fourier-sparse signals.

◮ Tight results via chaining and/or better net?

Can we go beyond ℓ2 and linear spaces?

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 18 / 18

slide-145
SLIDE 145

Conclusions and open questions

Active learning can be optimal in both criteria simultaneously

◮ O(K log d + K

ǫ ) unlabeled examples.

◮ O(d/ǫ) labeled examples

Gets some improvement for Fourier-sparse signals.

◮ Tight results via chaining and/or better net?

Can we go beyond ℓ2 and linear spaces?

◮ Logistic regression? Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 18 / 18

slide-146
SLIDE 146

Conclusions and open questions

Active learning can be optimal in both criteria simultaneously

◮ O(K log d + K

ǫ ) unlabeled examples.

◮ O(d/ǫ) labeled examples

Gets some improvement for Fourier-sparse signals.

◮ Tight results via chaining and/or better net?

Can we go beyond ℓ2 and linear spaces?

◮ Logistic regression?

Better theory for active learning ?

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 18 / 18

slide-147
SLIDE 147

Conclusions and open questions

Active learning can be optimal in both criteria simultaneously

◮ O(K log d + K

ǫ ) unlabeled examples.

◮ O(d/ǫ) labeled examples

Gets some improvement for Fourier-sparse signals.

◮ Tight results via chaining and/or better net?

Can we go beyond ℓ2 and linear spaces?

◮ Logistic regression?

Better theory for active learning ?

◮ Choose sample points sequentially. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 18 / 18

slide-148
SLIDE 148

Conclusions and open questions

Active learning can be optimal in both criteria simultaneously

◮ O(K log d + K

ǫ ) unlabeled examples.

◮ O(d/ǫ) labeled examples

Gets some improvement for Fourier-sparse signals.

◮ Tight results via chaining and/or better net?

Can we go beyond ℓ2 and linear spaces?

◮ Logistic regression?

Better theory for active learning ?

◮ Choose sample points sequentially. ◮ Dynamically changing functions. Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 18 / 18

slide-149
SLIDE 149

Conclusions and open questions

Active learning can be optimal in both criteria simultaneously

◮ O(K log d + K

ǫ ) unlabeled examples.

◮ O(d/ǫ) labeled examples

Gets some improvement for Fourier-sparse signals.

◮ Tight results via chaining and/or better net?

Can we go beyond ℓ2 and linear spaces?

◮ Logistic regression?

Better theory for active learning ?

◮ Choose sample points sequentially. ◮ Dynamically changing functions.

Thank You

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 18 / 18

slide-150
SLIDE 150

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 19 / 18

slide-151
SLIDE 151

Xue Chen, Eric Price (UT Austin) Active Regression via Linear-Sample Sparsification 20 / 18