TIME/ACCURACY TRADEOFFS FOR LEARNING A RELU WITH RESPECT TO - - PowerPoint PPT Presentation

time accuracy tradeoffs for learning a relu with respect
SMART_READER_LITE
LIVE PREVIEW

TIME/ACCURACY TRADEOFFS FOR LEARNING A RELU WITH RESPECT TO - - PowerPoint PPT Presentation

TIME/ACCURACY TRADEOFFS FOR LEARNING A RELU WITH RESPECT TO GAUSSIAN MARGINALS Surbhi Goel Sushrut Karmalkar Adam Klivans The University of Texas at Austin 1 WHAT IS RELU REGRESSION? Given : Samples drawn from distribution


slide-1
SLIDE 1

TIME/ACCURACY TRADEOFFS FOR LEARNING A RELU WITH RESPECT TO GAUSSIAN MARGINALS

Surbhi Goel Sushrut Karmalkar Adam Klivans The University of Texas at Austin

1

slide-2
SLIDE 2

WHAT IS RELU REGRESSION?

𝔽

𝒠 [(𝗌𝖿𝗆𝗏(

̂ w ⋅ x) − y)

2

] ≤ 𝗉𝗊𝗎 + ϵ

Given: Samples drawn from distribution with arbitrary labels

𝒠

Output: such that

̂ w ∈ ℝd

test error

𝗉𝗊𝗎 := min

w ( 𝔽 𝒠 [(𝗌𝖿𝗆𝗏(w ⋅ x) − y) 2

])

loss of the best-fitting ReLU

2

𝗌𝖿𝗆𝗏(a) = max(0, a)

slide-3
SLIDE 3

WHAT IS RELU REGRESSION?

𝔽

𝒠 [(𝗌𝖿𝗆𝗏(

̂ w ⋅ x) − y)

2

] ≤ 𝗉𝗊𝗎 + ϵ

Given: Samples drawn from distribution with arbitrary labels

𝒠

Output: such that

̂ w ∈ ℝd

test error

𝗉𝗊𝗎 := min

w ( 𝔽 𝒠 [(𝗌𝖿𝗆𝗏(w ⋅ x) − y) 2

])

loss of the best-fitting ReLU The underlying optimization problem is non-convex!

2

𝗌𝖿𝗆𝗏(a) = max(0, a)

slide-4
SLIDE 4

PRIOR WORK - POSITIVE

Mean-zero noise: Isotonic regression over Sphere [Kalai-Sastry’08, Kakade-

Kalai-Kanade-Shamir’11]

Noiseless: Gradient descent over Gaussian input [Soltanolkotabi’17]

3

slide-5
SLIDE 5

PRIOR WORK - POSITIVE

Mean-zero noise: Isotonic regression over Sphere [Kalai-Sastry’08, Kakade-

Kalai-Kanade-Shamir’11]

Noiseless: Gradient descent over Gaussian input [Soltanolkotabi’17]

3

Results require strong restrictions on the input or the label

slide-6
SLIDE 6

PRIOR WORK - NEGATIVE

Minimizing training loss is NP-hard [Manurangsi-Reichman’18] Hardness over uniform on the boolean cube [G-Kanade-K-Thaler’17]

4

slide-7
SLIDE 7

PRIOR WORK - NEGATIVE

Minimizing training loss is NP-hard [Manurangsi-Reichman’18] Hardness over uniform on the boolean cube [G-Kanade-K-Thaler’17]

4

Results use special discrete distributions to prove hardness

slide-8
SLIDE 8

DISTRIBUTION ASSUMPTION

Assumption: For all ,

(x, y) ∼ 𝒠 x ∼ 𝒪(0, Id) and y ∈ [0,1]

5

slide-9
SLIDE 9

DISTRIBUTION ASSUMPTION

Gaussian input allows for positive results in noiseless setting

[Tian’17, Soltanolkotabi’17, Li-Yuan'17, Zhong-Song-Jain-Bartlett-Dhillon'17, Brutzkus- Globerson’17, Zhong-Song-Dhillon17, Du-Lee-Tian-Poczos-Singh’18, Zhang-Yu-Wang- Gu’19, Fu-Chi-Liang’19…….]

Assumption: For all ,

(x, y) ∼ 𝒠 x ∼ 𝒪(0, Id) and y ∈ [0,1]

5

slide-10
SLIDE 10

DISTRIBUTION ASSUMPTION

Gaussian input allows for positive results in noiseless setting

[Tian’17, Soltanolkotabi’17, Li-Yuan'17, Zhong-Song-Jain-Bartlett-Dhillon'17, Brutzkus- Globerson’17, Zhong-Song-Dhillon17, Du-Lee-Tian-Poczos-Singh’18, Zhang-Yu-Wang- Gu’19, Fu-Chi-Liang’19…….]

Assumption: For all ,

(x, y) ∼ 𝒠 x ∼ 𝒪(0, Id) and y ∈ [0,1]

Explicitly compute closed-form expressions for loss/gradient

5

slide-11
SLIDE 11

HARDNESS RESULT

There exists NO algorithm for ReLU regression up to error in time under standard computational hardness assumptions.

ϵ do(log(1/ϵ))

6

slide-12
SLIDE 12

HARDNESS RESULT

There exists NO algorithm for ReLU regression up to error in time under standard computational hardness assumptions.

ϵ do(log(1/ϵ))

The problem is as hard as learning sparse parities with noise!

6

slide-13
SLIDE 13

HARDNESS RESULT

There exists NO algorithm for ReLU regression up to error in time under standard computational hardness assumptions.

ϵ do(log(1/ϵ))

First hardness result under the Gaussian assumption! The problem is as hard as learning sparse parities with noise!

6

slide-14
SLIDE 14

HARDNESS FOR GRADIENT DESCENT

Unconditionally, NO statistical query (SQ) algorithm with bounded norm queries can perform ReLU regression up to error with less than queries.

ϵ do(log(1/ϵ))

7

slide-15
SLIDE 15

HARDNESS FOR GRADIENT DESCENT

Unconditionally, NO statistical query (SQ) algorithm with bounded norm queries can perform ReLU regression up to error with less than queries.

ϵ do(log(1/ϵ))

Gradient Descent (GD) is well-known to be an SQ algorithm

7

slide-16
SLIDE 16

HARDNESS FOR GRADIENT DESCENT

Unconditionally, NO statistical query (SQ) algorithm with bounded norm queries can perform ReLU regression up to error with less than queries.

ϵ do(log(1/ϵ))

Gradient Descent (GD) is well-known to be an SQ algorithm GD can NOT solve ReLU regression in polynomial time

7

slide-17
SLIDE 17

HARDNESS FOR GRADIENT DESCENT

Unconditionally, NO statistical query (SQ) algorithm with bounded norm queries can perform ReLU regression up to error with less than queries.

ϵ do(log(1/ϵ))

Gradient Descent (GD) is well-known to be an SQ algorithm GD can NOT solve ReLU regression in polynomial time Recall GD works in noiseless setting [Soltanokotabi’17]

7

slide-18
SLIDE 18

APPROXIMATION RESULT

There exists an an algorithm for ReLU regression with error in time .

O (𝗉𝗊𝗎2/3) + ϵ

poly (d,1/ϵ)

8

slide-19
SLIDE 19

APPROXIMATION RESULT

There exists an an algorithm for ReLU regression with error in time .

O (𝗉𝗊𝗎2/3) + ϵ

poly (d,1/ϵ) Can get in time

[Diakonikolas-G-K-K-Soltanolkotabi’TBD]

O (𝗉𝗊𝗎) + ϵ

poly (d,1/ϵ)

8

slide-20
SLIDE 20

APPROXIMATION RESULT

There exists an an algorithm for ReLU regression with error in time .

O (𝗉𝗊𝗎2/3) + ϵ

poly (d,1/ϵ) Finding approximate solutions is tractable! Can get in time

[Diakonikolas-G-K-K-Soltanolkotabi’TBD]

O (𝗉𝗊𝗎) + ϵ

poly (d,1/ϵ)

8

slide-21
SLIDE 21

THANK YOU!

Poster @ East Exhibition Hall B + C #235