An Investigation of Why Overparameterization Exacerbates Spurious - - PowerPoint PPT Presentation

an investigation of why overparameterization exacerbates
SMART_READER_LITE
LIVE PREVIEW

An Investigation of Why Overparameterization Exacerbates Spurious - - PowerPoint PPT Presentation

An Investigation of Why Overparameterization Exacerbates Spurious Correlations Shiori Sagawa* Aditi Raghunathan* Pang Wei Koh* Percy Liang Models can latch onto spurious correlations Misleading heuristics; might work on most training


slide-1
SLIDE 1

An Investigation of Why Overparameterization Exacerbates Spurious Correlations

Shiori Sagawa* Pang Wei Koh* Percy Liang Aditi Raghunathan*

slide-2
SLIDE 2

Models can latch onto spurious correlations

input : bird image

ML model

Sagawa et al. (2020), Wah et al. (2011), Zhou et al. (2017)

Misleading heuristics; might work on most training examples but may not always hold up

label: bird type waterbird landbird vs

slide-3
SLIDE 3

Models can latch onto spurious correlations

input : bird image

ML model Misleading heuristics; might work on most training examples but may not always hold up

true label : prediction : water background waterbird waterbird spurious correlation:

Sagawa et al. (2020), Wah et al. (2011), Zhou et al. (2017)

slide-4
SLIDE 4

Models can latch onto spurious correlations

ML model

true label : prediction : land background waterbird landbird spurious correlation:

Misleading heuristics; might work on most training examples but may not always hold up

input : bird image

Sagawa et al. (2020), Wah et al. (2011), Zhou et al. (2017)

slide-5
SLIDE 5

Models can latch onto spurious correlations

input : face image

ML model

true label :

Sagawa et al. (2020), Liu et al. (2015)

label: hair color blonde hair dark hair vs

slide-6
SLIDE 6

Models can latch onto spurious correlations

input : face image

ML model

true label : prediction : gender blonde hair dark hair spurious correlation:

Sagawa et al. (2020), Liu et al. (2015)

slide-7
SLIDE 7

Models can latch onto spurious correlations

label: object waterbird landbird spurious attribute: background water background land background ma majority minority minority ma majority

Sagawa et al. (2020)

slide-8
SLIDE 8

Models perform well on average

label: object waterbird landbird spurious attribute: background water background land background

average error: 0.03

avg error

Sagawa et al. (2020)

0. 0.05 05 0. 0.21 21 0. 0.40 40 0. 0.004 004

slide-9
SLIDE 9

But models can have high worst-group error

label: object waterbird landbird spurious attribute: background water background land background

worst-group error: 0.40

0. 0.05 05 0. 0.21 21 0. 0.40 40 0. 0.004 004

avg worst group error

Sagawa et al. (2020)

slide-10
SLIDE 10

Approaches for improving worst-group error fail on high-capacity models

Label y 1

  • 1

Attribute a 1

✓ X

  • 1

X ✓

Label y 1

  • 1

Attribute a 1

✓ ✓

  • 1

✓ ✓

avg worst group error avg worst group error

  • More robust to spurious correlation
  • Low worst-group error
  • Relies on spurious correlation
  • High worst-group error

Sagawa et al. (2020)

Lo Low-ca capacity models Hi High-ca capacity models

  • Upweight minority groups:
slide-11
SLIDE 11

Overparameterization hurts worst-group error for models trained with the reweighted objective

av average error wo worst-gr grou

  • up error
  • r

Overparameterized is better than underparameterized Overparameterized is worse than underparameterized Ou Our work: : why do does es over erpa parame meter erization n exacer erba bate e worst-gr grou

  • up error
  • r?
slide-12
SLIDE 12

Overview

  • 1. Empirical results
  • 2. Analytical model and theoretical results
  • 3. Subsampling
slide-13
SLIDE 13

Overparameterization exacerbates worst- group error

ResNet10 Logistic regression on random features

slide-14
SLIDE 14

Intuition: overparameterized models learn the spurious attribute and memorize minority groups

! = 1 $ = −1 ! = −1 $ = 1 ! = 1 $ = 1 ! = −1 $ = −1 majority minority

Overparameterized

generalizable non-generalizable “memorizing”

slide-15
SLIDE 15

Overview

  • 1. Empirical results
  • 2. Analytical model and theoretical results
  • 3. Subsampling
slide-16
SLIDE 16

Toy example: data

majority minority

y 1

  • 1

a 1

  • 1

Majority fraction

slide-17
SLIDE 17

Toy example: data

co core sp spurious

Spurious-to-core information ratio (SCR)

slide-18
SLIDE 18

Toy example: data

co core sp spurious no noise For large N>>n, can be “memorized“

slide-19
SLIDE 19

Toy example: linear classifier

  • Logistic regression
  • In overparameterized

regime, equivalent to ma max-ma margin cl classifier

co core sp spurious no noise model

slide-20
SLIDE 20

Worst-group error is provably higher in the

  • verparameterized regime

Th Theorem (informal). For any there exists such that for all , with high probability, However, with and in the asymptotic regime with , High worst-group error for overparameterized Low worst-group error for underparameterized

,

High majority fraction High SCR

slide-21
SLIDE 21

learning spurious learning core

Underparameterized models need to learn the core feature to achieve low reweighted loss

X high reweighted loss ✓ low reweighted loss

slide-22
SLIDE 22

In overparameterized regime, minimum-norm inductive bias favors less memorization

learning spurious memorizing minority learning core memorizing outliers many examples memorized X high norm few examples memorized ✓ low norm No Norm scales with the number of points “m “memorized”

slide-23
SLIDE 23

Intuition: memorizing as few examples as possible under the min-norm inductive bias

mo model

y 1

  • 1

a 1

  • 1

Tr Train error

slide-24
SLIDE 24

Learn spurious à memorize minority, low norm

mo model

y 1

  • 1

a 1

  • 1

Tr Train error

1 1

slide-25
SLIDE 25

Learn spurious à memorize minority, low norm

points to memorize mo model

y 1

  • 1

a 1

  • 1

Tr Train error

1 1

slide-26
SLIDE 26

Learn spurious à memorize minority, low norm

points to memorize mo model

y 1

  • 1

a 1

  • 1

Tr Train error ✓ low norm

slide-27
SLIDE 27

Learn core à memorize more, high norm

mo model

y 1

  • 1

a 1

  • 1

Tr Train error

>0 >0 >0 >0

slide-28
SLIDE 28

Learn core à memorize more, high norm

mo model

y 1

  • 1

a 1

  • 1

Tr Train error

>0 >0 >0 >0

points to memorize

slide-29
SLIDE 29

Learn core à memorize more, high norm

mo model

y 1

  • 1

a 1

  • 1

Tr Train error points to memorize X high norm

slide-30
SLIDE 30

Overview

  • 1. Empirical results
  • 2. Simulations on synthetic data
  • 3. Subsampling
slide-31
SLIDE 31

Reweighting vs subsampling

up upweighting ng su subsa sampling

# examples

! = 1 $ = −1 ! = −1 $ = 1 ! = 1 $ = 1 ! = −1 $ = −1

# examples

! = 1 $ = −1 ! = −1 $ = 1 ! = 1 $ = 1 ! = −1 $ = −1

  • Reduces majority fraction
  • Lowers memorization cost of

learning the core feature

Chawla et al. (2011)

slide-32
SLIDE 32

Reweighting vs subsampling

up upweighting ng su subsa sampling

# examples

! = 1 $ = −1 ! = −1 $ = 1 ! = 1 $ = 1 ! = −1 $ = −1

# examples

! = 1 $ = −1 ! = −1 $ = 1 ! = 1 $ = 1 ! = −1 $ = −1 Chawla et al. (2011)

slide-33
SLIDE 33

Subsampling the majority group à

  • verparameterization helps worst-group error

Potential tension between using all of the data vs. using large overparameterized models. Both help average error, but can’t have both for good worst-group error. Upweighting Subsampling

slide-34
SLIDE 34

Thanks!

Thank you to Yair Carmon, John Duchi, Tatsunori Hashimoto, Ananya Kumar, Yiping Lu, Tengyu Ma, and Jacob Steinhardt Funded by Open Philanthropy Project Award, Stanford Graduate Fellowship, Google PhD Fellowship, Open Philanthropy Project AI Fellowship, and Facebook Fellowship Program.

Shiori Sagawa* Pang Wei Koh* Percy Liang Aditi Raghunathan*

Thanks!