An Investigation of Why Overparameterization Exacerbates Spurious - - PowerPoint PPT Presentation
An Investigation of Why Overparameterization Exacerbates Spurious - - PowerPoint PPT Presentation
An Investigation of Why Overparameterization Exacerbates Spurious Correlations Shiori Sagawa* Aditi Raghunathan* Pang Wei Koh* Percy Liang Models can latch onto spurious correlations Misleading heuristics; might work on most training
Models can latch onto spurious correlations
input : bird image
ML model
Sagawa et al. (2020), Wah et al. (2011), Zhou et al. (2017)
Misleading heuristics; might work on most training examples but may not always hold up
label: bird type waterbird landbird vs
Models can latch onto spurious correlations
input : bird image
ML model Misleading heuristics; might work on most training examples but may not always hold up
true label : prediction : water background waterbird waterbird spurious correlation:
✓
Sagawa et al. (2020), Wah et al. (2011), Zhou et al. (2017)
Models can latch onto spurious correlations
ML model
true label : prediction : land background waterbird landbird spurious correlation:
Misleading heuristics; might work on most training examples but may not always hold up
✕
input : bird image
Sagawa et al. (2020), Wah et al. (2011), Zhou et al. (2017)
Models can latch onto spurious correlations
input : face image
ML model
true label :
Sagawa et al. (2020), Liu et al. (2015)
label: hair color blonde hair dark hair vs
Models can latch onto spurious correlations
input : face image
ML model
true label : prediction : gender blonde hair dark hair spurious correlation:
Sagawa et al. (2020), Liu et al. (2015)
✕
Models can latch onto spurious correlations
label: object waterbird landbird spurious attribute: background water background land background ma majority minority minority ma majority
Sagawa et al. (2020)
Models perform well on average
label: object waterbird landbird spurious attribute: background water background land background
average error: 0.03
avg error
Sagawa et al. (2020)
0. 0.05 05 0. 0.21 21 0. 0.40 40 0. 0.004 004
But models can have high worst-group error
label: object waterbird landbird spurious attribute: background water background land background
worst-group error: 0.40
0. 0.05 05 0. 0.21 21 0. 0.40 40 0. 0.004 004
avg worst group error
Sagawa et al. (2020)
Approaches for improving worst-group error fail on high-capacity models
Label y 1
- 1
Attribute a 1
✓ X
- 1
X ✓
Label y 1
- 1
Attribute a 1
✓ ✓
- 1
✓ ✓
avg worst group error avg worst group error
- More robust to spurious correlation
- Low worst-group error
- Relies on spurious correlation
- High worst-group error
Sagawa et al. (2020)
Lo Low-ca capacity models Hi High-ca capacity models
- Upweight minority groups:
Overparameterization hurts worst-group error for models trained with the reweighted objective
av average error wo worst-gr grou
- up error
- r
Overparameterized is better than underparameterized Overparameterized is worse than underparameterized Ou Our work: : why do does es over erpa parame meter erization n exacer erba bate e worst-gr grou
- up error
- r?
Overview
- 1. Empirical results
- 2. Analytical model and theoretical results
- 3. Subsampling
Overparameterization exacerbates worst- group error
ResNet10 Logistic regression on random features
Intuition: overparameterized models learn the spurious attribute and memorize minority groups
! = 1 $ = −1 ! = −1 $ = 1 ! = 1 $ = 1 ! = −1 $ = −1 majority minority
Overparameterized
generalizable non-generalizable “memorizing”
Overview
- 1. Empirical results
- 2. Analytical model and theoretical results
- 3. Subsampling
Toy example: data
majority minority
y 1
- 1
a 1
- 1
Majority fraction
Toy example: data
co core sp spurious
Spurious-to-core information ratio (SCR)
…
Toy example: data
co core sp spurious no noise For large N>>n, can be “memorized“
Toy example: linear classifier
- Logistic regression
- In overparameterized
regime, equivalent to ma max-ma margin cl classifier
…
co core sp spurious no noise model
Worst-group error is provably higher in the
- verparameterized regime
Th Theorem (informal). For any there exists such that for all , with high probability, However, with and in the asymptotic regime with , High worst-group error for overparameterized Low worst-group error for underparameterized
,
High majority fraction High SCR
learning spurious learning core
Underparameterized models need to learn the core feature to achieve low reweighted loss
X high reweighted loss ✓ low reweighted loss
In overparameterized regime, minimum-norm inductive bias favors less memorization
learning spurious memorizing minority learning core memorizing outliers many examples memorized X high norm few examples memorized ✓ low norm No Norm scales with the number of points “m “memorized”
Intuition: memorizing as few examples as possible under the min-norm inductive bias
mo model
y 1
- 1
a 1
- 1
Tr Train error
Learn spurious à memorize minority, low norm
mo model
y 1
- 1
a 1
- 1
Tr Train error
1 1
Learn spurious à memorize minority, low norm
points to memorize mo model
y 1
- 1
a 1
- 1
Tr Train error
1 1
Learn spurious à memorize minority, low norm
points to memorize mo model
y 1
- 1
a 1
- 1
Tr Train error ✓ low norm
Learn core à memorize more, high norm
mo model
y 1
- 1
a 1
- 1
Tr Train error
>0 >0 >0 >0
Learn core à memorize more, high norm
mo model
y 1
- 1
a 1
- 1
Tr Train error
>0 >0 >0 >0
points to memorize
Learn core à memorize more, high norm
mo model
y 1
- 1
a 1
- 1
Tr Train error points to memorize X high norm
Overview
- 1. Empirical results
- 2. Simulations on synthetic data
- 3. Subsampling
Reweighting vs subsampling
up upweighting ng su subsa sampling
# examples
! = 1 $ = −1 ! = −1 $ = 1 ! = 1 $ = 1 ! = −1 $ = −1
# examples
! = 1 $ = −1 ! = −1 $ = 1 ! = 1 $ = 1 ! = −1 $ = −1
- Reduces majority fraction
- Lowers memorization cost of
learning the core feature
Chawla et al. (2011)
Reweighting vs subsampling
up upweighting ng su subsa sampling
# examples
! = 1 $ = −1 ! = −1 $ = 1 ! = 1 $ = 1 ! = −1 $ = −1
# examples
! = 1 $ = −1 ! = −1 $ = 1 ! = 1 $ = 1 ! = −1 $ = −1 Chawla et al. (2011)
Subsampling the majority group à
- verparameterization helps worst-group error
Potential tension between using all of the data vs. using large overparameterized models. Both help average error, but can’t have both for good worst-group error. Upweighting Subsampling
Thanks!
Thank you to Yair Carmon, John Duchi, Tatsunori Hashimoto, Ananya Kumar, Yiping Lu, Tengyu Ma, and Jacob Steinhardt Funded by Open Philanthropy Project Award, Stanford Graduate Fellowship, Google PhD Fellowship, Open Philanthropy Project AI Fellowship, and Facebook Fellowship Program.
Shiori Sagawa* Pang Wei Koh* Percy Liang Aditi Raghunathan*
Thanks!