Learning wit ith Pairw rwis ise Losses Problems, Algorithms and - - PowerPoint PPT Presentation

β–Ά
learning wit ith pairw rwis ise losses
SMART_READER_LITE
LIVE PREVIEW

Learning wit ith Pairw rwis ise Losses Problems, Algorithms and - - PowerPoint PPT Presentation

Learning wit ith Pairw rwis ise Losses Problems, Algorithms and Analysis Purushottam Kar Microsoft Research India Outl tline Part I: Introduction to pairwise loss functions Example applications Part II: Batch learning with


slide-1
SLIDE 1

Learning wit ith Pairw rwis ise Losses

Problems, Algorithms and Analysis

Purushottam Kar

Microsoft Research India

slide-2
SLIDE 2

Outl tline

  • Part I: Introduction to pairwise loss functions
  • Example applications
  • Part II: Batch learning with pairwise loss functions
  • Learning formulation: no algorithmic details
  • Generalization bounds
  • The coupling phenomenon
  • Decoupling techniques
  • Part III: Online learning with pairwise loss functions
  • A generic online algorithm
  • Regret analysis
  • Online-to-batch conversion bounds
  • A decoupling technique for online-to-batch conversions

E0 370: Statistical Learning Theory 2

slide-3
SLIDE 3

Part I: I: In Introduction

E0 370: Statistical Learning Theory 3

slide-4
SLIDE 4

What is is a lo loss fu functio ion?

  • We observe empirical losses on data 𝑇 = 𝑦1, … π‘¦π‘œ

ℓ𝑦𝑗 β‹… = β„“ β„Ž, 𝑦𝑗

  • … and try to minimize them (e.g. classfn, regression)

β„Ž = inf

β„Žβˆˆβ„‹

ℒ𝑇 β„Ž , ℒ𝑇 β„Ž = 1 π‘œ βˆ‘β„“π‘¦π‘— β„Ž

  • … in the hope that

1 π‘œ βˆ‘β„“π‘¦π‘— β‹… βˆ’ 𝔽ℓ𝑦 β‹…

∞ ≀ πœ—

  • ... so that

β„’ β„Ž ≀ β„’ β„Žβˆ— + πœ—, β„’ β„Ž = 𝔽ℓ𝑦 β„Ž

E0 370: Statistical Learning Theory 4

β„“: β„‹ β†’ ℝ+

slide-5
SLIDE 5

Metric ic Learnin ing

  • Penalize metric for bringing blue and red points close
  • Loss function needs to consider two points at a time!
  • … in other words a pairwise loss function
  • E.g. β„“ 𝑦1,𝑦2

𝑁 = 1, 𝑧1 β‰  𝑧2 and 𝑁 𝑦1, 𝑦2 < 𝛿1 1, 𝑧1 = 𝑧2 and 𝑁 𝑦1, 𝑦2 > 𝛿2 0, otherwise

E0 370: Statistical Learning Theory 5

slide-6
SLIDE 6

Pairw irwis ise Loss Functio ions

  • Typically, loss functions are based on ground truth

ℓ𝑦 β„Ž = β„“ β„Ž 𝑦 , 𝑧 𝑦

  • Thus, for metric learning, loss functions look like

β„“ 𝑦1,𝑦2 β„Ž = β„“ β„Ž 𝑦1, 𝑦2 , 𝑧 𝑦1, 𝑦2

  • In previous example, we had

β„Ž 𝑦1, 𝑦2 = 𝑁 𝑦1, 𝑦2 and 𝑧 𝑦1, 𝑦2 = 𝑧1𝑧2

  • Useful to learn patterns that capture data interactions

E0 370: Statistical Learning Theory 6

slide-7
SLIDE 7

Pairw irwis ise Loss Functio ions

Examples: (𝜚 is any margin loss function e.g. hinge loss)

  • Metric learning [Jin et al NIPS β€˜09]

β„“ 𝑦1,𝑦2 𝑁 = 𝜚 𝑧1𝑧2 1 βˆ’ 𝑁 𝑦1, 𝑦2

  • Preference learning [Xing et al NIPS β€˜02]
  • S-goodness [Balcan-Blum ICML β€˜06]

β„“ 𝑦1,𝑦2 𝐿 = 𝜚 𝑧1𝑧2𝐿 𝑦1, 𝑦2

  • Kernel-target alignment [Cortes et al ICML β€˜10]
  • Bipartite ranking, (p)AUC [Narasimhan-Agarwal ICML β€˜13]

β„“ 𝑦1,𝑦2 𝑔 = 𝜚 𝑔 𝑦1 βˆ’ 𝑔 𝑦2 𝑧1 βˆ’ 𝑧2

E0 370: Statistical Learning Theory 7

slide-8
SLIDE 8

Learnin ing Obje jectiv ives in in Pairw irwise Learnin ing

  • Given training data 𝑦1, 𝑦2, … π‘¦π‘œ
  • Learn

β„Ž: 𝒴 Γ— 𝒴 β†’ 𝒡 such that β„’ β„Ž ≀ β„’ β„Žβˆ— + πœ— (will define β„’ β‹… and β„’ β‹… shortly) Challenges:

  • Training data given as singletons, not pairs
  • Algorithmic efficiency
  • Generalization error bounds

E0 370: Statistical Learning Theory 8

slide-9
SLIDE 9

Part II: II: Batch Learning

E0 370: Statistical Learning Theory 9

slide-10
SLIDE 10

Part II: II: Batch Learning

Batch Learning for Unary Losses

E0 370: Statistical Learning Theory 10

slide-11
SLIDE 11

Trainin ing wit ith Unary Loss Functions

  • Notion of empirical loss

β„’: β„‹ β†’ ℝ+

  • Given training data 𝑇 = 𝑦1, … , π‘¦π‘œ , natural notion

ℒ𝑇 β‹… = 1 π‘œ βˆ‘β„“ β‹…, 𝑦𝑗

  • Empirical risk minimization dictates us to find

β„Ž, s.t. ℒ𝑇 β„Ž ≀ inf

β„Žβˆˆβ„‹

ℒ𝑇 β„Ž

  • Note that

β„’ β‹… is a U-statistic

  • U-statistic: a notion of β€œtraining loss”

ℒ𝑇: β„‹ β†’ ℝ+ s.t. βˆ€β„Ž ∈ β„‹, 𝔽 ℒ𝑇 β„Ž = β„’ β„Ž

E0 370: Statistical Learning Theory 11

slide-12
SLIDE 12

Generali lization bounds for Unary ry Loss Functio ions

  • Step 1: Bound excess risk by suprΔ“mus excess risk

β„’ β„Ž βˆ’ ℒ𝑇 β„Ž ≀ sup

β„Žβˆˆβ„‹

β„’ β„Ž βˆ’ ℒ𝑇 β„Ž

  • Step 2: Apply McDiarmid’s inequality

ℒ𝑇 β„Ž is not perturbed by changing any 𝑦𝑗 β„’ β„Ž βˆ’ ℒ𝑇 β„Ž ≀ 𝔽 sup

β„Žβˆˆβ„‹

β„’ β„Ž βˆ’ ℒ𝑇 β„Ž + 𝒫 1 π‘œ

  • Step 3: Analyze the expected suprΔ“mus excess risk

𝔽 sup

β„Žβˆˆβ„‹

β„’ β„Ž βˆ’ ℒ𝑇 β„Ž = 𝔽 sup

β„Žβˆˆβ„‹

𝔽 β„’

𝑇 β„Ž

βˆ’ ℒ𝑇 β„Ž ≀ 𝔽 sup

β„Žβˆˆβ„‹

β„’

𝑇 β„Ž βˆ’

ℒ𝑇 β„Ž (Jensenβ€²s inequality)

E0 370: Statistical Learning Theory 12

slide-13
SLIDE 13

Analy lyzin ing th the Expected Supr uprΔ“mus Excess Ris isk

  • For unary losses

ℒ𝑇 β‹… = βˆ‘β„“π‘¦π‘— β‹…

  • Analyzing this term through symmetrization easy

1 n 𝔽 sup

β„Žβˆˆβ„‹

βˆ‘β„“π‘¦π‘— β„Ž βˆ’ β„“

𝑦𝑗 β„Ž

≀ 2 π‘œ 𝔽 sup

β„Žβˆˆβ„‹

βˆ‘πœ—π‘—β„“π‘¦π‘— β„Ž ≀ 2𝑀 π‘œ 𝔽 sup

β„Žβˆˆβ„‹

βˆ‘πœ—π‘—β„Ž 𝑦𝑗 β‰ˆ 𝒫 1 π‘œ

E0 370: Statistical Learning Theory 13

𝔽 sup

β„Žβˆˆβ„‹

β„’

𝑇 β„Ž βˆ’

ℒ𝑇 β„Ž

slide-14
SLIDE 14

Part II: II: Batch Learning

Batch Learning for Pairwise Loss Functions

E0 370: Statistical Learning Theory 14

slide-15
SLIDE 15

Trainin ing wit ith Pairw irwis ise Loss Functions

  • Given training data 𝑦1, 𝑦2, … π‘¦π‘œ, choose a U-statistic
  • U-statistic should use terms like β„“ 𝑦𝑗,π‘¦π‘˜

β„Ž (the kernel)

  • Population risk defined as β„’ β‹… = 𝔽ℓ 𝑦,𝑦′

β‹… Examples:

  • For any index set Ξ© βŠ‚ π‘œ Γ— π‘œ , define

β„’S β‹…; Ξ© = 1 Ξ©

𝑗,π‘˜ ∈Ω

β„“ 𝑦𝑗,π‘¦π‘˜ β‹…

  • Choice of Ξ© =

𝑗, π‘˜ : 𝑗 β‰  π‘˜ maximizes data utilization

  • Various ways of optimizing inf

β„Žβˆˆβ„‹

ℒ𝑇 β„Ž (e.g. SSG)

E0 370: Statistical Learning Theory 15

slide-16
SLIDE 16

Generali lization bounds for Pairw irwise Loss Functio ions

  • Step 1: Bound excess risk by suprΔ“mus excess risk

β„’ β„Ž βˆ’ ℒ𝑇 β„Ž ≀ sup

β„Žβˆˆβ„‹

β„’ β„Ž βˆ’ ℒ𝑇 β„Ž

  • Step 2: Apply McDiarmid’s inequality

Check that ℒ𝑇 β„Ž is not perturbed by changing any 𝑦𝑗 β„’ β„Ž βˆ’ ℒ𝑇 β„Ž ≀ 𝔽 sup

β„Žβˆˆβ„‹

β„’ β„Ž βˆ’ ℒ𝑇 β„Ž + 𝒫 1 π‘œ

  • Step 3: Analyze the expected suprΔ“mus excess risk

𝔽 sup

β„Žβˆˆβ„‹

β„’ β„Ž βˆ’ ℒ𝑇 β„Ž = 𝔽 sup

β„Žβˆˆβ„‹

𝔽 β„’

𝑇 β„Ž

βˆ’ ℒ𝑇 β„Ž ≀ 𝔽 sup

β„Žβˆˆβ„‹

β„’

𝑇 β„Ž βˆ’

ℒ𝑇 β„Ž (Jensenβ€²s inequality)

E0 370: Statistical Learning Theory 16

slide-17
SLIDE 17

Analy lyzin ing th the Expected Supr uprΔ“mus Excess Ris isk

  • For pairwise losses

ℒ𝑇 β‹… = βˆ‘π‘—β‰ π‘˜ β„“ 𝑦𝑗,π‘¦π‘˜ β‹…

  • Clean symmetrization not possible due to coupling

2𝔽 sup

β„Žβˆˆβ„‹ 𝑗 π‘˜

β„“

𝑦𝑗, π‘¦π‘˜

β„Ž βˆ’ β„“ 𝑦𝑗,π‘¦π‘˜ β„Ž

  • Solutions [see ClΓ©menΓ§on et al Ann. Stat. β€˜08]
  • Alternate representation of U-statistics
  • Hoeffding decomposition

E0 370: Statistical Learning Theory 17

𝔽 sup

β„Žβˆˆβ„‹

β„’

𝑇 β„Ž βˆ’

ℒ𝑇 β„Ž

slide-18
SLIDE 18

Part III III: Onli line Learning

E0 370: Statistical Learning Theory 18

slide-19
SLIDE 19

Part III III: Onli line Learning

A Whirlwind Tour of Online Learning for Unary Losses

E0 370: Statistical Learning Theory 19

slide-20
SLIDE 20

Model l for Onli line Learnin ing wit ith Unary Losses

Propose hypothesis β„Žπ‘’βˆ’1 ∈ β„‹ Receive loss ℓ𝑒 β‹… = β„“ 𝑦𝑒,β‹… Update β„Žπ‘’βˆ’1 β†’ β„Žπ‘’

E0 370: Statistical Learning Theory 20

  • Regret

β„œπ‘ˆ = βˆ‘β„“π‘’ β„Žπ‘’βˆ’1 βˆ’ inf

β„Žβˆˆβ„‹ βˆ‘β„“π‘’ β„Ž

slide-21
SLIDE 21

Onli line Learnin ing Alg lgorit ithms

  • Generalized Infinitesimal Gradient Ascent (GIGA)

[Zinkevich ’03] β„Žπ‘’ = β„Žπ‘’βˆ’1 βˆ’ πœƒπ‘’π›Όβ„Žβ„“π‘’ β„Žπ‘’βˆ’1

  • Follow the Regularized Leader (FTRL)

[Hazan et al β€˜06] β„Žπ‘’ = argmin

β„Žβˆˆβ„‹ 𝜐=1 π‘’βˆ’1

β„“πœ β„Ž + πœπ‘’ β„Ž 2

  • Under some conditions

β„œπ‘ˆ ≀ 𝒫 π‘ˆ

  • Under stronger conditions

β„œπ‘ˆ ≀ 𝒫 log π‘ˆ

E0 370: Statistical Learning Theory 21

slide-22
SLIDE 22

Onli line to Batch Conversio ion for Unary Losses

  • Key insight: β„Žπ‘’βˆ’1 is evaluated on an unseen point

[Cesa-Bianchi et al β€˜01] 𝔽 ℓ𝑒 β„Žπ‘’βˆ’1 |𝜏(𝑦1, … , π‘¦π‘’βˆ’1) = 𝔽ℓ β„Žπ‘’βˆ’1, 𝑦𝑒 = β„’ β„Žπ‘’βˆ’1

  • Set up a martingale difference sequence

π‘Š

𝑒 = β„’ β„Žπ‘’βˆ’1 βˆ’ ℓ𝑒 β„Žπ‘’βˆ’1

𝔽 π‘Š

𝑒|𝜏 𝑦1, … , π‘¦π‘’βˆ’1

= 0

  • Azuma-Hoeffding gives us

βˆ‘β„’ β„Žπ‘’βˆ’1 ≀ βˆ‘β„“π‘’ β„Žπ‘’βˆ’1 + 𝒫 π‘ˆ βˆ‘β„“π‘’ β„Žβˆ— β‰₯ π‘ˆβ„’ β„Žβˆ— βˆ’ 𝒫 π‘ˆ

  • Together we get

βˆ‘β„’ β„Žπ‘’βˆ’1 βˆ’ π‘ˆβ„’ β„Žβˆ— ≀ β„œπ‘ˆ + 𝒫 π‘ˆ

E0 370: Statistical Learning Theory 22

slide-23
SLIDE 23

Onli line to Batch Conversio ion for Unary Losses

  • Hypothesis selection
  • Convex loss function

β„Ž =

1 π‘ˆ βˆ‘β„Žπ‘’

β„’ β„Ž ≀ 1 π‘ˆ βˆ‘β„’ β„Žπ‘’ ≀ β„’ β„Žβˆ— + β„œπ‘ˆ π‘ˆ + 𝒫 1 π‘ˆ

  • More involved for non convex losses
  • Better results possible [Tewari-Kakade β€˜08]
  • Assume strongly convex loss functions

βˆ‘β„’ β„Žπ‘’βˆ’1 ≀ π‘ˆβ„’ β„Žβˆ— + β„œπ‘ˆ + 𝒫 β„œπ‘ˆ

  • For β„œπ‘ˆ = 𝒫 log π‘ˆ , this reduces to

β„’ β„Ž ≀ 1 π‘ˆ βˆ‘β„’ β„Žπ‘’ ≀ β„’ β„Žβˆ— + 𝒫 log π‘ˆ π‘ˆ

E0 370: Statistical Learning Theory 23

slide-24
SLIDE 24

Part III III: Onli line Learning

Online Learning for Pairwise Loss Functions

E0 370: Statistical Learning Theory 24

slide-25
SLIDE 25

Model l for Onli line Learnin ing wit ith Pairw irwise Losses

Propose hypothesis β„Žπ‘’βˆ’1 ∈ β„‹ Receive loss ℓ𝑒 β‹… = ? Update β„Žπ‘’βˆ’1 β†’ β„Žπ‘’

E0 370: Statistical Learning Theory 25

  • Regret

β„œπ‘ˆ = ?

slide-26
SLIDE 26

Defin inin ing In Instantaneous Loss and Regret

  • At time 𝑒, we receive point 𝑦𝑒
  • Natural definition of instantaneous loss:

All the pairwise interactions 𝑦𝑒 has with previous points ℓ𝑒 β‹… =

𝜐=1 π‘’βˆ’1

β„“ 𝑦𝑒,π‘¦πœ β‹…

  • Corresponding notion of regret

β„œπ‘ˆ = βˆ‘β„“π‘’ β„Žπ‘’βˆ’1 βˆ’ inf

β„Žβˆˆβ„‹ βˆ‘β„“π‘’ β„Ž

  • Note that this notion of instantaneous loss satisfies

βˆ€β„Ž ∈ β„‹, βˆ‘β„“π‘’ β„Ž =

𝑗<π‘˜

β„“ 𝑦𝑗,π‘¦π‘˜ β„Ž = 1 2 ℒ𝑇 β„Ž

E0 370: Statistical Learning Theory 26

slide-27
SLIDE 27

Onli line Learnin ing Alg lgorit ithm wit ith Pairw irwise Losses

  • For regularity, we use a normalized loss

ℓ𝑒 β‹… = 1 𝑒 βˆ’ 1

𝜐=1 π‘’βˆ’1

β„“ 𝑦𝑒,𝑦_𝜐 β‹…

  • Note that ℓ𝑒 β‹… is convex, bounded and Lipchitz if β„“ is so
  • Turns out GIGA works just fine

β„Žπ‘’ = β„Žπ‘’βˆ’1 βˆ’ πœƒπ‘’π›Όβ„Žβ„“π‘’ β„Žπ‘’βˆ’1

  • Guarantees similar regret bounds

β„œπ‘ˆ ≀ 𝒫 π‘ˆ

E0 370: Statistical Learning Theory 27

slide-28
SLIDE 28

Onli line Learnin ing Alg lgorit ithm wit ith Pairw irwise Losses

  • Implementing GIGA requires storing previous history

π›Όβ„Žβ„“π‘’ β‹… = 1 𝑒 βˆ’ 1

𝜐=1 π‘’βˆ’1

π›Όβ„Žβ„“ 𝑦𝑒,π‘¦πœ β‹…

  • To reduce memory usage, keep a snapshot of history
  • Limited memory buffer 𝐢 = β–‘1, β–‘2, … , ░𝑑
  • Modified instantaneous loss

ℓ𝑒

buf β‹… = 1

𝑑

π‘¦βˆˆπΆπ‘’βˆ’1

β„“ 𝑦𝑒,𝑦 β‹…

  • Responsibilities: at each time step 𝑒
  • Update hypothesis β„Žπ‘’βˆ’1 β†’ β„Žπ‘’ (same as GIGA but with ℓ𝑒

buf β‹… )

  • Update buffer UPDATE πΆπ‘’βˆ’1, 𝑦𝑒 β†’ 𝐢𝑒

E0 370: Statistical Learning Theory 28

slide-29
SLIDE 29

Buffer Update Alg lgorithm

  • Online sampling algorithm for i.i.d. samples

[K. et al β€˜13]

E0 370: Statistical Learning Theory 29

RS-x: Reservoir sampling with replacement

𝑨1 𝑨𝑒 𝑨7 𝑨1 𝑨2 𝑨2 𝑨3 𝑨3 𝑨6 𝑨4 𝑨5 𝑨𝑒

∼ π‘ͺ 𝟐 𝒖

𝑨𝑒 𝑨𝑒

slide-30
SLIDE 30

Regret Analy lysis is for GIG IGA wit ith RS-x

  • RS-x gives the following guarantee

At any fixed time 𝑒, the buffer 𝐢 contains 𝑑 i.i.d. samples from the previous history 𝐼𝑒 = 𝑦1, … , π‘¦π‘’βˆ’1

  • Use this to prove a Regret Conversion Bound
  • Basic idea
  • Prove a finite buffer regret bound

1 T βˆ‘β„“π‘’

buf β„Žπ‘’βˆ’1 ≀ inf β„Žβˆˆβ„‹

1 π‘ˆ βˆ‘β„“π‘’

buf β„Ž + 𝒫

1 π‘ˆ

  • Use uniform convergence style bounds to show

ℓ𝑒 β„Žπ‘’βˆ’1 β‰ˆ ℓ𝑒

buf β„Žπ‘’βˆ’1 Β±

𝒫 1 𝑑

E0 370: Statistical Learning Theory 30

slide-31
SLIDE 31

Regret Analy lysis is for GIG IGA wit ith RS RS-x: Step 1

Finite Buffer Regret

  • The modified algo. uses ℓ𝑒

buf β‹… to update hypothesis

  • ℓ𝑒

buf β‹… is also convex, bounded and Lipchitz given 𝐢

  • Standard GIGA analysis gives us

1 T βˆ‘β„“π‘’

buf β„Žπ‘’βˆ’1 ≀ inf β„Žβˆˆβ„‹

1 π‘ˆ βˆ‘β„“π‘’

buf β„Ž + 𝒫 β„œπ‘ˆ buf

π‘ˆ , where β„œπ‘ˆ

buf = 𝒫

π‘ˆ

E0 370: Statistical Learning Theory 31

slide-32
SLIDE 32

Regret Analy lysis is for GIG IGA wit ith RS RS-x: Step 2

Uniform convergence

  • Think of 𝐼𝑒 as population and 𝐢 as i.i.d. sample of size 𝑑
  • Define 𝑕𝑦 β‹… = β„“ 𝑦𝑒,𝑦 β‹… and set unif. dist. over 𝐼𝑒
  • Population risk analysis

𝒣 β‹… = 𝔽𝑕𝑦 β‹… = 1 𝑒 βˆ’ 1

𝜐=1 π‘’βˆ’1

β„“ 𝑦𝑒,π‘¦πœ β‹… = ℓ𝑒 β‹…

  • Empirical risk analysis

𝒣 β‹… = 1 𝑑

π‘¦βˆˆπΆπ‘’βˆ’1

𝑕 𝑦 β‹… = 1 𝑑

π‘¦βˆˆπΆπ‘’βˆ’1

β„“ 𝑦𝑒,𝑦 β‹… = ℓ𝑒

buf β‹…

  • Finish off using 𝒣 β‹… βˆ’

𝒣 β‹…

∞ ≀

𝒫

1 𝑑

E0 370: Statistical Learning Theory 32

slide-33
SLIDE 33

Regret Analy lysis is for GIG IGA wit ith RS-x: Wrappin ing up

  • Convert finite buffer regret to true regret
  • Three results:

βˆ€π‘’, ℓ𝑒 β„Žπ‘’βˆ’1 ≀ ℓ𝑒

buf β„Žπ‘’βˆ’1 +

𝒫 1 𝑑 βˆ€β„Ž, βˆ€π‘’, ℓ𝑒

buf β„Ž ≀ ℓ𝑒 β„Ž +

𝒫 1 𝑑 βˆ€β„Ž, 1 π‘ˆ βˆ‘β„“π‘’

buf β„Žπ‘’βˆ’1 ≀ 1

π‘ˆ βˆ‘β„“π‘’

buf β„Ž + β„œπ‘ˆ buf

π‘ˆ

  • Combine to get

1 T βˆ‘β„“π‘’ β„Žπ‘’βˆ’1 ≀ inf

β„Žβˆˆβ„‹

1 π‘ˆ βˆ‘β„“π‘’ β„Ž + 𝒫 1 𝑑 + β„œπ‘ˆ

buf

π‘ˆ i.e. β„œπ‘ˆ ≀ β„œπ‘ˆ

buf +

𝒫

π‘ˆ 𝑑 =

𝒫

π‘ˆ 𝑑

E0 370: Statistical Learning Theory 33

slide-34
SLIDE 34

Regret Analy lysis is for GIG IGA wit ith RS-x

  • Better results possible for strongly convex losses
  • For any πœ— > 0, we can show

1 T βˆ‘β„“π‘’ β„Žπ‘’βˆ’1 ≀ 1 + πœ— inf

β„Žβˆˆβ„‹

βˆ‘β„“π‘’ β„Ž π‘ˆ + β„œπ‘ˆ π‘ˆ + 𝒫 1 πœ—π‘‘

  • For realizable cases (i.e. β„’ β„Žβˆ— = 0), we can also show

1 T βˆ‘β„“π‘’ β„Žπ‘’βˆ’1 ≀ inf

β„Žβˆˆβ„‹

1 π‘ˆ βˆ‘β„“π‘’ β„Ž + β„œπ‘ˆ π‘ˆ + 𝒫 β„œπ‘ˆ 𝑑

E0 370: Statistical Learning Theory 34

slide-35
SLIDE 35

Onli line to Batch Conversio ion for Pair irwise Losses

  • Recall that in unary case, we had an MDS

π‘Š

𝑒 = β„’ β„Žπ‘’βˆ’1 βˆ’ ℓ𝑒 β„Žπ‘’βˆ’1

  • Recall, in pairwise case, we have

β„’ β‹… = 𝔽ℓ 𝑦,𝑦′ β‹… ℓ𝑒 β‹… = 1 𝑒 βˆ’ 1

𝜐=1 π‘’βˆ’1

β„“ 𝑦𝑒,π‘¦πœ β‹…

  • No longer an MDS since π‘Š

𝑒 and π‘Š 𝜐, 𝜐 < 𝑒 are coupled

𝔽 π‘Š

𝑒|𝜏 𝐼𝑒

= β„’ β„Žπ‘’βˆ’1 βˆ’ 𝔽 ℓ𝑒 β„Žπ‘’βˆ’1 |𝜏 𝐼𝑒 β‰  0

E0 370: Statistical Learning Theory 35

slide-36
SLIDE 36

Onli line to Batch Conversio ion for Pair irwise Losses

Solution:

  • Martingale creation: let

ℓ𝑒 β‹… = 𝔽 ℓ𝑒 β‹… |𝜏 𝐼𝑒 π‘Š

𝑒 = β„’ β„Žπ‘’βˆ’1 βˆ’

ℓ𝑒 β„Žπ‘’βˆ’1 + ℓ𝑒 β„Žπ‘’βˆ’1 βˆ’ ℓ𝑒 β„Žπ‘’βˆ’1 π‘Š

𝑒 = 𝑄𝑒 + 𝑅𝑒

  • Sequence 𝑅𝑒 is an MDS by construction: A.H. bounds
  • Bounding 𝑄𝑒 using uniform convergence
  • Be careful during symmetrization step
  • End Result

1 π‘ˆ βˆ‘β„’ β„Žπ‘’βˆ’1 ≀ β„’ β„Žβˆ— + β„œπ‘ˆ π‘ˆ + 𝒫 1 π‘ˆ

E0 370: Statistical Learning Theory 36

slide-37
SLIDE 37

Faster Rates for Str trongly ly Convex Losses

  • Have to use fast rate results to bound both 𝑄

𝑒 and 𝑅𝑒

  • Fast rates for 𝑄𝑒

For strongly unary convex loss functions ℓ𝑦 β‹… , we have β„’ β„Ž βˆ’ β„’ β„Žβˆ— ≀ 1 + πœ— ℒ𝑇 β„Ž βˆ’ ℒ𝑇 β„Žβˆ— + 𝒫 1 πœ—π‘œ

  • Fast rates for 𝑅𝑒

Use Bernstein inequality for martingales

  • End result

1 π‘ˆ βˆ‘β„’ β„Žπ‘’βˆ’1 ≀ β„’ β„Žβˆ— + β„œπ‘ˆ π‘ˆ + 𝒫 β„œπ‘ˆ π‘ˆ

E0 370: Statistical Learning Theory 37

slide-38
SLIDE 38

Hid idden Constants

  • All our analyses involved Rademacher averages
  • Even for regret analysis and bounding 𝑄𝑒 for slow/fast rates
  • Get dimension independent bounds for regularized classes
  • Weak dependence on dimensionality for sparse formulations
  • Earlier work [Wang et al β€˜12] used covering number methods
  • If constants not imp. then can try analyzing π‘Š

𝑒 directly

  • Use covering number arguments to get linear dep. on 𝑒

E0 370: Statistical Learning Theory 38

slide-39
SLIDE 39

Some In Interesting Proje jects

  • Regret bounds require 𝑑 = πœ• log π‘ˆ
  • Is this necessary: regret lower bound
  • Learning higher order tensors
  • Scalability issues
  • RS-x is a data oblivious sampling algorithm
  • Can throw away useful points by chance
  • Data aware sampling methods + corresponding regret bounds

E0 370: Statistical Learning Theory 39

slide-40
SLIDE 40

That’s all!

Get slides from the following URL http://research.microsoft.com/en-us/people/t-purkar/

E0 370: Statistical Learning Theory 40

slide-41
SLIDE 41

References

  • Balcan, Maria-Florina and Blum, Avrim. On a Theory of Learning

with Similarity Functions. In ICML, pp. 73-80, 2006.

  • Cesa-Bianchi, Nicolo, Conconi, Alex, and Gentile, Claudio. On the

Generalization Ability of On-Line Learning Algorithms. In NIPS, pp. 359-366, 2001.

  • Clemencon, Stephan, Lugosi, Gabor, and Vayatis, Nicolas. Ranking

and empirical minimization of Ustatistics. Annals of Statistics, 36:844-874, 2008.

  • Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Two-

Stage Learning Kernel Algorithms. In ICML, pp. 239-246, 2010.

  • Hazan, Elad, Kalai, Adam, Kale, Satyen, and Agarwal, Amit.

Logarithmic Regret Algorithms for Online Convex Optimization. In COLT, pp. 499-513,2006.

E0 370: Statistical Learning Theory 41

slide-42
SLIDE 42

References

  • Jin, Rong, Wang, Shijun, and Zhou, Yang. Regularized Distance

Metric Learning: Theory and Algorithm. In NIPS, pp. 862-870, 2009.

  • Kakade, Sham M. and Tewari, Ambuj. On the Generalization

Ability of Online Strongly Convex Programming Algorithms. In NIPS, pp. 801-808, 2008.

  • Kar, Purushottam, Sriperumbudur, Bharath, Jain, Prateek, and

Karnick, Harish, On the Generalization Ability of Online Learning Algorithms for Pairwise Loss Functions. In ICML, 2013.

  • Narasimhan, Harikrishna and Agarwal, Shivani, A Structural SVM

Based Approach for Optimizing Partial AUC, In ICML, 2013.

  • De la PenΜ„a, Victor H. and GinΓ©, Evariste, Decoupling: From

Dependence to Independence. Springer, New York, 1999.

E0 370: Statistical Learning Theory 42

slide-43
SLIDE 43

References

  • Sridharan, Karthik, Shalev-Shwartz, Shai, and Srebro, Nathan. Fast

Rates for Regularized Objectives. In NIPS, pp. 1545-1552, 2008.

  • Wang, Yuyang, Khardon, Roni, Pechyony, Dmitry, and Jones,
  • Rosie. Generalization Bounds for Online Learning Algorithms with

Pairwise Loss Functions. In COLT 2012.

  • Zhao, Peilin, Hoi, Steven C. H., Jin, Rong, and Yang, Tianbao.

Online AUC Maximization. In ICML, pp. 233-240, 2011.

  • Xing, Eric P., Ng, Andrew Y., Jordan, Michael I., and Russell, Stuart
  • J. Distance Metric Learning with Application to Clustering with

Side-Information. In NIPS, pp. 505-512, 2002.

  • Zinkevich, Martin. Online Convex Programming and Generalized

Infinitesimal Gradient Ascent. In ICML, pp. 928-936, 2003.

E0 370: Statistical Learning Theory 43