A Two-Stage Approach for Learning a Sparse Model with Sharp Excess - - PowerPoint PPT Presentation

a two stage approach for learning a sparse model with
SMART_READER_LITE
LIVE PREVIEW

A Two-Stage Approach for Learning a Sparse Model with Sharp Excess - - PowerPoint PPT Presentation

A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis Zhe Li , Tianbao Yang ,Lijun Zhang , Rong Jin The University of Iowa, Nanjing University, Alibaba Group December 10, 2015 Zhe Li ,


slide-1
SLIDE 1

A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis

Zhe Li∗, Tianbao Yang∗,Lijun Zhang♮, Rong Jin†

∗The University of Iowa, ♮Nanjing University, †Alibaba Group

December 10, 2015

Zhe Li∗, Tianbao Yang∗,Lijun Zhang♮, Rong Jin† A Two-Stage Approach for Learning a Sparse Model with Sharp

slide-2
SLIDE 2

Problem

Let x ∈ Rd and y ∈ R denote an input and output pair Let w∗ be an optimal model that minimizes the expected error w∗ = arg min

||w||1≤B

1 2EP[(wTx − y)2] Key Problem: w∗ is not necessarily sparse The goal: to learn a sparse model w to achieve small excess risk ER(w, w∗) = EP[(wTx − y)2] − EP[(wT

∗ x − y)2] ≤ ǫ

Zhe Li∗, Tianbao Yang∗,Lijun Zhang♮, Rong Jin† A Two-Stage Approach for Learning a Sparse Model with Sharp

slide-3
SLIDE 3

The challenges

L = EP[(wTx − y)2] is not necessarily strongly convex

Stochastic optimization: O(1/ǫ2) sample complexity and no sparsity guarantee Empirical risk minimization + ℓ1 penalty: O(1/ǫ2) sample complexity and no sparsity guarantee

Challenges:

Can we reduce sample complexity (e.g. O(1/ǫ))? Can we also have a guarantee on sparsity of model?

Our solution:

Zhe Li∗, Tianbao Yang∗,Lijun Zhang♮, Rong Jin† A Two-Stage Approach for Learning a Sparse Model with Sharp

slide-4
SLIDE 4

The first stage

Our first stage algorithm is motivated by EPOCH-GD algorithm [Hazan, Kale 2011], which is on strongly convex setting. How to avoid strongly convex assumption?

L(w) = EP[(w Tx − y)2] = h(Aw) + bTw + c h(·): a strongly convex function The optimal solution set is a polyhedron By Hoffmans’ bound we have 2(L(w) − L∗) ≥ 1 κ||w − w +||2

2

where w + is the closest solution to w in the optimal solution set.

[1] Elad Hazan, Satyen Kale, Beyond the regret minimization barrier: optimal algorithm for stochastic strongly-convex optimization Zhe Li∗, Tianbao Yang∗,Lijun Zhang♮, Rong Jin† A Two-Stage Approach for Learning a Sparse Model with Sharp

slide-5
SLIDE 5

The second stage

Our second stage algorithm: Randomized Sparsification For k = 1, . . . , K

Sample ik ∈ [d] according to Pr(ik = j) = pj Compute [ wk]ik = [ wk−1]ik +

  • wik

pik

End For pj =

  • ˆ

w2

j E[x2 j ]

d

j=1

  • ˆ

w2

j E[x2 j ] instead of pj =

| ˆ wj| || ˆ w||1 [Shalve-Shwartz et

al., 2010] Reduced constant in O(1/ǫ) for sparsity

[2] shalve-shwartz, Srebro, Zhang, Trading accuracy for sparsity in optimization problems with sparsity constraints Zhe Li∗, Tianbao Yang∗,Lijun Zhang♮, Rong Jin† A Two-Stage Approach for Learning a Sparse Model with Sharp

slide-6
SLIDE 6

Experimental Results

k

1 2 3 4 5 6 7

RMSE

0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

E2006-tfidf

Epoch-SGD SGD

1st stage

K

500 1000 1500 2000 2500

RMSE

0.6 0.65 0.7 0.75

E2006-tfidf

MG-Sparsification DD-Sparsification full model

2nd stage

Sparsity(%)

0.011 10.4 16.9 53.8 92.3 92.0

RMSE

0.5 1 1.5 2 2.5 3

E2006-tfidf

SpT: K = 500 SpS: B = 1 SpS: B = 2 SpS: B = 3 SpS: B = 4 SpS: B = 5

RMSE vs Sparsity

Zhe Li∗, Tianbao Yang∗,Lijun Zhang♮, Rong Jin† A Two-Stage Approach for Learning a Sparse Model with Sharp