Adaptive Regularization Algorithms in Learning Theory Case Study: - - PowerPoint PPT Presentation

adaptive regularization algorithms in learning theory
SMART_READER_LITE
LIVE PREVIEW

Adaptive Regularization Algorithms in Learning Theory Case Study: - - PowerPoint PPT Presentation

Adaptive Regularization Algorithms in Learning Theory Case Study: Prediction of Blood Glucose Level Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang RICAM, Austria Joint research with E: De Vito (Uni. Genova), L. Rosasco (MIT,


slide-1
SLIDE 1

Adaptive Regularization Algorithms in Learning Theory – Case Study: Prediction of Blood Glucose Level

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang

RICAM, Austria

Joint research with E: De Vito (Uni. Genova), L. Rosasco (MIT, Boston). Workshop ”Inverse and Partial Information Problems”. RICAM, Linz. October-2008.

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-2
SLIDE 2

Learning from examples

Vapnik (95), Evgeniou, Pontil, Poggio (2000), Cucker, Smale (01) : 1) Two sets of variables X ⊂ Rd, Y ⊂ R are related by a probabilistic relationship: X ∋ x − → ρ(·|x)−(unknown) probability distribution on Y 2) Training data: z = {(x1,y1),...,(xn,yn)} ∈ (X ×Y)n The goal: provide an estimator f = fz : X → Y to predict y ∈ Y for any given x ∈ X.

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-3
SLIDE 3

EU-project ”DIAdvisor – diabetes adviser”: Glucose Prediction using patient vital data.

1) Input: x = xi = (ti,xi

1,xi 2,...,xi d−1) ∈ Rd, where xi k, k = 1,2,...,d −1,

are the measurements of vital signs (e.g. glucose concentration, blood pH, temperature...) measured at the time t = ti, i = 1,2,...,n. 2) Output: y is the blood glucose concentration at the time t > tn in the future. State of art (R. Gillis et al., Abstract 0415-P , 2007, Santa Barbara, CA): ”With the estimator blinded to meals one can accurately (i.e. with an error less than 2 mmol/l) predict glucose levels 45 minites into the

  • future. This is a promising result...”

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-4
SLIDE 4

”The Uncertainty...It is rather a matter of Efficiency” (David Mumford. ”The mathematics of Perception”) If the blood glucose concentration is assumed to be a function y = y(t,x1,x2,...,xd−1,xd,...), then Training data are: ti,xi

1,xi 2,...,xi d−1,yi = y(ti,xi 1,xi 2,...,xi d−1,xi d,...)

i = 1,2,...,n. In the first phase of ”DIAdvisor” only the data (ti,yi), i = 1,2,...,n are available. The goal is to predict the value ym = y(tm,...) for tm > tn, tm −tn > 45(minutes).

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-5
SLIDE 5

Statistical framework

1) ρX(·) is the (marginal) probability distribution on X (which is also unknown) 2) Expected risk of the estimator f : X → Y E (f) =

  • X
  • Y(f(x)−y)2ρ(y|x)ρX(x)dydx

3)Regression function fρ(x) = argmin{E (f),f ∈ L2(X,ρXdx)} =

  • Y yρ(y|x)dy

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-6
SLIDE 6

Hypothesis space and Target function

1) H is a Hilbert space. J : H ֒ → L2(X,ρXdx) is the compact embedding 2) fH = argmin{E (f),f ∈ H } = argmin{f −fρρ,f ∈ H } E (f) = f −fρ2

ρ +E (fρ),

·ρ = ·L2(X,ρXdx) ∀f ∈ H f −fρρ = J f −fρρ fH : J ∗J f = J ∗fρ; J ∗: L2(X,ρXdx) → H

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-7
SLIDE 7

Picard criterion and Source conditions

T = J ∗J =

i=1

ti·,eiH ei ; L = J J ∗ =

i=1

ti·,liρ li fH =

i=1

li,fρρ √ti ei ∈ H ⇐ ⇒

i=1

li,fρ2

ρ

ti < ∞ ∃φ : [0,t1] → R+, φ(0) = 0, φ ↑:

i=1

li,fρ2

ρ

tiφ2(ti) < ∞ v =

i=1

li,fρρ √tiφ(ti)ei ∈ H ⇒ fH = φ(T)v H φ = {f ∈ H : f = φ(T)v,v ∈ H }.

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-8
SLIDE 8

Reproducing Kernel Hilbert Space H = HK

1) K: X ×X → R is continuous, symmetric, positive semidefinite; Kx = K(x,·). 2) HK = {f : f =

r

j=1

cjKxj}, Kxj = K(xj,·) 3) f,gK =

r

j=1

cjKxj,

s

i=1

diKtiK :=

r

j=1 s

i=1

cjdiK(xj,ti) 4) HK is the completion of HK w.r.t ·K ∀f ∈ HK f(x) = Kx,fK

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-9
SLIDE 9

Discrete version of the equation J ∗J f = J ∗fρ for J = JHK

z = {(xi,yi)}n

i=1,

x = (xi)n

i=1,

y = (yi)n

i=1 ∈ Rn;

u,vRn = 1

n n

i=1

uivi. Sx : HK → Rn, Sxf = (f(xi))n

i=1,

S∗

x : Rn → HK

S∗

xy = 1 n n

i=1

yiKxi, Tx = S∗

xSx = 1 n n

i=1

KxiKxi,·K JHK: HK ֒ → L2(X,ρxdx); JHKf = fpρ ⇒ Sxf = y T = J ∗

HKJHK: HK → HK;

Tf = J ∗

HKfρ ⇒ Txf = S∗ xy

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-10
SLIDE 10

Regularization of Txf = S∗

xy

Poggio et al. (2000), ... Smale, Zhou (05): Tikhonov regularization f λ

z = argmin{ 1 n ∑n i=1(f(xi)−yi)2 +λf2 K} = (λI +Tx)−1S∗ xy

General regularization scheme: f λ

z = gλ(Tx)S∗ xy = n

i=1

γiKxi gλ(t): [0,Tx] → R; For Tikhonov gλ(t) = (λ +t)−1 1) sup

t

|gλ(t)| co

λ ;

2) ∃p: ∀ν ∈ [0,p] sup

t

|(1 −gλ(t)t)tν| ≤ cρλ ν. For Tikhonov p = 1. Remark: De Vore et al.(2006), Maiorov(2006): λ = 0, H is a finite ball in a finite-dimensional space. Cortes, Vapnik (1995): other forms of a loss function V(yi,f(xi)).

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-11
SLIDE 11

Basic Theorem

Assume: 1) fHK ∈ H φ

K ,

φ ∈ F p

æ,

æ > sup

x∈X

K(x,x); 2) gλ : ∀λ sup

t

|(1 −gλ(t)t)tq| cλ q, q p +1/2; Then for f λ

z = gλ(Tx)S∗ xy, λ > 2 √ 2 √n ælog 4 h, with probability 1 −h

fHK −f λ

z ρ

  • (c1φ(λ)

√ λ + c2 √ λn )log 1 h, fHK −f λ

z K

  • (c3φ(λ)+

c4 λ√n)log 1 h

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-12
SLIDE 12

A priori parameter choice

Th.1. Let θ(t) = φ(t)t, fHK ∈ H φ

K . Under the assumptions of Basic

Theorem for λn = θ −1(n−1/2) with probability 1 −h fHK −f λn

z ρ

  • cφ(θ −1(n−1/2))
  • θ −1(n−1/2)log 1

h fHK −f λn

z K

  • cφ(θ −1(n−1/2))log 1

h Remark 1: For φ(t) = tr; ·ρ ∼ n− 2r+1

4(r+1) ,

·K ∼ n−

r 2(r+1) .

Remark 2: Smale, Zhou (2005): 0 < r 1/2; Caponnetto et al. (2005): r > 1/2, ·ρ n−

2r+1 4(r+3/2)

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-13
SLIDE 13

Regularization in the empirical norm

f2

{xi} := 1

n

n

i=1

f 2(xi). Th.2. For f ∈ HK with the probability 1 −h

  • f2

ρ −f2 {xi}

  • ≤ c1 log 1

h

√n f2

K.

Moreover, under the assumptions of Basic Theorem with the same probability fHK −f λ

z {xi} ≤ (c5φ(λ)

√ λ + c6 √ λn )log 1 h.

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-14
SLIDE 14

Balancing Principle for Learning Theory

{f λi

z },

λi = λ0qi, i = 0,1,...,M; λ0 = 2

√ 2 √n ælog 4 h, q > 1.

λemp = max{λk : f λk

z −f λj z {xi} ≤ 4c6 log 1

h

λjn ,

j = 0,1,...,k −1}. λHK = max{λk : f λk

z −f λj z K ≤ 4c4 log 1

h

λj √n ,

j = 0,1,...,k −1}. Th.3. Under the assumption of Basic Theorem the choice λ+ = min{λemp,λHK} guarantees the optimal order of the risk without knowledge of the function φ generating source conditions.

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-15
SLIDE 15

Adaptive scheme

fHK −f λ

z

≤ φ(λ)+ c λ v√n, v = 1 2,1 f λk

z −f λj z

≤ fHK −f λk

z +fHK −f λj z ≤

4c λ v

j

√n.

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-16
SLIDE 16

Heuristic Counterpart of the Balancing Principle for Learning Theory

Mathe, Pereverzev (2003), Pereverzev, Schock (2005): In the balancing principle it is enough to compare only f λν

z

−f λν−1

z

, ν = 1,2,...,M, instead of f λν

z

−f λµ

z , ν = 1,2,...,M, µ = 0,1,...,ν −1.

Tikhonov, Glasko(1965): Quasi-optimality criterion σ(ν) = f λν

z

−f λν−1

z

, λ QO = λk, k = argmin{σ(ν),ν = 1,2,...,M}. Quasi-balancing principle: λ QB

+ = min{λ QO emp,λ QO HK},

where λ QO

emp = λk,

k = argmin{σemp(ν) = f λν

z

−f λν−1

z

{xi}, ν = 1,2,...,M} λ QO

HK = λl,

l = argmin{σHK(ν) = f λν

z

−f λν−1

z

K, ν = 1,2,...,M}.

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-17
SLIDE 17

Numerical Test 1.

Test example by Micchelli and Pontil (2005): fHK(x) = 1 10(x+2(e−8( 4

3π−x)2 −e−8( π 2 −x)2 −e−8( 3 2 π−x)2)).

x ∈ [0,2π]. fHK ∈ HK, K = K(s,t) = xt +e−8(t−x)2 z = {(xi,yi)}n

i=1,

xi = 2πi

m ,

yi = fHK(xi)+ζ,

ζ is uniformly sampled in [-0.02, 0.02].

Prediction of an interpolation type: n = m. Prediction of an extrapolation type: n < m.

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-18
SLIDE 18

Numerical Test 1 (continuation).

f λ

z = argmin{ 1 n ∑n i=1(f(xi)−yi)2 +λf2 K} = (λI +Tx)−1S∗ xy

λ ∈ {λi = λ0qi, i = 0,1,...,20}, λ0 = 10−6, q = 1.5. Test 1.1: n = m = 20, λ QB

+

= 1.5 ·10−6 Test 1.2: n = m = 50, λ QB

+

= 0.0033 Test 1.3: n = m = 50, λ = 10−5 (for comparison)

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-19
SLIDE 19

Numerical Test 1.1.

Figure: σ(ν) values in different norms when sampling frequency is T/20. The X-axis stands for the values of ν, the Y-axis stands for the values of σ(ν). The blue dots are empirical norms; the green crosses are RKHS norms.

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-20
SLIDE 20

Numerical Test 1.1 (continuation).

Figure: Approximation by using λ QB

+

= 0.0000015 for training data that generated in a sampling frequency T/20. The green line is the real signal; blue dots are training set {ti,f (ti)} , i=0,1,...,20; red line is the approximation

  • btained by the proposed method.

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-21
SLIDE 21

Numerical Test 1.2.

Figure: σ(ν) values in different norms when sampling frequency is T/50.

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-22
SLIDE 22

Numerical Test 1.2 (continuation).

Figure: λ QB

+

= 0.0033 with sampling frequency T/50.

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-23
SLIDE 23

Numerical Test 1.3.

Figure: λ = 0.00001.

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-24
SLIDE 24

Application to the problem of adaptive kernel choice.

Micchelli and Pontil (2005): K ∈ K. Kopt = KMP(λ) = argmin{Qλ(K),K ∈ K} Qλ(K) = inf{ 1

|z|

(xi,yi)∈z

(f(xi)−yi)2 +λf2

K,

f ∈ HK}. Note that KMP is λ-dependent. So, this criterion can be used only for a priori given regularization parameter λ.

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-25
SLIDE 25

Combination with the (Quasi-)Balancing principle.

–Consider a map λ+: K − → R+, λ+(K) = λ+, where λ+ is a parameter chosen for K ∈ K in accordance with (Quasi)Balancing principle. –Assume that λ∗ is the fixed point of the map λ − → λ+(KMP(λ)). Then K = KMP(λ∗) ∈ K is optimal in the sense of Micchelli and Pontil.

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-26
SLIDE 26

Numerical Test 2. Prediction of an interpolation type.

The same test as in Micchelli and Pontil (2005): f ∈ HK; K(x,t) = xt +e−8(t−x)2 z = {(xi,yi)}n

i=1,

xi = 2πi

m ,

yi = f(xi)+ζ, ζ ∈ [−0.02,0.02], m = n = 20. K={K(x,t) = (xt)β +e−j(x−t)2, β ∈ {0.5,1,2,3,4}, j = {1,2,...,10}}. λ∗ solves λ = λ+(KMP(λ)), λ∗ = 0.0012 KMP(λ∗) = xt +e−10(x−t)2

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-27
SLIDE 27

Numerical Test 2. Prediction of an interpolation type. (continuation).

Figure: Approximation by using kernel K(t1,t2) = (t1t2)+e(−10(t1−t2)2) and λ = 0.0012 for training data generated in sampling frequency T/20.

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-28
SLIDE 28

Numerical Test 2. Warning example.

Prediction of an extrapolation type: f, f λ

z ∈ HK,

K(x,t) = xt +e−8(t−x)2, as in Micchelli and Pontil (2005) z = {(xi,yi)}n

i=1,

xi = 2πi

m ,

yi = f(xi)+ζ, ζ ∈ [−0.02,0.02], m = 20, n = 15.

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-29
SLIDE 29

Possible remedy. New criterion

Let zs ⊂ z, and λ+ = λ+(K) be chosen in accordance with (Quasi-)Balancing principle for K ∈ K and zs. f λ+

zs (K,·) = argmin{ 1 |zs|

(xi,yi)∈zs

(f(xi)−yi)2 +λ+f2

K,

f ∈ HK}. Kopt = argmin{

1 |z\zs|

(xi,yi)∈z\zs

(f λ+

zs (K,xi)−yi)2,K ∈ K}.

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-30
SLIDE 30

Warning example revisited.

K={K(x,t) = (xt)β +e−j(x−t)2, β ∈ [0.5,4], j = [1,10]} f ∈ HK, K(x,t) = xt +e−8(t−x)2 f λ

z ∈ HKopt,

Kopt(x,t) = (xt)1.2 +e−2.6(t−x)2

Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-31
SLIDE 31

Application to ”DIAdvisor”

Patient: ANA1235, Subject ID - 47

Day1: specification Day2: prediction1 β = 2.5, j=0, λ+ = 0.005, n=10, m=14 . λ+ = 0.005, n=10, m=13. Day3: prediction2 λ+ = 0.005, n=10, m=13. Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-32
SLIDE 32

Application to ”DIAdvisor”. (continuation)

Patient: ANA1235, Subject ID - 47

Day1: specification Day2: prediction1 β = 1.5, j=0, λ+ = 0.005, n=6, m=14. λ+ = 0.005, n=6, m=13. Day3: prediction2 λ+ = 0.005, n=6, m=13. Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre

slide-33
SLIDE 33

Patient: ANA1235, Subject ID - 10

Day1: Day2:

20 40 60 80 100 120 140 160 180 10 12 14 16 18 20 22 20 40 60 80 100 120 140 160 180 8 9 10 11 12 13 14 15 16 17 18 19

Day3:

20 40 60 80 100 120 140 160 180 9 10 11 12 13 14 15 16 17 18 19

Kernel K(x,t) = (xt)0.35 +0.005e−0.001(x−t)2 in the interval [15,105]. λ0 = 0.0001. For Day2 and Day3, λ+ = 1.01 ×10−4 Kernel K(x,t) = (xt)0.01 +0.1e−0.00013(x−t)2 in the interval [75,165]. λ0 = 0.0001. For Day2 and Day3, λ+ = 1.01 ×10−4 Sergei V. Pereverzev, Sivananthan Sampath, Huajun Wang Adaptive Regularization Algorithms in Learning Theory – Case Study: Pre