Nonparametric prediction L aszl o Gy orfi Budapest University of - - PowerPoint PPT Presentation

nonparametric prediction l aszl o gy orfi
SMART_READER_LITE
LIVE PREVIEW

Nonparametric prediction L aszl o Gy orfi Budapest University of - - PowerPoint PPT Presentation

Nonparametric prediction L aszl o Gy orfi Budapest University of Technology and Economics Department of Computer Science and Information Theory e-mail: gyorfi@szit.bme.hu www.szit.bme.hu/ gyorfi 1 Universal prediction: squared


slide-1
SLIDE 1

Nonparametric prediction L´ aszl´

  • Gy¨
  • rfi

Budapest University of Technology and Economics Department of Computer Science and Information Theory e-mail: gyorfi@szit.bme.hu www.szit.bme.hu/∼gyorfi

1

slide-2
SLIDE 2

Universal prediction: squared loss

yi real valued xi vector valued At time instant i the predictor is asked to guess yi with knowledge of the past (x1, . . . , xi, y1, . . . yi−1) = (xi

1, yi−1 1

) The predictor is a sequence of functions g = {gi}∞

i=1

gi(xi

1, yi−1 1

) is the estimate of yi After n time instant the empirical squared error for the sequence xn

1, yn 1

Ln(g) = 1 n

n

  • i=1

(gi(xi

1, yi−1 1

) − yi)2.

2

slide-3
SLIDE 3

Regression function estimation

Y real valued X observation vector Regression problem min

f

E{(Y − f(X))2} Regression function m(x) = E{Y | X = x} For each function f one has E{(f(X) − Y )2} = E{(m(X) − Y )2} + E{(m(X) − f(X))2

3

slide-4
SLIDE 4

Data: Dn = {(X1, Y1), . . . , (Xn, Yn)} Regression function estimate mn(x) = mn(x, Dn) Usual consistency conditions:

  • m(x) is smooth
  • X has a density
  • Y is bounded

Nonparametric features:

  • construction of the estimate
  • consistency

4

slide-5
SLIDE 5

Universal consistency

Definition 1 The estimator mn is called (weakly) universally consistent if E{(m(X) − mn(X))2} → 0 for all distributions of (X, Y ) with EY 2 < ∞.

5

slide-6
SLIDE 6

Local averaging estimates

Stone (1977) mn(x) =

n

  • i=1

Wni(x; X1, . . . , Xn)Yi.

6

slide-7
SLIDE 7

k-nearest neighbor estimate

Wni is 1/k if Xi is one of the k nearest neighbors of x among X1, . . . , Xn, and Wni is 0 otherwise. Theorem 1 If kn → ∞, kn/n → 0 then the k-nearest neighbor estimate is weakly universally consistent.

7

slide-8
SLIDE 8

Partitioning estimate

Partition Pn = {An,1, An,2 . . . } mn(x) =

n

  • i=1

YiKn(x, Xi)

n

  • i=1

Kn(x, Xi) , where Kn(x, u) =

j

I[x∈An,j,u∈An,j].

8

slide-9
SLIDE 9

Theorem 2 If for all sphere S centered at the origin lim

n→∞

sup

j;An,j∩S=0

diam(An,j) = 0 and lim

n→∞

|{j; An,j ∩ S = 0}| n = 0 then the partitioning estimate is weakly universally consistent. Example: An,j are cubes with volume hd

n, hn → 0, nhd n → ∞ 9

slide-10
SLIDE 10

Kernel estimate

Kernel function K(x) ≥ 0 Bandwidth hn > 0 mn(x) =

n

  • i=1

YiK x−Xi

h

  • n
  • i=1

K x−Xi

h

  • Theorem 3 If

hn → 0, nhd

n → ∞

then under some conditions on K the kernel estimate is weakly universally consistent.

10

slide-11
SLIDE 11

Least squares estimates

empirical L2 error 1 n

n

  • j=1

|f(Xj) − Yj|2 class of functions Fn select a function from Fn which minimizes the empirical error: mn ∈ Fn and 1 n

n

  • j=1

|mn(Xj) − Yj|2 = min

f∈Fn

1 n

n

  • j=1

|f(Xj) − Yj|2. the class Fn grows slowly as n grows

11

slide-12
SLIDE 12

Examples for Fn:

  • polynomials
  • splines
  • neural networks
  • radial basis functions

12

slide-13
SLIDE 13

Dependent data: time series

The data Dn = {(X1, Y1), . . . , (Xn, Yn)} are dependent long-range dependent form a stationary and ergodic process For given n, the problem is the following minimization: min

g

E{(g(Xn+1, Dn) − Yn+1)2}. The best predictor is the conditional expectation E{Yn+1 | Xn+1, Dn}, which cannot be learned from data

13

slide-14
SLIDE 14

there is no prediction sequence with lim

n→∞(gn(Xn+1, Dn) − E{Yn+1 | Xn+1, Dn}) = 0

a.s. for all stationary and ergodic sequence.

  • ur aim is to achieve the optimum

L∗ = lim

n→∞ min g

E{(g(Xn+1, Dn) − Yn+1)2}, which is impossible

14

slide-15
SLIDE 15

Universal consistency

there are universal Ces´ aro consistent prediction sequence gn lim

n→∞

1 n

n

  • i=1

(gi(Xi+1, Di) − Yi+1)2 = L∗ a.s. for all stationary and ergodic sequence. Such prediction sequence is called universally consistent. We show a construction of universally consistent predictor by combination of predictors (experts).

15

slide-16
SLIDE 16

Lemma

Let ˜ h1, ˜ h2, . . . be a sequence of prediction strategies (experts), and let {qk} be a probability distribution on the set of positive integers. Assume that ˜ hi(yn−1

1

) ∈ [−B, B] and yn

1 ∈ [−B, B]n. Define

wt,k = qke−(t−1)Lt−1(˜

hk)/c

with c ≥ 8B2, and vt,k = wt,k

  • i=1

wt,i . Reminder: Ln(g) = 1 n

n

  • i=1

(gi(xi

1, yi−1 1

) − yi)2.

16

slide-17
SLIDE 17

If the prediction strategy ˜ g is defined by ˜ gt(yt−1

1

) =

  • k=1

vt,k˜ hk(yt−1

1

) t = 1, 2, . . . then for every n ≥ 1, Ln(˜ g) ≤ inf

k

  • Ln(˜

hk) − c ln qk n

  • .

17

slide-18
SLIDE 18

Special case: N predictors {qk} is the uniform distribution then Ln(˜ g) ≤ min

k Ln(˜

hk) + c ln N n .

18

slide-19
SLIDE 19

Dependent data: time series

stationary and ergodic data (X1, Y1), . . . , (Xn, Yn). Assume that |Y0| ≤ B. An elementary predictor (expert) is denoted by h(k,ℓ), k, ℓ = 1, 2, . . .. Let Gℓ be a quantizer of Rd and Hℓ be a quantizer of R. For given k, ℓ, let In be the set of time instants k < i < n, for which there is a match of k-length quantized sequences: Gℓ(xi

i−k) = Gℓ(xn n−k)

and Hℓ(yi−1

i−k) = Hℓ(yn−1 n−k). 19

slide-20
SLIDE 20

Then the prediction of this expert is the averages of yi’s if i ∈ In: h(k,ℓ)

n

(xn

1, yn−1 1

) =

  • i∈In yi

|In| . These predictors are not universally consistent since for small k the bias is large and large k the variance is large because of the few

  • matchings. The same is true for the quatizers.

The problem is how to choose k, ℓ in a data dependent way. The solution is the combination of experts.

20

slide-21
SLIDE 21

The combination of predictors can be derived according to the previous lemma. Let {qk,ℓ} be a probability distribution over (k, ℓ), and for c = 8B2 put wt,k,ℓ = qk,ℓe−(t−1)Lt−1(h(k,ℓ))/c and vt,k,ℓ = wt,k,ℓ

  • i,j=1

wt,i,j . Then the combined prediction gt(xt

1, yt−1 1

) =

  • k,ℓ=1

vt,k,ℓh(k,ℓ)(xt

1, yt−1 1

) .

21

slide-22
SLIDE 22

Theorem

If the quantizers Gℓ and Hℓ “are asymptotically fine”, and P{Yi ∈ [−B, B]} = 1, then the combined predictor g is universally consistent.

  • L. Gy¨
  • rfi, G. Lugosi (2001) ”Strategies for sequential prediction of

stationary time series”, in Modelling Uncertainty: An Examination of its Theory, Methods and Applications, M. Dror, P. L’Ecuyer, F. Szidarovszky (Eds.), pp. 225-248, Kluwer Academic Publisher.

22

slide-23
SLIDE 23

0 − 1 loss

yi takes values in the finite set {1, 2, . . . M}. At time instant i the classifier decides on yi based on the past observation (xi

1, yi−1 1

). After n round the empirical error for xn

1, yn 1 is

Ln(g) = 1 n

n

  • i=1

I{g(xi

1,yi−1 1

)=yi},

i.e., the loss is the 0 − 1 loss, and Ln(g) is the relative frequency of errors.

23

slide-24
SLIDE 24

Pattern recognition

Y {1, 2, . . . , M} valued X feature vector Classifier g : Rd → {1, 2, . . . , M}. Probability of error: Lg = P(g(X) = Y ). a posteriori probability Pi(x) = P{Y = i | X = x}. Bayes decision g∗(x) = arg max

i

Pi(x). L∗ Bayes error

24

slide-25
SLIDE 25

Universal consistency

Data: (X1, Y1), . . . , (Xn, Yn) gn(x) = gn((X1, Y1), . . . , (Xn, Yn), x). Definition 2 The classifier gn is called (weakly) universally consistent if P(gn(X) = Y ) → L∗ for all distributions of (X, Y ).

25

slide-26
SLIDE 26

Local majority voting

k-nearest neighbor rule gn(x) = arg max

j n

  • i=1

Wn,i(x)I{Yi=j}, Partitioning rule: gn(x) = arg max

j n

  • i=1

I{Xi∈An(x)}I{Yi=j} Kernel rule rule: gn(x) = arg max

j n

  • i=1

K Xi − x h

  • I{Yi=j}.

The k-NN rule and the partitioning rule and the kernel rule are strongly universally consistent.

26

slide-27
SLIDE 27

Empirical error minimization

empirical error 1 n

n

  • j=1

I{g(Xj)=Yj} class of classifiers Gn select a classifier from Gn which minimizes the empirical error: gn ∈ Gn and 1 n

n

  • j=1

I{gn(Xj)=Yj} = min

g∈Gn

1 n

n

  • j=1

I{g(Xj)=Yj} the VC dimension of Gn grows slowly as n grows

27

slide-28
SLIDE 28

Examples for Gn:

  • polynomial classifiers
  • tree classifiers
  • neural networks classifiers
  • radial basis functions classifiers

28

slide-29
SLIDE 29

Dependent data: time series

data Dn = {(X1, Y1), . . . , (Xn, Yn)} form a stationary and ergodic process For given n, the problem is the following minimization: min

g

P{g(Xn+1, Dn) = Yn+1}, which cannot be learned from data

  • ur aim is to achieve the optimum

R∗ = lim

n→∞ min g

P{g(Xn+1, Dn) = Yn+1}, which is impossible

29

slide-30
SLIDE 30

there are universal Ces´ aro consistent classifier sequence gn: lim

n→∞

1 n

n

  • i=1

I{gi(Xi+1,Di)=Yi+1} = R∗ a.s. for all stationary and ergodic sequence Such classifier sequence is called universally consistent.

30

slide-31
SLIDE 31

Lemma

Let ˜ h(1), . . . , ˜ h(N) be a finite collection of classifier strategies (experts). Then if the classifier strategy ˜ g is defined by ˜ gt(yt−1

1

, u) =              if u >

N

  • k=1

P

  • ˜

h(k)(yt−1

1

, Ut) = 1

  • ˜

wt(k)

N

  • k=1

˜ wt(k) 1

  • therwise,

t = 1, 2, . . . , n, where for all k = 1, . . . , N ˜ w1(k) = 1 and ˜ wt(k) = e−η

Lt−1

1

(˜ h(k)),

t > 1

31

slide-32
SLIDE 32

with η =

  • 8 ln N/n, then for every yn

1 ∈ {0, 1}n,

  • Ln

1(˜

g) ≤ min

k=1,...,N

  • Ln

1(˜

h(k)) +

  • ln N

2n . In

  • L. Gy¨
  • rfi, G. Lugosi, G. Morvai (1999) ”A simple randomized

algorithm for consistent sequential prediction of ergodic time series” IEEE Trans. Information Theory, 45, pp. 2642-2650 there is a construction of universally consistent classifier by randomized combination of classifiers (experts).

32

slide-33
SLIDE 33

log utility: portfolio selection

investment in the stock market return vector x = (x(1), . . . x(d)) j-th component x(j) is the factor by which capital invested in stock j grows during the market period a portfolio vector b = (b(1), . . . b(d)) j-th component b(j) of which gives the proportion of the investor’s capital invested in stock j S0 denotes the initial capital S1 = S0

d

  • j=1

b(j)x(j) = S0(b, x)

33

slide-34
SLIDE 34

long run investment, initial capital S0 xi the return vector on day i, b = b1 is the portfolio vector for the first day S1 = S0 · (b1, x1) for the second day, S1 new initial capital, the portfolio vector b2 = b(x1) S2 = S0 · (b, x1) · (b(x1), x2). nth day a portfolio strategy bn = b(xn−1

1

) Sn = S0

n

  • i=1

(b(xi−1

1

), xi) = S0enWn(B) with the average growth rate Wn(B) = 1 n

n

  • i=1

log(b(xi−1

1

), xi).

34

slide-35
SLIDE 35

log-optimum portfolio

X1, X2, . . . drawn from the vector valued stationary and ergodic process log-optimum portfolio B∗ = {b∗(·)} E{log(b∗(Xn−1

1

), Xn) | Xn−1

1

} = E{max

b(·) log(b(Xn−1 1

), Xn) | Xn−1

1

}.

35

slide-36
SLIDE 36

If S∗

n = Sn(B∗) denotes the capital after day n achieved by a

log-optimum portfolio strategy B∗, then for any portfolio strategy B with capital Sn = Sn(B) and for any stationary ergodic process {Xn}∞

−∞,

lim sup

n→∞

1 n log Sn S∗

n

≤ 0 almost surely and lim

n→∞

1 n log S∗

n = W ∗

almost surely, where W ∗ = E

  • max

b(·) E{log(b(X−1 −∞), X0)) | X−1 −∞}

  • is the maximal growth rate of any portfolio.

36

slide-37
SLIDE 37

Universal portfolio

These limit relations give rise to the following definition: Definition 3 A portfolio strategy B is called universal with respect to a class C of stationary and ergodic processes {Xn}∞

−∞, if for each process in the class,

lim

n→∞

1 n log Sn(B) = W ∗ almost surely.

37

slide-38
SLIDE 38

Elementary portfolio

H(k,ℓ) = {h(k,ℓ)(·)}, k, ℓ = 1, 2, . . . Pℓ = {Aℓ,j, j = 1, 2, . . . , mℓ} finite partitions of Rd, Gℓ be the corresponding quantizer: Gℓ(x) = j, if x ∈ Aℓ,j. xn

1 ∈ Rdn, Gℓ(xn 1)

Fix k, ℓ for each k-long string s of positive integers, define the partitioning portfolio b(k,ℓ)(xn−1

1

, s) = arg max

b

  • {k<i<n:Gℓ(xi−1

i−k)=s}

(b, xi)

38

slide-39
SLIDE 39

the elementary portfolio h(k,ℓ) by h(k,ℓ)(xn−1

1

) = b(k,ℓ)(xn−1

1

, Gℓ(xn−1

n−k))

That is, h(k,ℓ)

n

quantizes the sequence xn−1

1

according to the partition Pℓ, and browses through all past appearances of the last seen quantized string Gℓ(xn−1

n−k) of length k. Then it designs a fixed

portfolio vector according to the returns on the days following the

  • ccurence of the string.

39

slide-40
SLIDE 40

Combining elementary portfolios

Finally, let {qk,ℓ} be a probability distribution on the set of all pairs (k, ℓ) of positive integers such that for all k, ℓ, qk,ℓ > 0. The strategy B then arises from weighing the elementary portfolio stratgies H(k,ℓ) according to their past performances and {qk,ℓ} such that, after day n, the investor’s capital becomes Sn(B) =

  • k,ℓ

qk,ℓSn(H(k,ℓ)).

40

slide-41
SLIDE 41

Theorem

Assume that (a) the sequence of partitions is nested, that is, any cell of Pℓ+1 is a subset of a cell of Pℓ, ℓ = 1, 2, . . .; (b) if diam(A) = supx,y∈A x − y denotes the diameter of a set, then for any sphere S centered at the origin lim

ℓ→∞

max

j:Aℓ,j∩S=∅ diam(Aℓ,j) = 0 .

Then the portfolio scheme B defined above is universal with respect to the class of all ergodic processes such that E{| log X(j)| < ∞, for j = 1, 2, . . . , d.

41

slide-42
SLIDE 42
  • L. Gy¨
  • rfi, D. Sch¨

afer (2003) ”Nonparametric prediction”, in Advances in Learning Theory: Methods, Models and Applications, J.

  • A. K. Suykens, G. Horv´

ath, S. Basu, C. Micchelli, J. Vandevalle (Eds.), IOS Press, NATO Science Series, pp. 341-356. www.szit.bme.hu/∼gyorfi/histog.ps

  • L. Gy¨
  • rfi, G. Lugosi, F. Udina (2005) ”Nonparametric kernel-based

sequential investment strategies”, Mathematical Finance, .., pp. ...-... www.szit.bme.hu/∼gyorfi/kernel.ps

42