[PPT] - Hilberts 13th Problem Great Theorem; Shame about the Algorithm Bill PowerPoint Presentation

SLIDE 1

Hilbert’s 13th Problem Great Theorem; Shame about the Algorithm

Bill Moran

SLIDE 2

Structure of Talk

Solving Polynomial Equations Hilbert’s 13th Problem ‘Kolmogorov-Arnold Theorem Neural Networks

SLIDE 3

Quadratic Equations

ax2 + bx + c = 0 x = −b ± √ b2 − 4ac 2a How do we do it? Eliminate the x term by replacing x by y = x + b

2a

ay2 + c − b2 2a = 0

SLIDE 4

What about Cubics?

ax3 + bx2 + cx + d = 0 (1) Eliminate x2 term — replace x by y = x + b

3a:

y3 + c′y + d′ = 0 Write y = u + v u3 + v3 + (3uv + c′)(u + v) + d′ = 0 Set 3uv + c′ = 0 u3 − c′ 3u 3 + d′ = 0 Quadratic in u3 — solve quadratic and take cube roots This gives u, then get v, then y and finally x. del Ferro, Tartaglia, Cardano, 1530

SLIDE 5

Let’s be a little more adventurous

ax4 + bx3 + cx2 + dx + e = 0 Similar trick to cubic case to remove cubic term: y4 + py2 + qy + r = 0 Complete the square: (y2 + p 2)2 = p2 4 − qy − r Introduce new variable z: (y2 + p

2 + z)2 — this is:

(y2 + p

2)2 + pz + 2y2z + z2

Then (y2 + p 2 + z)2 = 2zy2 − qy +

z2 + zp + p2

4 − r

SLIDE 6

Quartic Continued

Choose z to make RHS a perfect square — so discriminant 0: q2 = 8z

z2 + zp + p2

4 − r

Solve this cubic for z then we have A2 = B2 where

A = (y2 + p

2 + z) and B2 = 2zy2 − qy +

z2 + zp + p2

4 − r

A = ±B gives two quadratics in y

Lodovico de Ferrari, Cardano

SLIDE 7

Quintic

ax5 + bx4 + cx3 + dx2 + ex + f = 0 (2) Tschirnhaus transformations:

y = g(x)

h(x)

g and h polynomials h non-vanishing at roots of quintic

Can use Tschirnhaus transformations to reduce (2) to the Bring-Jerrard form: x5 − x + q = 0 (3) q is some rational function of the coefficients in (2) Can obtain solutions of (2) as rational functions of roots of (3) (Hermite) Elliptic modular functions involving q are used to solve

SLIDE 8

Lest you think this is useless nonsense!

SLIDE 9

Sextic

ax6 + bx5 + cx4 + dx3 + ex2 + fx + g = 0 (4) Tschirnhaus transformations: x6 + px2 + qx + 1 = 0. (5) Its solution is φ(p, q). Solution uses derivatives of generalized hypergeometric functions wrt their parameters called Kamp´ e de F´ eriet functions

SLIDE 10

Septic

ax7 + bx6 + cx5 + dx4 + ex3 + fx2 + gx + h = 0 (6) Tschirnhaus transformations: x7 + px3 + qx2 + rx + 1 = 0. (7) Its solution is φ(p, q, r). Hilbert: Can we express φ(p, q, r) in terms of functions of 2 variables? Measure of complexity of problem

SLIDE 11

What this means

A function f(x1, x2, . . . , xn) of n variables is a superposition

f functions gk(yk,1, yk,2, . . . , yk,rk), (k = 0, 1, . . . , m) if each

yk,i is one of the variables xj and there is a function h so that f(x1, x2, . . . , xn) = h(g1(y1,1, y1,2, . . . , y1,r1), g2(y2,1, y2,2, . . . , y2,r2), . . . . . . , gm(ym,1, y1,2, . . . , ym,rm))

SLIDE 12

Solutions of Polynomial Equations and Superposition

Every solution of a polynomial equation of degree < 7 can be written as a superposition of functions of ≤ 2 variables Every solution of a polynomial equation of degree n can be written as a superposition of functions of ≤ n − 4 variables What about degree 7?

SLIDE 13

Hilbert’s 13th Problem

✓ ✒ ✏ ✑

A solution of the general equation of degree 7 can- not be represented as a superposition of continuous functions of two variables What he meant to say was “algebraic” or “analytic” instead of “continuous” as we shall see!

SLIDE 14

Why this might be a useful idea

Most functions we want to compute are composed of functions

f at most two variables (x, y) → x + y, (x, y) → x.y, x → 1

x,

(x, y) →

y

√x, x → ex, x → log x, x → sin x, etc. To compute gradients of such functions one can use chain rule This approach computes partial derivatives of functions of n variables more efficiently Kim, Nesterov, and Cherkasskii (1984) Given such a computable function of n variables, can compute the function and its gradient in only 4 times as many operations — for large n

SLIDE 15

Enter Kolmogorov

Every continuous function of n-variables on the unit cube is a superposition of continuous functions of 3 variables

SLIDE 16

Enter Kolmogorov

Every continuous function of n-variables on the unit cube is a superposition of continuous functions of 3 variables And Arnold: Every continuous function of n-variables on the unit cube is a superposition of continuous functions of 2 variables (Resolves Hilbert’s 13th Problem)

SLIDE 17

Sprecher’s Version

Sprecher: For each N ≥ 2 there is a Lipschitz function ψ in Lip

log 2

log(2N+2)

(I) with the following property: for each δ > 0,

there is a rational ǫ in interval (0, δ) s.t. for all integers n (2 ≤ n ≤ N), and for every continuous function f(x1, x2, . . . , xn) on In, f(x1, x2, . . . , xn) =

0≤q≤2n

g n

p=0

λpψ(xp + ǫq) + q

(8)

where g is continuous and λ > 0 is independent of f.

SLIDE 18

Idea of Proof — First use discontinuous functions

τk(x) is kth decimal place of x so x = ∞

k=1 τk(x) 10k

(assume none ends 00000 . . ., except 0 itself) Write ψr(x) = ∞

k=1 τk(x) 10kn+r for r = 0, 1, . . . , n − 1

Now (x1, x2, . . . , xn) →

n−1

r=0

χr(xr+1) = κ(x1, x2, . . . , xn) is 1 − 1 and onto [0, 1] but not continuous! Interlacing decimals Define g(y) = f

κ−1(x1, x2, . . . , xn)
And

f(x1, x2, . . . , xn) = g n−1

r=0

ψr(xr)

(9)

SLIDE 19

How does it work?

Two ideas:

The map (x1, x2, . . . , xn) → n

p=1 ψr(xr) is 1 − 1 —

ntoness not needed — but we will need them continuous

Then use g to “approximate” values of f on inverse of that map

Key issue: a continuous version of 1 − 1-ness — cannot map In in a 1 − 1 continuous way into one dimension

SLIDE 20

Continuous Version

Divide I = [0, 1] into 10 equal intervals and then shrink them slightly from their centres — call these E1(j) (j = 0, 1, . . . , 9) Repeat this construction 2n + 1 times (n is number of variables in function)— call them Ek(j) Shift the new Ek(j) (k > 1) along so that every x in I appears in all but at most one Ek

E0 E1 E2 E3 E4

SLIDE 21

Done in Two Dimensions

Take two copies Ei

k(j) and consider E1 k(j1) × E2 k(j2)

For each fixed k can find increasing continuous functions ψk,1 and ψk,2 on I such that ψk,1(E(1)

k (j)) + ψk,2(E(2) k (k)) are all

disjoint for each fixed k — and in 1-dim Note: enough to do for one k and then shift to cover all of square — cover square in 2n + 1 shifts)

SLIDE 22

Refine this

Now divide up I into 100 equal pieces, shrink slightly (less this time) from centre to form E2(j) Can adjust old ψk,1 and ψk,2 so that in refined version: ψk,1(E(1)

k (j)) + ψk,2(E(2) k (k)) are all disjoint — moreover,

adjustment needs only to be small because variation over Ek(j)s is small! Keep going . . . We end up sequence of compact sets Ek on each axis and ψk,i so that (x1, x2) → ψk,1(x1) + ψk,2(x2) is 1 − 1 on each member of sequence and Ek is most of the interval Union of 5 shifts of Eks cover I2

SLIDE 23

Approximate

Fix a continuous function f on I2 Approximate by a function of the form g

φk,1(x1) + φk,2(x2)
ver most of I2

Using shifted forms of ψs we can cover all of square I2 Given f continuous on I2, there exists g1 continuous on R with g1∞ ≤ f∞ s.t.

f(x1, x2) −

5

k=1

g1

ψk,1(x1) + ψ(x2)
< (1 − ǫ)f∞

Induct — f1 = f and fr+1(x1, x2) = fr(x1, x2) −

5

k=1

gr

ψk,1(x1) + ψ(x2)
Get gr → g and fr → 0 uniformly so

f(x1, x2) =

5

k=1

g

ψk,1(x1) + ψ(x2)

SLIDE 24

But what about differentiable?

f(x1, x2, . . . , xn) =

0≤q≤2n

g n

p=0

ψp,q(xp)

(∗)

(Hilbert) There is an analytic function of three variables that cannot be expressed as a superposition of analytic functions of 2 variables (Konrad, 1954) There is a continuously differentiable function

f 3 variables that cannot be expressed as a superposition of

continuously differentiable functions of 2 variables (Fridman, 1967) can replace ψs by Lipschitz functions of exponent 1 (Vitushkin, 1964) There exist analytic functions not expressible by (*) when ψs are chosen continuously differentiable

SLIDE 25

Neural Networks

A neuron is a node that takes as input a vector (y1, y2, . . . , yM) and outputs a value h(M

m=1 wmym − w0)

where wm are called weights (Hecht-Nielsen, 1987) Kolmogorov-Arnold can be seen as a 3-layer neural network

Input #1 Input #2 Input #3 Input #4 Output Hidden layer Input layer Output layer

SLIDE 26

Algorithmic Issues

Functions involved are highly non-smooth and cannot be made smooth Only get equality in (∗) by letting iteration go to ∞

SLIDE 27

Making it Computationally Feasible

Can live with ǫ rather than equality provided we know how many iterations for a given level of accuracy Can use Lipschitz functions! (Kurkova “Kolmogorov’s Theorem is Relevant” 1991-2) Can specify number of iterations in terms of ǫ

SLIDE 28

Making it Computationally Feasible II

(Nakamura, Mines, Kreinovich, ∼ 1995) There is an algorithm U that, for every N ≥ 2, generates an increasing function ψ ∈ Lip(

log 2 log(2N+2) )(I), with following property:

For all δ > 0, there exists a real number λ > 0 and a ra- tional number ǫ ∈ (0, δ) (both computable from δ), s.t., for 2 ≤ n ≤ N, every continuous function f : In → R, has a representation as f(x1, x2, . . . , xn) =

2n

q=0

g n

p=1

(λpψ(xp + qǫ) + q

for some continuous function g that is computable from f

SLIDE 29

But ...

Not how NNs are used: Train to find weights that fit (perhaps approximate) finite set

f input/output data

Data often has uncertainties Overfitting is serious problem Evans and Jones — γ-test http://users.cs.cf.ac.uk/O.F.Rana/Antonia.J. Jones/GammaArchive/Theses/DEvansThesis.pdf

SLIDE 30

But continued ...

Continuity is about “robustness” but not enough — we really want something like Lipschitz for computability If we only want approximation rather than equality then simpler approaches — Projection Pursuit f(x1, x2, . . . , xn) =

q

gq n

p=1

wpxp) Performs well with noisy high dimensional data and nonparametric regression techniques Diaconis and Shahshahani (1984) discuss projection pursuit (as equality)

SLIDE 31