SLIDE 1
Hilberts 13th Problem Great Theorem; Shame about the Algorithm Bill - - PowerPoint PPT Presentation
Hilberts 13th Problem Great Theorem; Shame about the Algorithm Bill - - PowerPoint PPT Presentation
Hilberts 13th Problem Great Theorem; Shame about the Algorithm Bill Moran Structure of Talk Solving Polynomial Equations Hilberts 13th Problem Kolmogorov-Arnold Theorem Neural Networks Quadratic Equations ax 2 + bx + c = 0 b 2
SLIDE 2
SLIDE 3
Quadratic Equations
ax2 + bx + c = 0 x = −b ± √ b2 − 4ac 2a How do we do it? Eliminate the x term by replacing x by y = x + b
2a
ay2 + c − b2 2a = 0
SLIDE 4
What about Cubics?
ax3 + bx2 + cx + d = 0 (1) Eliminate x2 term — replace x by y = x + b
3a:
y3 + c′y + d′ = 0 Write y = u + v u3 + v3 + (3uv + c′)(u + v) + d′ = 0 Set 3uv + c′ = 0 u3 − c′ 3u 3 + d′ = 0 Quadratic in u3 — solve quadratic and take cube roots This gives u, then get v, then y and finally x. del Ferro, Tartaglia, Cardano, 1530
SLIDE 5
Let’s be a little more adventurous
ax4 + bx3 + cx2 + dx + e = 0 Similar trick to cubic case to remove cubic term: y4 + py2 + qy + r = 0 Complete the square: (y2 + p 2)2 = p2 4 − qy − r Introduce new variable z: (y2 + p
2 + z)2 — this is:
(y2 + p
2)2 + pz + 2y2z + z2
Then (y2 + p 2 + z)2 = 2zy2 − qy +
- z2 + zp + p2
4 − r
SLIDE 6
Quartic Continued
Choose z to make RHS a perfect square — so discriminant 0: q2 = 8z
- z2 + zp + p2
4 − r
- Solve this cubic for z then we have A2 = B2 where
A = (y2 + p
2 + z) and B2 = 2zy2 − qy +
- z2 + zp + p2
4 − r
- A = ±B gives two quadratics in y
Lodovico de Ferrari, Cardano
SLIDE 7
Quintic
ax5 + bx4 + cx3 + dx2 + ex + f = 0 (2) Tschirnhaus transformations:
y = g(x)
h(x)
g and h polynomials h non-vanishing at roots of quintic
Can use Tschirnhaus transformations to reduce (2) to the Bring-Jerrard form: x5 − x + q = 0 (3) q is some rational function of the coefficients in (2) Can obtain solutions of (2) as rational functions of roots of (3) (Hermite) Elliptic modular functions involving q are used to solve
SLIDE 8
Lest you think this is useless nonsense!
SLIDE 9
Sextic
ax6 + bx5 + cx4 + dx3 + ex2 + fx + g = 0 (4) Tschirnhaus transformations: x6 + px2 + qx + 1 = 0. (5) Its solution is φ(p, q). Solution uses derivatives of generalized hypergeometric functions wrt their parameters called Kamp´ e de F´ eriet functions
SLIDE 10
Septic
ax7 + bx6 + cx5 + dx4 + ex3 + fx2 + gx + h = 0 (6) Tschirnhaus transformations: x7 + px3 + qx2 + rx + 1 = 0. (7) Its solution is φ(p, q, r). Hilbert: Can we express φ(p, q, r) in terms of functions of 2 variables? Measure of complexity of problem
SLIDE 11
What this means
A function f(x1, x2, . . . , xn) of n variables is a superposition
- f functions gk(yk,1, yk,2, . . . , yk,rk), (k = 0, 1, . . . , m) if each
yk,i is one of the variables xj and there is a function h so that f(x1, x2, . . . , xn) = h(g1(y1,1, y1,2, . . . , y1,r1), g2(y2,1, y2,2, . . . , y2,r2), . . . . . . , gm(ym,1, y1,2, . . . , ym,rm))
SLIDE 12
Solutions of Polynomial Equations and Superposition
Every solution of a polynomial equation of degree < 7 can be written as a superposition of functions of ≤ 2 variables Every solution of a polynomial equation of degree n can be written as a superposition of functions of ≤ n − 4 variables What about degree 7?
SLIDE 13
Hilbert’s 13th Problem
✓ ✒ ✏ ✑
A solution of the general equation of degree 7 can- not be represented as a superposition of continuous functions of two variables What he meant to say was “algebraic” or “analytic” instead of “continuous” as we shall see!
SLIDE 14
Why this might be a useful idea
Most functions we want to compute are composed of functions
- f at most two variables (x, y) → x + y, (x, y) → x.y, x → 1
x,
(x, y) →
y
√x, x → ex, x → log x, x → sin x, etc. To compute gradients of such functions one can use chain rule This approach computes partial derivatives of functions of n variables more efficiently Kim, Nesterov, and Cherkasskii (1984) Given such a computable function of n variables, can compute the function and its gradient in only 4 times as many operations — for large n
SLIDE 15
Enter Kolmogorov
Every continuous function of n-variables on the unit cube is a superposition of continuous functions of 3 variables
SLIDE 16
Enter Kolmogorov
Every continuous function of n-variables on the unit cube is a superposition of continuous functions of 3 variables And Arnold: Every continuous function of n-variables on the unit cube is a superposition of continuous functions of 2 variables (Resolves Hilbert’s 13th Problem)
SLIDE 17
Sprecher’s Version
Sprecher: For each N ≥ 2 there is a Lipschitz function ψ in Lip
- log 2
log(2N+2)
- (I) with the following property: for each δ > 0,
there is a rational ǫ in interval (0, δ) s.t. for all integers n (2 ≤ n ≤ N), and for every continuous function f(x1, x2, . . . , xn) on In, f(x1, x2, . . . , xn) =
- 0≤q≤2n
g n
- p=0
λpψ(xp + ǫq) + q
- (8)
where g is continuous and λ > 0 is independent of f.
SLIDE 18
Idea of Proof — First use discontinuous functions
τk(x) is kth decimal place of x so x = ∞
k=1 τk(x) 10k
(assume none ends 00000 . . ., except 0 itself) Write ψr(x) = ∞
k=1 τk(x) 10kn+r for r = 0, 1, . . . , n − 1
Now (x1, x2, . . . , xn) →
n−1
- r=0
χr(xr+1) = κ(x1, x2, . . . , xn) is 1 − 1 and onto [0, 1] but not continuous! Interlacing decimals Define g(y) = f
- κ−1(x1, x2, . . . , xn)
- And
f(x1, x2, . . . , xn) = g n−1
- r=0
ψr(xr)
- (9)
SLIDE 19
How does it work?
Two ideas:
The map (x1, x2, . . . , xn) → n
p=1 ψr(xr) is 1 − 1 —
- ntoness not needed — but we will need them continuous
Then use g to “approximate” values of f on inverse of that map
Key issue: a continuous version of 1 − 1-ness — cannot map In in a 1 − 1 continuous way into one dimension
SLIDE 20
Continuous Version
Divide I = [0, 1] into 10 equal intervals and then shrink them slightly from their centres — call these E1(j) (j = 0, 1, . . . , 9) Repeat this construction 2n + 1 times (n is number of variables in function)— call them Ek(j) Shift the new Ek(j) (k > 1) along so that every x in I appears in all but at most one Ek
E0 E1 E2 E3 E4
SLIDE 21
Done in Two Dimensions
Take two copies Ei
k(j) and consider E1 k(j1) × E2 k(j2)
For each fixed k can find increasing continuous functions ψk,1 and ψk,2 on I such that ψk,1(E(1)
k (j)) + ψk,2(E(2) k (k)) are all
disjoint for each fixed k — and in 1-dim Note: enough to do for one k and then shift to cover all of square — cover square in 2n + 1 shifts)
SLIDE 22
Refine this
Now divide up I into 100 equal pieces, shrink slightly (less this time) from centre to form E2(j) Can adjust old ψk,1 and ψk,2 so that in refined version: ψk,1(E(1)
k (j)) + ψk,2(E(2) k (k)) are all disjoint — moreover,
adjustment needs only to be small because variation over Ek(j)s is small! Keep going . . . We end up sequence of compact sets Ek on each axis and ψk,i so that (x1, x2) → ψk,1(x1) + ψk,2(x2) is 1 − 1 on each member of sequence and Ek is most of the interval Union of 5 shifts of Eks cover I2
SLIDE 23
Approximate
Fix a continuous function f on I2 Approximate by a function of the form g
- φk,1(x1) + φk,2(x2)
- ver most of I2
Using shifted forms of ψs we can cover all of square I2 Given f continuous on I2, there exists g1 continuous on R with g1∞ ≤ f∞ s.t.
- f(x1, x2) −
5
- k=1
g1
- ψk,1(x1) + ψ(x2)
- < (1 − ǫ)f∞
Induct — f1 = f and fr+1(x1, x2) = fr(x1, x2) −
5
- k=1
gr
- ψk,1(x1) + ψ(x2)
- Get gr → g and fr → 0 uniformly so
f(x1, x2) =
5
- k=1
g
- ψk,1(x1) + ψ(x2)
SLIDE 24
But what about differentiable?
f(x1, x2, . . . , xn) =
- 0≤q≤2n
g n
- p=0
ψp,q(xp)
- (∗)
(Hilbert) There is an analytic function of three variables that cannot be expressed as a superposition of analytic functions of 2 variables (Konrad, 1954) There is a continuously differentiable function
- f 3 variables that cannot be expressed as a superposition of
continuously differentiable functions of 2 variables (Fridman, 1967) can replace ψs by Lipschitz functions of exponent 1 (Vitushkin, 1964) There exist analytic functions not expressible by (*) when ψs are chosen continuously differentiable
SLIDE 25
Neural Networks
A neuron is a node that takes as input a vector (y1, y2, . . . , yM) and outputs a value h(M
m=1 wmym − w0)
where wm are called weights (Hecht-Nielsen, 1987) Kolmogorov-Arnold can be seen as a 3-layer neural network
Input #1 Input #2 Input #3 Input #4 Output Hidden layer Input layer Output layer
SLIDE 26
Algorithmic Issues
Functions involved are highly non-smooth and cannot be made smooth Only get equality in (∗) by letting iteration go to ∞
SLIDE 27
Making it Computationally Feasible
Can live with ǫ rather than equality provided we know how many iterations for a given level of accuracy Can use Lipschitz functions! (Kurkova “Kolmogorov’s Theorem is Relevant” 1991-2) Can specify number of iterations in terms of ǫ
SLIDE 28
Making it Computationally Feasible II
(Nakamura, Mines, Kreinovich, ∼ 1995) There is an algorithm U that, for every N ≥ 2, generates an increasing function ψ ∈ Lip(
log 2 log(2N+2) )(I), with following property:
For all δ > 0, there exists a real number λ > 0 and a ra- tional number ǫ ∈ (0, δ) (both computable from δ), s.t., for 2 ≤ n ≤ N, every continuous function f : In → R, has a representation as f(x1, x2, . . . , xn) =
2n
- q=0
g n
- p=1
(λpψ(xp + qǫ) + q
- for some continuous function g that is computable from f
SLIDE 29
But ...
Not how NNs are used: Train to find weights that fit (perhaps approximate) finite set
- f input/output data
Data often has uncertainties Overfitting is serious problem Evans and Jones — γ-test http://users.cs.cf.ac.uk/O.F.Rana/Antonia.J. Jones/GammaArchive/Theses/DEvansThesis.pdf
SLIDE 30
But continued ...
Continuity is about “robustness” but not enough — we really want something like Lipschitz for computability If we only want approximation rather than equality then simpler approaches — Projection Pursuit f(x1, x2, . . . , xn) =
- q
gq n
- p=1
wpxp) Performs well with noisy high dimensional data and nonparametric regression techniques Diaconis and Shahshahani (1984) discuss projection pursuit (as equality)
SLIDE 31