SLIDE 1
Discrete Geometry meets Machine Learning Amitabh Basu Johns - - PowerPoint PPT Presentation
Discrete Geometry meets Machine Learning Amitabh Basu Johns - - PowerPoint PPT Presentation
Discrete Geometry meets Machine Learning Amitabh Basu Johns Hopkins University 22nd Combinatorial Opt. Workshop at Aussois January 11, 2018 Joint work with Anirbit Mukherjee, Raman Arora, Poorya Mianjy Two Problems in Discrete Geometry
SLIDE 2
SLIDE 3
Two Problems in Discrete Geometry Problem 1: Given two polytopes P and Q, do there exist simplices A1, …, Ap and B1, …, Bq such that P + A1 + … + Ap = Q + B1 + … + Bq Problem 2: For natural number k, define k-zonotope as the Minkowski sum of a finite set of polytopes, where each is a convex hull of k points [2-zonotope = regular zonotope] Given two 2n-zonotopes P and Q, do there exist two 2n+1-zonotopes A and B such that conv(P U Q) + A = B
SLIDE 4
What is a Deep Neural Network (DNN) ?
SLIDE 5
What is a Deep Neural Network (DNN) ?
- Directed Acyclic Graph (Network Architecture)
SLIDE 6
What is a Deep Neural Network (DNN) ?
- Directed Acyclic Graph (Network Architecture)
- Weights on every edge and
every vertex
SLIDE 7
What is a Deep Neural Network (DNN) ?
- Directed Acyclic Graph (Network Architecture)
- Weights on every edge and
every vertex
2 1.65 −6.8 3 −1 0.53 2.45
SLIDE 8
What is a Deep Neural Network (DNN) ?
- Directed Acyclic Graph (Network Architecture)
- Weights on every edge and
every vertex
2 1.65 −6.8 3 −1 0.53 2.45
- R -> R
“Activation Function” Examples: f(x) = max{0,x} — Rectified Linear Unit (ReLU) f(x) = ex/(1 + ex) — Sigmoid
SLIDE 9
What is a Deep Neural Network (DNN) ?
- Directed Acyclic Graph (Network Architecture)
- Weights on every edge and
every vertex
2 1.65 −6.8 3 −1 0.53 2.45
- R -> R
“Activation Function” Examples: f(x) = max{0,x} — Rectified Linear Unit (ReLU) f(x) = ex/(1 + ex) — Sigmoid
- Sources = input, Sinks = output
SLIDE 10
What is a Deep Neural Network (DNN) ?
- Directed Acyclic Graph (Network Architecture)
- Weights on every edge and
every vertex
- R -> R
“Activation Function” Examples: f(x) = max{0,x} — Rectified Linear Unit (ReLU) f(x) = ex/(1 + ex) — Sigmoid
- Sources = input, Sinks = output
2 1.65 −6.8 3 −1 0.53 2.45 x1 x2 x3 y1 y2
SLIDE 11
- Weights on every edge and
every vertex What is a Deep Neural Network (DNN) ?
- Directed Acyclic Graph (Network Architecture)
- R -> R
“Activation Function” Examples: f(x) = max{0,x} — Rectified Linear Unit (ReLU) f(x) = ex/(1 + ex) — Sigmoid
- Sources = input, Sinks = output
2 1.65 −6.8 3 −1 0.53 2.45 x1 x2 x3 y1 y2
a1 a2 ak b u1 u2 uk
- = f(a1u1 + a2u2 + . . . + akuk + b)
SLIDE 12
- Weights on every edge and
every vertex What is a Deep Neural Network (DNN) ?
- Directed Acyclic Graph (Network Architecture)
- R -> R
“Activation Function” Examples: f(x) = max{0,x} — Rectified Linear Unit (ReLU) f(x) = ex/(1 + ex) — Sigmoid
- Sources = input, Sinks = output
2 1.65 −6.8 3 −1 0.53 2.45 x1 x2 x3 y1 y2
a1 a2 ak b u1 u2 uk
- = max{0, a1u1 +a2u2 +. . .+akuk +b}
SLIDE 13
Problems of interest for DNNs
- Expressiveness: What family of functions can one represent
using DNNs?
- Efficiency: How many layers (depth) and vertices (size)
needed to represent functions in the family?
- Training the network: Given architecture, data points (x,y),
find weights for the ``best fit” function.
- Generalization error: Rademacher complexity, VC dimension
SLIDE 14
Problems of interest for DNNs
- Expressiveness: What family of functions can one represent
using DNNs?
- Training the network: Given architecture, data points (x,y),
find weights for the ``best fit” function.
- Generalization error: Rademacher complexity, VC dimension
- Efficiency: How many layers (depth) and vertices (size)
needed to represent functions in the family?
SLIDE 15
Calculus of DNN functions
SLIDE 16
Calculus of DNN functions
- f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 + f2 in DNN(max{k1,k2}, s1+s2)
SLIDE 17
Calculus of DNN functions
- f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 + f2 in DNN(max{k1,k2}, s1+s2)
x f1 f2
y = f1+f2
SLIDE 18
Calculus of DNN functions
- f in DNN(k,s), c in R => cf in DNN(k,s)
- f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 + f2 in DNN(max{k1,k2}, s1+s2)
SLIDE 19
Calculus of DNN functions
- f in DNN(k,s), c in R => cf in DNN(k,s)
- f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 + f2 in DNN(max{k1,k2}, s1+s2)
- f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 o f2 in DNN(k1+k2, s1+s2)
SLIDE 20
Calculus of DNN functions
- f in DNN(k,s), c in R => cf in DNN(k,s)
- f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 + f2 in DNN(max{k1,k2}, s1+s2)
- f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 o f2 in DNN(k1+k2, s1+s2)
x f2 f1 f1 f2
SLIDE 21
Calculus of DNN functions
- f in DNN(k,s), c in R => cf in DNN(k,s)
- f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 + f2 in DNN(max{k1,k2}, s1+s2)
- f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 o f2 in DNN(k1+k2, s1+s2)
- f1 in ReLU-DNN(k1,s1), f2 in ReLU-DNN(k2,s2) =>
max{f1 , f2} in ReLU-DNN(max{k1,k2}+1, s1+s2+4)
SLIDE 22
Calculus of DNN functions
- f in DNN(k,s), c in R => cf in DNN(k,s)
- f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 + f2 in DNN(max{k1,k2}, s1+s2)
- f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 o f2 in DNN(k1+k2, s1+s2)
F : Rn ! R2, F(x) = (f1(x), f2(x)) G : R2 ! R, G(z1, z2) = max{z1, z2} max{f1, f2} = G F
- f1 in ReLU-DNN(k1,s1), f2 in ReLU-DNN(k2,s2) =>
max{f1 , f2} in ReLU-DNN(max{k1,k2}+1, s1+s2+4)
SLIDE 23
Calculus of DNN functions
- f in DNN(k,s), c in R => cf in DNN(k,s)
- f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 + f2 in DNN(max{k1,k2}, s1+s2)
- f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 o f2 in DNN(k1+k2, s1+s2)
G : R2 → R, G(z1, z2) = max{z1, z2} max{z1, z2} = z1+z2
2
+ |z1−z2|
2
- f1 in ReLU-DNN(k1,s1), f2 in ReLU-DNN(k2,s2) =>
max{f1 , f2} in ReLU-DNN(max{k1,k2}+1, s1+s2+4)
SLIDE 24
Calculus of DNN functions
- f in DNN(k,s), c in R => cf in DNN(k,s)
- f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 + f2 in DNN(max{k1,k2}, s1+s2)
- f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 o f2 in DNN(k1+k2, s1+s2)
Input x1 Input x2
x1+x2 2
+ |x1−x2|
2
1 1
- 1
- 1
- 1
1 1
- 1
1 2
− 1
2 1 2 1 2
- f1 in ReLU-DNN(k1,s1), f2 in ReLU-DNN(k2,s2) =>
max{f1 , f2} in ReLU-DNN(max{k1,k2}+1, s1+s2+4)
SLIDE 25
Calculus of DNN functions
- f in DNN(k,s), c in R => cf in DNN(k,s)
- f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 + f2 in DNN(max{k1,k2}, s1+s2)
- f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 o f2 in DNN(k1+k2, s1+s2)
- f1 in ReLU-DNN(k1,s1), f2 in ReLU-DNN(k2,s2) =>
max{f1 , f2} in ReLU-DNN(max{k1,k2}+1, s1+s2+4)
- Affine functions can be implemented in ReLU-DNN(1,2n)
SLIDE 26
Problems of interest for DNNs
- Expressiveness: What family of functions can one represent
using DNNs?
- Training the network: Given architecture, data points (x,y),
find weights for the ``best fit” function.
- Generalization error: Rademacher complexity, VC dimension
- Efficiency: How many layers (depth) and vertices (size)
needed to represent functions in the family?
SLIDE 27
Expressiveness of ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): Any ReLU DNN with ’n’ inputs implements a continuous piecewise affine function on Rn. Conversely, any continuous piecewise affine function on Rn can be implemented by some ReLU DNN. Moreover, at most log(n+1) hidden layers are needed.
SLIDE 28
Expressiveness of ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): Any ReLU DNN with ’n’ inputs implements a continuous piecewise affine function on Rn. Conversely, any continuous piecewise affine function on Rn can be implemented by some ReLU DNN. Moreover, at most log(n+1) hidden layers are needed. Proof: Tropical Geometry result [Ovchinnikov 2002] says any continuous piecewise affine function can be written as max i=1, …, k min j in Si {lj}
SLIDE 29
Expressiveness of ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): Any ReLU DNN with ’n’ inputs implements a continuous piecewise affine function on Rn. Conversely, any continuous piecewise affine function on Rn can be implemented by some ReLU DNN. Moreover, at most log(n+1) hidden layers are needed. Proof: Tropical Geometry result [Ovchinnikov 2002] says any continuous piecewise affine function can be written as max i=1, …, k min j in Si {lj}
SLIDE 30
Expressiveness of ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): Any ReLU DNN with ’n’ inputs implements a continuous piecewise affine function on Rn. Conversely, any continuous piecewise affine function on Rn can be implemented by some ReLU DNN. Moreover, at most log(n+1) hidden layers are needed. Proof: Tropical Geometry result [Ovchinnikov 2002] says any continuous piecewise affine function can be written as max i=1, …, k min j in Si {lj}
SLIDE 31
Expressiveness of ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): Any ReLU DNN with ’n’ inputs implements a continuous piecewise affine function on Rn. Conversely, any continuous piecewise affine function on Rn can be implemented by some ReLU DNN. Moreover, at most log(n+1) hidden layers are needed. Proof: Tropical Geometry result [Ovchinnikov 2002] says any continuous piecewise affine function can be written as max i=1, …, k min j in Si {lj}
SLIDE 32
Expressiveness of ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): Any ReLU DNN with ’n’ inputs implements a continuous piecewise affine function on Rn. Conversely, any continuous piecewise affine function on Rn can be implemented by some ReLU DNN. Moreover, at most log(n+1) hidden layers are needed. Proof (Take 2): Any continuous PWL function can be written as difference of two PWL convex functions: max{l1
1,l1 2, …, l1 m} - max{l2 1,l2 2, …, l2 s}
SLIDE 33
Expressiveness of ReLU DNNs Proof (Take 2): Any continuous PWL function can be written as difference of two PWL convex functions: max{l1
1,l1 2, …, l1 m} - max{l2 1,l2 2, …, l2 s}
Want to show: Can be rewritten as max{a1
1,a1 2, …, a1 n+1} + … + max{aq 1,aq 2, …, aq n+1}
- max{c1
1,c1 2, …, c1 n+1} - … - max{cp 1,cp 2, …, cp n+1}
SLIDE 34
Expressiveness of ReLU DNNs Proof (Take 2): Any continuous PWL function can be written as difference of two PWL convex functions: max{l1
1,l1 2, …, l1 m} - max{l2 1,l2 2, …, l2 s}
Want to show: Can be rewritten as max{a1
1,a1 2, …, a1 n+1} + … + max{aq 1,aq 2, …, aq n+1}
- max{c1
1,c1 2, …, c1 n+1} - … - max{cp 1,cp 2, …, cp n+1}
Can assume without loss of generality that li
j are linear
SLIDE 35
Expressiveness of ReLU DNNs Proof (Take 2): Any continuous PWL function can be written as difference of two PWL convex functions: max{l1
1,l1 2, …, l1 m} - max{l2 1,l2 2, …, l2 s}
Want to show: Can be rewritten as max{a1
1,a1 2, …, a1 n+1} + … + max{aq 1,aq 2, …, aq n+1}
- max{c1
1,c1 2, …, c1 n+1} - … - max{cp 1,cp 2, …, cp n+1}
Can assume without loss of generality that li
j are linear
max{l1
1,l1 2, …, l1 m}
= max{<a1
1, x>, <a1 2, x>, …, <a1 m, x>}
= support function of conv( {a1
1, a1 2, …, a1 m} )
SLIDE 36
Expressiveness of ReLU DNNs Proof (Take 2): Any continuous PWL function can be written as difference of two PWL convex functions: max{l1
1,l1 2, …, l1 m} - max{l2 1,l2 2, …, l2 s}
Want to show: Can be rewritten as max{a1
1,a1 2, …, a1 n+1} + … + max{aq 1,aq 2, …, aq n+1}
- max{c1
1,c1 2, …, c1 n+1} - … - max{cp 1,cp 2, …, cp n+1}
Equivalent formulation: [ hK denotes support function of K] hP - hQ = hB_1 + …. + hB_q - hA_1 - …. - hA_p
SLIDE 37
Expressiveness of ReLU DNNs Proof (Take 2): Equivalent formulation: [ hK denotes support function of K] hP - hQ = hB_1 + …. + hB_q - hA_1 - …. - hA_p
iff hP + hA_1 + …. + hA_p = hQ + hB_1 + …. + hB_q
iff P + A1 + … + Ap = Q + B1 + … + Bq
SLIDE 38
Expressiveness of ReLU DNNs Proof (Take 2): Result from circuits literature [Wang and Sun 2006] says any continuous piecewise affine function can be written as c1max{l1
1,l1 2, …, l1 n+1} + … + ckmax{lk 1,lk 2, …, lk n+1}
SLIDE 39
Expressiveness of ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): Any ReLU DNN with ’n’ inputs implements a continuous piecewise affine function on Rn. Conversely, any continuous piecewise affine function on Rn can be implemented by some ReLU DNN. Moreover, at most log(n+1) hidden layers are needed.
SLIDE 40
Expressiveness of ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): Any ReLU DNN with ’n’ inputs implements a continuous piecewise affine function on Rn. Conversely, any continuous piecewise affine function on Rn can be implemented by some ReLU DNN. Moreover, at most log(n+1) hidden layers are needed. ReLU-DNN(1, *) ReLU-DNN(2, *) ReLU-DNN(3, *) ReLU-DNN(log(n+1), *) Open Question
SLIDE 41
Expressiveness of ReLU DNNs Open Question: ReLU(1, *) ReLU(2, *) ReLU(3, *) Proof Strategy/Plan: Claim 1: Let k, d be natural numbers such that 2k <= d. Any function in ReLU(k, *) on R^d is a linear combination
- f 2k-MAX functions.
SLIDE 42
Expressiveness of ReLU DNNs Open Question: ReLU(1, *) ReLU(2, *) ReLU(3, *) Proof Strategy/Plan: Definition: Let n, d be natural numbers. The set of functions
- n Rd that can be written as linear combinations of n-max
functions will be denoted by (d,n)-HH. Claim 1: Let k, d be natural numbers such that 2k <= d. Any function in ReLU(k, *) on R^d is a linear combination
- f 2k-MAX functions.
SLIDE 43
Expressiveness of ReLU DNNs Open Question: ReLU(1, *) ReLU(2, *) ReLU(3, *) Proof Strategy/Plan: Claim 2: Let n, d be natural numbers such that n <= d+1. Then (d,n)-HH (d,n+1)-HH Definition: Let n, d be natural numbers. The set of functions
- n Rd that can be written as linear combinations of n-max
functions will be denoted by (d,n)-HH. Claim 1: Let k, d be natural numbers such that 2k <= d. Any function in ReLU(k, *) on R^d is a linear combination
- f 2k-MAX functions.
SLIDE 44
Expressiveness of ReLU DNNs Claim 1: Any function in ReLU(k, *) is a linear combination
- f 2k-max functions.
Equivalent to Problem 2: For natural number k, define k-zonotope as the Minkowski sum of a finite set of k-simplices [1-zonotope = regular zonotope] Given two 2n-zonotopes P and Q, do there exist two 2n+1-zonotopes A and B such that conv(P U Q) + A = B
SLIDE 45
Problems of interest for DNNs
- Expressiveness: What family of functions can one represent
using DNNs?
- Training the network: Given architecture, data points (x,y),
find weights for the ``best fit” function.
- Generalization error: Rademacher complexity, VC dimension
- Efficiency: How many layers (depth) and vertices (size)
needed to represent functions in the family?
SLIDE 46
Depth v/s size tradeoffs for ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): For every natural number N, there exists a family of R -> R functions such that for any function f in this family, we have:
- 1. f is in ReLU-DNN(N2, N3).
- 2. f is NOT in ReLU-DNN(N, (1/2)NN - 1).
Moreover, this family is in one-to-one correspondence with the torus in N dimensions. Remark: More general versions, Approximation versions. n>=2 version using zonotopal norms.
SLIDE 47
Depth v/s size tradeoffs for ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): For every natural number N, there exists a family of R -> R functions such that for any function f in this family, we have:
- 1. f is in ReLU-DNN(N2, N3).
- 2. f is NOT in ReLU-DNN(N, (1/2)NN - 1).
Moreover, this family is in one-to-one correspondence with the torus in N dimensions. Remark: More general versions, Approximation versions. n>=2 version using zonotopal norms. Fact: Any R -> R PWL function with p pieces is in ReLU-DNN(1,p+1)
SLIDE 48
Depth v/s size tradeoffs for ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): For every natural number N, there exists a family of R -> R functions such that for any function f in this family, we have:
- 1. f is in ReLU-DNN(N2, N3).
- 2. f is NOT in ReLU-DNN(N, (1/2)NN - 1).
Moreover, this family is in one-to-one correspondence with the torus in N dimensions. Remark: More general versions, Approximation versions. n>=2 version using zonotopal norms. Fact: Any R -> R PWL function with p pieces is in ReLU-DNN(1,p+1)
- 0.2
0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1
- 0.2
0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1
- 0.2
0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1
SLIDE 49
Calculus of DNN functions
- f in DNN(k,s), c in R => cf in DNN(k,s)
- f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 + f2 in DNN(max{k1,k2}, s1+s2)
- f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 o f2 in DNN(k1+k2, s1+s2)
x f2 f1 f1 f2
SLIDE 50
Depth v/s size tradeoffs for ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): For every natural number N, there exists a family of R -> R functions such that for any function f in this family, we have:
- 1. f is in ReLU-DNN(N2, N3).
- 2. f is NOT in ReLU-DNN(N, (1/2)NN - 1).
Moreover, this family is in one-to-one correspondence with the torus in N dimensions. Remark: More general versions, Approximation versions. n>=2 version using zonotopal norms. Fact: Any R -> R PWL function with p pieces is in ReLU-DNN(1,p+1)
- 0.2
0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1
- 0.2
0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1
- 0.2
0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1
SLIDE 51
Depth v/s size tradeoffs for ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): For every natural number N, there exists a family of R -> R functions such that for any function f in this family, we have:
- 1. f is in ReLU-DNN(N2, N3).
- 2. f is NOT in ReLU-DNN(N, (1/2)NN - 1).
Moreover, this family is in one-to-one correspondence with the torus in N dimensions. Remark: More general versions, Approximation versions. n>=2 version using zonotopal norms. Fact: Any R -> R function in ReLU(k, w) has at most O(w^k) pieces
SLIDE 52
Depth v/s size tradeoffs for ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): For every natural number N, there exists a family of R -> R functions such that for any function f in this family, we have:
- 1. f is in ReLU-DNN(N2, N3).
- 2. f is NOT in ReLU-DNN(N, (1/2)NN - 1).
Moreover, this family is in one-to-one correspondence with the torus in N dimensions. Remark: More general versions, Approximation versions. n>=2 version using zonotopal norms. Fact: Any R -> R function in ReLU(k, w) has at most O(w^k) pieces
- 0.2
0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1
- 0.2
0.2 0.4 0.6 0.8 1 1.2
- 0.4
- 0.2
0.2 0.4 0.6
SLIDE 53
Depth v/s size tradeoffs for ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): For every natural number N, there exists a family of R -> R functions such that for any function f in this family, we have:
- 1. f is in ReLU-DNN(N2, N3).
- 2. f is NOT in ReLU-DNN(N, (1/2)NN - 1).
Moreover, this family is in one-to-one correspondence with the torus in N dimensions. Remark: More general versions, Approximation versions. n>=2 version using zonotopal norms.
SLIDE 54
Depth v/s size tradeoffs for ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): For every natural number N, there exists a family of R -> R functions such that for any function f in this family, we have:
- 1. f is in ReLU-DNN(N2, N3).
- 2. f is NOT in ReLU-DNN(N, (1/2)NN - 1).
Moreover, this family is in one-to-one correspondence with the torus in N dimensions. Open Question Finer gaps and n >= 2. Recent result by Eldan-Shamir shows exponential in ’n’ gap between 1 and 2 hidden layers. Extend to k v/s k+1 layers? k = O(1) v/s k = log(n)? Remark: More general versions, Approximation versions. n>=2 version using zonotopal norms.
SLIDE 55
Depth v/s size tradeoffs for ReLU DNNs Restricting inputs to Boolean Hypercube (Mukherjee, Basu 2017):
- 1. 2 hidden layers always suffice: Any function on Boolean
hypercube is a linear combination of the vertex-indicator
- functions. Each vertex indicator function can be
implemented by a single ReLU gate.
- 2. Exponential lower bounds on ReLU DNN’
s of O(nc) depth to implement certain Boolean functions (for c < 1/8). Also implies some new Boolean circuit complexity results with LTF gates. Discrete Geometry Techniques: Method of sign-rank and random restrictions.
SLIDE 56
Problems of interest for DNNs
- Expressiveness: What family of functions can one represent
using DNNs?
- Training the network: Given architecture, data points (x,y),
find weights for the ``best fit” function.
- Generalization error: Rademacher complexity, VC dimension
- Efficiency: How many layers (depth) and vertices (size)
needed to represent functions in the family?
SLIDE 57
Training Algorithm for ReLU-DNN(1,w) Theorem (Arora, Basu, Mianjy, Mukherjee 2016): For. Let n, w be natural numbers, and (x1, y1), … , (xD, yD) a set of D data points in Rn x R. There exists an algorithm that solves the following training problem to global optimality min{ |F(x1) - y1| + … + |F(xD) - yD| : F in ReLU-DNN(1,w) } The running time of the algorithm is 2w Dnw poly(D,n,w). Remark: More general convex loss functions can be handled
SLIDE 58
Training Algorithm for ReLU-DNN(1,w) Characterization of ReLU(1,w) functions: max{0, < p1, x > + q1} + … + max{0, < pk, x > + qk}
- max{0, < n1, x > + h1} - … - max{0, < ns, x > + hs}
Equivalently: There is a hyperplane arrangement such that the function is affine linear in each cell of the hyperplane arrangement and whenever we ``cross” a hyperplane in the arrangement, the value changes by the same linear function.
SLIDE 59
Equivalently: There is a hyperplane arrangement such that the function is affine linear in each cell of the hyperplane arrangement and whenever we ``cross” a hyperplane in the arrangement, the value changes by the same linear function. Training Algorithm for ReLU-DNN(1,w) Characterization of ReLU(1,w) functions: max{0, < p1, x > + q1} + … + max{0, < pk, x > + qk}
- max{0, < n1, x > + h1} - … - max{0, < ns, x > + hs}
x1 x2 y aj
i
cj bj
y = c1max{0, ha1, xi + b1} + c2max{0, ha2, xi + b2} + c3max{0, ha3, xi + b3}
SLIDE 60
Training Algorithm for ReLU-DNN(1,w) Characterization of ReLU(1,w) functions: max{0, < p1, x > + q1} + … + max{0, < pk, x > + qk}
- max{0, < n1, x > + h1} - … - max{0, < ns, x > + hs}
Equivalently: There is a hyperplane arrangement such that the function is affine linear in each cell of the hyperplane arrangement and whenever we ``cross” a hyperplane in the arrangement, the value changes by the same linear function.
SLIDE 61
Training Algorithm for ReLU-DNN(1,w)
SLIDE 62
Training Algorithm for ReLU-DNN(1,w)
SLIDE 63
Training Algorithm for ReLU-DNN(1,w)
SLIDE 64
Training Algorithm for ReLU-DNN(1,w)
SLIDE 65
Training Algorithm for ReLU-DNN(1,w) Theorem (Arora, Basu, Mianjy, Mukherjee 2016): For. Let n, w be natural numbers, and (x1, y1), … , (xD, yD) a set of D data points in Rn x R. There exists an algorithm that solves the following training problem to global optimality min{ |F(x1) - y1| + … + |F(xD) - yD| : F in ReLU-DNN(1,w) } The running time of the algorithm is 2w Dnw poly(D,n,w). Remark: More general convex loss functions can be handled
SLIDE 66
Training Algorithm for ReLU-DNN(1,w) Theorem (Arora, Basu, Mianjy, Mukherjee 2016): For. Let n, w be natural numbers, and (x1, y1), … , (xD, yD) a set of D data points in Rn x R. There exists an algorithm that solves the following training problem to global optimality min{ |f(x1) - y1| + … + |f(xD) - yD| : f in ReLU-DNN(1,w) } The running time of the algorithm is 2w Dnw poly(D,n,w). Remark: More general convex loss functions can be handled Open Questions
- 1. Exponential dependence on size ‘w’
necessary?
- 2. Training with 2 or more hidden layers.
SLIDE 67