Discrete Geometry meets Machine Learning Amitabh Basu Johns - - PowerPoint PPT Presentation

discrete geometry meets machine learning
SMART_READER_LITE
LIVE PREVIEW

Discrete Geometry meets Machine Learning Amitabh Basu Johns - - PowerPoint PPT Presentation

Discrete Geometry meets Machine Learning Amitabh Basu Johns Hopkins University 22nd Combinatorial Opt. Workshop at Aussois January 11, 2018 Joint work with Anirbit Mukherjee, Raman Arora, Poorya Mianjy Two Problems in Discrete Geometry


slide-1
SLIDE 1

Amitabh Basu

Johns Hopkins University

Discrete Geometry meets Machine Learning

22nd Combinatorial Opt. Workshop at Aussois January 11, 2018 Joint work with Anirbit Mukherjee, Raman Arora, Poorya Mianjy

slide-2
SLIDE 2

Two Problems in Discrete Geometry Problem 1: Given two polytopes P and Q, do there exist simplices A1, …, Ap and B1, …, Bq such that P + A1 + … + Ap = Q + B1 + … + Bq

slide-3
SLIDE 3

Two Problems in Discrete Geometry Problem 1: Given two polytopes P and Q, do there exist simplices A1, …, Ap and B1, …, Bq such that P + A1 + … + Ap = Q + B1 + … + Bq Problem 2: For natural number k, define k-zonotope as the Minkowski sum of a finite set of polytopes, where each is a convex hull of k points [2-zonotope = regular zonotope] Given two 2n-zonotopes P and Q, do there exist two 2n+1-zonotopes A and B such that conv(P U Q) + A = B

slide-4
SLIDE 4

What is a Deep Neural Network (DNN) ?

slide-5
SLIDE 5

What is a Deep Neural Network (DNN) ?

  • Directed Acyclic Graph (Network Architecture)
slide-6
SLIDE 6

What is a Deep Neural Network (DNN) ?

  • Directed Acyclic Graph (Network Architecture)
  • Weights on every edge and

every vertex

slide-7
SLIDE 7

What is a Deep Neural Network (DNN) ?

  • Directed Acyclic Graph (Network Architecture)
  • Weights on every edge and

every vertex

2 1.65 −6.8 3 −1 0.53 2.45

slide-8
SLIDE 8

What is a Deep Neural Network (DNN) ?

  • Directed Acyclic Graph (Network Architecture)
  • Weights on every edge and

every vertex

2 1.65 −6.8 3 −1 0.53 2.45

  • R -> R

“Activation Function” Examples: f(x) = max{0,x} — Rectified Linear Unit (ReLU) f(x) = ex/(1 + ex) — Sigmoid

slide-9
SLIDE 9

What is a Deep Neural Network (DNN) ?

  • Directed Acyclic Graph (Network Architecture)
  • Weights on every edge and

every vertex

2 1.65 −6.8 3 −1 0.53 2.45

  • R -> R

“Activation Function” Examples: f(x) = max{0,x} — Rectified Linear Unit (ReLU) f(x) = ex/(1 + ex) — Sigmoid

  • Sources = input, Sinks = output
slide-10
SLIDE 10

What is a Deep Neural Network (DNN) ?

  • Directed Acyclic Graph (Network Architecture)
  • Weights on every edge and

every vertex

  • R -> R

“Activation Function” Examples: f(x) = max{0,x} — Rectified Linear Unit (ReLU) f(x) = ex/(1 + ex) — Sigmoid

  • Sources = input, Sinks = output

2 1.65 −6.8 3 −1 0.53 2.45 x1 x2 x3 y1 y2

slide-11
SLIDE 11
  • Weights on every edge and

every vertex What is a Deep Neural Network (DNN) ?

  • Directed Acyclic Graph (Network Architecture)
  • R -> R

“Activation Function” Examples: f(x) = max{0,x} — Rectified Linear Unit (ReLU) f(x) = ex/(1 + ex) — Sigmoid

  • Sources = input, Sinks = output

2 1.65 −6.8 3 −1 0.53 2.45 x1 x2 x3 y1 y2

a1 a2 ak b u1 u2 uk

  • = f(a1u1 + a2u2 + . . . + akuk + b)
slide-12
SLIDE 12
  • Weights on every edge and

every vertex What is a Deep Neural Network (DNN) ?

  • Directed Acyclic Graph (Network Architecture)
  • R -> R

“Activation Function” Examples: f(x) = max{0,x} — Rectified Linear Unit (ReLU) f(x) = ex/(1 + ex) — Sigmoid

  • Sources = input, Sinks = output

2 1.65 −6.8 3 −1 0.53 2.45 x1 x2 x3 y1 y2

a1 a2 ak b u1 u2 uk

  • = max{0, a1u1 +a2u2 +. . .+akuk +b}
slide-13
SLIDE 13

Problems of interest for DNNs

  • Expressiveness: What family of functions can one represent

using DNNs?

  • Efficiency: How many layers (depth) and vertices (size)

needed to represent functions in the family?

  • Training the network: Given architecture, data points (x,y),

find weights for the ``best fit” function.

  • Generalization error: Rademacher complexity, VC dimension
slide-14
SLIDE 14

Problems of interest for DNNs

  • Expressiveness: What family of functions can one represent

using DNNs?

  • Training the network: Given architecture, data points (x,y),

find weights for the ``best fit” function.

  • Generalization error: Rademacher complexity, VC dimension
  • Efficiency: How many layers (depth) and vertices (size)

needed to represent functions in the family?

slide-15
SLIDE 15

Calculus of DNN functions

slide-16
SLIDE 16

Calculus of DNN functions

  • f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 + f2 in DNN(max{k1,k2}, s1+s2)
slide-17
SLIDE 17

Calculus of DNN functions

  • f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 + f2 in DNN(max{k1,k2}, s1+s2)

x f1 f2

y = f1+f2

slide-18
SLIDE 18

Calculus of DNN functions

  • f in DNN(k,s), c in R => cf in DNN(k,s)
  • f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 + f2 in DNN(max{k1,k2}, s1+s2)
slide-19
SLIDE 19

Calculus of DNN functions

  • f in DNN(k,s), c in R => cf in DNN(k,s)
  • f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 + f2 in DNN(max{k1,k2}, s1+s2)
  • f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 o f2 in DNN(k1+k2, s1+s2)
slide-20
SLIDE 20

Calculus of DNN functions

  • f in DNN(k,s), c in R => cf in DNN(k,s)
  • f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 + f2 in DNN(max{k1,k2}, s1+s2)
  • f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 o f2 in DNN(k1+k2, s1+s2)

x f2 f1 f1 f2

slide-21
SLIDE 21

Calculus of DNN functions

  • f in DNN(k,s), c in R => cf in DNN(k,s)
  • f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 + f2 in DNN(max{k1,k2}, s1+s2)
  • f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 o f2 in DNN(k1+k2, s1+s2)
  • f1 in ReLU-DNN(k1,s1), f2 in ReLU-DNN(k2,s2) =>

max{f1 , f2} in ReLU-DNN(max{k1,k2}+1, s1+s2+4)

slide-22
SLIDE 22

Calculus of DNN functions

  • f in DNN(k,s), c in R => cf in DNN(k,s)
  • f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 + f2 in DNN(max{k1,k2}, s1+s2)
  • f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 o f2 in DNN(k1+k2, s1+s2)

F : Rn ! R2, F(x) = (f1(x), f2(x)) G : R2 ! R, G(z1, z2) = max{z1, z2} max{f1, f2} = G F

  • f1 in ReLU-DNN(k1,s1), f2 in ReLU-DNN(k2,s2) =>

max{f1 , f2} in ReLU-DNN(max{k1,k2}+1, s1+s2+4)

slide-23
SLIDE 23

Calculus of DNN functions

  • f in DNN(k,s), c in R => cf in DNN(k,s)
  • f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 + f2 in DNN(max{k1,k2}, s1+s2)
  • f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 o f2 in DNN(k1+k2, s1+s2)

G : R2 → R, G(z1, z2) = max{z1, z2} max{z1, z2} = z1+z2

2

+ |z1−z2|

2

  • f1 in ReLU-DNN(k1,s1), f2 in ReLU-DNN(k2,s2) =>

max{f1 , f2} in ReLU-DNN(max{k1,k2}+1, s1+s2+4)

slide-24
SLIDE 24

Calculus of DNN functions

  • f in DNN(k,s), c in R => cf in DNN(k,s)
  • f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 + f2 in DNN(max{k1,k2}, s1+s2)
  • f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 o f2 in DNN(k1+k2, s1+s2)

Input x1 Input x2

x1+x2 2

+ |x1−x2|

2

1 1

  • 1
  • 1
  • 1

1 1

  • 1

1 2

− 1

2 1 2 1 2

  • f1 in ReLU-DNN(k1,s1), f2 in ReLU-DNN(k2,s2) =>

max{f1 , f2} in ReLU-DNN(max{k1,k2}+1, s1+s2+4)

slide-25
SLIDE 25

Calculus of DNN functions

  • f in DNN(k,s), c in R => cf in DNN(k,s)
  • f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 + f2 in DNN(max{k1,k2}, s1+s2)
  • f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 o f2 in DNN(k1+k2, s1+s2)
  • f1 in ReLU-DNN(k1,s1), f2 in ReLU-DNN(k2,s2) =>

max{f1 , f2} in ReLU-DNN(max{k1,k2}+1, s1+s2+4)

  • Affine functions can be implemented in ReLU-DNN(1,2n)
slide-26
SLIDE 26

Problems of interest for DNNs

  • Expressiveness: What family of functions can one represent

using DNNs?

  • Training the network: Given architecture, data points (x,y),

find weights for the ``best fit” function.

  • Generalization error: Rademacher complexity, VC dimension
  • Efficiency: How many layers (depth) and vertices (size)

needed to represent functions in the family?

slide-27
SLIDE 27

Expressiveness of ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): Any ReLU DNN with ’n’ inputs implements a continuous piecewise affine function on Rn. Conversely, any continuous piecewise affine function on Rn can be implemented by some ReLU DNN. Moreover, at most log(n+1) hidden layers are needed.

slide-28
SLIDE 28

Expressiveness of ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): Any ReLU DNN with ’n’ inputs implements a continuous piecewise affine function on Rn. Conversely, any continuous piecewise affine function on Rn can be implemented by some ReLU DNN. Moreover, at most log(n+1) hidden layers are needed. Proof: Tropical Geometry result [Ovchinnikov 2002] says any continuous piecewise affine function can be written as max i=1, …, k min j in Si {lj}

slide-29
SLIDE 29

Expressiveness of ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): Any ReLU DNN with ’n’ inputs implements a continuous piecewise affine function on Rn. Conversely, any continuous piecewise affine function on Rn can be implemented by some ReLU DNN. Moreover, at most log(n+1) hidden layers are needed. Proof: Tropical Geometry result [Ovchinnikov 2002] says any continuous piecewise affine function can be written as max i=1, …, k min j in Si {lj}

slide-30
SLIDE 30

Expressiveness of ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): Any ReLU DNN with ’n’ inputs implements a continuous piecewise affine function on Rn. Conversely, any continuous piecewise affine function on Rn can be implemented by some ReLU DNN. Moreover, at most log(n+1) hidden layers are needed. Proof: Tropical Geometry result [Ovchinnikov 2002] says any continuous piecewise affine function can be written as max i=1, …, k min j in Si {lj}

slide-31
SLIDE 31

Expressiveness of ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): Any ReLU DNN with ’n’ inputs implements a continuous piecewise affine function on Rn. Conversely, any continuous piecewise affine function on Rn can be implemented by some ReLU DNN. Moreover, at most log(n+1) hidden layers are needed. Proof: Tropical Geometry result [Ovchinnikov 2002] says any continuous piecewise affine function can be written as max i=1, …, k min j in Si {lj}

slide-32
SLIDE 32

Expressiveness of ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): Any ReLU DNN with ’n’ inputs implements a continuous piecewise affine function on Rn. Conversely, any continuous piecewise affine function on Rn can be implemented by some ReLU DNN. Moreover, at most log(n+1) hidden layers are needed. Proof (Take 2): Any continuous PWL function can be written as difference of two PWL convex functions: max{l1

1,l1 2, …, l1 m} - max{l2 1,l2 2, …, l2 s}

slide-33
SLIDE 33

Expressiveness of ReLU DNNs Proof (Take 2): Any continuous PWL function can be written as difference of two PWL convex functions: max{l1

1,l1 2, …, l1 m} - max{l2 1,l2 2, …, l2 s}

Want to show: Can be rewritten as max{a1

1,a1 2, …, a1 n+1} + … + max{aq 1,aq 2, …, aq n+1}

  • max{c1

1,c1 2, …, c1 n+1} - … - max{cp 1,cp 2, …, cp n+1}

slide-34
SLIDE 34

Expressiveness of ReLU DNNs Proof (Take 2): Any continuous PWL function can be written as difference of two PWL convex functions: max{l1

1,l1 2, …, l1 m} - max{l2 1,l2 2, …, l2 s}

Want to show: Can be rewritten as max{a1

1,a1 2, …, a1 n+1} + … + max{aq 1,aq 2, …, aq n+1}

  • max{c1

1,c1 2, …, c1 n+1} - … - max{cp 1,cp 2, …, cp n+1}

Can assume without loss of generality that li

j are linear

slide-35
SLIDE 35

Expressiveness of ReLU DNNs Proof (Take 2): Any continuous PWL function can be written as difference of two PWL convex functions: max{l1

1,l1 2, …, l1 m} - max{l2 1,l2 2, …, l2 s}

Want to show: Can be rewritten as max{a1

1,a1 2, …, a1 n+1} + … + max{aq 1,aq 2, …, aq n+1}

  • max{c1

1,c1 2, …, c1 n+1} - … - max{cp 1,cp 2, …, cp n+1}

Can assume without loss of generality that li

j are linear

max{l1

1,l1 2, …, l1 m}

= max{<a1

1, x>, <a1 2, x>, …, <a1 m, x>}

= support function of conv( {a1

1, a1 2, …, a1 m} )

slide-36
SLIDE 36

Expressiveness of ReLU DNNs Proof (Take 2): Any continuous PWL function can be written as difference of two PWL convex functions: max{l1

1,l1 2, …, l1 m} - max{l2 1,l2 2, …, l2 s}

Want to show: Can be rewritten as max{a1

1,a1 2, …, a1 n+1} + … + max{aq 1,aq 2, …, aq n+1}

  • max{c1

1,c1 2, …, c1 n+1} - … - max{cp 1,cp 2, …, cp n+1}

Equivalent formulation: [ hK denotes support function of K] hP - hQ = hB_1 + …. + hB_q - hA_1 - …. - hA_p

slide-37
SLIDE 37

Expressiveness of ReLU DNNs Proof (Take 2): Equivalent formulation: [ hK denotes support function of K] hP - hQ = hB_1 + …. + hB_q - hA_1 - …. - hA_p

iff hP + hA_1 + …. + hA_p = hQ + hB_1 + …. + hB_q

iff P + A1 + … + Ap = Q + B1 + … + Bq

slide-38
SLIDE 38

Expressiveness of ReLU DNNs Proof (Take 2): Result from circuits literature [Wang and Sun 2006] says any continuous piecewise affine function can be written as c1max{l1

1,l1 2, …, l1 n+1} + … + ckmax{lk 1,lk 2, …, lk n+1}

slide-39
SLIDE 39

Expressiveness of ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): Any ReLU DNN with ’n’ inputs implements a continuous piecewise affine function on Rn. Conversely, any continuous piecewise affine function on Rn can be implemented by some ReLU DNN. Moreover, at most log(n+1) hidden layers are needed.

slide-40
SLIDE 40

Expressiveness of ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): Any ReLU DNN with ’n’ inputs implements a continuous piecewise affine function on Rn. Conversely, any continuous piecewise affine function on Rn can be implemented by some ReLU DNN. Moreover, at most log(n+1) hidden layers are needed. ReLU-DNN(1, *) ReLU-DNN(2, *) ReLU-DNN(3, *) ReLU-DNN(log(n+1), *) Open Question

slide-41
SLIDE 41

Expressiveness of ReLU DNNs Open Question: ReLU(1, *) ReLU(2, *) ReLU(3, *) Proof Strategy/Plan: Claim 1: Let k, d be natural numbers such that 2k <= d. Any function in ReLU(k, *) on R^d is a linear combination

  • f 2k-MAX functions.
slide-42
SLIDE 42

Expressiveness of ReLU DNNs Open Question: ReLU(1, *) ReLU(2, *) ReLU(3, *) Proof Strategy/Plan: Definition: Let n, d be natural numbers. The set of functions

  • n Rd that can be written as linear combinations of n-max

functions will be denoted by (d,n)-HH. Claim 1: Let k, d be natural numbers such that 2k <= d. Any function in ReLU(k, *) on R^d is a linear combination

  • f 2k-MAX functions.
slide-43
SLIDE 43

Expressiveness of ReLU DNNs Open Question: ReLU(1, *) ReLU(2, *) ReLU(3, *) Proof Strategy/Plan: Claim 2: Let n, d be natural numbers such that n <= d+1. Then (d,n)-HH (d,n+1)-HH Definition: Let n, d be natural numbers. The set of functions

  • n Rd that can be written as linear combinations of n-max

functions will be denoted by (d,n)-HH. Claim 1: Let k, d be natural numbers such that 2k <= d. Any function in ReLU(k, *) on R^d is a linear combination

  • f 2k-MAX functions.
slide-44
SLIDE 44

Expressiveness of ReLU DNNs Claim 1: Any function in ReLU(k, *) is a linear combination

  • f 2k-max functions.

Equivalent to Problem 2: For natural number k, define k-zonotope as the Minkowski sum of a finite set of k-simplices [1-zonotope = regular zonotope] Given two 2n-zonotopes P and Q, do there exist two 2n+1-zonotopes A and B such that conv(P U Q) + A = B

slide-45
SLIDE 45

Problems of interest for DNNs

  • Expressiveness: What family of functions can one represent

using DNNs?

  • Training the network: Given architecture, data points (x,y),

find weights for the ``best fit” function.

  • Generalization error: Rademacher complexity, VC dimension
  • Efficiency: How many layers (depth) and vertices (size)

needed to represent functions in the family?

slide-46
SLIDE 46

Depth v/s size tradeoffs for ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): For every natural number N, there exists a family of R -> R functions such that for any function f in this family, we have:

  • 1. f is in ReLU-DNN(N2, N3).
  • 2. f is NOT in ReLU-DNN(N, (1/2)NN - 1).

Moreover, this family is in one-to-one correspondence with the torus in N dimensions. Remark: More general versions, Approximation versions. n>=2 version using zonotopal norms.

slide-47
SLIDE 47

Depth v/s size tradeoffs for ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): For every natural number N, there exists a family of R -> R functions such that for any function f in this family, we have:

  • 1. f is in ReLU-DNN(N2, N3).
  • 2. f is NOT in ReLU-DNN(N, (1/2)NN - 1).

Moreover, this family is in one-to-one correspondence with the torus in N dimensions. Remark: More general versions, Approximation versions. n>=2 version using zonotopal norms. Fact: Any R -> R PWL function with p pieces is in ReLU-DNN(1,p+1)

slide-48
SLIDE 48

Depth v/s size tradeoffs for ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): For every natural number N, there exists a family of R -> R functions such that for any function f in this family, we have:

  • 1. f is in ReLU-DNN(N2, N3).
  • 2. f is NOT in ReLU-DNN(N, (1/2)NN - 1).

Moreover, this family is in one-to-one correspondence with the torus in N dimensions. Remark: More general versions, Approximation versions. n>=2 version using zonotopal norms. Fact: Any R -> R PWL function with p pieces is in ReLU-DNN(1,p+1)

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1

slide-49
SLIDE 49

Calculus of DNN functions

  • f in DNN(k,s), c in R => cf in DNN(k,s)
  • f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 + f2 in DNN(max{k1,k2}, s1+s2)
  • f1 in DNN(k1,s1), f2 in DNN(k2,s2) => f1 o f2 in DNN(k1+k2, s1+s2)

x f2 f1 f1 f2

slide-50
SLIDE 50

Depth v/s size tradeoffs for ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): For every natural number N, there exists a family of R -> R functions such that for any function f in this family, we have:

  • 1. f is in ReLU-DNN(N2, N3).
  • 2. f is NOT in ReLU-DNN(N, (1/2)NN - 1).

Moreover, this family is in one-to-one correspondence with the torus in N dimensions. Remark: More general versions, Approximation versions. n>=2 version using zonotopal norms. Fact: Any R -> R PWL function with p pieces is in ReLU-DNN(1,p+1)

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1

slide-51
SLIDE 51

Depth v/s size tradeoffs for ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): For every natural number N, there exists a family of R -> R functions such that for any function f in this family, we have:

  • 1. f is in ReLU-DNN(N2, N3).
  • 2. f is NOT in ReLU-DNN(N, (1/2)NN - 1).

Moreover, this family is in one-to-one correspondence with the torus in N dimensions. Remark: More general versions, Approximation versions. n>=2 version using zonotopal norms. Fact: Any R -> R function in ReLU(k, w) has at most O(w^k) pieces

slide-52
SLIDE 52

Depth v/s size tradeoffs for ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): For every natural number N, there exists a family of R -> R functions such that for any function f in this family, we have:

  • 1. f is in ReLU-DNN(N2, N3).
  • 2. f is NOT in ReLU-DNN(N, (1/2)NN - 1).

Moreover, this family is in one-to-one correspondence with the torus in N dimensions. Remark: More general versions, Approximation versions. n>=2 version using zonotopal norms. Fact: Any R -> R function in ReLU(k, w) has at most O(w^k) pieces

  • 0.2

0.2 0.4 0.6 0.8 1 1.2 0.2 0.4 0.6 0.8 1

  • 0.2

0.2 0.4 0.6 0.8 1 1.2

  • 0.4
  • 0.2

0.2 0.4 0.6

slide-53
SLIDE 53

Depth v/s size tradeoffs for ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): For every natural number N, there exists a family of R -> R functions such that for any function f in this family, we have:

  • 1. f is in ReLU-DNN(N2, N3).
  • 2. f is NOT in ReLU-DNN(N, (1/2)NN - 1).

Moreover, this family is in one-to-one correspondence with the torus in N dimensions. Remark: More general versions, Approximation versions. n>=2 version using zonotopal norms.

slide-54
SLIDE 54

Depth v/s size tradeoffs for ReLU DNNs Theorem (Arora, Basu, Mianjy, Mukherjee 2016): For every natural number N, there exists a family of R -> R functions such that for any function f in this family, we have:

  • 1. f is in ReLU-DNN(N2, N3).
  • 2. f is NOT in ReLU-DNN(N, (1/2)NN - 1).

Moreover, this family is in one-to-one correspondence with the torus in N dimensions. Open Question Finer gaps and n >= 2. Recent result by Eldan-Shamir shows exponential in ’n’ gap between 1 and 2 hidden layers. Extend to k v/s k+1 layers? k = O(1) v/s k = log(n)? Remark: More general versions, Approximation versions. n>=2 version using zonotopal norms.

slide-55
SLIDE 55

Depth v/s size tradeoffs for ReLU DNNs Restricting inputs to Boolean Hypercube (Mukherjee, Basu 2017):

  • 1. 2 hidden layers always suffice: Any function on Boolean

hypercube is a linear combination of the vertex-indicator

  • functions. Each vertex indicator function can be

implemented by a single ReLU gate.

  • 2. Exponential lower bounds on ReLU DNN’

s of O(nc) depth to implement certain Boolean functions (for c < 1/8). Also implies some new Boolean circuit complexity results with LTF gates. Discrete Geometry Techniques: Method of sign-rank and random restrictions.

slide-56
SLIDE 56

Problems of interest for DNNs

  • Expressiveness: What family of functions can one represent

using DNNs?

  • Training the network: Given architecture, data points (x,y),

find weights for the ``best fit” function.

  • Generalization error: Rademacher complexity, VC dimension
  • Efficiency: How many layers (depth) and vertices (size)

needed to represent functions in the family?

slide-57
SLIDE 57

Training Algorithm for ReLU-DNN(1,w) Theorem (Arora, Basu, Mianjy, Mukherjee 2016): For. Let n, w be natural numbers, and (x1, y1), … , (xD, yD) a set of D data points in Rn x R. There exists an algorithm that solves the following training problem to global optimality min{ |F(x1) - y1| + … + |F(xD) - yD| : F in ReLU-DNN(1,w) } The running time of the algorithm is 2w Dnw poly(D,n,w). Remark: More general convex loss functions can be handled

slide-58
SLIDE 58

Training Algorithm for ReLU-DNN(1,w) Characterization of ReLU(1,w) functions: max{0, < p1, x > + q1} + … + max{0, < pk, x > + qk}

  • max{0, < n1, x > + h1} - … - max{0, < ns, x > + hs}

Equivalently: There is a hyperplane arrangement such that the function is affine linear in each cell of the hyperplane arrangement and whenever we ``cross” a hyperplane in the arrangement, the value changes by the same linear function.

slide-59
SLIDE 59

Equivalently: There is a hyperplane arrangement such that the function is affine linear in each cell of the hyperplane arrangement and whenever we ``cross” a hyperplane in the arrangement, the value changes by the same linear function. Training Algorithm for ReLU-DNN(1,w) Characterization of ReLU(1,w) functions: max{0, < p1, x > + q1} + … + max{0, < pk, x > + qk}

  • max{0, < n1, x > + h1} - … - max{0, < ns, x > + hs}

x1 x2 y aj

i

cj bj

y = c1max{0, ha1, xi + b1} + c2max{0, ha2, xi + b2} + c3max{0, ha3, xi + b3}

slide-60
SLIDE 60

Training Algorithm for ReLU-DNN(1,w) Characterization of ReLU(1,w) functions: max{0, < p1, x > + q1} + … + max{0, < pk, x > + qk}

  • max{0, < n1, x > + h1} - … - max{0, < ns, x > + hs}

Equivalently: There is a hyperplane arrangement such that the function is affine linear in each cell of the hyperplane arrangement and whenever we ``cross” a hyperplane in the arrangement, the value changes by the same linear function.

slide-61
SLIDE 61

Training Algorithm for ReLU-DNN(1,w)

slide-62
SLIDE 62

Training Algorithm for ReLU-DNN(1,w)

slide-63
SLIDE 63

Training Algorithm for ReLU-DNN(1,w)

slide-64
SLIDE 64

Training Algorithm for ReLU-DNN(1,w)

slide-65
SLIDE 65

Training Algorithm for ReLU-DNN(1,w) Theorem (Arora, Basu, Mianjy, Mukherjee 2016): For. Let n, w be natural numbers, and (x1, y1), … , (xD, yD) a set of D data points in Rn x R. There exists an algorithm that solves the following training problem to global optimality min{ |F(x1) - y1| + … + |F(xD) - yD| : F in ReLU-DNN(1,w) } The running time of the algorithm is 2w Dnw poly(D,n,w). Remark: More general convex loss functions can be handled

slide-66
SLIDE 66

Training Algorithm for ReLU-DNN(1,w) Theorem (Arora, Basu, Mianjy, Mukherjee 2016): For. Let n, w be natural numbers, and (x1, y1), … , (xD, yD) a set of D data points in Rn x R. There exists an algorithm that solves the following training problem to global optimality min{ |f(x1) - y1| + … + |f(xD) - yD| : f in ReLU-DNN(1,w) } The running time of the algorithm is 2w Dnw poly(D,n,w). Remark: More general convex loss functions can be handled Open Questions

  • 1. Exponential dependence on size ‘w’

necessary?

  • 2. Training with 2 or more hidden layers.
slide-67
SLIDE 67

Thank you! Questions/Comments/Answers?