Partial Differential Equations Approaches to Optimization and - - PowerPoint PPT Presentation

partial differential equations approaches to optimization
SMART_READER_LITE
LIVE PREVIEW

Partial Differential Equations Approaches to Optimization and - - PowerPoint PPT Presentation

Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75 Years of Mathematics of Computation ICERM Nov 2, 2018 Adam Oberman McGill Dept of Math and Stats supported by NSERC, Simons


slide-1
SLIDE 1

Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks

Celebrating 75 Years of Mathematics of Computation ICERM

Nov 2, 2018 Adam Oberman McGill Dept of Math and Stats

supported by NSERC, Simons Fellowship, AFOSR FA9550-18-1-0167

slide-2
SLIDE 2

Background AI

  • Artificial Intelligence is loosely defined

as intelligence exhibited by machines

  • Operationally: R&D in CS academic

sub-disciplines: Computer Vision, Natural Language Processing (NLP), Robotics, etc

AlphaGo uses DL to beat world champion at Go

slide-3
SLIDE 3
  • AI : specific tasks,
  • AGI : general cognitive

abilities.

  • AGI is a small research

area within AI: build machines that can successfully perform any task that a human might do

  • So far, no progress on AGI.

Artificial General Intelligence (AGI)

slide-4
SLIDE 4

Deep Learning vs. traditional Machine Learning

  • Machine Learning (ML) has been

around for some time.

  • Deep Learning is newer branch of

ML which uses Deep Neural networks.

  • ML has theory: error estimates and

convergence proofs.

  • DL less theory. But DL can effectively

solve substantially larger scale problems

slide-5
SLIDE 5
  • ImageNet: Total number of classes: m =21841
  • Total number of images: n =14,197,122
  • Color images d= 3*256*256= 196,608

Facebook used 256 GPUs, working in parallel, to train ImageNet. Still an academic dataset. Total number of images

  • n Facebook is much larger

What are DNNs (in Math Language)?

slide-6
SLIDE 6

What is the function? Looking for a map from images to labels

x in M = manifold of images graph of list of word labels f(x)

slide-7
SLIDE 7

Doing the impossible?

In theory, due to curse of dimensionality, impossible to accurately interpolate a high dimensional function. In practice, possible using Deep Neural Network architecture, training to fit the data with SGD. However we don’t know why it works.

Can train a computer to caption images more accurately that human performance.

slide-8
SLIDE 8

Loss Functions versus Error

Classification problem: map image to discrete label space {1,2,3,…,10} In practice: map to a probability vector, then assign label of the arg max. Classification is not differentiable, so instead, in order to train, use a loss function as a surrogate.

slide-9
SLIDE 9

DNNs in Math Language: high dimensional function fitting

Data fitting problem: f parameterized map from images to probability vectors on labels. y is the correct label. Try to fit data by minimizing the loss. Training: minimize expected loss, by taking stochastic (approximate) gradients Note: train on an empirical distribution sampled from the density rho.

min

w Ex∼ρn`(f(x; w), y(x)) = n

X

i=1

`(f(xi; w), y(xi))

<latexit sha1_base64="3uHpidkJhxBpaO95Fmt7WdytCQ=">ACQ3icbZBNSysxFIbP+HW9avq0k1QhA6IdNwoyAVBZcKVoudGjJpxoYmSHJaMsw/8Wf4sY/4M4/4MaF3ErmLYifh0IPLznPZycN0oFN7ZavfdGRsfGJ/5M/i1NTc/MzpXnF05MkmnKajQRia5HxDBFatZbgWrp5oRGQl2GnV2+/3TS6YNT9Sx7aWsKcmF4jGnxDoJl89CyRW+QqEkth1F+X6B8y4KDZco1O0EqwKFTIhKXOluX/lrvUrX9E/Z8gkzrmjoDhXHxbM302Y+z4ur1TXq4NCHxB8h5WdfVW/BoBDXL4LWwnNJFOWCmJMI9hIbTMn2nIqWFEKM8NSQjvkgjUcKiKZaeaDAq06pQWihPtnrJoH6eyIk0picj5+yfar73+uJvUZm461mzlWaWabocFGcCWQT1A8Utbhm1IqeA0I1d39FtE0odbFXnIh/Dj5J5xsrAeOj1waezCsSViCZahAJuwAwdwCDWgcAMP8AT/vVv0Xv2XobWEe9ZhG+lPf6Bur+sIg=</latexit><latexit sha1_base64="LDMLH6UNa2jZV0Aeyzoau7EZiQ=">ACQ3icbZDLSgMxFIYz9V5vVZdugiJ0QGqnGwURBVcKlgtdmrIpBkbmSGJGNbhnkIn8EXceMLuPMF3LhQxK1g2op4OxD4+M9/ODl/EHOmTbn84ORGRsfGJyan8tMzs3PzhYXFUx0litAqiXikagHWlDNJq4YZTmuxolgEnJ4F7b1+/+yKs0ieWJ6MW0IfClZyAg2VkKFc18wiTrQF9i0giA9yFDahb5mAvqFSGZQZ9yXgyL3e2Ou94rdl0X7lhDIlDKLHnZhfyIPZpQsx1UWG1XCoPCn6B9xtWdw9k7WZj/oIFe79ZkQSQaUhHGtd9yqxaRYGUY4zfJ+omMSRtf0rpFiQXVjXSQbXrNKEYaTskwYO1O8TKRZa90Rgnf1T9e9eX/yvV09MuNVImYwTQyUZLgoTDk0E+4HCJlOUGN6zgIli9q+QtLDCxNjY8zaEPyf/hdNKybN8bNPYB8OaBMtgBRSBzbBLjgER6AKCLgFj+AZvDh3zpPz6rwNrTnc2YJ/Cjn/QNKGrGP</latexit><latexit sha1_base64="LDMLH6UNa2jZV0Aeyzoau7EZiQ=">ACQ3icbZDLSgMxFIYz9V5vVZdugiJ0QGqnGwURBVcKlgtdmrIpBkbmSGJGNbhnkIn8EXceMLuPMF3LhQxK1g2op4OxD4+M9/ODl/EHOmTbn84ORGRsfGJyan8tMzs3PzhYXFUx0litAqiXikagHWlDNJq4YZTmuxolgEnJ4F7b1+/+yKs0ieWJ6MW0IfClZyAg2VkKFc18wiTrQF9i0giA9yFDahb5mAvqFSGZQZ9yXgyL3e2Ou94rdl0X7lhDIlDKLHnZhfyIPZpQsx1UWG1XCoPCn6B9xtWdw9k7WZj/oIFe79ZkQSQaUhHGtd9yqxaRYGUY4zfJ+omMSRtf0rpFiQXVjXSQbXrNKEYaTskwYO1O8TKRZa90Rgnf1T9e9eX/yvV09MuNVImYwTQyUZLgoTDk0E+4HCJlOUGN6zgIli9q+QtLDCxNjY8zaEPyf/hdNKybN8bNPYB8OaBMtgBRSBzbBLjgER6AKCLgFj+AZvDh3zpPz6rwNrTnc2YJ/Cjn/QNKGrGP</latexit><latexit sha1_base64="1t3sG2NE1Erwu7My7b6VndIN3d0=">ACQ3icbZDLSgMxFIYz9VbrerSTbAIHRCZcaMgqCywrWip0aMmDSaZIcloyzDv5sYXcOcLuHGhiFvBtJbi7UDg4z/4eT8YcKZNp736BQmJqemZ4qzpbn5hcWl8vLKuY5TRWidxDxWFyHWlDNJ64YZTi8SRbEIOW2E14eDfuOGKs1ieWb6CW0J3JEsYgQbK6HyZSCYRLcwENh0wzA7zlHWg4FmAgaqGyOZw4ByXo2qvb1bd7Nf7bku3LeGVKCMWfLzKzm2IDYyIea6qFzxtrxhwTH4v6ECRlVD5YegHZNUGkIx1o3/e3EtDKsDCOc5qUg1TB5Bp3aNOixILqVjbMIcbVmnDKFb2SQOH6veJDAut+yK0zsGp+ndvIP7Xa6Ym2m1lTCapoZJ8LYpSDk0MB4HCNlOUGN63gIli9q+QdLHCxNjYSzaEPyf/hfPtLd/yqVc5OBrFUQRrYB1UgQ92wAE4ATVQBwTcgSfwAl6de+fZeXPev6wFZzSzCn6U8/EJVBWunQ=</latexit>
slide-10
SLIDE 10
  • Generalization. Training set and test set

Goal: generalization: hope that training error is a good estimate of the generalization loss, which is the expected loss on unseen images drawn from the same distribution.

Ex∼ρ`(f(x; w), y(x)) = Z `(f(xi; w), y(xi))d⇢(x)

<latexit sha1_base64="mQd6WwJaTEyehbn5A9xdhb6+jI=">ACOXicbVBLS0JBFD7XmYvq2WbIQkUQtRNQVCBS0N8gFeucwd5+rg3Eczc0u5+Lfa9C/aBW1a9KBtf6C5KlLagYGP78Gc89kBZ1IVCs9GYmFxaXkluZpaW9/Y3Epv79SkHwpCq8TnvmjYWFLOPFpVTHaCATFrs1p3e6dx3r9jgrJfO9GDQLacnHYw4jWGnKSldMF6ubUeXQyvqI1MyF5mi6w+RSTnPOtn+yX3ucJDt53LoDJnMU1PBYhPJYlpsxylts9KZQr4wGjQFxVmQKZ9+3CIAqFjpJ7Ptk9ClniIcS9kslgLVirBQjHA6TJmhpAEmPdyhTQ097FLZikaXD9GBZtrI8YV+erUR+zsRYVfKgWtrZ3ynNVi8j+tGSrnuBUxLwgV9cj4IyfkSPkorhG1maBE8YEGmAimd0WkiwUmSped0iXMnTwPaqV8UeNr3cYFjCcJe7APWSjCEZThCipQBQIP8AJv8G48Gq/Gp/E1tiaMSWYX/ozx/QOjV6x4</latexit><latexit sha1_base64="PeNmUL0W364KwKS6AjbkFuD2cho=">ACOXicbVDLSgMxFM3UV62vUZduokVoQUrbjYIKBRUENxXsAzrDkEkzbWjmYZLRDkO/yL0b/8Kd4MaFD9z6A2baItp6IXA4D3LvsQNGhSwWn7TUzOzc/EJ6MbO0vLK6pq9v1IUfckxq2Gc+b9pIEY9UpNUMtIMOEGuzUjD7p0keuOGcEF970pGATFd1PGoQzGSirL0quEi2bXt+GxgxX1oCOpCg3f9ATQIYzkn1z+8ze9FuX4+D4+hQT35I1h0LFlUie0kpWyWni0WisOBP6A0CbKVo/fr7bsLUbX0R6Pt49AlnsQMCdEqlQNpxohLihkZIxQkADhHuqQloIecokw4+HlA7irmDZ0fK6eWm3I/k7EyBUicm3lTO4Uk1pC/qe1QukcmDH1glASD48+ckIGpQ+TGmGbcoIlixRAmFO1K8RdxBGWquyMKmHq5GlQLxdKCl+qNk7BaNJgC+yAHCiBfVAB56AKagCDe/AMXsGb9qC9aB/a58ia0saZTfBntK9vh0ut4g=</latexit><latexit sha1_base64="PeNmUL0W364KwKS6AjbkFuD2cho=">ACOXicbVDLSgMxFM3UV62vUZduokVoQUrbjYIKBRUENxXsAzrDkEkzbWjmYZLRDkO/yL0b/8Kd4MaFD9z6A2baItp6IXA4D3LvsQNGhSwWn7TUzOzc/EJ6MbO0vLK6pq9v1IUfckxq2Gc+b9pIEY9UpNUMtIMOEGuzUjD7p0keuOGcEF970pGATFd1PGoQzGSirL0quEi2bXt+GxgxX1oCOpCg3f9ATQIYzkn1z+8ze9FuX4+D4+hQT35I1h0LFlUie0kpWyWni0WisOBP6A0CbKVo/fr7bsLUbX0R6Pt49AlnsQMCdEqlQNpxohLihkZIxQkADhHuqQloIecokw4+HlA7irmDZ0fK6eWm3I/k7EyBUicm3lTO4Uk1pC/qe1QukcmDH1glASD48+ckIGpQ+TGmGbcoIlixRAmFO1K8RdxBGWquyMKmHq5GlQLxdKCl+qNk7BaNJgC+yAHCiBfVAB56AKagCDe/AMXsGb9qC9aB/a58ia0saZTfBntK9vh0ut4g=</latexit><latexit sha1_base64="4u7epyBMGvw4Z10O2qFM64ivPWQ=">ACOXicbVBNS8MwGE79nPOr6tFLcAgbyGh3URBhoILHCe4D1lHSLN3CkrYkqa6U/S0v/gtvghcPinj1D5huY+jmC4GH54O87+NFjEplWS/G0vLK6tp6biO/ubW9s2vu7TdkGAtM6jhkoWh5SBJGA1JXVDHSigRB3GOk6Q0uM715T4SkYXCnkoh0OoF1KcYKU25Zs3hSPU9L70euekQOpJy6Ih+OIOYazoF4fnD6WTpDgsleAFdGigZoJLp5JLtdjNUtrmgWrbI0HzoA9DwpgOjXfHa6IY45CRmSMq2XYlUJ0VCUczIKO/EkQID1CPtDUMECeyk4vH8FjzXShHwr9Gpj9nciRVzKhHvamd0p57WM/E9rx8o/6Q0iGJFAjz5yI8ZVCHMaoRdKghWLNEAYUH1rhD3kUBY6bLzuoSFkxdBo1K2Nb61CtWraR05cAiOQBHY4BRUwQ2ogTrA4BG8gnfwYTwZb8an8TWxLhnTzAH4M8b3D/9XqoQ=</latexit>

Testing: reserve some data and approximate the generalization loss/ error on the test data, which is a surrogate for the true expected error

  • n the full density.

Ex∼ρtest`(f(x; w), y(x)) = X `(f(xi; w), y(xi))

<latexit sha1_base64="dsiTuKjAr93GjfozAUupKurJQBM=">ACN3icbVBNSxBFHyj0ej6kTUevTSKsAMiu3sxEAQhETyJQlaFnWXo6X3jNnbPDN1vdJdhIEfv/pVc8jdyixcPinjNP7B3VyR+FDQUVe/xuirKlLRUr/1JiY/TE1/nJmtzM0vLH6qLn0+smluBLZEqlJzEnGLSibYIkKTzKDXEcKj6Ozb0P/+ByNlWnygwYZdjQ/TWQsBScnhdX9QHPqRVGxW4ZFnwVWahaYXhoWhJbKkgWoVC2u9b9e+BuDWt/32babyvWzEconK5S+H1bX6pv1EdgzabwmazvNy6ufAHAQVv8E3VTkGhMSilvbjQz6hTckBQKy0qQW8y4On2HY04RptpxjlLtm6U7osTo17CbGR+v9GwbW1Ax25yWFK+9obiu957ZziL51CJlOmIjxoThXjFI2LJF1pUFBauAIF0a6vzLR4YLclVXAlvIr8lR83NhuOHro3vMYMrMAq1KABW7ADe3ALRDwC67hFu68396Nd+89jEcnvKedZXgB798juKsrA=</latexit><latexit sha1_base64="NEQI9PaZjbDyruWtrCfw7KgbFqw=">ACN3icbVBNSyNBFOzxa2N0NbpHL40iZGCRJBcXFiGgieJYEwgE4aezhvT2D0zdL9ZE4Y5ieDdv+Jl/4a3cseVsSr/8BOIuJXQUNR9R6vq4JECoOVyh9nanpmdu5LYb64sPh1abm0snpi4lRzaPJYxrodMANSRNBEgRLaiQamAgmt4Gx35Ld+gTYijo5xmEBXsdNIhIztJfOvQUw34QZPu5nw2oZ4Sinu7HfoZgM+pB1KWw/Lg57n7fVgeuC7dsVOpejF8Wz5wnX90kZlqzIGfSHV92SjXru6vrwIaMv3Xq9mKcKIuSGdOp1hLsZkyj4BLyopcaSBg/Y6fQsTRiCkw3G+fO6aZVejSMtX0R0rH6eiNjypihCuzkKV5743Ez7xOiuGPbiaiJEWI+ORQmEqKMR2VSHtCA0c5tIRxLexfKe8zTjaqou2hA+RP5KT2lbV8iPbxh6ZoEDWyDopkyrZJnVyQBqkSTi5IX/Jf3Ln/Hb+OfOw2R0yne+UbewHl8Ah7JrbY=</latexit><latexit sha1_base64="NEQI9PaZjbDyruWtrCfw7KgbFqw=">ACN3icbVBNSyNBFOzxa2N0NbpHL40iZGCRJBcXFiGgieJYEwgE4aezhvT2D0zdL9ZE4Y5ieDdv+Jl/4a3cseVsSr/8BOIuJXQUNR9R6vq4JECoOVyh9nanpmdu5LYb64sPh1abm0snpi4lRzaPJYxrodMANSRNBEgRLaiQamAgmt4Gx35Ld+gTYijo5xmEBXsdNIhIztJfOvQUw34QZPu5nw2oZ4Sinu7HfoZgM+pB1KWw/Lg57n7fVgeuC7dsVOpejF8Wz5wnX90kZlqzIGfSHV92SjXru6vrwIaMv3Xq9mKcKIuSGdOp1hLsZkyj4BLyopcaSBg/Y6fQsTRiCkw3G+fO6aZVejSMtX0R0rH6eiNjypihCuzkKV5743Ez7xOiuGPbiaiJEWI+ORQmEqKMR2VSHtCA0c5tIRxLexfKe8zTjaqou2hA+RP5KT2lbV8iPbxh6ZoEDWyDopkyrZJnVyQBqkSTi5IX/Jf3Ln/Hb+OfOw2R0yne+UbewHl8Ah7JrbY=</latexit><latexit sha1_base64="71ODtIYdgAE2JcdlClqIel1VpHk=">ACN3icbVDLSsNAFJ34tr6qLt0MFqEBkbQbBREFVyJglWhKWEyvbGDM0mYudGWkL9y42+4040LRdz6B05rEV8HBg7n3Mudc8JUCoOe9+CMjI6NT0xOTZdmZufmF8qLS2cmyTSHBk9koi9CZkCKGBoUMJFqoGpUMJ5eLX98+vQRuRxKfYS6Gl2GUsIsEZWikoH/mKYScM84MiyLvUN0JRX3eSIEcwWBTUBymrUbW7feOu96pd16U7dipTX0YghlYgXDcoV7wNbwD6RWq/SYUMcRyU7/12wjMFMXLJjGnW6im2cqZRcAlFyc8MpIxfsUtoWhozBaVD3IXdM0qbRol2r4Y6UD9vpEzZUxPhXayn9L89vrif14zw2irlYs4zRBi/nkoyiTFhPZLpG2hgaPsWcK4FvavlHeYZhxt1SVbwp/If8lZfaNm+YlX2d0f1jFVsgqZIa2S75JAckwbh5JY8kmfy4tw5T86r8/Y5OuIMd5bJDzjvH/YEqp8=</latexit>

Orange curve: overtrained. Green curve better generalization

slide-11
SLIDE 11

Challenges for deep learning

“It is not clear that the existing AI paradigm is immediately amenable to any sort of software engineering validation and verification. This is a serious issue, and is a potential roadblock to DoD’s use of these modern AI systems, especially when considering the liability and accountability of using AI” JASON report

slide-12
SLIDE 12

Mary Shaw’s evolution of software engineering discipline

Better theory: improves reliability and discipline evolves

slide-13
SLIDE 13

Entropy-SGD

Pratik Chaudhari UCLA (now Amazon/ U Penn)

Stefano Soatto UCLA Comp Sci. Stanley Osher, UCLA Math Guillaume Carlier, CEREMADE, U. Parix IX Dauphine

  • 2017 UCLA PhD student (at time of research)
  • 2018 (present) Amazon research
  • Fall 2019 Faculty in ESE at U Penn

Deep Relaxation: partial differential equations for optimizing deep neural networks Pratik Chaudhari, Adam M. Oberman, Stanley Osher, Stefano Soatto, Guillame Carlier 2017

slide-14
SLIDE 14

Entropy SGD results in Deep Neural Networks (Pratik)

Visualization of Improvement in training loss (left) Improve in Validation Error (right) dimension = 1.67 million

slide-15
SLIDE 15

Different Interpretation: Regularization using Viscous Hamilton-Jacobi PDE

Solution of PDE in one dimension. Cartoon: Algorithm only solves PDE for time depending on Hf(x).

slide-16
SLIDE 16

Expected Improvement Theorem in continuous time

slide-17
SLIDE 17

Adaptive-SGD

joint with PhD Student Mariana Prazeres

slide-18
SLIDE 18

Model for mini-batch gradients: k=1 means.

full gradient mini-batch

For k>1 means, same calculation applies, if we restrict to the active

  • indices. This leads to smaller active batch sizes, and higher variance
slide-19
SLIDE 19

Motivation: quality of gMB depends on x

  • 1
  • 0.5

0.5 1

  • 1
  • 0.5

0.5 1

  • far from x* mb gradients point in a good direction.
  • near x* require more samples, or small steps (so that directions average in time)
  • 4
  • 3
  • 2
  • 1

1 2 3 4

  • 4
  • 3
  • 2
  • 1

1 2 3 4

slide-20
SLIDE 20

Adapt in Space instead of Time

  • 1
  • 0.5

0.5 1

  • 1
  • 0.5

0.5 1

The ideal learning rate/batch size combination should depend on x (space) rather than k (time). use MB = 10 use MB = 60

slide-21
SLIDE 21

Adaptive SGD

  • Adaptively, depending on x, decide on
  • MB size, or
  • learning rate.
  • Use the following formula (derived later)
  • f large, learning rate large (ok to use small MB)
  • g small, var(MB) restricts learning rate (so use large MB)
slide-22
SLIDE 22

Benchmarks: Fix MB and adapt h

100 200 300 400 500 600 10-8 10-6 10-4 10-2 100

f(x) - f* (for a typical run)

Scheduled Scheduled 1/t Adaptive

100 200 300 400 500 600 10-6 10-5 10-4 10-3 10-2 10-1 100

f(x) - f*(Averaged over 40 runs)

Scheduled Scheduled 1/t Adaptive

Left: Not too clear what is happening with one path. Right: Average over several runs to see the trend

slide-23
SLIDE 23
  • 0.2
  • 0.15
  • 0.1
  • 0.05

0.05 0.1 0.15 0.2

  • 0.2
  • 0.15
  • 0.1
  • 0.05

0.05 0.1 0.15 0.2

Paths of Scheduled and Adaptive SGD

The variance of the paths is clear from this figure

slide-24
SLIDE 24

Proof of Convergence with Rate

rate is same order at SGD, but with better constant

slide-25
SLIDE 25

Proof of Convergence and Generalization for Lipschitz Regularized DNNs

joint with Jeff Calder

Lipschitz regularized Deep Neural Networks converge and generalize O. and Jeff Calder; 2018

slide-26
SLIDE 26

Background

  • n the generalization/convergence result
slide-27
SLIDE 27

Problem: traditional ML theory does not apply to Deep Learning

A new idea is needed to make Deep Learning more reliable.

“Understanding Deep Learning requires rethinking generalization” Zhang (2016)

slide-28
SLIDE 28

inspiration: old idea: Total Variation Denoising [1992] R-Osher-F.

used in early, high profile image reconstruction of video images.

Stanley Osher

  • minimize a variational functional: combination of a loss term, to

the original noisy image, and a regularization term

  • Regularization is large on noise, small on images
slide-29
SLIDE 29

Regularization: from images to maps

x in M = manifold of images word labels f(x) the learned map: well-behaved on data manifold, but very bad off the manifold (without regularization)

slide-30
SLIDE 30

What is new in our result?

  • Bartlett proved generalization under the assumption of Lipschitz

regularity.

  • However, DNNs are not uniformly Lipschitz
  • By adding regularization to the objective function in the training, we
  • btain the uniform Lipschitz bounds
slide-31
SLIDE 31

Approaches to regularization

  • A. Machine Learning: learn data using an

appropriate (smooth) parameterized class of functions

  • B. Algorithmic: use an algorithm which selects

the best solution (e.g. Stochastic Gradient Descent as a regularizer, adversarial training)

  • C. Inverse problems: allow for a broad class of

functions, but modify the loss to choose the right one

min

w Ex∼ρ`(f(x; w), y(x))

<latexit sha1_base64="SxNFJklN3YXQujZE+PFI6SQ90+4=">ACGnicbVDLSgMxFL3js9ZX1aWboAgtSJlxo+BGUMFlBdsKnTJm0kwbmSGJKMtQ7fiH7jxV9y4UMSduPFvTFsRXwcCh3Pu5eacMOFMG9d9dyYmp6ZnZnNz+fmFxaXlwspqTcepIrRKYh6r8xBrypmkVcMp+eJoliEnNbD7uHQr19SpVksz0w/oU2B25JFjGBjpaDg+YLJ4Ar5AptOGbHgyDrIV8zgXzViQfIp5wXo2Jv/6q03S/2SqWgsOmW3RHQF/F+k82D3ZvrCwCoBIVXvxWTVFBpCMdaN7ydxDQzrAwjnA7yfqpgkXt2nDUokF1c1sFG2AtqzSQlGs7JMGjdTvGxkWvdFaCeHCfRvbyj+5zVSE+01MyaT1FBJxoeilCMTo2FPqMUJYb3LcFEMftXRDpYWJsm3lbwp/If0ltp+y5Ze/UtnEY+RgHTagCB7swgGcQAWqQOAW7uERnpw758F5dl7GoxPO584a/IDz9gEpiaHK</latexit><latexit sha1_base64="H8CIfyvwEJ3UxyVBgiSjhDmLCRM=">ACGnicbVDLSgMxFM3UV62vqks3QRFakDLTQU3BRXEVQX7kE4ZMmnGhiaZIcnYlqFb8Q/c+CtuXCjiTtz4GYIfYNqK+DoQOJxzLzfn+BGjStv2q5Wamp6ZnUvPZxYWl5ZXsqtrNRXGEpMqDlkoGz5ShFBqpqRhqRJIj7jNT97v7Ir18QqWgoTvUgIi2OzgUNKEbaSF7WcTkVXg+6HOmO7yeHQy/pQ1dRDl3ZCYfQJYzlglx/r5fGeT6+byX3bIL9hjwizi/yVa5dHV5dvz+VvGyz247xDEnQmOGlGo6xUi3EiQ1xYwM26sSIRwF52TpqECcaJayTjaEG4bpQ2DUJonNByr3zcSxJUacN9MjhKo395I/M9rxjrYbSVURLEmAk8OBTGDOoSjnmCbSoI1GxiCsKTmrxB3kERYmzYzpoQ/kf+SWrHg2AXnxLRxACZIgw2wCXLASVQBkegAqoAg2twC+7Bg3Vj3VmP1tNkNGV97qyDH7BePgAKcaPy</latexit><latexit sha1_base64="H8CIfyvwEJ3UxyVBgiSjhDmLCRM=">ACGnicbVDLSgMxFM3UV62vqks3QRFakDLTQU3BRXEVQX7kE4ZMmnGhiaZIcnYlqFb8Q/c+CtuXCjiTtz4GYIfYNqK+DoQOJxzLzfn+BGjStv2q5Wamp6ZnUvPZxYWl5ZXsqtrNRXGEpMqDlkoGz5ShFBqpqRhqRJIj7jNT97v7Ir18QqWgoTvUgIi2OzgUNKEbaSF7WcTkVXg+6HOmO7yeHQy/pQ1dRDl3ZCYfQJYzlglx/r5fGeT6+byX3bIL9hjwizi/yVa5dHV5dvz+VvGyz247xDEnQmOGlGo6xUi3EiQ1xYwM26sSIRwF52TpqECcaJayTjaEG4bpQ2DUJonNByr3zcSxJUacN9MjhKo395I/M9rxjrYbSVURLEmAk8OBTGDOoSjnmCbSoI1GxiCsKTmrxB3kERYmzYzpoQ/kf+SWrHg2AXnxLRxACZIgw2wCXLASVQBkegAqoAg2twC+7Bg3Vj3VmP1tNkNGV97qyDH7BePgAKcaPy</latexit><latexit sha1_base64="O4J9xO27AvK0cuHosbflwDq4gHQ=">ACGnicbVDLSgMxFM3UV62vqks3wSK0IGWmGwU3BRVcVrAP6JQhk2ba0CQzJBnbYeh3uPFX3LhQxJ248W9MH4i2HgczrmXm3P8iFGlbfvLyqysrq1vZDdzW9s7u3v5/YOGCmOJSR2HLJQtHynCqCB1TUjrUgSxH1Gmv7gcuI374lUNBR3OolIh6OeoAHFSBvJyzsup8IbQpcj3f9HrspSPoKsqhK/vhGLqEsWJQHF0MS6dJcVQqefmCXbangD/EWSQFMEfNy3+43RDHnAiNGVKq7VQi3UmR1BQzMs65sSIRwgPUI21DBeJEdJptDE8MUoXBqE0T2g4VX9vpIgrlXDfTE4SqEVvIv7ntWMdnHdSKqJYE4Fnh4KYQR3CSU+wSyXBmiWGICyp+SvEfSQR1qbNnClhKfIyaVTKjl12bu1C9WpeRxYcgWNQBA4A1VwA2qgDjB4AE/gBbxaj9az9Wa9z0Yz1nznEPyB9fkNmf5Q=</latexit>

wk+1 = wk + hkrmb`(. . . w)

<latexit sha1_base64="BZyOvDFarvXwrLP3MBcmYw5kg=">ACF3icbVBNS8NAFHzx26o16tHLqghKISRe9CIevCoYG2hiWGz3dglm03Y3VhK6L/w4l/x4kERr3rz37htRbR1YGY2eG9N1HOmdKu+2lNTc/Mzs0vLFaWleq/ba+rXKCklonWQ8k80IK8qZoHXNKfNXFKcRpw2ouR04DfuqFQsE1e6l9MgxbeCxYxgbaTQdro3ZVLz+ugYdW8SVEOdMEG+wBHYZlGfeRTztGe3860Qt390N5xHXcI9EO8cbJzsrVWBYOL0P4wUVKkVGjCsVIt7yDXQYmlZoTfsUvFM0xSfAtbRkqcEpVUA7v6qNdo7RnEnzhEZD9XeixKlSvcGWuynWHTXuDcT/vFah46OgZCIvNBVkNCguONIZGpSE2kxSonPEwkM7si0sESE2qrJgSJk6eJNcHjuc63qVp4wxGWIBN2IY98OAQTuAcLqAOBO7hEZ7hxXqwnqxX6230dcr6zmzAH1jvX40inl0=</latexit><latexit sha1_base64="O8Jfg1XG1Gs6KyE5rSjFZ8Va8xI=">ACF3icbVBPSwJBHJ21f2ZlWscuYyIYgux6qUsg1aGjQf4BXZfZcdRhZ2eXmdlElv0WXfoqXToU0bVufZtGjSjtwcDjvXn8fr/nhoxKZqfRmptfWNzK72d2dndy+7n8gctGUQCkyYOWCA6LpKEU6aipGOqEgyHcZabve5cxv3xEhacBv1TQkto9GnA4pRkpLTq46cdexUrgOZz0PViBY8eDPY5chpzYdxPYI4zBcm8QKAknJ06uaFbNOeAPsZJsV7IZ1OFi7Dh5D50FEc+4QozJGXqoXKjpFQFDOSZHqRJCHCHhqRrqYc+UTa8fyuBJa0MoDQOjHFZyrvxMx8qWczrYs+UiN5bI3E/zupEantkx5WGkCMeLQcOIQRXAWUlwQAXBik01QVhQvSvEYyQVrKjC5h5eRV0qpVLbNq3eg2rsACaXAEjkEZWOAU1ME1aIAmwOAePIJn8GI8GE/Gq/G2+JoyvjOH4A+M9y+pdJ8y</latexit><latexit sha1_base64="O8Jfg1XG1Gs6KyE5rSjFZ8Va8xI=">ACF3icbVBPSwJBHJ21f2ZlWscuYyIYgux6qUsg1aGjQf4BXZfZcdRhZ2eXmdlElv0WXfoqXToU0bVufZtGjSjtwcDjvXn8fr/nhoxKZqfRmptfWNzK72d2dndy+7n8gctGUQCkyYOWCA6LpKEU6aipGOqEgyHcZabve5cxv3xEhacBv1TQkto9GnA4pRkpLTq46cdexUrgOZz0PViBY8eDPY5chpzYdxPYI4zBcm8QKAknJ06uaFbNOeAPsZJsV7IZ1OFi7Dh5D50FEc+4QozJGXqoXKjpFQFDOSZHqRJCHCHhqRrqYc+UTa8fyuBJa0MoDQOjHFZyrvxMx8qWczrYs+UiN5bI3E/zupEantkx5WGkCMeLQcOIQRXAWUlwQAXBik01QVhQvSvEYyQVrKjC5h5eRV0qpVLbNq3eg2rsACaXAEjkEZWOAU1ME1aIAmwOAePIJn8GI8GE/Gq/G2+JoyvjOH4A+M9y+pdJ8y</latexit><latexit sha1_base64="8NtPglgPo5lAysiDwWcJbXcFd98=">ACF3icbVDLSsNAFJ3UV62vqEs3g6VQKZSkG90IBV24rGAf0LRhMp2QyaTMDOxlJC/cOvuHGhiFvd+TdO2iLaemDgcM4c7r3HixiVyrK+jNza+sbmVn67sLO7t39gHh61ZBgLTJo4ZKHoeEgSRjlpKqoY6USCoMBjpO35V5nfvidC0pDfqWlEegEacTqkGCktuWZ10k/8ip3CSzjp+7ACx64PHY48htwk8FLoEMZg2RmESsLJmWsWrao1A/wh9jIpgUarvmpozgOCFeYISm7di1SvQJRTEjacGJYkQ9tGIdDXlKCyl8zuSmFJKwM4DIV+XMGZ+juRoEDKabZlKUBqLJe9TPzP68ZqeNFLKI9iRTieDxrGDKoQZiXBARUEKzbVBGFB9a4Qj5FAWOkqC7qElZNXSatWta2qfWsV69eLOvLgBJyCMrDBOaiDG9ATYDBA3gCL+DVeDSejTfjf41Zywyx+APjI9v34Cd3Q=</latexit>

min

w Ex∼ρ`(f(x; w), y(x)) + krxfkLp(X,ρ(x))

<latexit sha1_base64="P7DuaVrOMGQN8/l0rEXk4DEg0n8=">ACRnicbVBNaxRBEK1ZPxLXj2zM0UthEGYxLDO5JOAlEAM5eIjgJgvb49jT25Nt0t0zdPeYXca5hvyvXDx78yd48aCIV3t2RTxQcOrV/WorpeVUlgXRZ+Dzq3bd+6urN7r3n/w8NFab/3xsS0qw/iQFbIwo4xaLoXmQyec5KPScKoyU+ys/2f/KeGysK/cbNS54oeqpFLh1Xkp7CVFCp+dIFHXTLKsPmrSeIbFCITHTokHCpQzcPbivL81D2f9Pj5HIv2CUXyAYmaTpDHNfpfWrt2U42mqd7WiT9jajQbQA/iHxdbK5t3N58Q4AjtLeJzIpWKW4dkxSa8fxdumSmhonmORNl1SWl5Sd0VM+9lRTxW1SL2Jo8JlXJpgXxj/tcKH+7aipsnauMj/ZXmuv91rxf71x5fLdpBa6rBzXbLkoryS6AtMcSIMZ07OPaHMCP9XZFNqKHM+a4P4cbJN8nx9iCOBvFrn8ZLWGIVnsBTCGHdiDQziCITC4gi/wDb4H4OvwY/g53K0E/z2bMA/6MAv/KWxJg=</latexit><latexit sha1_base64="ZwZTG1LCJtv8HQPcPqjWfeKrVIU=">ACRnicbVBNaxRBEK1ZPxLXr9UcvTQGYRbDMpNLhFwCKkjwEMFNVrbHoa3J9uku2fo7jG7jHOV/K9cs7Nn+BF8Auv9uyKaOKDhlev6lFdLyulsC6KPgadK1evXV9ZvdG9ev2nbu9e/f3bVEZxoeskIUZWi5FJoPnXCSj0rDUWSH2RHT9v+wTturCj0azcveaLwUItcMHReSnsJVUKnx4QqdNMsq583aT0j1ApFqJkWDaFcyjAPZ9vH/Y15Ov3yWNCpV8wQULfE6oxk5jOSO6rtH75tgxHG62zHW3S3no0iBYgf0h8kazvbJ18eLP7/cte2junk4JVimvHJFo7jdLl9RonGCSN1aWV4iO8JDPvZUo+I2qRcxNOSRVyYkL4x/2pGF+rejRmXtXGV+sr3WXuy14v9648rlT5Ja6LJyXLPlorySxBWkzZRMhOHMybknyIzwfyVsigaZ8l3fQiXTr5M9jcHcTSIX/k0nsESq/AHkIMWzBDryAPRgCg1P4BF/hW3AWfA5+BD+Xo53gt2cN/kEHfgHdjbNO</latexit><latexit sha1_base64="ZwZTG1LCJtv8HQPcPqjWfeKrVIU=">ACRnicbVBNaxRBEK1ZPxLXr9UcvTQGYRbDMpNLhFwCKkjwEMFNVrbHoa3J9uku2fo7jG7jHOV/K9cs7Nn+BF8Auv9uyKaOKDhlev6lFdLyulsC6KPgadK1evXV9ZvdG9ev2nbu9e/f3bVEZxoeskIUZWi5FJoPnXCSj0rDUWSH2RHT9v+wTturCj0azcveaLwUItcMHReSnsJVUKnx4QqdNMsq583aT0j1ApFqJkWDaFcyjAPZ9vH/Y15Ov3yWNCpV8wQULfE6oxk5jOSO6rtH75tgxHG62zHW3S3no0iBYgf0h8kazvbJ18eLP7/cte2junk4JVimvHJFo7jdLl9RonGCSN1aWV4iO8JDPvZUo+I2qRcxNOSRVyYkL4x/2pGF+rejRmXtXGV+sr3WXuy14v9648rlT5Ja6LJyXLPlorySxBWkzZRMhOHMybknyIzwfyVsigaZ8l3fQiXTr5M9jcHcTSIX/k0nsESq/AHkIMWzBDryAPRgCg1P4BF/hW3AWfA5+BD+Xo53gt2cN/kEHfgHdjbNO</latexit><latexit sha1_base64="Wk5NREuEB3uk4J3RK/4Zi/0yj8=">ACRnicbVBNaxRBEK1Zo8b1a9VjLk0WYRbDMpOLgpeACXjwkEA2Wdgeh5renmyT7p6hu8fsMplfl4tnb/4ELx4iwas9myVo4oOGV6/qUV0vK6WwLoq+B517a/cfPFx/1H385Omz570XL49sURnGR6yQhRlnaLkUmo+cJKPS8NRZIfZ6cf2v7xF26sKPShW5Q8UXiRS4YOi+lvYQqodMzQhW6WZbVe01azwm1QhFqZkVDKJcyzMP5+7PB1iKcDwbkDaHSL5gioeEaswkpnOS+yqtP30uw/FW62xHm7TXj4bREuSGxLdJH1bYT3vf6LRgleLaMYnWTuLt0iU1GieY5E2XVpaXyE7xhE81ai4TeplDA157ZUpyQvjn3Zkqf7tqFZu1CZn2yvtbd7rfi/3qRy+bukFrqsHNfselFeSeIK0mZKpsJw5uTCE2RG+L8SNkODzPnkuz6EOyfJUfbwzgaxgdRf2d3Fc6bMAmhBDW9iBj7API2BwAT/gEn4FX4OfwVXw+3q0E6w8r+AfdOAPbZSvQ=</latexit>
slide-32
SLIDE 32

Comment for math audience

  • Our result may not be surprising to math experts
  • However, it is a new approach to generalization theory.
  • Speaking personally, the hard work was giving a math interpretation of

the problem (1.5 years)

  • Once the model was set up correctly, and we realized we could

implement it in a DNN, the math was relatively easier.

  • Paper and proof was done in about 6 weeks.
slide-33
SLIDE 33

Clean Labels:

  • relevant in benchmark data sets and

applications,

  • simpler proof, since the clean lable

function is a minimizers

  • regime of perfect data interpolation

possible with DNNs

Convergence: two cases

Noisy labels:

  • relevant in applications,
  • familiar setting for calculus of

variations

slide-34
SLIDE 34

Statement and proof sketch of the generalization/convergence result

slide-35
SLIDE 35

Lipschitz Regularization of DNNs

Data function: augment the expected loss function on the data with a Lipschitz regularization term where L0 is Lipschitz constant of the data, and n is number of data points. We are interested in the limit as we sample more points. The limiting funtional is given by

slide-36
SLIDE 36

Convergence theorem for Noisy Labels

Convergence on the data manifold. Lipschitz off.

slide-37
SLIDE 37

Convergence for clean labels (with a rate)

  • Rate of convergence, on the data manifold, of the minimizers.
  • The rate depends on, n, the number of data points sampled and, m,

the number of labels.

  • Probabilistic bound, where obtain a given error with high

probability

  • uniform sampling vs random sampling: the log term and the

probability goes away

slide-38
SLIDE 38

Proof

slide-39
SLIDE 39

Generalization follows

slide-40
SLIDE 40

Lipschitz Regularization improves Adversarial Robustness

Chris Finlay (current PhD student)

Improved robustness to adversarial examples using Lipschitz regularization of the loss Chris Finlay, O., Bilal Abbasi; Oct 2018; arxiv

Bilal Abbasi (former PhD now working in AI)

slide-41
SLIDE 41

Adversarial Attacks

slide-42
SLIDE 42

Adversarial Attacks on the loss

slide-43
SLIDE 43

Scale measures visible attacks

DNNs are vulnerable to attacks which are invisible to the human eye. Undefended networks have 100% error rate at .1 (in max norm)

slide-44
SLIDE 44

Implementation of Lipschitz Regularization of the Loss

slide-45
SLIDE 45

Robustness bounds from the Lipschitz constant of the Loss So training the model to have a better Lipschitz constant will improve the adversarial robustness bounds.

slide-46
SLIDE 46

Arms race of attack methods and defences

We tested against toolbox of attacks. Plotted the error curve as a function

  • f the adversarial size.

Strongest attacks:

  • 1. Iterative l2-projected gradient
  • 2. Iterative Fast Gradient Signed

Method (FGSM)

slide-47
SLIDE 47

Adversarial Training: interpretation as regularization

slide-48
SLIDE 48

Adversarial Training augmented with Lipschitz Regularization

slide-49
SLIDE 49

AT + Tulip Results (2-norm)

Significant improvement over state-of-the art results come from augmenting AT with Lipschitz regularization

slide-50
SLIDE 50

AT + Tulip Results (2-norm vs max-norm)

2-Lip > AT

  • 2 > AT
  • 1 > baseline (for all noise levels on both datasets)
slide-51
SLIDE 51

Other current areas of interest in AI with connections to mathematics

We are looking for collaborators: these are possible new projects

slide-52
SLIDE 52

Reinforcement Learning

  • Related to dynamic Programming
  • Computationally intensive and unstable

Related math: dynamic programming, Optimal Control

slide-53
SLIDE 53

Recurrent NN

slide-54
SLIDE 54

Generative Networks (GANs)

Wasserstein GANs: optimal transportation (OT) mapping between random noise (Gaussians) and target distribution of images Related math: Optimal Tranportation algorithms and convergence (Peyre-Cuturi)

slide-55
SLIDE 55

Squeeze Nets

Inference (evaluating the data and assigning a label) is costly (typically 0.1 second on a power hungry high memory GPU) in terms of

  • Memory (to store the weights)
  • Computation (multiplying the matrices times the vectors)
  • Power (the energy Joules used by the chip)
  • Time

Research effort to make lean NN. How?

  • Quantization: low bit number representation and arithmetic. (related

math : non-smooth optimization, when the ReLu are also quantized)

  • Pruning: trim off the small weights, and retrain
  • Hyperparameter Optimization: train over multiple architectures and

params Mostly engineering effort, but could be combined with more math on the training.