FRACTIONAL UNDERDAMPED LANGEVIN DYNAMICS: Umut im ekli LTCI, Tlcom - - PowerPoint PPT Presentation

fractional underdamped langevin dynamics
SMART_READER_LITE
LIVE PREVIEW

FRACTIONAL UNDERDAMPED LANGEVIN DYNAMICS: Umut im ekli LTCI, Tlcom - - PowerPoint PPT Presentation

FRACTIONAL UNDERDAMPED LANGEVIN DYNAMICS: Umut im ekli LTCI, Tlcom Paris, RETARGETING SGD WITH MOMENTUM UNDER Institut Polytechnique de Paris HEAVY-TAILED GRADIENT NOISE Umut im ekli*, Lingjiong Zhu*, Yee Whye Teh, Mert


slide-1
SLIDE 1

Umut Şimşekli

LTCI, Télécom Paris, Institut Polytechnique de Paris

FRACTIONAL UNDERDAMPED LANGEVIN DYNAMICS: RETARGETING SGD WITH MOMENTUM UNDER HEAVY-TAILED GRADIENT NOISE

Umut Şimşekli*, Lingjiong Zhu*, Yee Whye Teh, Mert Gürbüzbalaban

(Florida State University) (University of Oxford) (Rutgers University) *equal contribution

International Conference on Machine Learning, 2020

slide-2
SLIDE 2

§ Deep learning (in general) § Optimization Algorithm – Stochastic Gradiend Descent with momentum

DEEP LEARNING & SGD-MOMENTUM

2

network weights non-convex cost function data points minibatch minibatch size

˜ vk+1 =˜ γ˜ vk ˜ ηr ˜ fk+1(xk) xk+1 =xk + ˜ vk+1

<latexit sha1_base64="k1CAiTA04L6xnd5D7O+mlmMUzw=">AESXicfZLdbtMwGIbdZcAIfxucIHFiNoEGE1MzEHCNI1NYtIQ42dbpbqrvjhuF9V2Itst3SzfAbfB1XAFXAYcIY5w2pR1KcJSpNfv8/049hfnPNWmXv9emwvmL12+snA1vHb9xs1bi0u3D3XWV5Qd0IxnqhGDZjyV7MCkhrNGrhiImLOjuPe64EcDpnSayU/mNGctAV2ZdlIKxlvtxS/EpDxhlgxid2x7a5F79RBPvC4IAQ5Ph/Twk797ZgoIeYw8TquPSqySoaxL+ceYUJCPN54GxfFxwSv4ZnW7cWV+np9tPCsiEqxgsq1316a+0qSjPYFk4Zy0LoZ1XPTsqBMSjlzIelrlgPtQZc1vZQgmG7Z0bU5/MA7Ce5kyn/S4JE7nWFBaH0qYh8pwJzoKivMf7Fm3Retmwq875hko4bdfocmwXb4CTVDFq+KkXQFXqz4rpCSigxr9USCT7TDN/8zKx/rKcJUWHuGOHzl2Egyk4qMKzKXhWhUozM8Gx/VDFO+dsp8q2R6mWUOC4MQNz6eGw2k0kZUElbFLN2RLncKsK96bgnjsmwPMTqMTsJud/ulstoEFPTvRjdMnOxeGZJv5uVHsrc9+lzMFJlOPLQHVFal0fo6pBD/i4NhGedFeGFEiOZLO6GOyoOsaz4nBjPXq6vH+2crm83LEF9A9tIxWUYReoE30Bu2jA0TRz9rd2v3acvAt+BH8Cn6PQ+dqZc4dGHNB38A2B7WQ=</latexit>

step-size (learning rate) stochastic gradient

x? = arg min

x∈Rd

n f(x) , 1 n

n

X

i=1

f (i)(x)

  • <latexit sha1_base64="fF9Fa6Oi0eSLx1J4zadUKP9lCRI=">AENnicfZLfbtMwFMbdlcEIsHVwyY1FhbRxUTUDATeTprFJTBpi/Ok2qW4rx3Fa7YTbGe0s/w4vAOvwg1cIW5BJwuZV2KsBTp+PzOd+wcf1HGmTbt9rfaUv3G8s1bK7eDO3fvra41u8f6zRXhHZIylN1GmFNOZO0Y5jh9DRTFIuI05Po7FXBT86p0iyVH80koz2Bh5IljGDjU4PGBI2jPtIGK7gNEVZDwSTiTDCjB9YziJiESGlq+rGDaJcNObIw2fBoEyKjGJZDTj9BlChMQit9jc7F3w5sO3R9CZO+3WCbrlT5Jgq5QaPZbrWnCy4GYRk0QbmOButLX1CcklxQaQjHWnfDdmZ6FivDCKcuQLmGSZneEi7PpRYUN2z0xk5+NhnYpikyn/SwGl2XmGx0HoiIl8psBnpKiuS/2Ld3CQve5bJLDdUksuDkpxDk8Ji4DBmihLDJz7ARDF/V0hG2E/L+GcJkKSfSoElnExbmdRcUKU2LFz1+H5HDyvwos5eFGFxevNcGTfV/H+Fduvsr2p1CKCOTxdgJn0cFw9TcRlQyVsXNXsiu4W4WHc/DQ9RHm2QhXag7iqz89qDbQWM9u+8Fdymc7FwRoj3rfKPrGq9mVGTqie2NL3zPhqiIvhfHR6XdT4IrlmkuJU64LY4dVGy8Gx1ut8Glr692z5s7z0uIr4CF4BDZACF6AHfAaHIEOIOBHbm2Wlurf61/r/+s/7osXaqVmgfg2qr/gOeTHUc</latexit>

r ˜ fk(x) , 1 b X

i∈Ωk

f (i)(x)

<latexit sha1_base64="C8nmr3TGvVxaCrtAv71khNJL21s=">AEXicfZLNbhMxEMfdhI+yfLVw5GIRVWo5RElBwLEqrUSlohbol1Sn0azXm1ixvYvtLUktPwXPwEPACXHlCbhxhafASRM12SAsrfTf+c3fnhlNnAtubKPxc6FSvXb9xs3FW9HtO3fv3V9afnBkskJTdkgzkemTGAwTXLFDy61gJ7lmIGPBjuPeqyE/Pmfa8Ewd2EHOWhI6iqecg2h9tIBURALIJaLhLnUt3urpB+vYWI1B9UR7AMmqQbadLHxBSC65NW3HCVeY7EnWgXbP4/TMrfI1P3K3l2qNemN08LxojkUNjc9+e7nymSQZLSRTlgow5rTZyG3LgbacCuYjUhiWA+1Bh50GqUAy03Kj9j1eCZEp5kOn7J4FJ12OJDGDGRoYEWC7ZoyGwb/xU4Lm75sOa7ywjJFLx9KC4FthoezxAnXjFoxCAKo5qFWTLsQpmXDxCOi2EeaSQkqcWEq3pHhC3Hq+t7PwvMpeF6GF1Pwogy1YXaCY/eujLev2HaZbY2sjlAQ+GQO5irAfvk1mYwv1NIlZc+mvIKbZbg7BXf9GQGRd6GUs5NcdbpTvsCAmVT73l/aJ38+isgWC3uj2Zvg3suZBpvpJ46A7kiufNijDhmK/+VBf5wXRDSzIsOSbJYJ48NiN8trPC+O1uvNp/X1t89qG8/HK76IHqHaBU10Qu0gV6jfXSIKPqKfqHf6E/1U/VL9Vv1+2VqZWHseYhmTvXHX9L3abg=</latexit>

velocity (momentum) Momentum decay

slide-3
SLIDE 3

UNDERSTANDING SGD-M

§ Theory: better-established for convex problems still in early phase for non-convex problems § Useful approach for analysis à Stochastic Differential Equations (SDE) gradient noise: § If we assume and invoke CLT: § SGD-m à Euler-Maruyama Discretization of the SDE:

3

Uk(x) , r ˜ fk(x) rf(x)

<latexit sha1_base64="QrCyrRqFlXcH3Bmcpi1Gkw2AnU=">AD93icfZJLbxMxEMfdLo+yvFI4clkRIRUkoqQgQJyq0kpUKqI80kaKQzTr9SZWbO9iOyGptZ8FLiCufAmucOXb4E02SrJBjGTp7/nNeMajCVPOtKnX/2xsehcuXrq8dcW/eu36jZuV7VunOhkqQpsk4YlqhaApZ5I2DTOctlJFQYScnoWDFzk/G1GlWSLfm0lKOwJ6ksWMgHGubuV5szvYwePwfoCNYiB7nH4MsISQAzaMR9TG2TziYQGCeHrvVqr1Wn1qwbpoFKCjvpbm9+xVFChoJKQzho3W7U9OxoAwjnGY+HmqaAhlAj7adlCo7tjpJ7PgnvNEQZwod6QJpt7lDAtC64kIXaQA09dljv/xdpDEz/rWCbToaGSzArFQx6YJMgnFkRMUWL4xAkgirleA9IHBcS4ufpY0k8kEQJkZN1UMovzCmFsx1m2CkdLcFSG50vwvAyVpmaOQ/u2jA8X7LDMDqapFhPgQWsNptLBcbmaiIoHlbBROWdfLOB+GR4vwePsAwae9qEUcxQtfnpUfkCDnf7Lpulz2+Z7+MD6vZG0Vcu+3VKFZhEPbAYVE8wmbk96uFc/C8OxkWcE/7KiuQtmSThOnOL3Siv8bo43a01HtV23zyu7j0pVnwL3UF30Q5qoKdoD71EJ6iJCPqCfqJf6Lc38T5737zvs9DNjSLnNlox78dfEkRdrQ=</latexit>

EkUk(x)k2 < 1

<latexit sha1_base64="idQYSgAEt/YXMHrO0jrzbvpGoSg=">AD3nicfZLbhMxFIbdDJcy3FJYshmIkAqLKAmoZcGiKo1EpSLKJW2kOE09Hk9ixfaMbE9I6s4WVogtL8CSLTwLb4MnmSjJBHGkf453/nt46Pjx4wqXav92Sg5V65eu75w7156/adu+WteycqSiQmLRyxSLZ9pAijgrQ01Yy0Y0kQ9xk59YevMn46IlLRSHzUk5h0OeoLGlKMtE31yg8hR3rg+6aZwstWb7gNx/4TeHnW8F56kIpQT3rlSq1am4a3Luq5qIA8jntbpR8wiHDCidCYIaU69VqsuwZJTEjqQsTRWKEh6hPOlYKxInqmulbUu+xzQReGEn7Ce1Ns8sOg7hSE+7byqxzVWRZ8l+sk+jwRdQESeaCDy7KEyYpyMvG4wXUEmwZhMrEJbU9urhAZIazs+FwryCUecIxEYO6LUzOYWmnGarsLREhwV4cUSvChCqYieY9+8L+LmgjWL7GBqNRAj5rXYCwsHBdv40F+oOQmKHr2+QLuF+HREjxKzyBi8QAVag6DxUsPiwcopObdfkhn9vlf6rwgNi9keSNdb+NiUQ6k8NRLPqUjtHvVhJv5Xh8Z5nRXuyopkLekoYiq1i10vrvG6OGlU68+qjXfPK3s7+YpvgfgEdgGdbAL9sBrcAxaAIMv4Cf4BX47585n56vzbVZa2sg98FKON/Ap3MVE4=</latexit>

Uk ∼ Gaussian

<latexit sha1_base64="pDfvaTRzPf4KrEM0W/eQlFyxsmA=">AD13icfZJb9MwFMe9hsItw7e4MWiQkI8VO1AwOM0NsGkIcalW1FTqhPHba36EtlOaRdF8IR45UPAK3wdvg1Om6ptirAU6e/zO3/7+OSEMWfGNhp/tirehYuXLm9f8a9eu37jZnXn1qlRiSa0RXuh2CoZxJ2rLMctqONQURcnoWjp7n/GxMtWFKvrfTmHYFDCTrMwLWhXrVO63eCAeGCRwIsEMt0heQGMNAZr1qrVFvzBbeFM1C1FCxTno7lR9BpEgiqLSEgzGdZiO23RS0ZYTzA8SQ2MgIxjQjpMSBDXdPaIDN93kQj3lXaftHgWXWkIyZitBl5pWaMsuD/2KdxPafdVMm48RSeYX9ROrcJ5R3DENCWT50AopmrFZMhaCDW9c0PJP1ElBAgozSYhFk61PYTydZtg7HK3Bchucr8LwMtaF2gcP0bRkfLtlhmR3MrGlAgOP2Boylg5PybSIqDnR/Oyp79sUS7pfh8Qo8zj4GwOMhlHKOouVLj8oHGDCLat9lc/til/l+cEDd3Gj6yrlfx1SDVfphGoAeCYzN0eDIBf/y4NJkeEvzYieUlWKW7ywW6Wx3hTnO7Wm4/qu28e1/aeFCO+je6ie+gBaqKnaA+9RCeohQj6gn6iX+i398H7H31vs1TK1uF5zZaW973vySzUjU=</latexit>

Brownian Motion

dvt = (γvt + rf(xt))dt + r2γ β dBt dxt = vtdt

<latexit sha1_base64="KVGumPDBglPv0/WPIQVS5uBk6A=">AENnicfZLdbtMwFMfdhcEIsA+45MZiAnVMVG1BwM2kaWwSk4YH/uQ6lKdOE4XzXaC7Z2Vh6Hd+BVuIErxC2PgNkWpsiLFn6+/zO/9g5OUHKY2aze+1Be/a4vUbSzf9W7fvLK+srt091slAUXZE56o0wA047FkRyY2nJ2mioEIODsJzl/l/GTIlI4T+dGMU9YV0JdxFMwLtRbHT8iSoSYDIOe2XqC6QPQkBx3iQSAg4qpORO25sTFIN3sREf1bGkgBte3CklkSMANZhouCO6JnfEL8sn5eYKsoWySY3up6s9GcLDwvWqVYR+U67K0tfCVhQgeCSUM5aN1pNVPTtaBMTDnLfDLQLAV6Dn3WcVKCYLprJz3K8EMXCXGUKLelwZPotMOC0HosApcpwJzpKsuD/2KdgYledm0s04FhkhYXRQOTYLzhuMwVowaPnYCqIrdWzE9A9c5436LTyT7QhPXPxla1yTXxfyGILKjLJuFwyk4rMKLKXhRhUozc4kD+76K967YXpXtTqyWUOD4dA6m0sFR9TYRlgWVsGHVsyOu4E4VHkzBg+wTAZ6eQSVnP7z60v1qAQ368rUfsJ+ecp8n+wyNzeKvXHutylTYBL12BJQfRHLzM1Rn+Tif3kwKvOc8GdGJH+SRKuMzfYreoYz4vjdqP1tNF+92x9+3k54kvoPnqA6qiFXqBt9BodoiNE0c/aYm25tuJ98354v7zfRepCrfTcQzPL+/MXanJy3w=</latexit>

Inverse temperature friction Underdamped (a.k.a. Kinetic) Langevin Dynamics

Dalalyan&Riou-Durand’18 Gao et al.’18

slide-4
SLIDE 4

UNDERDAMPED LANGEVIN DYNAMICS

§ Favorable properties: Targets Boltzmann-Gibbs measure à marginal density Minima of ⇔ Maxima of the target measure As , target concentrates on the global minimum of § Problem: gradient noise à non-Gaussian heavy-tailed in deep nets

4

dvt = (γvt + rf(xt))dt + r2γ β dBt dxt = vtdt

<latexit sha1_base64="KVGumPDBglPv0/WPIQVS5uBk6A=">AENnicfZLdbtMwFMfdhcEIsA+45MZiAnVMVG1BwM2kaWwSk4YH/uQ6lKdOE4XzXaC7Z2Vh6Hd+BVuIErxC2PgNkWpsiLFn6+/zO/9g5OUHKY2aze+1Be/a4vUbSzf9W7fvLK+srt091slAUXZE56o0wA047FkRyY2nJ2mioEIODsJzl/l/GTIlI4T+dGMU9YV0JdxFMwLtRbHT8iSoSYDIOe2XqC6QPQkBx3iQSAg4qpORO25sTFIN3sREf1bGkgBte3CklkSMANZhouCO6JnfEL8sn5eYKsoWySY3up6s9GcLDwvWqVYR+U67K0tfCVhQgeCSUM5aN1pNVPTtaBMTDnLfDLQLAV6Dn3WcVKCYLprJz3K8EMXCXGUKLelwZPotMOC0HosApcpwJzpKsuD/2KdgYledm0s04FhkhYXRQOTYLzhuMwVowaPnYCqIrdWzE9A9c5436LTyT7QhPXPxla1yTXxfyGILKjLJuFwyk4rMKLKXhRhUozc4kD+76K967YXpXtTqyWUOD4dA6m0sFR9TYRlgWVsGHVsyOu4E4VHkzBg+wTAZ6eQSVnP7z60v1qAQ368rUfsJ+ecp8n+wyNzeKvXHutylTYBL12BJQfRHLzM1Rn+Tif3kwKvOc8GdGJH+SRKuMzfYreoYz4vjdqP1tNF+92x9+3k54kvoPnqA6qiFXqBt9BodoiNE0c/aYm25tuJ98354v7zfRepCrfTcQzPL+/MXanJy3w=</latexit>

∝ exp(−βf(x))

<latexit sha1_base64="SsWvblsrJ8Xr+zmFhLpv7OdXSY=">AD1nicfZJb9MwFMe9hsItw6eEC8RFVKHRNUOBDxOY5OYNMS4dKtUl8pxnDab7Kdki4KPCFe+RJ7hc/Dt8FpU7X1EeK9M/5nb/PsXVCSRNt2u0/GzXvytVr1zdv+Ddv3b5zt75170SLVGHSxYIK1QuRJjThpGsSQ0lPKoJYSMlpePa65KcTonQi+CczlWTA0IgncYKRsalh/QGUSkgjAkgy2XwKQ2JQEDez7e1hvdFutWcRXBadSjRAFcfDrdoFjAROGeEGU6R1v9OWZpAjZRJMSeHDVBOJ8Bkakb6VHDGiB/nsDkXw2GaiIBbKftwEs+yqI0dM6ykLbSVDZqxdVib/xfqpiV8N8oTL1BCO543ilAb2zuWDBFGiCDZ0agXCKrGzBniMFMLGPpsPOfmCBWOIRznMwiKHZYcwzrOiWIeTFThx4fkKPHeh0sQscJh/cPHBkh24bH9mzSFGNOhdgpJbmLndWFQdqFgeuZ49toR7LjxagUfFZ4ioHCOn5jBa3vTQPUAjvZj2YzG3L/4K34f7xO6NIm+t+50kChmhnuQqRFLeGH3aARL8b86lFV1VvhrK1KOZISgurCL3XHX+LI42Wl1nrV23j9v7L6oVnwTPASPQBN0wEuwC96AY9AFGHwDF+AX+O31vK/ed+/HvLS2UXnug7Xwfv4FwxhQnQ=</latexit>

f

<latexit sha1_base64="nyuVojMUxBDs+URIG9JUGP+g0f0=">ADvHicfZLdbtMwFMe9ho8Rvja45CaiQkJcVO1AwBWaxiYxaYgN6FapKdOJ47RW/RHZTulm5Qm4hUfgoXgbnDZVWxdxpEh/n9/5+xHJ8kZ1abd/rPVCG7cvHV7+054979Bw93dh+da1koTLpYMql6CWjCqCBdQw0jvVwR4AkjF8n4fcUvJkRpKsVXc5WTAYehoBnFYFzqLvcabZb7VlEm6JTiyaq4/Ryt/E7TiUuOBEGM9C632nZmBGYoZKcO40CQHPIYh6TspgBM9sLNJy+iZy6RJpX7hIlm2VWHBa71FU9cJQcz0j6rkv9i/cJkbweWirwROB5o6xgkZFR9ewopYpgw6cAKyomzXCI1CAjfs5YSzIdyw5B5HaeJqUNq46JmdluU6nKzAiQ+vV+C1D5UmZoET+9nHR0t25LPDmdXGFjU24C5cHDqd+NpfaHiNvU9B3wJD3x4sgJPym8xsHwEXs1xunzpsX+Br2Y9ks5ty9OZRjGh8TtjSIfnftThQYqV7YGNSQU1G6PRrGlfhfHUzrOifCtRWpRjJSMl26xe74a7wpzvdanZetvbNXzf3X9YpvoyfoKXqOugN2kcf0CnqIowI+oF+ol/BuyANxgGflza2as9jtBbB5C+2cUdU</latexit>

β → ∞

<latexit sha1_base64="WSp/0SO8cFa+Djimn7Rtsg8jOM=">ADzXicfZLbtNAEIa3MYdiTilcmMRISEuoqQg6GXVg0SlopZD2kjZEI3X62TVPVi765DUmFtegtvyTrwN68REgcxkqXf82/M7uaMOHM2Fbrz1bNu3X7zt3te/79Bw8fPa7vPLkwKtWEdojiSndDMJQzSTuWU67iaYgQk4vw6vDgl+OqTZMyS92mtC+gKFkMSNgXWpQr+OQWgiwVQFmMrbTQb3RarZmEWyKdikaqIzwU7tBkeKpIJKSzgY02u3EtvPQFtGOM19nBqaALmCIe05KUFQ089mo+fBC5eJglhp90kbzLKrjgyEMVMRukoBdmSqrEj+i/VSG+/1MyaT1FJ5o3ilAfuosU7BHTlFg+dQKIZm7WgIxA7HutXws6TeihAZXgS5hkuOoRxNsnzdThegeMqvF6B1WoDbULHGafqvh4yY6r7GhmzTABHnQ3YCIdnFS7iag8UIsqnoOxBIeVOHpCjzNv2LgyQgqNSfR8qYn1QMmMW0n/O5fGX+z4+om5vNP3g3GcJ1WCVfpVh0EPBZO72aIgL8b86mJR1TvhrK1KMZJXiJneL3a6u8a42G2Xzd3P75p7L8tV3wbPUP0UvURu/QPnqPzlEHETRGv9AN+u2dean3fsxL61tlZ6naC28n38BihNsg=</latexit>

f

<latexit sha1_base64="nyuVojMUxBDs+URIG9JUGP+g0f0=">ADvHicfZLdbtMwFMe9ho8Rvja45CaiQkJcVO1AwBWaxiYxaYgN6FapKdOJ47RW/RHZTulm5Qm4hUfgoXgbnDZVWxdxpEh/n9/5+xHJ8kZ1abd/rPVCG7cvHV7+054979Bw93dh+da1koTLpYMql6CWjCqCBdQw0jvVwR4AkjF8n4fcUvJkRpKsVXc5WTAYehoBnFYFzqLvcabZb7VlEm6JTiyaq4/Ryt/E7TiUuOBEGM9C632nZmBGYoZKcO40CQHPIYh6TspgBM9sLNJy+iZy6RJpX7hIlm2VWHBa71FU9cJQcz0j6rkv9i/cJkbweWirwROB5o6xgkZFR9ewopYpgw6cAKyomzXCI1CAjfs5YSzIdyw5B5HaeJqUNq46JmdluU6nKzAiQ+vV+C1D5UmZoET+9nHR0t25LPDmdXGFjU24C5cHDqd+NpfaHiNvU9B3wJD3x4sgJPym8xsHwEXs1xunzpsX+Br2Y9ks5ty9OZRjGh8TtjSIfnftThQYqV7YGNSQU1G6PRrGlfhfHUzrOifCtRWpRjJSMl26xe74a7wpzvdanZetvbNXzf3X9YpvoyfoKXqOugN2kcf0CnqIowI+oF+ol/BuyANxgGflza2as9jtBbB5C+2cUdU</latexit>
  • 10
  • 5

5 10

x

10-3 10-2 10-1

p(x)

α=1.5 α=1.7 α=2.0

Gaussian when α=2 Infinite variance when α≠2 (Simsekli et al.’19)

slide-5
SLIDE 5

DYNAMICS WITH HEAVY-TAILED NOISE

§ Gaussian increments à Brownian motion α-stable increments à α-stable Lévy motion § Problem: doesn’t target the Gibbs measure! The modes are shifted à bias Analytical results for bias when

5

dvt = (γvt− + rf(xt))dt + r2γ β dLα

t

dxt = vtdt

<latexit sha1_base64="MVjpNEiKHE1taVsKbq/BvfzHpo=">AEOXicfZJb9MwFMfdlcsIt248mIxgTqmVW1BwMukaWwSk4Yl12kuVQnjtNFs51gu6Wb8cfhU/BJeIQnxCtfAKfJ1C5DWIp0cn7/znOyQkznmjTbn+vzdWvXL12f5GcPW7Tt3GwuL+zodKsr2aMpTdRiCZjyRbM8khrPDTDEQIWcH4cnLnB+MmNJKj+Y04z1BAxkEicUjE/1G18eESUiTEZh36yt4iYZgBAwebdm1a0QCSEHDfJ2CuWlydqg1cw0Z+UsSRWQG23cDlLQmbAOVzU3BF9gwkJyhZ5gbWiUyEw/cZSu9WeHw56JTBEirPbn9h7iuJUjoUTBrKQeujTjszPQvKJQzF5ChZhnQExiwIx9KEz37GRMDj/0mQjHqfKPNHiSnXVYEFqfitArBZhjXWV58l/saGjiFz2byGxomKRFo3jIsUlxPnMcJYpRw09AFQl/q6YHoOfnPF/JiCSfapn5+MrB+Sn2LeIYzt2LmLcDQDR1V4NgPqlBpZs5xaN9V8daUbVXZ5sRqCQWODy/BTHo4rnYTUVlQCRtVPRtiCjeqcGcG7riPBHh2DBXNdjT90u1qAQ36/LbvXWE/f3NBQDaZ3xvFXnv3m4wpMKl6bAmogUik83s0IHnwPx2MS50Pgsrkl/JpCnXzi92p7rGl4P9bqvzpNV9+3Rp/Vm54vPoPnqAmqiDnqN19Artoj1E0c/afG2htlj/Vv9R/1X/XUjnaqXnHrpw6n/+AqykdEI=</latexit>

(Sliusarenko et al.’13) (Capała & Dybiec’19)

γ → ∞

<latexit sha1_base64="EkGWlGhJDlhpIACzdoXuiS+zVRo=">ADznicfZLbtNAEIa3NYdiTglcmMRISEuoqRFwGVWolKRYRD2kjZEI3X62TVPVi76zSpZXHLS3AJPBNvwzpxlMRBjGTp93z78yuJkw4M7bV+rOz6924ev23h3/7r37Dx7W6o/OjUo1oV2iuNK9EAzlTNKuZbTXqIpiJDTi/DybcEvJlQbpuQXO0voQMBIspgRsC41rNXxCISAFsVYCZjOxvWGq1max7BtmiXoHK6Azruz9xpEgqLSEgzH9diuxgwy0ZYT3MepoQmQSxjRvpMSBDWDbD57HjxzmSiIlXaftME8u+7IQBgzE6GrFGDHpsqK5L9YP7Xxm0HGZJaKsmiUZzywF20eIgYpoSy2dOANHMzRqQMWg1j2XjyW9Isq9jIwyPA3zDBcdwjib5vkmnKzBSRVer8HrKtSG2iUOs09VfLJiJ1V2PLdmAPelswkQ5Oq91EVB6oRZVPUdiBY+q8GwNnuVfMfBkDJWa02h109PqAQbMctrP+cK+/Mt9Hx9TtzeavnfuDwnVYJV+kWHQI8Fk7vZohAvxvzqYlnVO+BsrUoxkleImd4vdrq7xtjfb7YPmvsfXzYOX5UrvoeoKfoOWqj1+gQvUMd1EUEXaEf6Bf67XW8iZd73xaluzul5zHaCO/7Xw/+TiM=</latexit>

500 1000 1500 2000 2500 3000

t

  • 60
  • 40
  • 20

20 40

t

α=1.5 α=1.7 α=2.0

slide-6
SLIDE 6

MAIN QUESTION & GOALS

§ In other words: “retarget SGD-m towards the Gibbs measure” § Make sure à the modes match the minima of Make sure à the dynamics is computationally tractable

Overdamped case à retargeting possible but not tractable (Simsekli’17)

§ Relation to gradient clipping and differential geometric approaches § Asymptotic analysis of the discretization scheme

6

Can we design an SDE that targets the Gibbs measure under α–stable gradient noise?

f

<latexit sha1_base64="nyuVojMUxBDs+URIG9JUGP+g0f0=">ADvHicfZLdbtMwFMe9ho8Rvja45CaiQkJcVO1AwBWaxiYxaYgN6FapKdOJ47RW/RHZTulm5Qm4hUfgoXgbnDZVWxdxpEh/n9/5+xHJ8kZ1abd/rPVCG7cvHV7+054979Bw93dh+da1koTLpYMql6CWjCqCBdQw0jvVwR4AkjF8n4fcUvJkRpKsVXc5WTAYehoBnFYFzqLvcabZb7VlEm6JTiyaq4/Ryt/E7TiUuOBEGM9C632nZmBGYoZKcO40CQHPIYh6TspgBM9sLNJy+iZy6RJpX7hIlm2VWHBa71FU9cJQcz0j6rkv9i/cJkbweWirwROB5o6xgkZFR9ewopYpgw6cAKyomzXCI1CAjfs5YSzIdyw5B5HaeJqUNq46JmdluU6nKzAiQ+vV+C1D5UmZoET+9nHR0t25LPDmdXGFjU24C5cHDqd+NpfaHiNvU9B3wJD3x4sgJPym8xsHwEXs1xunzpsX+Br2Y9ks5ty9OZRjGh8TtjSIfnftThQYqV7YGNSQU1G6PRrGlfhfHUzrOifCtRWpRjJSMl26xe74a7wpzvdanZetvbNXzf3X9YpvoyfoKXqOugN2kcf0CnqIowI+oF+ol/BuyANxgGflza2as9jtBbB5C+2cUdU</latexit>
slide-7
SLIDE 7

PROPOSED DYNAMICS

§ : fractional Riesz derivative à non-local à often no analytical expression hard to approximate (Simsekli’17) § General recipe: specify the Kinetic energy g à imposes a drift c § Two choices of g: 1) Gaussian 2) α-stable à analytical Riesz derivatives!

7

(c(v, α))i := Dα−2

vi

(ψ(v)∂vig(v)) ψ(v) , ψ(v) := e−g(v)

<latexit sha1_base64="6rInQYQpmilYjw3VTSAFD/f/WUk=">AEQnicfZLdbtMwFMfdhY8Rvja4YzcRE1KLtqotCNCkSdPYJCYNMT6TZrbynGczpqdZLZTulm+5F4E16CV4ArxC0XOE26tinCUqTj8/sf+T47yeMStVofK8sONeu37i5eMu9fefuvftLyw8OZwKTNo4ZrE49pEkjEakrahi5DgRBHGfkSP/7HXGjwZESBpHn9RFQjoc9SMaUoyUTfWvlRxFQ78NYhYcopqtZ6mZmMThgJhDTlSpxgxvWN6epAR09W5cL1lqjCRNKutwQJREbi/qjbM3oK4VZc+H5eYoC9yq1sUm6ej2Xmt7SaqPeGC1vPmgWwSo1kFveErDGKchIpzJCUJ81Gojo6awQzYlyYSpIgfIb65MSGEeJEdvRoXsZ7YjOBF8bCfpHyRtnpCo24lBfct8psBLMsuS/2EmqwlcdTaMkVSTC+UVhyjwVe9nwvYAKghW7sAHCgtpePXyK7KyVfSIXRuQzjlHUaDh0Df5A/ihHhozCwdTcFCGl1PwsgyFJGqMf2hjHcnbLfMdkalGlpHeMdzMIksHJZv40FxoOA6KNds8wncLsP9KbhvurntSpq9YPKne+UDJLjbj+avHy8M64Ld4j1jSBvbfW7hAikYvHUmlv0OY2M9VEfZsH/dGhY6Gzgzlgka0nFMZOZsZtlG8Hh61681m9f756taLwuKLYAU8BlXQBC/BFngDkAbYPCzslx5VFlxvjk/nF/O71y6UClqHoKZ5fz5Cx6veM=</latexit>

Theorem 1 (Fractional Underdamped Langevin Dynamics – FULD)

dvt = (γc(vt−, α) + rf(xt))dt + ⇣γ β ⌘1/α dLα

t

dxt = rg(vt)dt

<latexit sha1_base64="VQSJq4ThzidMQAMSKJb/Yx9lRw=">AEXHicfZJdb9MwFIbTrYORMdhA4oYbiwnUsg/aMT5uJk1lk5g0xPjYVmnuqhPHSaPZTmS7pZvl38IVP4obfgtOm2pthrAU6eQ85z3HOXmDjCVKNxq/K3Pz1YU7dxfv+Uv3lx8XFl9dKrSviT0hKQsle0AFGWJoCc60Yy2M0mB4yeBZcfcn42oFIlqfiurzLa4RCLJEoIaJfqrvx8gSUPER4EXb27iWo4Bs4BkVqeMXrTbmBgWQ/q61hAwABFNTx0tfU6Ggn1OsKtJGY1HEkgZiy3BgdUg82JrF+Y5qtxEzvW4CPe1T7GfjE74d2i/7xaLKetO+urDW2GqODbgfNIljzinPcXZ37hcOU9DkVmjBQ6rzZyHTHgNQJYdT6uK9oBuQSYnruQgGcqo4ZbdKi5y4ToiV7hEajbLTCgNcqSseuEoOuqfKLE/+i53dfS+YxKR9TUVZDwo6jOkU5T/FhQmkhLNrlwARCburoj0wK1Uu5/nY0F/kNRtVoTGrcvtN58QRGZo7SwcTMFBGV5PwesylIrqCQ7M1zI+uGEHZbY/khpMgKH2LZgJB4flaTwsGkpuwrKmxW9gqwyPpuCRvSisNVtzGN586WG5gQI1ue03O5ZP3qzv43qfCPpJ6f+nFEJOpUvDQYZ80RY56MY58H/6mBY1LnAn7FIfiWdpkxZ+xm2ca3g9Ptrebre0vO2t7bwuL3pPvWdezWt67w976N37J14pLJQWa/sVN7M/6lWq0vV5XHpXKXQPZmTvXJX2s4e9Y=</latexit>

Consider the dynamics with Then, the Boltzmann-Gibbs measure is an invariant measure of this SDE.

Kinetic Energy

v

<latexit sha1_base64="/KQ09+gP6Yge5FIdJlz043VrU8A=">AD1XicfZJdb9MwFIa9ho8Rvjq4QeImokJCXFTtQMDlNDqJSUOMj6V6q46cdw2mu1EtlPaWeYKcufgEv4P/wbnDZV2xRhKdLr85zXPj45YcpipRuNPzsV78rVa9d3b/g3b92+c7e6d+9MJZktE0SlshuCIqyWNC2jWj3VRS4CGjnfDidc47EypVnIhPepbSPoeRiIcxAe1Cg+oDzEGPCTDTsgMzsecGj4BzsINqrVFvzFewLZqFqKFinQ72Kj9xlJCMU6EJA6V6zUaq+wakjgmj1seZoimQCxjRnpMCOFV9M3+CDR67SBQME+k+oYN5dN1hgCs146HLzAtWZYH/8V6mR6+6ptYpJmgiwuGmYs0EmQ9yOIYkmJZjMngMjY1RqQMUg2nXNx4J+Jonrh4gMnobWzNsVDs3U2k04WYOTMrxcg5dlKBXVSxyaD2V8tGJHZdaWw12/y/obsFUODgt38aj4kDJTVT2HPIVPCzDkzV4Ys8xsHQMpZzjaPXS4/IBCtSy2o92YV/urO/jFnVzI+lb536XUgk6kU8NBjnisbBujkY4F/Lg2mR54S/MSJ5STpJmMoHu1ke421xtl9vPqv39eO3hRjPgueogeoSeoiV6iA/QGnaI2IugL+oF+od9ex7PeV+/bIrWyU3juo43lf8LSm5R+Q=</latexit>

Drift

slide-8
SLIDE 8

Gaussian

GAUSSIAN KINETIC ENERGY

§ α=2, recovers standard ULD § Non-Lipschitz, explosive § Uniqueness not guaranteed § Doesn’t have much practical value

8

Gamma fn. Kummer confluent hypergeometric fn.

Let . Then, Theorem 2 (FULD with Gaussian Kinetic energy)

(c(v, α))i = 2

α 2 vi

√π Γ ✓α + 1 2 ◆

1F 1

✓2 − α 2 ; 3 2; v2

i

2 ◆

<latexit sha1_base64="GgE7lQOMAdkIiE9dA+pw8aRhJcQ=">AEbHicfZLrbtMwGIaTrsAIp43DrwkpogK1A6amQ4A0IU1jBSYNMQ47SHVbOY7TWrOTYLulm+XL4TfXw01wDThNqYpwlKlN9/zfv5c+/UTSoRsNn/blZXqlavXVq87N27eun1nbf3uiYhHOFjFNOYn/lQYEoifCyJpPgs4Rgyn+JT/xtyk/HmAsSR9/kRYK7DA4iEhIEpSn137VUR2M/WcA0mQIG42+Itp5A0IOkWr1VCYyqFVL63Fq0AqI71wqkBgN3kPGIKA4lPWi/amXNgBOBkPZcFTfe6f7ytNFY+v5fOedrLRd0NZvXTqfKP+Wq251Zwud1l4uahZ+Trqr1d+giBGI4YjiSgUouM1E9lVkEuCKNYOGAmcQHQOB7hjZAQZFl01vVntPjaVwA1jbn6RdKfVYoeCTIgL5hsng3Ioyiwt/ot1RjJ83VUkSkYSRygbFI6oK2M3fSY3IBwjS+MgIgTc1YXDaG5FWke0wER/oFic+1RoMDEN+RTvBDNdF6EY4LcFyGlwV4WYZcYDnDvpSxu05a5fZ/rRVAQSpe7YEk8jASXkaC/INOVNBuWePzeFeGR4W4KHu5ZFa9BwE8396UN5AQDE7Vedtc+tOAfWxyw/FH0/0pwRzKmG+ajPMBI5E2ORqAVPzPBye5zwhnISLpkWQcU6FNsL1yjJfFSWvL295qfX5R232ZR3zV2rAeWXLs15Zu9YH68g6tpD9wN6x9+32yp/q/epG9WFmrdh5z1rYVWf/AXcn4jH</latexit>

g(v) = 1 2kvk2

<latexit sha1_base64="tVShlIeF5aU4boxaGbJDcLQ38e8=">AD4XicfZLjtMwFIY9DZch3DqwZBNRkAYWVMQsEaDTMSIw1iuLRTqe5UjuOk1thOZDulHSsPACvEljV7tvAmvA1Om6ptijhSpD/nO7/PsXWClFGlW60/WzXn0uUrV7evudv3Lx1u75zp6uSTGLSwQlLZC9AijAqSEdTzUgvlQTxgJHT4PxVwU/HRCqaiI96mpIBR7GgEcVI29Sw/iDehePg0UsYSYSNn5t2DrtEas9mZ+LMZob1RqvZmoW3KfxSNEAZJ8Od2g8YJjRGjMkFJ9v5XqgUFSU8xI7sJMkRThcxSTvpUCcaIGZnad3HtoM6EXJdJ+Qnuz7KrDIK7UlAe2kiM9UlVWJP/F+pmOXgwMFWmicDzRlHGPJ14xdt4IZUEaza1AmFJ7aweHiH7MNq+oAsF+YQTzpEIDZwEuYFhyAykzxfh+MVOK7CixV4UYVSEb3AgXlfxYdLdlhlBzOrgRgxr7cBU2HhpNqNh+WBkpuw6tnS7hfhcr8Dg/g4ilI1SpOQqXNz2qHqCQWkz7IZ/bF3+568IDYvdGkjfW/TYlEulEPjYQyZhTkds9imEh/leHJmWdFe7aihQj6SRhqlhsv7rGm6LbvpPmu13Txt7z8oV3wb3wH2wC3zwHOyB1+AEdAGX8BP8Av8drDz2fnqfJuX1rZKz12wFs73v7fDVc=</latexit>
slide-9
SLIDE 9

ALPHA-STABLE KINETIC ENERGY

§ Resulting SDE: § Proposition: is Lipschitz for all α. § Uniqueness: with standard conditions on

9

Theorem 3 (FULD with α-Stable Kinetic energy) Let be the pdf of the α-stable distribution. Define the kinetic energy: with . Then, .

e−gα(v)

<latexit sha1_base64="dZItQKF9135c9jJOvh9TON7oPM=">AD0XicfZJdaxNBFIanWT/q+tFUvfNmMQhVMGyq2F6W2oKFitWaNpBNw+zsJBk6H8vMbEwyDIi3/glvNB/5L9xNtmQZCMeWHjnPOfMOTu8cUqJ0mH4Z6Pi3bh56/bmHf/uvfsPtqrbDy+UyCTCTSokK0YKkwJx01NMWtVGLIYov4+u3Ob8cYqmI4J/1OMUdBvuc9AiC2qW61cf4yrzsd0EaTqAdicaxs9t1oL6+E0gnXRKEQNFHW3a78jBKBMoa5RhQq1W6Eqe4YKDVBFs/yhROIbqGfdx2kOGVcdM17fBM5dJgp6Q7uM6mGaXOwxkSo1Z7CoZ1ANVZnyX6yd6d5+xCeZhpzNBvUy2igRZC/RZAQiZGmYycgksTtGqABlBp92J+xPEXJBiDPDHRKLYmyifEPTOydhUOl+CwDCdLcFKGUmE9x7H5VMbHC3ZcZkfTVhMhSIPWGky5g6PyNJYUF0pmknLPIVvAwzI8XYKn9qwy2rNSbL405PyBQq+bndtY+P1nfj46w843E713hxRLqIV84Uwp+4xw63zUj3Lxvzo4Kuqc8Fcskq+khaAqN3ajbON1cbFb7yq7358XTt4U1h8EzwBT8EOaIA9cADegTPQBAhMwA/wC/z2zr2x9X7NiutbBQ9j8BKeN/An0GT2o=</latexit>

ψ(v) = e−Gα(v)

<latexit sha1_base64="Cbd5SdBkO5fmXWpeHo0LoTe3yV8=">AD23icfZJbaxNBFMenWS91vaX6KMhiEKpgSaqoL0KpLVqoWC9pA5k0zM5OkqFzWZmY9JhnvRJfPUj+OKrfhm/jbPJhiQb8cDCf8/vnDlnhn+cMqpNvf5nrRJcuHjp8vqV8Oq16zduVjduHWuZKUyaWDKpWjHShFBmoYaRlqpIojHjJzEZy9zfjIkSlMpPpxSjoc9QXtUYyMT3Wrd2Gq6SYcxg9ekFP76FXQsTSAXKTnOtWa/Wt+iSiVdEoRA0UcdTdqPyAicQZJ8JghrRuN+qp6VikDMWMuBmqQIn6E+aXspECe6YycXcdF9n0minlT+EyaZBc7LOJaj3nsKzkyA1mefJfrJ2Z3vOpSLNDBF4OqiXscjIKH+VKGKYMPGXiCsqN81wgOkEDb+7UIoyCcsOUcisXAUOwvzCXHPjpxbhsMFOCzD8wV4XoZKEzPDsX1fxvtztl9me5NWCzFiUWsFpsLDUXkaT4oDFbdJuWeXz+FuGR4uwEN3WthlueYgmd/0oHyARnq27Qc3bZ/9uTCEe8T7RpE3vtShQyUj30plR9ToXzPurDXPyvDo2KOi/CJYvkKxkpmc6N3SjbeFUcb281Hm9tv3tS23laWHwd3AH3wCZogGdgB7wGR6AJMPgCfoJf4HfQCT4HX4Nv09LKWtFzGyxF8P0vVrtTGA=</latexit>

(c(v, α))i = vi

<latexit sha1_base64="nkcZ7hpxiML+ZwJ5GWLpid1fQo=">AD1nicfZJdb9MwFIa9ho8RPtbBFeImokLqEKrSgYAbpGlsEpOGB/dKtWlchy3tWY7ke2UdlbgCnHLn9gt/B7+DU6bq2LsJTo9XnO8Tm23ihlVOkw/LNR8a5cvXZ984Z/89btO1vV7bunKskJi2csES2I6QIo4K0NWMtFNJEI8YOYvOXxf8bESkon4pCcp6XI0ELRPMdI21Kver+M6HEVPIGLpEO3s9AzNX42Kf69aCxvhdAXrolmKGijXSW+7cgnjBGecCI0ZUqrTDFPdNUhqihnJfZgpkiJ8jgakY6VAnKiumd4hDx7ZSBz0E2k/oYNpdLnCIK7UhEc2kyM9VC4rgv9inUz3X3YNFWmicCzRv2MBToJigcJYioJ1mxiBcKS2lkDPEQSYW2fzYeCfMEJ50jEBo6j3MCiQ9Q34zxfhaMlOHLhxRK8cKFURM9xZD64+HDBDl12MC01ECMWtNdgKiwcu914XB4ouYndmn2+gPsuPF6Cx/nmWecnKN4cdMj9wCF1Hzaj/msfL7LfR8eEOsbSd7a6ncpkUgn8rGBSA4Fbn10QAW4n95aFzmWeGvWKQYScJU4Wxm6N18XpbqP5tLH7/lt73lp8U3wADwEdAEL8AeANOQAtg8A1cgl/gt9f2vnrfvR+z1MpGWXMPrCzv518f1EJ</latexit>

Gα(v) = Xd

i=1 gα(vi)

<latexit sha1_base64="kLnT18tV1DZ4qdythGMJhl6VuQ=">AD9XicfZLbhMxFIbdDJcy3FJYshkRIbUsokxB0E2lqrSCSkWUS9pIcRp5PE5i1ZeR7UmTjvwosAK2vARb2PM2eJKJkwQlkb6fb5zfI49f5Qwqk2j8Wet4l27fuPm+i3/9p279+5XNx6capkqTJpYMqlaEdKEUGahpGWokiEeMnEUXr3J+NiRKUyk+mXFCOhz1Be1RjIwLdas7r7sZRCwZILsJh9HWLtQph0IyqnR3YzuhvY8i21/njZ0UbvVrdYa9cZkBasiLEQNFOuku1H5BmOJU06EwQxp3Q4bielkSBmKGbE+TDVJEL5AfdJ2UiBOdCebXNEGT1wkDnpSuU+YBJdrMgQ13rMI5fJkRnoMsuD/2Lt1PR2OhkVSWqIwNGvZQFRgb5ewUxVQbNnYCYUXdrAEeIWwca/qQ0EuseQciTiDo8hmMO8Q9bKRtctwuACHZXi1AK/KUGliZjKPpTx4ZwdltnBpDSDGLGgtQIT4eCo3I3HxYGKu/9eovt8DvfL8HgBHtvzwi/LOUfx/KZH5QM0rNpP9p+WxnfR8eEOcbRd6ncJUchI9dS5UvU5Fdb5qA9z8b8NCrynPCXLJKPZKRk2jpjh2Ubr4rT7Xr4rL79/nlt70Vh8XwCDwGmyAEL8EeANOQBNg8AX8BL/Ab+/S+x9b5PUytrRc1DsLS8H38BsGtfEg=</latexit>

dvt = γvt−dt rf(xt)dt + γ1/αdLα

t ,

dxt = rGα(vt)dt

<latexit sha1_base64="GRm7CRmSY8G8dqAyYPq5RVmVg=">AEU3icfZLdbtMwFMfdD2AEBh1chOtAhW2lnYg4GbSNDbBpCLGR7dJdVc5jtGs53Idko7K6/BG/BIXPAscIGTumqbIixFOj6/nI8fEiGkjVbP4qFEvlGzdvbdx27tzdvHe/svXgTIaxwKSDQxqKCw9JQgNOipQlFxEgiDmUXLuXb1N+fmYCBmE/KuaRqTH0JAHgwAjZVz9yvcnUDfhWOvr1WyX4dDxBiy93qSQVWHkUuYManGS6pzP/zkx9qVvPIaLRCGV62GapZteBPOQx84hwIHRsnVn8vk34rq9tYM12YDP3K9Vmo5kd91oWaMK7DntbxV/QD/EMSNcYqk7LakepJFSAKUkcGEsSIXyFhqRrTI4YkT2dDTBxHxuP7w5CYT6u3My7HKERk3LKPKNkSI1knqXOf7FurAZvejrgUawIx7NCg5i6KnT13D9QBCs6NQYCIvA9OriERIK/NmZoDkGw7NhLmvzeQSDdMK3kBPkmQVjpfgOA+vl+B1HgpJ1Bx7+nMeHy/YcZ4dZaEaYkTdizUYcQMn+WrMtwkF034+5pAt4GEetpdgO7m0e7OqOfEXf3qSTyCRnHf7JZmFz2+J48AjYvZGkA8m+mNEBFKheGa2UwxZwBOzR0OYGv/ToYnVGcNZWZG0JRWGVCZmsVv5NV43zvYarReNvU8vqwev7IpvgEdgG9RAC7wGB+A9OAUdgMGfwnZhp7Bb+ln6XS6WyzNpsWBjHoKVU978Cz9mfbo=</latexit>

rGα

<latexit sha1_base64="Mq/mauLE81Zv3OYwA5bUW5f7hY=">ADznicfZLbtNAEIa3MYdiTglcmMRISEuoqQg6GXVg6BSEeGQNlIcovF6k6y6B2t3nSZdWdzyElwCz8TbsE4cJXEQI1n6d76ZnfHqjxJGtWk2/+xUvBs3b93evePfvXf/wcNq7dG5lqnCpIMlk6obgSaMCtIx1DSTRQBHjFyEV0e5fxiQpSmUnwxs4T0OYwEHVIMxqUG1VoIGIQvB3YEFgyhmxQrTcbzXkE26JViDoqoj2oVX6GscQpJ8JgBlr3Ws3E9C0oQzEjmR+miSAL2FEek4K4ET37Xz3LHjmMnEwlMp9wgTz7HqHBa71jEeukoMZ6zLk/9ivdQM9/uWiQ1RODFoGHKAiOD/CGCmCqCDZs5AVhRt2uAx6AG/dcfijIFZacg4htOI0yG+YToqGdZtkmnKzBSRler8HrMlSamCWO7KcyPlmxkzI7nrfaEAMLulswEQ5Oy9N4XFyouI3LPYd8BQ/L8GwNnmVfC6Ns1pzGqz89LV+gQS+3/Zwt2penzPfDY+J8o8h71/0hIQqMVC+cHdWIU5E5H43CXPyvDqZFnRP+hkXylYyUTOfGbpVtvC3O9xqtl429j6/qB68Li+iJ+gpeo5a6A06QO9QG3UQRlfoB/qFfntb+Jl3rdFaWn6HmMNsL7/hdZz045</latexit>

f

<latexit sha1_base64="nyuVojMUxBDs+URIG9JUGP+g0f0=">ADvHicfZLdbtMwFMe9ho8Rvja45CaiQkJcVO1AwBWaxiYxaYgN6FapKdOJ47RW/RHZTulm5Qm4hUfgoXgbnDZVWxdxpEh/n9/5+xHJ8kZ1abd/rPVCG7cvHV7+054979Bw93dh+da1koTLpYMql6CWjCqCBdQw0jvVwR4AkjF8n4fcUvJkRpKsVXc5WTAYehoBnFYFzqLvcabZb7VlEm6JTiyaq4/Ryt/E7TiUuOBEGM9C632nZmBGYoZKcO40CQHPIYh6TspgBM9sLNJy+iZy6RJpX7hIlm2VWHBa71FU9cJQcz0j6rkv9i/cJkbweWirwROB5o6xgkZFR9ewopYpgw6cAKyomzXCI1CAjfs5YSzIdyw5B5HaeJqUNq46JmdluU6nKzAiQ+vV+C1D5UmZoET+9nHR0t25LPDmdXGFjU24C5cHDqd+NpfaHiNvU9B3wJD3x4sgJPym8xsHwEXs1xunzpsX+Br2Y9ks5ty9OZRjGh8TtjSIfnftThQYqV7YGNSQU1G6PRrGlfhfHUzrOifCtRWpRjJSMl26xe74a7wpzvdanZetvbNXzf3X9YpvoyfoKXqOugN2kcf0CnqIowI+oF+ol/BuyANxgGflza2as9jtBbB5C+2cUdU</latexit>
slide-10
SLIDE 10

DISCRETIZATION

§ Analysis: weak convergence of the Euler-Maruyama discretization § Our interest: via

10

vk+1 = ˜ γkvk ηkrf(xk) + ⇣ηkγ β ⌘1/α sk+1 xk+1 = xk + ηkrGα(vk+1)

<latexit sha1_base64="XF4dXkcB+KnwgbuYD98+L6af1mc=">AEjnicfZJtb9MwEMfTUWCEpw5e8sZiAnVMjGagjTcT09jEJg0xHvYgzV1cZwsqu1EtlO6Wf4fCBe8m1w2mxtM4QlS+f73f19l2Ys1TpTudPY+5W8/adu/P3/PsPHj563Fp4cqSyQhJ6SDKWyZMQFGWpoIc61Yye5JICDxk9DvsfS348oFKlmfihL3La5ZCINE4JaOfqtX7jQXhm+suBRS83ENYpi6jBCXAOtmf6FpW8j14jTDWMHAgLCBmguI2HDi2hZYS30oS1cSyBmOu4SsTg0HlsGSKXzkzwBgPLz6HU4aDPw9goWxWARSYKHlKJMPZL8Uldo4Mtr6rkqyI+9cZy7etnLPVai52Vzmihm0ZQGYtetQ56C3O/cJSRglOhCQOlToNOrsGpE4Jo9bHhaI5kD4k9NSZAjhVXTP6e4teOE+E4ky6LTQaeaczDHClLnjoIsv3qjornf9ip4WO3dNKvJCU0HGF8UFQzpDZSNRlEpKNLtwBhCZuloROQfXAu3a7WNBf5LMNUBExv2ea0P120NrZ+FgCg7q8HIKXtahVFRf4dB8q+OdCdups+1RqsEGDq5AXPh4LB+G48qQclNVM/Z4hO4VYf7U3DfnlUTOBuzF01eulcXUKCuqv1ux+lXJ+v7eJu6uZH0s8v+klMJOpOvDAaZ8FRYN0cJLo3/xcGwinOGPzMiZUk6y5iybrCD+hjfNI5WV4K3K6tf3y1urlUjPu89857bS/w1r1Nb9c78A490lhvdBtxI2m2mvNjeaHcehco8p56s2s5u5fj2eQiA=</latexit>

˜ γk = 1 − γηk

<latexit sha1_base64="QhWS3SsLKNGMqz5YoA631AaMi4A=">AD5XicfZLdihMxFMezHT/W8aurl94Ei7AIls4q6o2wrLvgworR3cLTS2ZTDoNTJDkqnthnkEvRJvfINvNX38G3MtFPaTsUDA/85v/POQknTDnTptX6s1XzLl2+cnX7mn/9xs1bt+s7d850kilC2yThieqEWFPOJG0bZjtpIpiEXJ6Ho5eFvx8TJVmifxgpintCRxLNmAEG5fq13eRYTyiFsVYCJz37SiHL2DwCMJ5BiJqcJHt1xutZmsWcFMEpWiAMk7O7UfKEpIJqg0hGOtu0ErNT2LlWGE09xHmaYpJiMc06TEguqe3Z2pRw+cJkIDhLlPmngLvqsFhoPRWhqxTYDHWVFcl/sW5mBs97lsk0M1SeaNBxqFJYPE+MGKEsOnTmCimJsVkiFWmBj3ij6S9BNJ3LvIyKJmFtUdAgHdpLn63C8AsdVeLECL6pQaWoWOLTvqvhoyY6q7HBmtYhgDjsbMJUOTqrdRFQeqISNqp4DsYQHVXiyAk/yjwjzdIgrNcfR8qbH1QM01otp3+dz+Iv9310SN3eKPraud+kVGTqIcWYRULJnO3RzEqxP/q8KSsc8JfW5FiJMkXBeLHVTXeFOc7TWDx829t08a+0/LFd8G98B9sAsC8Azsg1fgFLQBAV/AT/AL/PZi7P31fs2L61tlZ67YC28738BiA1XHg=</latexit>

Standard α-stable r.v.

π(h) := EX∼π[h(X)]

<latexit sha1_base64="ZSo7YUBDRs6b93FOQfCZpRyl3xs=">AD5XicfZLdihMxFMezHT/W+tXVS2+CReh6UdpVARhWbfgworR3cLTS2ZTNqGTJDkqnthjyCXom3XvkG3up7+DZm2iltp+KBgf+c3/knJ4cTJpxp02j82SoFly5fubp9rXz9xs1btys7d051nCpC2yTmseqEWFPOJG0bZjtJIpiEXJ6Fp6/zPjZmCrNYvnBTBPaE3go2YARbHyqX6mhNVGu/D5C4gENqMwtC3Xtx2INBPQwe7o1pnt9evVBv1xizgpmjmogryOnvlH6gKCapoNIQjrXuNhuJ6VmsDCOcujJKNU0wOcdD2vVSYkF1z86e5OADn4ngIFb+kwbOsqsOi4XWUxH6yqxtXWRZ8l+sm5rBs5lMkNlWR+0SDl0MQwmw+MmKLE8KkXmCjme4VkhBUmxk+xjCT9RGIhsIwsmoTOzoc2sBPn1uF4BY6L8GIFXhSh0tQscGjfFXFryVpFdjizWkQwh50NmEgPJ8XbRJQfqISNip4DsYQHRXi8Ao/dR4R5MsKFmqNo+dKj4gEa60W3793cvhz5TI6pH5vFH3t3W8SqrCJ1UOLsBoKJp3foyHKxP/q8CSv86K8tiJZSyaOuXZ+sZvFNd4Up3v15qP63tvH1f0n+Ypvg3vgPqiBJngK9sErcALagIAv4Cf4BX4Hw+Bz8DX4Ni8tbeWeu2Atgu9/AbjDVgM=</latexit>

¯ πK(h) := PK

k=1 ηkh(xk)

PK

k=1 ηk

<latexit sha1_base64="9pvzEx3seDOPfWASoIRONqI6LtQ=">AEGnicfZLbhMxFIbdhEsJtxSWbEZESCmLKFMQIKRKVWklqhZRLmkjxUnk8XgSK7ZnZHtCUstvwpYXgRViy4ZHYAsvgCcXJZmgWhrp9/nO73Pm6AQJo0rX6782CsUrV69d37xRunr9p275a17ZypOJSYNHLNYNgOkCKOCNDTVjDQTSRAPGDkPBq8yfj4kUtFYfNTjhLQ56gkaUYy0C3XLbRgaWBCbfe42t9+uevBSCJsoEo5FDGjnGrVNYNd3aOPUg06g68fhWOgo4Z2G17aLtliv1Wn1yvHXhz0QFzM5pd6vwBYxTjkRGjOkVMuvJ7ptkNQUM2JLMFUkQXiAeqTlpECcqLaZzMF6j1wk9KJYuk9obxJdhjElRrzwGVypPsqz7Lg/1gr1dGLtqEiSTUReFoSpmnYy8bqhdSbBmYycQltT16uE+cnPUbvQlKMgnHORGjc4NzMsgpBZEbWrsLhEhzm4cUSvMhDqYie48C8z+PDBTvMs4OJ1UCMmNdcg4lwcJSvxsPZg5KbMO/Z5wu4n4cnS/DEdiBiSR/lco7CxZ8e5R9QSM27/WCn9vnNlkrwgLi9keSNc79NiEQ6lo8NRLHqbBuj3owE5flodEsz4nSyopkLek4ZipbD+/xuvibKfmP6ntvHta2Xs2W/FN8A8BFXg+dgD7wGp6ABMPgGfoM/4G/xc/Fr8XvxzS1sDHz3Acrp/jzH8q4bdE=</latexit>

Corollary 1 (Weak Convergence of Sample Averages) Assume: , Assume: is Lipschitz + linear growth + standard Lyapunov condition Then, almost surely, as .

limk→∞ ηk = 0

<latexit sha1_base64="jXPu6WPmWRxfeGwlETakGz8Y0k=">AD4XicfZLbhMxFIbdDJcy3FJYshkRkBCLaNIiYINUlVaiUhHlkjZSHCKPx5NY8WVke0JSax4AVogta/Zs4U14GzJREkmCEsz+n2+8/scWydKGdUmDP9s1bxLl69c3b7mX79x89bt+s6dMy0zhUkbSyZVJ0KaMCpI21DSCdVBPGIkfNo9Lg52OiNJXig5mpMfRQNCEYmRcqF9/ABnlUEj3p0b37QgaCalIzDSHxKD+KHgRhP16I2yGsxVsilYpGqBcp/2d2g8YS5xIgxmSOtuK0xNzyJlKGYk92GmSYrwCA1I10mBONE9O7tOHjx0kThIpHKfMEsuqwiGs95ZHL5MgMdZUVwX+xbmaS5z1LRZoZIvC8UJKxwMigeJsgpopgw6ZOIKyo6zXAQ6QNu4FfSjIJyw5RyK2cBLlFhYVosRO8nwdjlfguAovVuBFSpNzAJH9l0VHy3ZUZUdzqwWYsSCzgZMhYOTajUelwcqbuOq54Av4UEVnqzAk/wjRCwdokrOcby86XH1AI30otv3+dy+2OW+Dw+JmxtFXjv3m5QoZKR6bCFSA05F7uZoAvxvzw0KfOc8NdGpGjJSMl07ga7VR3jTXG2ztNXfPmnsPy1HfBvcA/fBI9ACz8A+eAVOQRtg8AX8BL/Abw97n72v3rd5am2r9NwFa8v7/hcxNlZ6</latexit>

limk→∞ Xk

i=1 ηi = ∞

<latexit sha1_base64="XxHdzqm42uCKpwEakyU5LBKuMxY=">AD+3icfZJb9MwFMe9hcsItw4eYmokBAPVbMh4GVoGpvEpCHGpVuluqscx2mt+hLZTmkX+dMAL4hXvgSv8MK3wWlTtU0RlhL9fX7/43NsnShlVJtm8/Gpnfl6rXrWzf8m7du37lb2753pmWmMGlhyaRqR0gTRgVpGWoYaeKIB4xch4NXxX8fESUplJ8NJOUdDnqC5pQjIwL9WovIaMcCun+1OhePoRGQioSM7FQZ8uE7oX2YgiJQT0a7AUzU69Wbza0xWsi7AUdVCu09725lcYS5xIgxmSOtO2ExN0fKUMyI9WGmSYrwEPVJx0mBONHdfHpRGzxykThIpHKfME0upyRI671hEfOyZEZ6Corgv9incwkL7o5FWlmiMCzQknGAiOD4tWCmCqCDZs4gbCirtcAD5BC2Li39aEgn7DkHIk4h+PI5rCoECX52NpVOFqCoyq8XIKXVag0MXMc5e+r+GjBjqrscJqaQ4xY0F6DqXBwXK3G4/JAxfO4mnPAF/CgCk+W4Im9gIilA1TxHMeLmx5XD9BIz7v9YGfp8531fXhI3Nwo8sZlv02JQkaqJzlEqs+psG6O+rAQ/Ohcelzwl8ZkaIlIyXT1g12WB3jdXG20wh3Gzvntb3n5UjvgUegIfgMQjBc7APXoNT0AIYfAE/wS/w27PeZ+b931m3dwoc+6DleX9+As+nmJF</latexit>

rf

<latexit sha1_base64="TNiypgZrnTt6egs9k3p5pM91fc=">ADw3icfZJb9MwFMe9hcsItw0eYmokBAPVTsQ8DiNVWLSEOPSrVJdphPHaP6EtlOaRf5S/AKr3wovg1Om6qtizhSpL/P7/x9jqMT5yzTptX6s7Mb3Lh56/benfDuvfsPHu4fPLrQslCEdolkUvVi0JRlgnZNZhjt5YoCjxm9jMfvKn45oUpnUnw1s5wOAxFlmYEjEv1sICYQZRe7TdazdY8om3RrkUD1XF+dbD7GyeSFJwKQxho3W+3cjMoQZmMGpDXGiaAxnDkPadFMCpHpTzgW30zGWSKJXKfcJE8+y6owSu9YzHrpKDGWmfVcl/sX5h0reDMhN5Yagi0ZpwSIjo+r1UZIpSgybOQFEZW7WiIxATHuH4VY0O9Ecg4iKfE0tiWuOsRpObV2E07W4MSH12vw2odKU7PEcfnZx50V6/jsZG4tMQEW9bZgLhyc+t14Ul+oeJn4nmO+gsc+PFuDZ/YbBpaPwKs5TVYvPfUv0KCX036xC/vyZMQn1C3N4p+cO6POVgpHpRYlBDngnr9miIK/G/OpjWdU6EGytSjWSkZNq6xW7a7wtLg6b7ZfNw0+vGkev6xXfQ0/QU/QctdEbdITeo3PURQx9AP9RL+CTjAOVGAWpbs7tecx2ojA/gUY1EoU</latexit>

¯ πK(h) → π(h)

<latexit sha1_base64="6R9qdNifCnqPLaiToVfFW3C5KQ=">AD1nicfZJLbxMxEMfdLI+yPJrCXFZESEVDlFSEHCsSitRUR5pI0Uh2jW6yRWbe/K9oak1sIJceVL9Aqfh2+DN9koyQYxkqW/5zfjmbEmTDjTptH4s1Hxrly9dn3zhn/z1u07W9Xtu6c6ThWhLRLzWLVD0JQzSVuGU7biaIgQk7PwvNXOT8bUaVZLD+ZSUK7AgaS9RkB41y96n0cgrI4YVnvzc7wcYBNHLibk71qrVFvTC1YF81C1FBhJ73tyiWOYpIKg3hoHWn2UhM14IyjHCa+TjVNAFyDgPacVKCoLprpzNkwSPniYJ+rNyRJph6lzMsCK0nInSRAsxQl1nu/BfrpKb/smuZTFJDJZkV6qc8cIPmHxJETFi+MQJIq5XgMyBAXEuG/zsaRfSCwEyMjicZhZnFcI+3acZatwtARHZXixBC/KUGlq5ji0H8r4cMEOy+xgmoxAR6012AiHRyXq4moeFAJG5Vz9sUC7pfh8RI8zj5j4MkQSjFH0WLSo/IDGvS824/ZLH1+y3wfH1C3N4q+dnvEqrAxOqJxaAGgsnM7dEA5+J/cTAu4pzwV1Ykb8nEMdeZW+xmeY3Xxeluvfm0v+W3vebHim+gBeoh2UBO9QHvoNTpBLUTQN3SJfqHfXtv76n3fsxCKxtFzj20Yt7Pv8EBUJ0=</latexit>

K → ∞

<latexit sha1_base64="KJ495xm3IzY2e6/xcSHhcYgsx0=">ADxXicfZJb9MwFMe9hsItw0eYmYkBAPVTsQ8DiNTAxLh0q6jLdOI4rTVfItsp7ayIL8ErvPGh+DY4baq2KeJIkf4+v/P3OY5OnHFmbKv1Z6MRXLl67frmjfDmrdt37m5t3zs1KteEdojiSndjMJQzSTuWU67maYgYk7P4otXJT8bUW2Ykp/tJKN9AQPJUkbA+tSXt9gqzGRqJ+dbO61maxrRumhXYgdVcXK+3fiNE0VyQaUlHIzptVuZ7TvQlhFOixDnhmZALmBAe15KENT03XTkInrkM0mUKu0/aNpdtnhQBgzEbGvFGCHps7K5L9YL7fpy75jMstlWTWKM15ZFVUvj9KmKbE8okXQDTzs0ZkCBqI9X8pxJ+I0oIkInD47hwuOwQp25cFKtwtARHdXi5BC/rUBtq5zh2H+v4cMEO6+xganWYAI+6azCTHo7r3URSXaiFS+qefbGA+3V4vASPi68YeDaEWs1RsnjpUf0CA2Y+7adiZp+fijDEB9TvjabvPt9RjVYpZ84DHogmCz8Hg1wKf5XB+OqzotwZUXKkaxS3BR+sdv1NV4Xp7vN9tPm7odnO3vPqxXfRA/Q/QYtdELtIfeoBPUQRJ9AP9RL+C14EIbDCalTY2Ks9tBLB979ltktY</latexit>
slide-11
SLIDE 11

CONNECTIONS

§ Gradient Clipping: used when “exploding gradients” Heavy-tails à exploding grads à clipping improves convergence § Naturally arises in our scheme:

11

§ Natural gradient § Precondition by Fisher Information § The “Composite” gradient: (α=1) with § Can be seen as a (biased) estimator

  • f the diagonal of FIM

E[rf(x)rf(x)>]

<latexit sha1_base64="RlxiadkcCNk5enRltmHDeInmA=">AD6nicfZLbhMxEIbdLIeyHJrCJTcrIqSCRJQUBFxWpZGoVEQ5pI0Up9Ws10lW9WFle0NSa18CrhC38BDcwlPwNniTjZJsECNZ+j3fjD0zmjBhsTaNxp+Ninfl6rXrmzf8m7du39mqbt890TJVhLaJZFJ1QtCUxYK2TWwY7SKAg8ZPQ0vXuX8dESVjqX4aCYJ7XEYiLgfEzDOdV59gjmYRjaVtbFAkIGQX8Hj8NHK5czbGTS8/3zaq1Rb0wtWBfNQtRQYcfn25UfOJIk5VQYwkDrbrORmJ4FZWLCaObjVNMEyAUMaNdJAZzqnp32lQUPnScK+lK5I0w9S5nWOBaT3joIvMudJnlzn+xbmr6L3s2FklqCzj/opC4wM8iEFUawoMWziBAVu1oDMgQFxLhR+ljQT0RyDiKybjqZnc2wb8dZtgpHS3BUhpdL8LIMlaZmjkP7voxbC9Yqs4NpqsUEWNBZg4lwcFz+jUfFg4rbqJyzxdwvwyPluBRdoaBJUMoxRxGi04Pyw9o0PNqP2Sz9Pkt8318QN3eKPrGZb9NqAIj1WOLQ14LDK3RwOci/FwbiIc8JfWZG8JCMl05lb7GZ5jdfFyW69+bS+5Zbe95seKb6D56gHZQE71Ae+g1OkZtRNAX9BP9Qr895n32vnrfZqGVjSLnHlox7/tfo1ZYhQ=</latexit>

rG1(r ˜ fk(x)) = Mk(x)−1r ˜ fk(x)

<latexit sha1_base64="IJUuC32cwzX8p4Kr+H6qvgsPdE=">AEFXicfZLbhMxFIbdDpcyXJrCks2IClFIsoUBGyQqtIKrWiXNJGyqSRx+NJrNieke0JS0/B0/AY8AKsWXNni08A5koiQTwJKl3+c7v318dMKUEqkajR9r686ly1eublxzr9+4eWuzsnX7VCaZQLiJEpqIVglpoTjpiK4lYqMGQhxWfh4EXOz4ZYSJLw92qc4g6DPU5igqCyoW6lFXAYUui97Grf1IpDoAiNsI5Nd1ALRuH2tvfcCxhU/TDWx7PguX7om38YXNftVqNemOyvFXhF6IKinXS3Vr/FEQJyhjmClEoZdtvpKqjoVAEUWzcIJM4hWgAe7htJYcMy46etMB4920k8uJE2M2VN4kuOjRkUo5ZaDPzj8gy4N/Y+1Mxc86mvA0U5ij6UNxRj2VeHk/vYgIjBQdWwGRILZWD/WhgEjZrsBx9QwhjkbaNMXrWxpExy3C4AIdleLEAL8pQSKxmONRvy/hgzg7KbH9i1QGC1GutwJRbOCq/xqLiQsF0VPbsTncK8OjBXhkzgNI0z4s5RxG858eli+QUM6qfWem9tnJuG6wj+3cCHxs3a9TLKBKxAMdQNFjhBs7R70gF/Lg6Mizwp3aUTyklSUGnsYPvlMV4Vpzt1/1F9583j6u6TYsQ3wF1wD9SAD56CXfAKnIAmQOAL+Al+gd/OR+ez89X5Nk1dXys8d8DScr7/AaFPaQA=</latexit>

mii = ((r ˜ fk(x))2

i + 1)/2

<latexit sha1_base64="bp+XS81obNQz/YNIceamvJieUns=">AD7nicfZLdbtMwFMe9hY8Rvjq45MaiQmpBKk1BwA3SNDaJSUOMj26Vmi5yHKe1ajuR7ZR2Vl4DrhBXSLwEt/AMvA1um6ptijhSpH/O7/x9jq0Tpowq3Wz+2dp2Ll2+cnXnmnv9xs1btyu7d05VklM2jhieyESBFGBWlrqhnpJIgHjJyFg5fTfnZiEhFE/FRT1LS46gvaEwx0jYVDweGEpz+BLWar5AIUPQ15RFxMR5MKz547BeD+h5Cz7y6o9bro2gUm02mrOAm8IrRBUcRLsbn/3owRnAiNGVKq6zVT3TNIaoZyV0/UyRFeIj6pGulQJyonpndLYcPbCaCcSLtJzScZVcdBnGlJjy0lRzpgSqzafJfrJvp+EXPUJFmg8bxRnDOoETh8KRlQSrNnECoQltbNCPEASYW2f0/UF+YQTzpGIjH2k3PjTDmFsxnm+DkcrcFSGFyvwogylInqBQ/O+jA+X7LDMDmZW42PEYGcDpsLCcbkbj4oDJTdR2bPl3C/DI9X4HF+7iOWDlCp5iha3vSofIBCajHth3xuX/zlrusfELs3kryx7rcpkUgn8qHxkexzKnK7R31/Kv5Xh8ZFnRXu2opMR9JwlRuF9sr/GmOG01vCeN1run1b1nxYrvgHvgPqgBDzwHe+A1OAFtgMEX8BP8Ar+d1PnsfHW+zUu3twrPXbAWzo+/+SRXqA=</latexit>

(Amari’98) (Pascanu et al.’13) (Zhang et al.’19)

slide-12
SLIDE 12

EXPERIMENTS – SYNTHETIC

§ f: “Double-well” (4-th order polynomial)

12

α=1.0 α=1.9 α=1.0 α=1.9

slide-13
SLIDE 13

EXPERIMENTS – NEURAL NETWORKS

§ Fully connected networks – ReLU activations – Cross Entropy loss § No additional noise – true gradient noise

13

MNIST CIFAR10

slide-14
SLIDE 14

CONCLUSION

§ Gradient noise in deep networks can be heavy-tailed § Heavy-tails à might shift the modes § Proposed dynamics à targets the Gibbs measure à no shift § Brings theoretical understanding for gradient clipping

14

slide-15
SLIDE 15

THANK YOU FOR YOU ATTENTION!