Probability*and*Statistics* ! ! for*Computer*Science** - - PowerPoint PPT Presentation

probability and statistics
SMART_READER_LITE
LIVE PREVIEW

Probability*and*Statistics* ! ! for*Computer*Science** - - PowerPoint PPT Presentation

Probability*and*Statistics* ! ! for*Computer*Science** All!models!are!wrong,!but!some! models!are!useful555!George!Box! !! Credit:!wikipedia! Hongye!Liu,!Teaching!Assistant!Prof,!CS361,!UIUC,!04.30.2020! Contents* Markov!chain!


slide-1
SLIDE 1

!!

Probability*and*Statistics* for*Computer*Science**

“All!models!are!wrong,!but!some! models!are!useful”555!George!Box! !!

Hongye!Liu,!Teaching!Assistant!Prof,!CS361,!UIUC,!04.30.2020! Credit:!wikipedia!

slide-2
SLIDE 2

Contents*

Markov!chain! MoQvaQon! DefiniQon!of!Markov!model! Graph!representaQon!–!Markov!chain! TransiQon!probability!matrix! The!staQonary!Markov!chain! The!pageRank!algorithm!

!

slide-3
SLIDE 3

Project

review :

why

do

we

have

the

exercises

in

part I 2 It ?

what

are

expected

for

each ex

. ?

che

notations

mean

7

What

do

.
slide-4
SLIDE 4

CS 361

SP 2020

project

(1) Stochastic

First

Order Optimization

( 65 pts )

* stochastic

first

  • rder

approximation

hcxt

x*= ?

  • what

is

this

task ?

what does

this have

to

do with optimization ?

slide-5
SLIDE 5

hey

#1=0

⇒ root finding

pfcx*)=o ⇒

  • ptimization

in

the

context

  • f
  • ptimization

dip

!

X

are

the parameters

.

! !

  • /

ie

che hyperplane

  • f

sum

a

"

x c- 5=0

classifier

slide-6
SLIDE 6

If

we

don't know

hcx)

but

we

know gcx)

= hcx

) t Z Z

is

a

random noise

independent

& Ecgcx)) = hcxj

to

x

is

hex)

random ?

what is ECZI ?

E- (gcxl ) =

E- Chex) TEC-27

  • hcxl =

hex, t EE-27

EEZ) -_ o

slide-7
SLIDE 7

CS 361: Probability and Statistics for Computer Science

(Spring 2020)

Stochastic First Order Optimization 1 Stochastic Approximation

Root-finding is simply the process of finding where h(x) = 0. For simple polynomials (e.g. h(x) = (x5)(x+3)), this is very easy. However, this is not easy for all functions. For instance, say we want to optimize a machine learning algorithm. We can define f(x) to be the error function for an algorithm, which we want as small as pos-

  • sible. In order to do so, we would need to find the root of the derivative of the error function (ie, h(x) = f 0(x)),

since this is where the minimum of the error function might be. Additionally, we may have to worry about noise. Say we want to find where h(x) = 0, but finding the true value of h(x) at some x is extremely expensive or impossible. On the bright side, we have access to a “noisy” version of h that we call g(x). In other words, g(x) = h(x) + z. You cannot control the additive noise z or predict it, but you can assume that it is independent of x, and E[z] = 0. Stochastic approximation (SA) is the process of root-finding on a noisy function g(x).

1.1 Stochastic Approximation in simple setting

For stochastic approximation to be effective, we need a sequence of positive learning rates that we denote as {ηn}n1. In the following exercises, we will perform stochastic approximation on h(x), and have access to a noisy version y = h(x) + z. In order to find a good sequence of learning rates, we need to make the following assumptions:

  • 1. The function h has a unique root x∗ (i.e., h(x⇤) = 0 for a unique x⇤). This unique root is a positive zero

crossing of h. In other words: h(x⇤) = 0, x > x⇤ ) h(x) > 0, x < x⇤ ) h(x) < 0

  • 2. y has a finite upper and lower bound. In other words, we have bounded noise:

P(|y| < c) = 1 where c 0 (1.1)

  • 3. The noise is independent of x:

8x : E[y] = h(x), P(z|x) = P(z) (1.2)

  • 4. The learning rates ηn do not approach 0 too quickly or too slowly. More formally:
1

X

n=1

ηn = 1,

1

X

n=1

η2

n = c

(1.3) For some positive c. In other words, the sum of the learning rates is unbounded, but the sum of their squares is bounded. Exercise 1. (4 points) Propose a family of learning rates that satisfies assumption 4 (a formal proof is not needed). hint: Try providing a range of values for α in nα that would satisfy the constraints. Now that we have a series of learning rates, we can move onto stochastic approximation. The algorithm is defined as follows, where Xn is the nth approximation of x⇤:

  • Let X1 be some initial value or guess

1

I -

  • Ia

.

÷:#'in ....

.

...

. iii.

is!

E

inIl

n

nd

¥1

good ? V

2 so

good ?

In

"

=

£

.?

+

2

n -

t a

1

Learning

rate

you,=Xn - 7nF

slide-8
SLIDE 8

Xn

is

nth

approximation of

X *

Xnei

= Xn - Mn Tn

as

n → is

? Xn

→ x*

Will this

happen stochastically ?

( im ECCXN - HTT

=

  • n-

so

slide-9
SLIDE 9
  • 1. 2 stochastichmmoximnttimcowvergenf.in
  • cheatorement#dttg

* There

are

elaborated steps

which

are

too

complicated

for the

project *

We

selected

part of

them

to

use as

exercises

for you

.

*

Some of the intermediate

results

are

provided

*

We'd

like you to

learn about conditional

expectation

!

slide-10
SLIDE 10

– Stochastic First Order Optimization 2

  • For n = 1, 2, · · · perform the following iteration

Xn+1 = Xn ηnYn (1.4) Where Yn = h(Xn) + Zn (i.e., the noisy version of h(Xn)), just as mentioned previously.

1.2 Convergence proof of SA

Now that SA is defined, we want to show that it actually works. To do so, we can define an expression for the error, and show that expression converges to 0. Define the Mean Squared Error at step n as follows e2

n = E[(Xn x⇤)2]

(1.5) Exercise 2.

  • 1. (4 points) Prove the following relationship for any two random variables u, v:

Eu[f(u)] = Ev[Eu|v[f(u)|v]] (1.6) Do not assume any kind of independence. We can summarize this relation as E[A] = E[E[A|B]]. Hint: It requires notion of conditional expectation (Eu|v[f(u)|v]). Here is a resource to learn about conditional expectation. You are free to find and use others. Ultimately, we want to show that the mean squared error will converge to 0 as the number of steps approaches 1. To do so, we’ll need the following relationships: 1 e2

n+1 = e2 n 2ηnρn + η2 nE[Y 2 n ]

(1.7) where ρn = E{(Xn x⇤)h(Xn)}, and Yn is still the noisy version of h(Xn). This shows us the relationship between two subsequent iterations of the mean squared error. e2

n+1 = e2 1 2 n

X

i=1

ηiρi +

n

X

i=1

η2

i E[Y 2 i ]

(1.8)

  • 2. (3 points) Knowing that noise is bounded (1.1), show that E[Y 2
n ] is also bounded.
  • 3. (3 points) Given that E[Y 2
n ] is bounded, show that Pn i=1 η2 i E[Y 2 i ] is bounded. Hint: use (1.3)
  • 4. (2 points) Let bn := |X1| + c Pn
i=1 ηi and dn = min|x|<bn h(x) xx∗ . For this problem, the actual values of

dn and bn are unimportant - we can show that P1

i=1 ηidi = 1 and P1 i=1 ηidie2 i < 1. Using these two

evidences, and assuming e2

n converges, finalize the proof by proving the following:

lim

n!1 e2 n = 0

2 Stochastic First Order Optimization 2

2.1 Review

The goal of optimization is to find the x⇤ that minimizes of f(x). However, f is again either unknown or very expensive to collect, but we have access to the noisy version g(x). E[g(x)] = f(x) (2.1) We also assume that we have access to the gradient of g, which is also noisy. E[rg(x)] = rf(x) (2.2)

1Extra Credit Ex. 1 asks you to prove the statements we provided without proof in Exercise 2 and may help increase your

mathematical understanding of error bounding.

2Before continuing, you may consider attempting Extra Credit Ex.

2, 3, and 4. These exercises ask you to analyze some properties of SA and the order of convergence under SA settings. Again, these are not required.

'Tn

is

a

sequence

  • f

random

variables

① ①

Xn

is

another

. . .

Xn

= Xu -I
  • Yu
  • i Th
  • I

continuous RV

ECECfunlul)

Ecg

in

the

= Effort] a

context

P ( 12/24=1

c > o

example

→ Lec

8

pg 23

& notes

  • F. bike
  • =f'Cx

)

I > o

µ*¥

'" f

''s,
slide-11
SLIDE 11

L

lion

en

= c

.

Cfo

C > o n -

so

HE

> o

there

is

a N

n> N

lent else

  • ← en'- c se

en '_ c

> - E

en'

> c - E

(WH g

  • I

,

en'

>I

slide-12
SLIDE 12
  • di

~

I

7; ei

e:3E

  • ca

i-

  • i

di

B di

= E. hiei't E., nice?

es

di lo

di

> I 7,

  • e :'t It, Yi ca

⇒ ¥ 7.

  • di e? = -

⇒ contradiction is

  • T

T 7 .

  • di =- ⇒ I ni di C

is

E-I

i-

  • N
slide-13
SLIDE 13

enE E C l Xn

  • x*I ']

Enet = E C l Xue i

  • x*IT

= E C l X n

  • y Ya
  • x* I y

= E ( E l Hn

  • y Ya - x* T / xn)

to

relate

enet with

eT

slide-14
SLIDE 14

Conditional

expectation

we

have seen this

:

for

discrete Rv .

E- Ex] =

2- xpcx,

7C

→ ECXIY

  • g) = I xp cx=xlY=y,

I

slide-15
SLIDE 15

The mean of EX IY]· Law of -tera ed expectations

  • g(y) = E[X I y = y

[ [

] = [X

  • = Expert y ,

I

= E Cgcy ))

pcxlg,=Pg,

= I 814, pity ,

peg,

y

= I I

X pl Xly, pity ,

y

x

= z

  • 2 xp ex, y ,

X

  • y

=

Ty

x Ey pix

, y , = § xp CX)

= Efx)

slide-16
SLIDE 16

What

about

ECXIY )

for

continues

random

variable ?

E-EXIT) = S xpcxly

> da

K

T density

E-(tix)141=5 tcxspcxlysdx

'T

X density

Ex .

2 . '

for EREC#147 )

when

X

, Y

ane

J

pix

, y > Ix

continuous RVs

×

=

p

  • Y)
slide-17
SLIDE 17

Stick-break-ng exa pie

fy(y)

  • Stick example. stick of length £

break at uniform y chosen point Y

break what ·s left at uniformly chosen po·nt X

y

E[X Iy - y) = E[X IY] =

y

X

E[X] =

  • Random

p

ECT )

= ? I

t

z

Ef

random

e

variable (/ /

  • E

T

SK pcxly)d×

yp

? I

n

= Sj x. tydx-g-T.KZ

I

z

n

ECELXIYI ) = ICE]

Does

it matter

whether I

the =

ECT)

break

from

left

  • r

the = I

right ?

4

slide-18
SLIDE 18

CS 361

SP 2020

project

(1) Stochastic

First

Order Optimization

( 65 pts )

* stochastic

first

  • rder

approximation

#

stochastic

first order

  • ptimization
.

GD and

SGD

slide-19
SLIDE 19

First&order&approximation&

  • Textbook

I

63

I

are

= ( !!)

let 's

  • ne

't b

for

now

f- Sca , b)

→I

2A'

g

is

any

cost

function

y

is

learning

;

'

un

's a

Ch '

rate

¥

Lad

7 >

  • a

is small

/

'

aint ytglu

"')

  • y Pgt 178

=

  • 711*7811

'

slide-20
SLIDE 20

The$difference$btw$GD$and$SGD$ for

SVM

\

N

GD :

S

=-I Sgt

penalty ( Hall )

N g-

  • I

b

SGD :

S

=# Sit penalty ( Hall)

bi-

  • I

i istrandom sample

from

C ' - K)

a.

'ts

b

can be just

1 !

replace

why

would it work ?

  • ray
slide-21
SLIDE 21

Z -

  • g
  • t

G D

  • X n e ,

=

X n

  • y D f

S G D

Xna

= xn

  • g-xgD-7.LI?m

l im E (CXn

  • x *5)

= o

"t

n →-

slide-22
SLIDE 22

– Stochastic First Order Optimization 3 2.1.1 Gradient Descent Algorithm

  • Initialize X1
  • For n = 1, 2, · · · Then do the following update

Xn+1 = Xn ⌘nrf(Xn) (2.3) Exercise 3. (2 point) Recall that root finding algorithms find the value of x where a function h(x) = 0, and that a gradient descent algorithm finds the minimum of a function. Are these algorithms accomplishing the same goal? Briefly explain why or why not. Your answer should be limited to three lines. 2.1.2 Stochastic Gradient Descent Algorithm

  • Initialize X1
  • For n = 1, 2, · · · . Obtain a noisy version of the gradient (i.e. rg(Xn)). Then do the following update

Xn+1 = Xn ⌘nrg(Xn) (2.4) Exercise 4. (2 point) Do stochastic gradient descent and stochastic approximation (from section 1) accomplish the same goal? Briefly explain why or why not. Your answer should be limited to three lines. 2.1.3 Empirical Risk Minimization In a lot of machine learning problems, the training problem boils down to the following format: We are trying to minimize function f(x). f(x) = 1 k

k

X

j=1

Qj(x) (2.5) where Qj(x) is the loss function for jth data point where we have k training data points. Exercise 5. (4 points) Define g(x) to be Qi(x), and rg(x) to be rQi(x). If there is some noise z = Qi(x) f(x), then g is the noisy version of f, so g(x) = Qi(x) = f(x) + z. i is an index chosen randomly and uniformly, with replacement, from 1 to k. Given this definition of f and g, show that equations 2.2, 2.1, and E[z] = 0 are satisfied.

2.2 Convergence rates for SGD and GD

Again, we want to optimize function f with learning rate sequence ⌘n. We will mention convexity, and we will assume that our functions are convex so we can use the following theorems: Theorem 2.1 Assume rf has a unique root x∗. If f is twice continuously differentiable and strongly convex, and ⌘n = O(n−1), then to reach an approximation error ✏ = |E[f(Xn)] f(x∗)|, stochastic gradient descent requires n = O( 1

✏ ) updates.

Theorem 2.2 Assume rf has a unique root x∗. If either the smoothness assumptions about f in theorem 2.1

  • r ⌘n = O(n−1) are not met, then to reach an approximation error ✏ = |E[f(Xn)] f(x∗)|, stochastic gradient

descent requires n = O( 1

✏2 ) updates.

Theorem 2.3 Assume rf has a unique root x∗. To reach an approximation error ✏ = |E[f(Xn)] f(x∗)|, gradient descent requires n = O(ln( 1

✏ )) updates.

E- ( fcxl )

= I fcxs Rx=x,

E-Coffin

  • _ I

pc

Qicx

,

= Oci)

loss

=,S=¥E it

°

solve

for

Ecgcxil

K

is che parameter

vector

  • sa

= Elect )

O

  • pea
  • it's

a

i EEQI

HE'#"HEA '

eqgcx,)

ECTgun ]

,

7/0=124 . tap ,

"

E ft)

  • y D= 4.to-
slide-23
SLIDE 23

– Stochastic First Order Optimization 4 Exercise 6. Consider the strongly convex and one dimensional function f(x) = 1

  • 2x2. Also, assume that g(x) = 1
2(x+z)2 1 2,

where z is a bounded random noise with mean of zero and a unit variance.

  • 1. (2 points) Find
d dxg(x).
  • 2. (2 points) Does rf have a unique root?
  • 3. (2 points) Verify that E[g(x)] = f(x).
  • 4. (2 points) Verify that E[ d
dxg(x)] = d dxf(x).
  • 5. (2 points) Using one of the theorems above, find the order of updates required by SGD to reach approx-

imation error ✏ = |E[f(Xn)] f(x∗)| when using g as the noisy version. Consider using the learning rate sequence ⌘n = 1

n.

3 Comparing SGD vs GD for Empirical Risk Minimization

Let us consider the f minimization task discussed earlier for the ERM task.

  • f(x) and g are as described in exercise 5.
  • Let’s assume that f is strongly convex, and is twice continuously differentiable.
  • Let’s consider using the ⌘n = 1
n learning rate sequence.
  • We have k data points.
  • In this part, we are determined to achieve a training precision of ⇢ = |E[f(Xn)] f(x∗)|.
  • Finding rQi(x) takes c time, and finding rf(x) takes kc time.

Exercise 7. (15 points) Fill in table 1, with complexity orders in terms of k and ⇢ Table 1 Comparing SGD vs GD in terms of training precision SGD GD Computational Cost per Update O(1) Number of updates to reach ⇢ Total Cost Explain your answer for each element of the second row by providing a reference to the theorem mentioned in this document. For every other element, explain your answer.

4 Comparing SGD vs GD with respect to test loss

The last exercise was about finding the computational complexity required to reach a specific precision with respect to the optimal training loss. However, a specific precision of training loss does not translate to the same precision of test loss. Here are a couple of interesting facts:

  • To achieve a specific test loss precision ✏, you a need large enough number of training samples k. The

relation ✏ = O(k−) must hold for most functions where 0 <  1 is an unknown constant.

  • Let’s assume ⇢ = O(✏).

2-

. ' s

iz dpot

from

,

K P

  • C-

.

Gee, 8)

slide-24
SLIDE 24

– Stochastic First Order Optimization 5 Exercise 8. (16 points) Fill in table 2, with complexity orders in terms of ✏ and . You can refer to the previous table, and replace the elements with appropriate values. Make sure you state the reason for each element even if it looks obvious. Table 2 Comparing SGD vs GD in terms of testing precision SGD GD Computational Cost per Update O(1) Number of updates to reach ✏ Total Cost For a typical = 0.2, Explain why choosing GD over SGD can be very unwise.

D

  • t

t

slide-25
SLIDE 25

CS 361

SP 2020

project

(1) Stochastic

First

Order Optimization

( 65 pts ) ( II) Stochastic

Optimization Implementation

( 40 pts)

remember

what

is

a

convex

function

,

SGD

doesn't

work

well

  • n

'

non - convex

functions

! !

slide-26
SLIDE 26

Convex'set'and'convex'function'

If#a#set#is#convex,#

any#line#connec4ng# two#points#in#the# set#is#completely# included#in#the#set##

A#convex#func4on:#

the#area#above#the# curve#is#convex##

Credit:#Dr.#Kelvin#Murphy#

f(λx + (1 − λ)y) < λf(x) + (1 − λ)f(y)*

. :/ .

×

.

  • "
slide-27
SLIDE 27

Implement

a

state

  • f

art

stochastic

  • ptimization algo
.

ADAMI

¥tiwm Estimation

Gradient

Descent

al go .

  • works

well for

non - convex

problems .

* Try ADAM algo

.
  • n

linear

regression problems and

neural - network

classification

→ discussion

session

.
slide-28
SLIDE 28

The goals

are

:

tf

understand convexity ; learning rate convergence

in

practice

.

*

Understand

che

difference btw methods .

* Critical thinking

based

  • n
  • ur

understanding

  • f

the method 4 problem

.

*

Coding is

minimal given

the

starter

.
slide-29
SLIDE 29

CS 361: Probability and Statistics for Computer Science

(Spring 2020)

Stochastic Optimization Implementation

To find the starter code, please take a look at the 361 Project Github. There is also a helpful ‘CS 361 Final Project Coding Instructions.pdf’ that will help you get your environment setup. (Note: If issues arise with a local environment setup, they will be much harder to resolve this semester given the unique remote situation.) Objective: Implement state-of-the-art stochastic optimization algorithms for Machine Learning Problems, Linear Regression and Classification (using Neural Networks).

4.1 Adaptive Momentum Estimation (ADAM) Gradient Descent Algorithm

SGD can be ineffective when the function is highly non-convex. Unfortunately, most modern applications of Machine Learning involve non-convex optimization problems. One stochastic optimization algorithm that works well even for non-convexity is ADAM [KB14]. ADAM uses past gradient information to “speed” up optimization by smoothing, and the algoirthm has been further improved [SSS18]. You will implement the ADAM stochastic

  • ptimization algorithm for a Linear Regression problem.

The pseudo-code for ADAM has been reproduced here from this paper. Credits to [KB14]. Disclaimer: The textbook, in Chapter 13, uses for parameters but we will be using ✓. Algorithm 1: g2

t indicates the elementwise square gtgt. Good default settings for the tested machine learning

problems are ↵ = 0.001, 1 = 0.9, 2 = 0.999 and ✏ = 10−8. All operations on vectors are element-wise. With t

1 and t 2, we denote 1 and 2 to the power t.

Require: ↵: Stepsize Require: ✏: Division-by-zero control parameter Require: 1, 2 2 [0, 1): Exponential decay rates for the moment estimates Require: f(✓): Stochastic objective function with parameters ✓. Require: ✓0: Initial parameter vector m0 0 (Initialize 1st moment vector) v0 0 (Initialize 2nd moment vector) t 0 (Initialize timestep) while ✓t not converged do t t + 1 gt rθft(✓t−1) (Get gradients w.r.t. stochastic objective at timestep t) mt 1 · mt−1 + (1 1) · gt (Update biased first moment estimate) vt 2 · vt−1 + (1 2) · g2

t (Update biased second raw moment estimate)

ˆ mt mt/(1 t

1) (Compute bias-corrected first moment estimate).

ˆ vt vt/(1 t

2) (Compute bias-correct second raw moment estimate).

ˆ ✓t ✓t−1 ↵ · ˆ mt/(p ˆ vt + ✏) (Update parameters) end while return ✓t (Resulting parameters) Exercise 9. Consider the following problem setting:

  • The number of training data points is k = 1000. The number of test data points is 50.
  • The ith data point is represented by xi and is a 10-dimensional vector. Each element of xi is being

generated from a uniform distribution over the interval [1, 1].

  • θtrue (i.e. the true parameter set) and θ (i.e. the variable indicating a candidate parameter set) are also

10-dimensional vectors. Assume that θtrue is all ones. However, we will pretend that we do not know it and we want to estimate it using the training data. 1

slide-30
SLIDE 30

– Stochastic Optimization Implementation 2

  • Data is being generated according to yi = xT
i θtrue +0.1·⇠ (where ⇠ ⇠ N(0, 1) is a sample from the normal

distribution). The label yi is a scalar.

  • Let us assume that all the algorithm variants start from the same initial θ, whose elements are picked

uniformly in the interval [0, 0.1].

  • Use 1000 updates.

Since this is a classification problem, we need to define a loss function. Let’s use the following format Q(θ) = 1 k

k

X

j=1

Qj(θ) (4.1) Qj(θ) =

  • xT
j θ yj
  • γ

(4.2) Where is a hyper-parameter that we control and defines the objective. When you answer the following questions, snippets of code are not necessary. You should state your findings, provide analysis, and substantiate them with necessary plots in a clean way.

  • 1. (3 points) For = 2, the problem becomes the least squares regression that you learned from Chapter
  • 13. State the closed form solution (i.e. θ = ...) in your report, and then implement the solution to solve

for θ. Provide the results of the experiment and state whether it is close to the true value.

  • 2. (3 points) For Qj(θ), find an expression that gives you the gradient rQj(θ). Report this expression,

and implement it in the appropriate function. hint: For r(θ) = h(e(θ)) = h(xT

j θ y), the gradient can be written as rr(θ) = ∂h ∂e · re(θ) = ∂h ∂e · xj

according to the chain rule. hint: The sign function, sgn(x), may be useful.

  • 3. (3 points) Code the ADAM optimization algorithm (with default hyper-parameters such as the learning

rate as mentioned in the pseudocode above) and SGD to find the best parameter θ. Use a batch size of 1 for this problem, and perform 1000 parameter updates. Report the final set of parameters.

  • 4. (3 points) Update your code to compute the average test loss at every 50th update, and plot these values.

You might notice that the error bars of ADAM and SGD overlap. This is due to the inherent randomness from sampling values. To avoid this probabilistic overlap, increase the number of replicates (num rep in the starter code) until the error bars between ADAM and SGD do not overlap. Report this curve.

  • 5. (9 points) Run ADAM method for each of = 0.4, 0.7, 1, 2, 3, and 5. Report your observations clearly,

and analyze the trends you are seeing. State whether ADAM consistently out performs SGD. Your analysis should include the reason why one method outperforms the other under each setting.

  • 6. (8 points)

Another SGD variant is called ADAGRAD[DHS11]. We define the following for ADAGRAD:

  • Given noisy gradient gt at step t, construct a running squared sum Gt = Pt
n=1 gn gn, where

represents the element-wise multiplication.

  • The update rule at time step t is:

θt = θt−1 ↵ pGt + ✏gt (4.3) where ↵ is the learning rate and ✏ is the division by zero control hyper-parameter. (a) Implement ADAGRAD, and show a training plot for the default setting where = 2 and ↵ and ✏ are the same as mentioned in Algorithm 1. How does ADAGRAD do relative to SGD and ADAM? Why? (b) A student says “ADAGRAD is better at handling larger learning rates”. Use a learning rate of ↵ = 0.1 only for ADAGRAD, and leave ADAM and SGD learning rates the same as before. How do the results change? Does this confirm the students’ claim? Is this a fair experiment? Also provide the training plot.

"

Ioi

  • Fei

Fo

t.eu

e-ix.to -sit

  • = 19 I

J=x T

  • T
slide-31
SLIDE 31

– Stochastic Optimization Implementation 3 (c) Keep using ↵ = 0.1 only for ADAGRAD. If you think about the underlying math of ADAGRAD, you may notice some theoretical issues. Tweak the problem hyper-parameters so that ADAGRAD performs worse than only 1 of the other 2 methods (i.e. worse than either ADAM or SGD). Report the plot and explain why ADAGRAD performs better than one method and worse than the other

  • method. 3

You are free to tweak things like the loss function, adding artificial noise to the gradient, starting from unusual initial θ, changing the data generating model, changing the data dimension/size, etc. However, you should justify the changes based on some insight gained from theory. (Randomly changing the setting until a desirable outcome appears is not accepted.)

4.2 Classifying Handwritten Digits with Neural Networks

Next, we’ll use Neural Networks to classify handwritten digits from MNIST dataset (dataset of handwritten digits). The objective is to use different stochastic optimization algorithms that you have seen so far and compare their performances (GD, SGD, ADAM). You will train the model and then classify handwritten digits. A self contained starter code in Python has been provided for your reference. You need to change a few lines

  • f code to answer the exercise questions.

Fun fact: One of the co-creators of MNIST dataset (Dr. Yann LeCun) is also the co-recicpient of 2018 ACM Turing award for his work on Neural Networks. Exercise 10. We will run the starter code provided, understand different blocks of the code, and then run different gradient descent algorithms.

  • 1. To run the Gradient Descent (GD) Algorithm: Set the (mini) batch size to 60000 (the size of the MNIST

dataset). Run the SGD optimizer with a learning rate of 0.003. (1 points) Why is this the same as the GD algorithm if we are using SGD optimizer? (2 points) What do you observe? Report the accuracy. (2 points) List at least 2 ways to improve the accuracy.

  • 2. To run the Stochastic Gradient Descent (SGD) Algorithm: Set the (mini) batch size to 64. Run the SGD
  • ptimizer with a learning rate of 0.003.

(2 points) What do you observe? Report the accuracy. (1 points) List at least one way to improve accuracy further.

  • 3. To run the ADAM algorithm: Set the (mini) batch size to 64. Run the ADAM optimizer with default

learning rates (Algorithm 1) Hint: You can use Pytorch (or any other library in any language) for setting up the ADAM optimizer. (2 points) What do you observe? Report the accuracy. (1 points) Why does ADAM converge faster than SGD?

References

[RM51] Herbert Robbins and Sutton Monro. “A stochastic approximation method”. In: The annals of mathematical statistics (1951), pp. 400–407. [Sac58] Jerome Sacks. “Asymptotic distribution of stochastic approximation procedures”. In: The Annals

  • f Mathematical Statistics 29.2 (1958), pp. 373–405.

[NY83] Semenovich Nemirovsky Arkadi and David Borisovich Yudin. “Problem complexity and method efficiency in optimization.” In: Wiley-Interscience series in discrete mathematics. (1983). [CZ07] Felipe Cucker and Ding Xuan Zhou. Learning theory: an approximation theory viewpoint. Vol. 24. Cambridge University Press, 2007. [Nem+09] Arkadi Nemirovski et al. “Robust stochastic approximation approach to stochastic programming”. In: SIAM Journal on optimization 19.4 (2009), pp. 1574–1609.

3Extra Credit Ex. 5 asks you to implement other optimization algorithms in a similar fashion.
slide-32
SLIDE 32

– Stochastic Optimization Implementation 4 [Bot10] L´ eon Bottou. “Large-scale machine learning with stochastic gradient descent”. In: Proceedings of COMPSTAT’2010. Springer, 2010, pp. 177–186. [DHS11] John Duchi, Elad Hazan, and Yoram Singer. “Adaptive subgradient methods for online learning and stochastic optimization”. In: Journal of Machine Learning Research 12.Jul (2011), pp. 2121–2159. [Bot12] L´ eon Bottou. “Stochastic gradient descent tricks”. In: Neural networks: Tricks of the trade. Springer, 2012, pp. 421–436. [DGL13] Luc Devroye, L´ aszl´

  • Gy¨
  • rfi, and G´

abor Lugosi. A probabilistic theory of pattern recognition. Vol. 31. Springer Science & Business Media, 2013. [JZ13] Rie Johnson and Tong Zhang. “Accelerating stochastic gradient descent using predictive variance reduction”. In: Advances in neural information processing systems. 2013, pp. 315–323. [Vap13] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013. [KB14] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. 2014. arXiv: 1412.6980 [cs.LG]. [MV18] Pierre Moulin and Venugopal Veeravalli. Statistical inference for engineers and data scientists. Cambridge University Press, 2018. [SSS18] Sashank, Satyen, and Sanjiv. On the Convergence of Adam and Beyond. 2018.

Acknowledgements

Ehsan Saleh, Anay Pattanik created the first draft of the project outline. Ehsan Saleh, Hongye Liu, Ajay Fewell, Muhammed Imran, Brian Yang, and Jinglin contributed to the new edition. This document was compiled and inspired by the work and ideas shown in [RM51; Sac58; NY83; Nem+09; Bot10; JZ13; Bot12; Vap13; DGL13; CZ07; MV18]

slide-33
SLIDE 33

Motivation(

So#far,#the#processes#we#learned#such#as#

Bernoulli)and)Poisson#process#are#sequences#

  • f#independent#trials.#

There#are#a#lot#of#real#world#situaQons#where#

sequences#of#events#are#Not)independent#In# comparison.#

Markov#chain#is#one#type#of#characterizaQon#

  • f#a#series#of#dependent#trials.#
slide-34
SLIDE 34

An(example(of(dependent(events(in(a( sequence(

I#had#a#glass#of#wine#with#my#grilled#####################

slide-35
SLIDE 35

An(example(of(dependent(events(in(a( sequence(

slide-36
SLIDE 36

Markov(chain(

Markov#chain#is#a#process#

in#which#outcome#of#any# trial#in#a#sequence#is# condi2oned)by)the)

  • utcome)of)the)trial)

immediately)preceding,)but) not)by)earlier)ones.##

Such#dependence#is#called#

chain)dependence)

#

Andrey#Markov#(1856F1922)#

Xn

Xn - l

Xi

000000000

P4xl⇒€Elx

slide-37
SLIDE 37

Markov(chain(in(terms(of(probability(

Let######,######,…#be#a#sequence#of#discrete#finiteFvalued#

random#variables##

The#sequence#is#a#Markov#chain#if#the#probability#

distribuQon######only#depends#on#the#distribuQon#of#the# immediately#preceding#random#variable#

If#the#condiQonal#probabiliQes#(transiQon#probabiliQes)#

do#NOT)change)with)2me,#it’s#called#constant)Markov) chain.# P(Xt|X0..., Xt−1) = P(Xt|Xt−1)

P(Xt|Xt−1) = P(Xt−1|Xt−2) = ... = P(X1|X0) X0 X1 Xt Xt−1

=

  • #

=!He,=⇐

in

  • -0
slide-38
SLIDE 38

Coin(example(

Toss#a#fair#coin#unQl#you#see#two#heads#in#a#row#and#

then#stop,#what#is#the#probability#of#stopping#a_er# exactly#n#flips?##

Use#a#state#diagram,#which#is#a#directed)graph.#Circles#

are#the#states#of#likely#outcomes.#Arrow#direcQons#show#the# direcQon#of#transiQons.#Numbers#over#the#arrows#show# transiQon#probabiliQes.# ##

3) 1)B>)Start)or)just)had)tail/restart) 2)B>)had)one)head)aHer)start/restart) 3)B>)2heads)in)a)row/Stop) 1/2# 1/2# 1/2# 1/2#

  • "'

finite

H H

  • -
  • H

T

Q J

  • ~
  • a

'

0<-0

° . g-
slide-39
SLIDE 39

Is(this(a(Markov(chain?(And(why?(

3) 1/2# 1/2# 1/2# 1/2#

j state

f- act

"

N - 2

v.non

a.

a %

slide-40
SLIDE 40

Is(this(a(Markov(chain?(And(why?(

Yes.#Because#for#each#trial,#the# probability#distribuQon#of#the#

  • utcomes#is#only#condiQoned#on#the#

previous#trial.#

slide-41
SLIDE 41

Final)Exam)

Time:$7_10pm$5/12$Central$Time$ Conflicts$need$to$be$requested$1$week$ahead$to$the$graduate$

assistant$

DuraQon:$3hrs$ Content$coverage:$Ch1_14,$except$8,$evenly$distributed$ Open$book$and$lecture$notes$ Format:$50$mulQple$choices$ $

slide-42
SLIDE 42

Additional)References)

✺ Robert$V.$Hogg,$Elliot$A.$Tanis$and$Dale$L.$

Zimmerman.$“Probability$and$StaQsQcal$ Inference”$$

Kelvin$Murphy,$“Machine$learning,$A$

ProbabilisQc$perspecQve”$

slide-43
SLIDE 43

Acknowledgement)

Thank You!