!!
Probability*and*Statistics* for*Computer*Science**
“All!models!are!wrong,!but!some! models!are!useful”555!George!Box! !!
Hongye!Liu,!Teaching!Assistant!Prof,!CS361,!UIUC,!04.30.2020! Credit:!wikipedia!
Probability*and*Statistics* ! ! for*Computer*Science** - - PowerPoint PPT Presentation
Probability*and*Statistics* ! ! for*Computer*Science** All!models!are!wrong,!but!some! models!are!useful555!George!Box! !! Credit:!wikipedia! Hongye!Liu,!Teaching!Assistant!Prof,!CS361,!UIUC,!04.30.2020! Contents* Markov!chain!
!!
“All!models!are!wrong,!but!some! models!are!useful”555!George!Box! !!
Hongye!Liu,!Teaching!Assistant!Prof,!CS361,!UIUC,!04.30.2020! Credit:!wikipedia!
Markov!chain! MoQvaQon! DefiniQon!of!Markov!model! Graph!representaQon!–!Markov!chain! TransiQon!probability!matrix! The!staQonary!Markov!chain! The!pageRank!algorithm!
Project
review :
why
do
we
have
the
exercises
in
part I 2 It ?
what
are
expected
for
each ex
. ?che
notations
mean
7
What
do
.CS 361
SP 2020
project
(1) Stochastic
First
Order Optimization
( 65 pts )
* stochastic
first
approximation
x*= ?
is
this
task ?
what does
this have
to
do with optimization ?
hey
#1=0
⇒ root finding
in
the
context
!
X
are
the parameters
.! !
ie
che hyperplane
sum
a
"
x c- 5=0
classifier
we
don't know
hcx)
but
we
know gcx)
= hcx
) t Z Z
is
a
random noise
independent
& Ecgcx)) = hcxj
to
x
is
hex)
random ?
what is ECZI ?
E- (gcxl ) =
E- Chex) TEC-27
hex, t EE-27
⇒
EEZ) -_ o
CS 361: Probability and Statistics for Computer Science
(Spring 2020)
Stochastic First Order Optimization 1 Stochastic Approximation
Root-finding is simply the process of finding where h(x) = 0. For simple polynomials (e.g. h(x) = (x5)(x+3)), this is very easy. However, this is not easy for all functions. For instance, say we want to optimize a machine learning algorithm. We can define f(x) to be the error function for an algorithm, which we want as small as pos-
since this is where the minimum of the error function might be. Additionally, we may have to worry about noise. Say we want to find where h(x) = 0, but finding the true value of h(x) at some x is extremely expensive or impossible. On the bright side, we have access to a “noisy” version of h that we call g(x). In other words, g(x) = h(x) + z. You cannot control the additive noise z or predict it, but you can assume that it is independent of x, and E[z] = 0. Stochastic approximation (SA) is the process of root-finding on a noisy function g(x).
1.1 Stochastic Approximation in simple setting
For stochastic approximation to be effective, we need a sequence of positive learning rates that we denote as {ηn}n1. In the following exercises, we will perform stochastic approximation on h(x), and have access to a noisy version y = h(x) + z. In order to find a good sequence of learning rates, we need to make the following assumptions:
crossing of h. In other words: h(x⇤) = 0, x > x⇤ ) h(x) > 0, x < x⇤ ) h(x) < 0
P(|y| < c) = 1 where c 0 (1.1)
8x : E[y] = h(x), P(z|x) = P(z) (1.2)
X
n=1ηn = 1,
1X
n=1η2
n = c(1.3) For some positive c. In other words, the sum of the learning rates is unbounded, but the sum of their squares is bounded. Exercise 1. (4 points) Propose a family of learning rates that satisfies assumption 4 (a formal proof is not needed). hint: Try providing a range of values for α in nα that would satisfy the constraints. Now that we have a series of learning rates, we can move onto stochastic approximation. The algorithm is defined as follows, where Xn is the nth approximation of x⇤:
1
.
÷:#'in ....
.
...
. iii.
is!
E
inIln
nd
¥1
good ? V
2 so
good ?
In
"
=
£
.?+
2
n -
1
Learning
rate
you,=Xn - 7nF
Xn
is
nth
approximation of
X *
Xnei
= Xn - Mn Tn
as
n → is
? Xn
→ x*
Will this
happen stochastically ?
( im ECCXN - HTT
=
so
* There
are
elaborated steps
which
are
too
complicated
for the
project *
We
selected
part of
them
to
use as
exercises
for you
.*
Some of the intermediate
results
are
provided
*
We'd
like you to
learn about conditional
expectation
!
– Stochastic First Order Optimization 2
Xn+1 = Xn ηnYn (1.4) Where Yn = h(Xn) + Zn (i.e., the noisy version of h(Xn)), just as mentioned previously.
1.2 Convergence proof of SA
Now that SA is defined, we want to show that it actually works. To do so, we can define an expression for the error, and show that expression converges to 0. Define the Mean Squared Error at step n as follows e2
n = E[(Xn x⇤)2](1.5) Exercise 2.
Eu[f(u)] = Ev[Eu|v[f(u)|v]] (1.6) Do not assume any kind of independence. We can summarize this relation as E[A] = E[E[A|B]]. Hint: It requires notion of conditional expectation (Eu|v[f(u)|v]). Here is a resource to learn about conditional expectation. You are free to find and use others. Ultimately, we want to show that the mean squared error will converge to 0 as the number of steps approaches 1. To do so, we’ll need the following relationships: 1 e2
n+1 = e2 n 2ηnρn + η2 nE[Y 2 n ](1.7) where ρn = E{(Xn x⇤)h(Xn)}, and Yn is still the noisy version of h(Xn). This shows us the relationship between two subsequent iterations of the mean squared error. e2
n+1 = e2 1 2 nX
i=1ηiρi +
nX
i=1η2
i E[Y 2 i ](1.8)
dn and bn are unimportant - we can show that P1
i=1 ηidi = 1 and P1 i=1 ηidie2 i < 1. Using these twoevidences, and assuming e2
n converges, finalize the proof by proving the following:lim
n!1 e2 n = 02 Stochastic First Order Optimization 2
2.1 Review
The goal of optimization is to find the x⇤ that minimizes of f(x). However, f is again either unknown or very expensive to collect, but we have access to the noisy version g(x). E[g(x)] = f(x) (2.1) We also assume that we have access to the gradient of g, which is also noisy. E[rg(x)] = rf(x) (2.2)
1Extra Credit Ex. 1 asks you to prove the statements we provided without proof in Exercise 2 and may help increase yourmathematical understanding of error bounding.
2Before continuing, you may consider attempting Extra Credit Ex.2, 3, and 4. These exercises ask you to analyze some properties of SA and the order of convergence under SA settings. Again, these are not required.
'Tnis
a
sequence
random
variables
① ①
Xn
is
another
. . .Xn
= Xu -Icontinuous RV
ECECfunlul)
Ecg
in
the
= Effort] a
context
P ( 12/24=1
c > o
example
→ Lec
8
pg 23
& notes
)
I > o
'" f
''s,L
lion
en
= c
.Cfo
C > o n -
so
HE
> othere
is
a N
n> N
lent else
en '_ c
> - E
en'
> c - E
(WH g
,
en'
>I
~
I
7; ei
e:3E
i-
di
B di
= E. hiei't E., nice?
es
di lo
di
> I 7,
⇒ ¥ 7.
⇒ contradiction is
T 7 .
is
E-I
i-
enE E C l Xn
Enet = E C l Xue i
= E C l X n
= E ( E l Hn
to
relate
enet with
eT
Conditional
expectation
we
have seen this
:
for
discrete Rv .
E- Ex] =
2- xpcx,
7C
→ ECXIY
The mean of EX IY]· Law of -tera ed expectations
= E Cgcy ))
= I 814, pity ,
peg,
y
= I I
X pl Xly, pity ,
y
x
= z
X
=
x Ey pix
, y , = § xp CX)
= Efx)
What
about
ECXIY )
for
continues
random
variable ?
E-EXIT) = S xpcxly
> da
K
T density
E-(tix)141=5 tcxspcxlysdx
'TX density
→
Ex .
2 . '
for EREC#147 )
when
X
, Y
ane
J
pix
, y > Ix
continuous RVs
×
=
p
Stick-break-ng exa pie
fy(y)
break at uniform y chosen point Y
break what ·s left at uniformly chosen po·nt X
y
E[X Iy - y) = E[X IY] =
y
X
E[X] =
p
ECT )
= ? I
z
random
e
variable (/ /
T
yp
? I
n
= Sj x. tydx-g-T.KZ
z
nDoes
it matter
whether I
the =
ECT)
break
from
left
the = I
right ?
4
CS 361
SP 2020
project
(1) Stochastic
First
Order Optimization
( 65 pts )
* stochastic
first
approximation
#
stochastic
first order
GD and
SGD
I
63
I
are
= ( !!)let 's
't b
for
nowf- Sca , b)
→I2A'
g
is
any
cost
function
y
is
learning
;
'un
's a
Ch 'rate
¥
Lad
7 >
is small
/
'
aint ytglu
"')
=
'
SVM
\
N
GD :
S
=-I Sgt
penalty ( Hall )
N g-
b
SGD :
S
=# Sit penalty ( Hall)
bi-
i istrandom sample
from
C ' - K)
a.
'ts
b
can be just
1 !
replace
would it work ?
Z -
G D
⇐
=
X n
S G D
Xna
= xn
l im E (CXn
= o
"t
n →-
– Stochastic First Order Optimization 3 2.1.1 Gradient Descent Algorithm
Xn+1 = Xn ⌘nrf(Xn) (2.3) Exercise 3. (2 point) Recall that root finding algorithms find the value of x where a function h(x) = 0, and that a gradient descent algorithm finds the minimum of a function. Are these algorithms accomplishing the same goal? Briefly explain why or why not. Your answer should be limited to three lines. 2.1.2 Stochastic Gradient Descent Algorithm
Xn+1 = Xn ⌘nrg(Xn) (2.4) Exercise 4. (2 point) Do stochastic gradient descent and stochastic approximation (from section 1) accomplish the same goal? Briefly explain why or why not. Your answer should be limited to three lines. 2.1.3 Empirical Risk Minimization In a lot of machine learning problems, the training problem boils down to the following format: We are trying to minimize function f(x). f(x) = 1 k
kX
j=1Qj(x) (2.5) where Qj(x) is the loss function for jth data point where we have k training data points. Exercise 5. (4 points) Define g(x) to be Qi(x), and rg(x) to be rQi(x). If there is some noise z = Qi(x) f(x), then g is the noisy version of f, so g(x) = Qi(x) = f(x) + z. i is an index chosen randomly and uniformly, with replacement, from 1 to k. Given this definition of f and g, show that equations 2.2, 2.1, and E[z] = 0 are satisfied.
2.2 Convergence rates for SGD and GD
Again, we want to optimize function f with learning rate sequence ⌘n. We will mention convexity, and we will assume that our functions are convex so we can use the following theorems: Theorem 2.1 Assume rf has a unique root x∗. If f is twice continuously differentiable and strongly convex, and ⌘n = O(n−1), then to reach an approximation error ✏ = |E[f(Xn)] f(x∗)|, stochastic gradient descent requires n = O( 1
✏ ) updates.Theorem 2.2 Assume rf has a unique root x∗. If either the smoothness assumptions about f in theorem 2.1
descent requires n = O( 1
✏2 ) updates.Theorem 2.3 Assume rf has a unique root x∗. To reach an approximation error ✏ = |E[f(Xn)] f(x∗)|, gradient descent requires n = O(ln( 1
✏ )) updates.E- ( fcxl )
= I fcxs Rx=x,
E-Coffin
Qicx
,
= Oci)
loss
=,S=¥E it
°
solve
for
Ecgcxil
K
is che parameter
vector
= Elect )
O
a
i EEQI
HE'#"HEA '
eqgcx,)
ECTgun ]
,7/0=124 . tap ,
"E ft)
– Stochastic First Order Optimization 4 Exercise 6. Consider the strongly convex and one dimensional function f(x) = 1
where z is a bounded random noise with mean of zero and a unit variance.
imation error ✏ = |E[f(Xn)] f(x∗)| when using g as the noisy version. Consider using the learning rate sequence ⌘n = 1
n.3 Comparing SGD vs GD for Empirical Risk Minimization
Let us consider the f minimization task discussed earlier for the ERM task.
Exercise 7. (15 points) Fill in table 1, with complexity orders in terms of k and ⇢ Table 1 Comparing SGD vs GD in terms of training precision SGD GD Computational Cost per Update O(1) Number of updates to reach ⇢ Total Cost Explain your answer for each element of the second row by providing a reference to the theorem mentioned in this document. For every other element, explain your answer.
4 Comparing SGD vs GD with respect to test loss
The last exercise was about finding the computational complexity required to reach a specific precision with respect to the optimal training loss. However, a specific precision of training loss does not translate to the same precision of test loss. Here are a couple of interesting facts:
relation ✏ = O(k−) must hold for most functions where 0 < 1 is an unknown constant.
2-
. ' siz dpot
from
,
K P
.
Gee, 8)
– Stochastic First Order Optimization 5 Exercise 8. (16 points) Fill in table 2, with complexity orders in terms of ✏ and . You can refer to the previous table, and replace the elements with appropriate values. Make sure you state the reason for each element even if it looks obvious. Table 2 Comparing SGD vs GD in terms of testing precision SGD GD Computational Cost per Update O(1) Number of updates to reach ✏ Total Cost For a typical = 0.2, Explain why choosing GD over SGD can be very unwise.
t
CS 361
SP 2020
project
(1) Stochastic
First
Order Optimization
( 65 pts ) ( II) Stochastic
Optimization Implementation
( 40 pts)
remember
what
is
a
convex
function
,
SGD
doesn't
work
well
'
non - convex
functions
! !
If#a#set#is#convex,#
any#line#connec4ng# two#points#in#the# set#is#completely# included#in#the#set##
A#convex#func4on:#
the#area#above#the# curve#is#convex##
Credit:#Dr.#Kelvin#Murphy#
f(λx + (1 − λ)y) < λf(x) + (1 − λ)f(y)*
. :/ .×
.
Implement
a
state
art
stochastic
Gradient
Descent
al go .
well for
non - convex
problems .
* Try ADAM algo
.linear
regression problems and
neural - network
classification
→ discussion
session
.The goals
are
:
tf
understand convexity ; learning rate convergence
in
practice
.*
Understand
che
difference btw methods .
* Critical thinking
based
understanding
the method 4 problem
.*
Coding is
minimal given
the
starter
.CS 361: Probability and Statistics for Computer Science
(Spring 2020)
Stochastic Optimization Implementation
To find the starter code, please take a look at the 361 Project Github. There is also a helpful ‘CS 361 Final Project Coding Instructions.pdf’ that will help you get your environment setup. (Note: If issues arise with a local environment setup, they will be much harder to resolve this semester given the unique remote situation.) Objective: Implement state-of-the-art stochastic optimization algorithms for Machine Learning Problems, Linear Regression and Classification (using Neural Networks).
4.1 Adaptive Momentum Estimation (ADAM) Gradient Descent Algorithm
SGD can be ineffective when the function is highly non-convex. Unfortunately, most modern applications of Machine Learning involve non-convex optimization problems. One stochastic optimization algorithm that works well even for non-convexity is ADAM [KB14]. ADAM uses past gradient information to “speed” up optimization by smoothing, and the algoirthm has been further improved [SSS18]. You will implement the ADAM stochastic
The pseudo-code for ADAM has been reproduced here from this paper. Credits to [KB14]. Disclaimer: The textbook, in Chapter 13, uses for parameters but we will be using ✓. Algorithm 1: g2
t indicates the elementwise square gtgt. Good default settings for the tested machine learningproblems are ↵ = 0.001, 1 = 0.9, 2 = 0.999 and ✏ = 10−8. All operations on vectors are element-wise. With t
1 and t 2, we denote 1 and 2 to the power t.Require: ↵: Stepsize Require: ✏: Division-by-zero control parameter Require: 1, 2 2 [0, 1): Exponential decay rates for the moment estimates Require: f(✓): Stochastic objective function with parameters ✓. Require: ✓0: Initial parameter vector m0 0 (Initialize 1st moment vector) v0 0 (Initialize 2nd moment vector) t 0 (Initialize timestep) while ✓t not converged do t t + 1 gt rθft(✓t−1) (Get gradients w.r.t. stochastic objective at timestep t) mt 1 · mt−1 + (1 1) · gt (Update biased first moment estimate) vt 2 · vt−1 + (1 2) · g2
t (Update biased second raw moment estimate)ˆ mt mt/(1 t
1) (Compute bias-corrected first moment estimate).ˆ vt vt/(1 t
2) (Compute bias-correct second raw moment estimate).ˆ ✓t ✓t−1 ↵ · ˆ mt/(p ˆ vt + ✏) (Update parameters) end while return ✓t (Resulting parameters) Exercise 9. Consider the following problem setting:
generated from a uniform distribution over the interval [1, 1].
10-dimensional vectors. Assume that θtrue is all ones. However, we will pretend that we do not know it and we want to estimate it using the training data. 1
– Stochastic Optimization Implementation 2
distribution). The label yi is a scalar.
uniformly in the interval [0, 0.1].
Since this is a classification problem, we need to define a loss function. Let’s use the following format Q(θ) = 1 k
kX
j=1Qj(θ) (4.1) Qj(θ) =
(4.2) Where is a hyper-parameter that we control and defines the objective. When you answer the following questions, snippets of code are not necessary. You should state your findings, provide analysis, and substantiate them with necessary plots in a clean way.
for θ. Provide the results of the experiment and state whether it is close to the true value.
and implement it in the appropriate function. hint: For r(θ) = h(e(θ)) = h(xT
j θ y), the gradient can be written as rr(θ) = ∂h ∂e · re(θ) = ∂h ∂e · xjaccording to the chain rule. hint: The sign function, sgn(x), may be useful.
rate as mentioned in the pseudocode above) and SGD to find the best parameter θ. Use a batch size of 1 for this problem, and perform 1000 parameter updates. Report the final set of parameters.
You might notice that the error bars of ADAM and SGD overlap. This is due to the inherent randomness from sampling values. To avoid this probabilistic overlap, increase the number of replicates (num rep in the starter code) until the error bars between ADAM and SGD do not overlap. Report this curve.
and analyze the trends you are seeing. State whether ADAM consistently out performs SGD. Your analysis should include the reason why one method outperforms the other under each setting.
Another SGD variant is called ADAGRAD[DHS11]. We define the following for ADAGRAD:
represents the element-wise multiplication.
θt = θt−1 ↵ pGt + ✏gt (4.3) where ↵ is the learning rate and ✏ is the division by zero control hyper-parameter. (a) Implement ADAGRAD, and show a training plot for the default setting where = 2 and ↵ and ✏ are the same as mentioned in Algorithm 1. How does ADAGRAD do relative to SGD and ADAM? Why? (b) A student says “ADAGRAD is better at handling larger learning rates”. Use a learning rate of ↵ = 0.1 only for ADAGRAD, and leave ADAM and SGD learning rates the same as before. How do the results change? Does this confirm the students’ claim? Is this a fair experiment? Also provide the training plot.
"
Ioi
Fo
t.eu
e-ix.to -sit
J=x T
– Stochastic Optimization Implementation 3 (c) Keep using ↵ = 0.1 only for ADAGRAD. If you think about the underlying math of ADAGRAD, you may notice some theoretical issues. Tweak the problem hyper-parameters so that ADAGRAD performs worse than only 1 of the other 2 methods (i.e. worse than either ADAM or SGD). Report the plot and explain why ADAGRAD performs better than one method and worse than the other
You are free to tweak things like the loss function, adding artificial noise to the gradient, starting from unusual initial θ, changing the data generating model, changing the data dimension/size, etc. However, you should justify the changes based on some insight gained from theory. (Randomly changing the setting until a desirable outcome appears is not accepted.)
4.2 Classifying Handwritten Digits with Neural Networks
Next, we’ll use Neural Networks to classify handwritten digits from MNIST dataset (dataset of handwritten digits). The objective is to use different stochastic optimization algorithms that you have seen so far and compare their performances (GD, SGD, ADAM). You will train the model and then classify handwritten digits. A self contained starter code in Python has been provided for your reference. You need to change a few lines
Fun fact: One of the co-creators of MNIST dataset (Dr. Yann LeCun) is also the co-recicpient of 2018 ACM Turing award for his work on Neural Networks. Exercise 10. We will run the starter code provided, understand different blocks of the code, and then run different gradient descent algorithms.
dataset). Run the SGD optimizer with a learning rate of 0.003. (1 points) Why is this the same as the GD algorithm if we are using SGD optimizer? (2 points) What do you observe? Report the accuracy. (2 points) List at least 2 ways to improve the accuracy.
(2 points) What do you observe? Report the accuracy. (1 points) List at least one way to improve accuracy further.
learning rates (Algorithm 1) Hint: You can use Pytorch (or any other library in any language) for setting up the ADAM optimizer. (2 points) What do you observe? Report the accuracy. (1 points) Why does ADAM converge faster than SGD?
References
[RM51] Herbert Robbins and Sutton Monro. “A stochastic approximation method”. In: The annals of mathematical statistics (1951), pp. 400–407. [Sac58] Jerome Sacks. “Asymptotic distribution of stochastic approximation procedures”. In: The Annals
[NY83] Semenovich Nemirovsky Arkadi and David Borisovich Yudin. “Problem complexity and method efficiency in optimization.” In: Wiley-Interscience series in discrete mathematics. (1983). [CZ07] Felipe Cucker and Ding Xuan Zhou. Learning theory: an approximation theory viewpoint. Vol. 24. Cambridge University Press, 2007. [Nem+09] Arkadi Nemirovski et al. “Robust stochastic approximation approach to stochastic programming”. In: SIAM Journal on optimization 19.4 (2009), pp. 1574–1609.
3Extra Credit Ex. 5 asks you to implement other optimization algorithms in a similar fashion.– Stochastic Optimization Implementation 4 [Bot10] L´ eon Bottou. “Large-scale machine learning with stochastic gradient descent”. In: Proceedings of COMPSTAT’2010. Springer, 2010, pp. 177–186. [DHS11] John Duchi, Elad Hazan, and Yoram Singer. “Adaptive subgradient methods for online learning and stochastic optimization”. In: Journal of Machine Learning Research 12.Jul (2011), pp. 2121–2159. [Bot12] L´ eon Bottou. “Stochastic gradient descent tricks”. In: Neural networks: Tricks of the trade. Springer, 2012, pp. 421–436. [DGL13] Luc Devroye, L´ aszl´
abor Lugosi. A probabilistic theory of pattern recognition. Vol. 31. Springer Science & Business Media, 2013. [JZ13] Rie Johnson and Tong Zhang. “Accelerating stochastic gradient descent using predictive variance reduction”. In: Advances in neural information processing systems. 2013, pp. 315–323. [Vap13] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013. [KB14] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. 2014. arXiv: 1412.6980 [cs.LG]. [MV18] Pierre Moulin and Venugopal Veeravalli. Statistical inference for engineers and data scientists. Cambridge University Press, 2018. [SSS18] Sashank, Satyen, and Sanjiv. On the Convergence of Adam and Beyond. 2018.
Acknowledgements
Ehsan Saleh, Anay Pattanik created the first draft of the project outline. Ehsan Saleh, Hongye Liu, Ajay Fewell, Muhammed Imran, Brian Yang, and Jinglin contributed to the new edition. This document was compiled and inspired by the work and ideas shown in [RM51; Sac58; NY83; Nem+09; Bot10; JZ13; Bot12; Vap13; DGL13; CZ07; MV18]
So#far,#the#processes#we#learned#such#as#
Bernoulli)and)Poisson#process#are#sequences#
There#are#a#lot#of#real#world#situaQons#where#
sequences#of#events#are#Not)independent#In# comparison.#
Markov#chain#is#one#type#of#characterizaQon#
I#had#a#glass#of#wine#with#my#grilled#####################
Markov#chain#is#a#process#
in#which#outcome#of#any# trial#in#a#sequence#is# condi2oned)by)the)
immediately)preceding,)but) not)by)earlier)ones.##
Such#dependence#is#called#
chain)dependence)
Andrey#Markov#(1856F1922)#
Xn
Xn - l
Xi
Let######,######,…#be#a#sequence#of#discrete#finiteFvalued#
random#variables##
The#sequence#is#a#Markov#chain#if#the#probability#
distribuQon######only#depends#on#the#distribuQon#of#the# immediately#preceding#random#variable#
If#the#condiQonal#probabiliQes#(transiQon#probabiliQes)#
do#NOT)change)with)2me,#it’s#called#constant)Markov) chain.# P(Xt|X0..., Xt−1) = P(Xt|Xt−1)
P(Xt|Xt−1) = P(Xt−1|Xt−2) = ... = P(X1|X0) X0 X1 Xt Xt−1
=!He,=⇐
Toss#a#fair#coin#unQl#you#see#two#heads#in#a#row#and#
then#stop,#what#is#the#probability#of#stopping#a_er# exactly#n#flips?##
Use#a#state#diagram,#which#is#a#directed)graph.#Circles#
are#the#states#of#likely#outcomes.#Arrow#direcQons#show#the# direcQon#of#transiQons.#Numbers#over#the#arrows#show# transiQon#probabiliQes.# ##
3) 1)B>)Start)or)just)had)tail/restart) 2)B>)had)one)head)aHer)start/restart) 3)B>)2heads)in)a)row/Stop) 1/2# 1/2# 1/2# 1/2#
finite
H H
T
Q J
'
3) 1/2# 1/2# 1/2# 1/2#
j state
f- act
"
N - 2
v.non
a.
a %
Yes.#Because#for#each#trial,#the# probability#distribuQon#of#the#
previous#trial.#
Time:$7_10pm$5/12$Central$Time$ Conflicts$need$to$be$requested$1$week$ahead$to$the$graduate$
assistant$
DuraQon:$3hrs$ Content$coverage:$Ch1_14,$except$8,$evenly$distributed$ Open$book$and$lecture$notes$ Format:$50$mulQple$choices$ $
✺ Robert$V.$Hogg,$Elliot$A.$Tanis$and$Dale$L.$
Zimmerman.$“Probability$and$StaQsQcal$ Inference”$$
Kelvin$Murphy,$“Machine$learning,$A$
ProbabilisQc$perspecQve”$