ProbabilityandStatistics* ! ! forComputerScience** - PowerPoint PPT Presentation

Probability*and*Statistics* ! ! for*Computer*Science** “All!models!are!wrong,!but!some! models!are!useful”555!George!Box! !! Credit:!wikipedia! Hongye!Liu,!Teaching!Assistant!Prof,!CS361,!UIUC,!04.30.2020!

Contents* � Markov!chain! � MoQvaQon! � DefiniQon!of!Markov!model! � Graph!representaQon!–!Markov!chain! � TransiQon!probability!matrix! � The!staQonary!Markov!chain! � The!pageRank!algorithm! !

Project review : the have exercises do why we part I 2 It ? in . ? for expected each ex what are 7 mean notations What che do .

SP 2020 project CS 361 Order Optimization First (1) Stochastic ( 65 pts ) order approximation first stochastic * hcxt x*= ? - task ? this is what have do what does this to with optimization ?

⇒ root finding # 1=0 hey pfcx*)=o ⇒ optimization of optimization context in the ! ! the parameters X . are ! dip che hyperplane of ie - / sum " x c- 5=0 a classifier

⇒ don't know hcx ) but If we we know gcx ) = hcx ) t Z independent noise random is Z a to x & Ecgcx ) ) = hcxj random ? is hex ) ECZI ? what is E- Chex ) TEC -27 E- ( gcxl ) = - hex , t EE -27 hcxl = EEZ ) -_ o

CS 361: Probability and Statistics for Computer Science (Spring 2020) Stochastic First Order Optimization 1 Stochastic Approximation Root-finding is simply the process of finding where h ( x ) = 0. For simple polynomials (e.g. h ( x ) = ( x � 5)( x +3)), this is very easy. However, this is not easy for all functions. For instance, say we want to optimize a machine learning algorithm. We can define f ( x ) to be the error function for an algorithm, which we want as small as pos- sible. In order to do so, we would need to find the root of the derivative of the error function (ie, h ( x ) = f 0 ( x )), since this is where the minimum of the error function might be. Additionally, we may have to worry about noise . Say we want to find where h ( x ) = 0, but finding the true value of h ( x ) at some x is extremely expensive or impossible. On the bright side, we have access to a “noisy” version of h that we call g ( x ). In other words, g ( x ) = h ( x ) + z . You cannot control the additive noise z or predict it, but you can assume that it is independent of x , and E [ z ] = 0. Stochastic approximation (SA) is the process of root-finding on a noisy function g ( x ). 1.1 Stochastic Approximation in simple setting For stochastic approximation to be e ff ective, we need a sequence of positive learning rates that we denote I - as { η n } n � 1 . In the following exercises, we will perform stochastic approximation on h ( x ), and have access to a noisy version y = h ( x ) + z . In order to find a good sequence of learning rates, we need to make the following assumptions: 1. The function h has a unique root x ∗ (i.e., h ( x ⇤ ) = 0 for a unique x ⇤ ). This unique root is a positive zero ÷ :# ' in . . . . - is ! crossing of h . In other words: - Ia - x > x ⇤ ) h ( x ) > 0 , x < x ⇤ ) h ( x ) < 0 h ( x ⇤ ) = 0 , . 2. y has a finite upper and lower bound. In other words, we have bounded noise: . iii. P ( | y | < c ) = 1 where c � 0 (1.1) 3. The noise is independent of x : 8 x : E [ y ] = h ( x ) , P ( z | x ) = P ( z ) (1.2) 4. The learning rates η n do not approach 0 too quickly or too slowly. More formally: . . . . 1 1 X X η 2 η n = 1 , n = c (1.3) E n =1 n =1 n For some positive c . In other words, the sum of the learning rates is unbounded, but the sum of their in Il good ? V squares is bounded. nd ¥1 " In good ? = 2 so Exercise 1. (4 points) Propose a family of learning rates that satisfies assumption 4 (a formal proof is not needed). hint : Try providing a range of values for α in n α that would satisfy the constraints. . ? £ Now that we have a series of learning rates, we can move onto stochastic approximation. The algorithm is defined as follows, where X n is the n th approximation of x ⇤ : 2 + t a • Let X 1 be some initial value or guess n - 1 Learning rate 1 you ,=Xn - 7nF

X * approximation of Xn nth is = Xn - Mn Tn Xnei → x* ? Xn n → is as happen stochastically ? Will this ( im ECCXN - HTT o = so n -

1. 2 stochastichmmoximnttimcowvergenf.in -cheatorement#dttg which elaborated are steps * There are project for the complicated too them as part of to selected use We * you exercises for . results are intermediate the Some of * provided like you to about learn We'd ! * expectation conditional

' Tn of is sequence a – Stochastic First Order Optimization 2 variables random another Xn • For n = 1 , 2 , · · · perform the following iteration is . . . ① ① X n +1 = X n � η n Y n (1.4) Where Y n = h ( X n ) + Z n (i.e., the noisy version of h ( X n )), just as mentioned previously. - Yu - i Th Xn = Xu - I 1.2 Convergence proof of SA - I Now that SA is defined, we want to show that it actually works. To do so, we can define an expression for the error, and show that expression converges to 0. Define the Mean Squared Error at step n as follows - ✓ e 2 n = E [( X n � x ⇤ ) 2 ] (1.5) Exercise 2. RV continuous 1. (4 points) Prove the following relationship for any two random variables u, v : the ECE Cfunlul ) in Ecg E u [ f ( u )] = E v [ E u | v [ f ( u ) | v ]] (1.6) = Effort ] context a Do not assume any kind of independence. We can summarize this relation as E [ A ] = E [ E [ A | B ]]. Hint: It requires notion of conditional expectation ( E u | v [ f ( u ) | v ]). Here is a resource to learn about conditional expectation. You are free to find and use others. Ultimately, we want to show that the mean squared error will converge to 0 as the number of steps approaches 1 . To do so, we’ll need the following relationships: 1 e 2 n +1 = e 2 n � 2 η n ρ n + η 2 n E [ Y 2 n ] (1.7) where ρ n = E { ( X n � x ⇤ ) h ( X n ) } , and Y n is still the noisy version of h ( X n ). This shows us the relationship between two subsequent iterations of the mean squared error. n n e 2 n +1 = e 2 X X η 2 i E [ Y 2 1 � 2 η i ρ i + i ] (1.8) P ( 12/24=1 example c > o i =1 i =1 & 8 → Lec pg 23 2. (3 points) Knowing that noise is bounded (1.1), show that E [ Y 2 n ] is also bounded. notes 0 F. bike n ] is bounded, show that P n 3. (3 points) Given that E [ Y 2 i =1 η 2 i E [ Y 2 i ] is bounded. Hint: use (1.3) o 4. (2 points) Let b n := | X 1 | + c P n h ( x ) - i =1 η i and d n = min | x | <b n x � x ∗ . For this problem, the actual values of d n and b n are unimportant - we can show that P 1 i =1 η i d i = 1 and P 1 i =1 η i d i e 2 i < 1 . Using these two evidences, and assuming e 2 n converges, finalize the proof by proving the following: =f' Cx ) n !1 e 2 I > o lim n = 0 µ *¥ ' " f ' 's , Stochastic First Order Optimization 2 2 2.1 Review The goal of optimization is to find the x ⇤ that minimizes of f ( x ). However, f is again either unknown or very expensive to collect, but we have access to the noisy version g ( x ). E [ g ( x )] = f ( x ) (2.1) We also assume that we have access to the gradient of g , which is also noisy. E [ r g ( x )] = r f ( x ) (2.2) 1 Extra Credit Ex. 1 asks you to prove the statements we provided without proof in Exercise 2 and may help increase your mathematical understanding of error bounding. 2 Before continuing, you may consider attempting Extra Credit Ex. 2, 3, and 4. These exercises ask you to analyze some properties of SA and the order of convergence under SA settings. Again, these are not required.

L Cfo C > o lion en = c . so n - a N is there HE > o n > N lent else - ← en '- c se en '_ c > - E en ' > c - E ( WH g - I en ' - > I ,

di o ~ e :3 E I 7 ; ei - ca - = E. hiei 't E. , nice ? B di i - di - i es di > I 7 , ⇒ ¥ 7 . - di e ? = - - e : 't It , Yi ca di lo ⇒ contradiction T is - - di = - ⇒ I ni di C T 7 . is E- I i - - N

en E E C l X n - x * I ' ] - x * IT En et = E C l Xue i - x * I y - y Ya = E C l X n - y Ya - x * T / xn ) = E ( E l Hn relate to e T en et with

Conditional expectation discrete Rv . for have seen this : we 2- xpcx , E- Ex ] = 7C - g) = I xp cx=xlY=y , → ECXIY - I

The mean of EX I Y]· Law of -tera ed expectations = Expert y , • g(y) = E[X I y = y ] = [X [ [ I = E Cgcy ) ) pcxlg ,=Pg , peg , = I 814 , pity , y X pl Xly , pity , = I I y x -2 xp ex , y , = z X - , y , = § xp CX ) = Efx ) y x Ey pix Ty = �

ECXIY ) about What variable ? random continues for E- EXIT ) = S xpcxly > da T density K E- ( tix ) 141=5 tcxspcxlysdx ' T X density EREC # 147 ) when for Ex . 2 . ' → , Y X ane J continuous RVs , y > Ix pix × - Y ) p = -

Random Stick-break-ng exa pie p fy(y) = ? I ECT ) t z • Stick example. stick of length £ e random Ef variable ( / / break at uniform y chosen point Y chosen po·nt X E y break what ·s left at uniformly o T ? I SK pcxly )d × yp E[X I y - y) = = Sj x. tydx-g-T.KZ n I E[X I Y] = z n y X ECELXIYI ) = ICE ] E[X] = whether I Does matter it the ECT ) break the left from or = = I right ? � 4

SP 2020 project CS 361 Order Optimization First (1) Stochastic ( 65 pts ) order approximation first stochastic * first order optimization stochastic # . GD and SGD

ProbabilityandStatistics* ! ! forComputerScience** - PowerPoint PPT Presentation

ProbabilityandStatistics* ! ! forComputerScience** All!models!are!wrong,!but!some! models!are!useful555!George!Box! !! Credit:!wikipedia! Hongye!Liu,!Teaching!Assistant!Prof,!CS361,!UIUC,!04.30.2020! Contents* Markov!chain!

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

Continuing Probability. Wrap up: Total Probability and Conditional Probability. Continuing

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Basics Probability Background Martin Emms October 1, 2020 Probability Basics

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Categorical Probability and Statistics Peter McCullagh Department of Statistics University of

Counting and Probability Whats to come? Counting and Probability Whats to come?

Unit 2: Probability and distributions Lecture 1: Probability and conditional probability

Which probability Which probability Which probability Which probability theory for cosmology?

Recap of Basic Probability Elements of basic probability theory probability theory The

1 2 3 4 Stopping Probability Visiting Probability 5 Stopping

ACMS 20340 Statistics for Life Sciences Chapter 9: Introducing Probability Why Consider

Statistics 1B Statistics 1B 1 (11) 0. Lecture 1. Introduction and probability review

Statistics 370 Probability and Statistics for Engineers Instructor: Peter Bloomfield Course

Chapter II.2: Basic Probability Theory and Statistics 1. What is a probability? 1.1. Probability

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Introduction to Reinforcement Learning Milan Straka October 5, 2020 Charles University in

DUTY TO GIVE REASONS Duty to give reasons Key principle A decision-maker must always give

Stochastic Optimal Control part 2 discrete time, Markov Decision Processes, Reinforcement

PROCUREMENT OF MEALS FROM A COMMERCIAL VENDOR SCHOOL YEAR 2020-2021 TODAYS WEBINAR This

Monte Carlo Learning Lecture 4, CMU 10-403 Katerina Fragkiadaki Katerina Fragkiadaki Used

Used Materials Acknowledgement : Much of the material and slides for this lecture were

Challenges for Socially-Beneficial AI Daniel S. Weld University of Washington Outline

Lecture 3: Monte Carlo and Generalization CS234: RL Emma Brunskill Spring 2017 Much of the

Probability*and*Statistics* ! ! for*Computer*Science** - PowerPoint PPT Presentation

Probability*and*Statistics* ! ! for*Computer*Science** All!models!are!wrong,!but!some! models!are!useful555!George!Box! !! Credit:!wikipedia! Hongye!Liu,!Teaching!Assistant!Prof,!CS361,!UIUC,!04.30.2020! Contents* Markov!chain!

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

Continuing Probability. Wrap up: Total Probability and Conditional Probability. Continuing

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Basics Probability Background Martin Emms October 1, 2020 Probability Basics

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Categorical Probability and Statistics Peter McCullagh Department of Statistics University of

Counting and Probability Whats to come? Counting and Probability Whats to come?

Unit 2: Probability and distributions Lecture 1: Probability and conditional probability

Which probability Which probability Which probability Which probability theory for cosmology?

Recap of Basic Probability Elements of basic probability theory probability theory The

1 2 3 4 Stopping Probability Visiting Probability 5 Stopping

ACMS 20340 Statistics for Life Sciences Chapter 9: Introducing Probability Why Consider

Statistics 1B Statistics 1B 1 (11) 0. Lecture 1. Introduction and probability review

Statistics 370 Probability and Statistics for Engineers Instructor: Peter Bloomfield Course

Chapter II.2: Basic Probability Theory and Statistics 1. What is a probability? 1.1. Probability

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Introduction to Reinforcement Learning Milan Straka October 5, 2020 Charles University in

DUTY TO GIVE REASONS Duty to give reasons Key principle A decision-maker must always give

Stochastic Optimal Control part 2 discrete time, Markov Decision Processes, Reinforcement

PROCUREMENT OF MEALS FROM A COMMERCIAL VENDOR SCHOOL YEAR 2020-2021 TODAYS WEBINAR This

Monte Carlo Learning Lecture 4, CMU 10-403 Katerina Fragkiadaki Katerina Fragkiadaki Used

Used Materials Acknowledgement : Much of the material and slides for this lecture were

Challenges for Socially-Beneficial AI Daniel S. Weld University of Washington Outline

Lecture 3: Monte Carlo and Generalization CS234: RL Emma Brunskill Spring 2017 Much of the

ProbabilityandStatistics* ! ! forComputerScience** - PowerPoint PPT Presentation

ProbabilityandStatistics* ! ! forComputerScience** All!models!are!wrong,!but!some! models!are!useful555!George!Box! !! Credit:!wikipedia! Hongye!Liu,!Teaching!Assistant!Prof,!CS361,!UIUC,!04.30.2020! Contents* Markov!chain!