HiGrad: Statistical Inference for Stochastic Approximation and - PowerPoint PPT Presentation

Iterate along HiGrad tree Recall: noisy gradient g ( θ, Z ) unbiased for ∇ f ( θ ) ; partition { Z s } of { Z 1 , . . . , Z N } ; and L k := n 0 + · · · + n k ◮ Iterate along level 0 segment: θ j = θ j − 1 − γ j ∇ f ( θ j − 1 , Z j ) for j = 1 , . . . , n 0 , starting from some θ 0 21 / 59

Iterate along HiGrad tree Recall: noisy gradient g ( θ, Z ) unbiased for ∇ f ( θ ) ; partition { Z s } of { Z 1 , . . . , Z N } ; and L k := n 0 + · · · + n k ◮ Iterate along level 0 segment: θ j = θ j − 1 − γ j ∇ f ( θ j − 1 , Z j ) for j = 1 , . . . , n 0 , starting from some θ 0 ◮ Iterate along each level 1 segment s = ( b 1 ) for 1 ≤ b 1 ≤ B 1 θ s j = θ s j − 1 − γ j + L 0 g ( θ s j − 1 , Z s j ) for j = 1 , . . . , n 1 , starting from θ n 0 21 / 59

Iterate along HiGrad tree Recall: noisy gradient g ( θ, Z ) unbiased for ∇ f ( θ ) ; partition { Z s } of { Z 1 , . . . , Z N } ; and L k := n 0 + · · · + n k ◮ Iterate along level 0 segment: θ j = θ j − 1 − γ j ∇ f ( θ j − 1 , Z j ) for j = 1 , . . . , n 0 , starting from some θ 0 ◮ Iterate along each level 1 segment s = ( b 1 ) for 1 ≤ b 1 ≤ B 1 θ s j = θ s j − 1 − γ j + L 0 g ( θ s j − 1 , Z s j ) for j = 1 , . . . , n 1 , starting from θ n 0 ◮ Generally, for the segment s = ( b 1 · · · b k ) , iterate θ s j = θ s j − 1 − γ j + L k − 1 g ( θ s j − 1 , Z s j ) for j = 1 , . . . , n k , starting from θ ( b 1 ··· b k − 1 ) n k − 1 21 / 59

A second look at the HiGrad tree An example of HiGrad tree: B 1 = 2 , B 2 = 3 , K = 2 22 / 59

A second look at the HiGrad tree An example of HiGrad tree: B 1 = 2 , B 2 = 3 , K = 2 Fulfilled • Online in nature with same computational cost as vanilla SGD 22 / 59

A second look at the HiGrad tree An example of HiGrad tree: B 1 = 2 , B 2 = 3 , K = 2 Fulfilled • Online in nature with same computational cost as vanilla SGD Bonus Easier to parallelize than vanilla SGD! 22 / 59

The HiGrad algorithm in action Require: g ( · , · ) , Z 1 , . . . , Z N , ( n 0 , n 1 , . . . , n K ) , ( B 1 , . . . , B K ) , ( γ 1 , . . . , γ N K ) , θ 0 s = 0 for all segments s θ function NodeTreeSGD( θ, s ) θ s 0 = θ k = # s for j = 1 to n k do θ s j ← θ s j − 1 − γ j + L k − 1 g ( θ s j − 1 , Z s j ) s ← θ s + θ s θ j /n k end for if k < K then for b k +1 = 1 to B k +1 do s + ← ( s , b k +1 ) � n k , s + � execute NodeTreeSGD θ s end for end if end function execute NodeTreeSGD ( θ 0 , ∅ ) s for all segments s output: θ 23 / 59

Outline 1. Deriving HiGrad 2. Constructing Confidence Intervals 3. Configuring HiGrad 4. Empirical Performance 24 / 59

Estimate µ ∗ x through each thread Average over each segment s = ( b 1 , . . . , b k ) n k s = 1 � θ s θ j n k j =1 Given weights w 0 , w 1 , . . . , w K that sum up to 1, weighted average along thread t = ( b 1 , . . . , b K ) is K ( b 1 ,...,b k ) � θ t = w k θ k =0 25 / 59

Estimate µ ∗ x through each thread Average over each segment s = ( b 1 , . . . , b k ) n k s = 1 � θ s θ j n k j =1 Given weights w 0 , w 1 , . . . , w K that sum up to 1, weighted average along thread t = ( b 1 , . . . , b K ) is K ( b 1 ,...,b k ) � θ t = w k θ k =0 Estimator yielded by thread t µ t x := µ x ( θ t ) 25 / 59

How to construct a confidence interval based on T := B 1 B 2 · · · B K many such µ t x estimates? 25 / 59

Assume normality Denote by µ x the T -dimensional vector consisting of all µ t x Normality of µ x (to be proved soon) √ N ( µ x − µ ∗ x 1 ) converges weakly to normal distribution N ( 0 , Σ) as N → ∞ 26 / 59

Convert to simple linear regression a ∼ N ( µ ∗ From µ x x 1 , Σ /N ) we get Σ − 1 2 µ x ≈ (Σ − 1 2 1 ) µ ∗ x + ˜ z , z ∼ N (0 , I /N ) ˜ 27 / 59

Convert to simple linear regression a ∼ N ( µ ∗ From µ x x 1 , Σ /N ) we get Σ − 1 2 µ x ≈ (Σ − 1 2 1 ) µ ∗ x + ˜ z , z ∼ N (0 , I /N ) ˜ Simple linear regression! Least-squares estimator of µ ∗ x given as ( 1 ′ Σ − 1 2 Σ − 1 2 1 ) − 1 1 ′ Σ − 1 2 Σ − 1 2 µ x = ( 1 ′ Σ − 1 1 ) − 1 1 ′ Σ − 1 µ x = 1 � µ t x ≡ µ x T t ∈T HiGrad estimator Just the sample mean µ x 27 / 59

A t -based confidence interval A pivot for µ ∗ x µ x − µ ∗ a x ∼ t T − 1 , SE x where the standard error is given as √ � ( µ ′ x − µ x 1 ′ )Σ − 1 ( µ x − µ x 1 ) 1 ′ Σ 1 SE x = · T − 1 T 28 / 59

A t -based confidence interval A pivot for µ ∗ x µ x − µ ∗ a x ∼ t T − 1 , SE x where the standard error is given as √ � ( µ ′ x − µ x 1 ′ )Σ − 1 ( µ x − µ x 1 ) 1 ′ Σ 1 SE x = · T − 1 T HiGrad confidence interval of coverage 1 − α � � µ x − t T − 1 , 1 − α 2 SE x , µ x + t T − 1 , 1 − α 2 SE x 28 / 59

Do we know the covariance Σ ? 28 / 59

An extension of Ruppert–Polyak normality Given a thread t = ( b 1 , . . . , b K ) , denote by segments s k = ( b 1 , b 2 , . . . , b k ) Fact (informal) √ n 0 ( θ s 0 − θ ∗ ) , √ n 1 ( θ s 1 − θ ∗ ) , . . . , √ n K ( θ s K − θ ∗ ) converge to i.i.d. centered normal distributions 29 / 59

An extension of Ruppert–Polyak normality Given a thread t = ( b 1 , . . . , b K ) , denote by segments s k = ( b 1 , b 2 , . . . , b k ) Fact (informal) √ n 0 ( θ s 0 − θ ∗ ) , √ n 1 ( θ s 1 − θ ∗ ) , . . . , √ n K ( θ s K − θ ∗ ) converge to i.i.d. centered normal distributions • Hessian H = ∇ 2 f ( θ ∗ ) and V = E [ g ( θ ∗ , Z ) g ( θ ∗ , Z ) ′ ] . Ruppert (1988), Polyak (1990), and Polyak and Juditsky (1992) prove √ N ( θ N − θ ∗ ) ⇒ N (0 , H − 1 V H − 1 ) 29 / 59

An extension of Ruppert–Polyak normality Given a thread t = ( b 1 , . . . , b K ) , denote by segments s k = ( b 1 , b 2 , . . . , b k ) Fact (informal) √ n 0 ( θ s 0 − θ ∗ ) , √ n 1 ( θ s 1 − θ ∗ ) , . . . , √ n K ( θ s K − θ ∗ ) converge to i.i.d. centered normal distributions • Hessian H = ∇ 2 f ( θ ∗ ) and V = E [ g ( θ ∗ , Z ) g ( θ ∗ , Z ) ′ ] . Ruppert (1988), Polyak (1990), and Polyak and Juditsky (1992) prove √ N ( θ N − θ ∗ ) ⇒ N (0 , H − 1 V H − 1 ) • Difficult to estimate sandwich covariance H − 1 V H − 1 (Chen et al, 2016) 29 / 59

An extension of Ruppert–Polyak normality Given a thread t = ( b 1 , . . . , b K ) , denote by segments s k = ( b 1 , b 2 , . . . , b k ) Fact (informal) √ n 0 ( θ s 0 − θ ∗ ) , √ n 1 ( θ s 1 − θ ∗ ) , . . . , √ n K ( θ s K − θ ∗ ) converge to i.i.d. centered normal distributions • Hessian H = ∇ 2 f ( θ ∗ ) and V = E [ g ( θ ∗ , Z ) g ( θ ∗ , Z ) ′ ] . Ruppert (1988), Polyak (1990), and Polyak and Juditsky (1992) prove √ N ( θ N − θ ∗ ) ⇒ N (0 , H − 1 V H − 1 ) • Difficult to estimate sandwich covariance H − 1 V H − 1 (Chen et al, 2016) • To know covariance of { µ x ( θ t ) } , really need to know H − 1 V H − 1 ? 29 / 59

Covariance determined by number of shared segments Consider µ x ( θ ) = T ( x ) ′ θ and observe • √ n 0 ( µ x ( θ x ) , √ n 1 ( µ x ( θ x ) , . . . , √ n K ( µ x ( θ s 0 ) − µ ∗ s 1 ) − µ ∗ s K ) − µ ∗ x ) converge to i.i.d. centered univariate normal distributions • µ t K � s k ) − µ ∗ � x − µ ∗ x = µ x ( θ t ) − µ ∗ � x = w k µ x ( θ x k =0 30 / 59

Covariance determined by number of shared segments Consider µ x ( θ ) = T ( x ) ′ θ and observe • √ n 0 ( µ x ( θ x ) , √ n 1 ( µ x ( θ x ) , . . . , √ n K ( µ x ( θ s 0 ) − µ ∗ s 1 ) − µ ∗ s K ) − µ ∗ x ) converge to i.i.d. centered univariate normal distributions • µ t K � s k ) − µ ∗ � x − µ ∗ x = µ x ( θ t ) − µ ∗ � x = w k µ x ( θ x k =0 Fact (informal) For any two threads t and t ′ that agree at the first k segments and differ henceforth, we have k w 2 � � x , µ t ′ � µ t = (1 + o (1)) σ 2 i Cov x n i i =0 30 / 59

Specify Σ up to a multiplicative factor If µ x ( θ ) = T ( x ) ′ θ , then for any two threads t and t ′ that agree only at the first k segments, k ω 2 i N � Σ t , t ′ = (1 + o (1)) C n i i =0 31 / 59

Specify Σ up to a multiplicative factor If µ x ( θ ) = T ( x ) ′ θ , then for any two threads t and t ′ that agree only at the first k segments, k ω 2 i N � Σ t , t ′ = (1 + o (1)) C n i i =0 • Do we need to know C as well? 31 / 59

Specify Σ up to a multiplicative factor If µ x ( θ ) = T ( x ) ′ θ , then for any two threads t and t ′ that agree only at the first k segments, k ω 2 i N � Σ t , t ′ = (1 + o (1)) C n i i =0 • Do we need to know C as well? • No! Standard error of µ x invariant under multiplying Σ by a scalar √ � ( µ ′ x − µ x 1 ′ )Σ − 1 ( µ x − µ x 1 ) 1 ′ Σ 1 SE x = · T − 1 T 31 / 59

Some remarks • In generalized linear models, µ x often takes the form µ x ( θ ) = η − 1 ( T ( x ) ′ θ ) for an increasing η . Construct confidence interval for η ( µ x ) and then invert • For general nonlinear but smooth µ x ( θ ) , use delta method • Need less than Ruppert–Polyak: remains to hold if √ N ( θ N − θ ∗ ) converges to some centered normal distribution 32 / 59

Formal statement of theoretical results 32 / 59

Assumptions Local strong convexity. f ( θ ) ≡ E f ( θ, Z ) convex , differentiable, with 1 Lipschitz gradients. Hessian ∇ 2 f ( θ ) locally Lipschitz and positive-definite at θ ∗ Noise regularity. V ( θ ) = E [ g ( θ, Z ) g ( θ, Z ) ′ ] Lipschitz and does not grow too 2 fast. Noisy gradient g ( θ, Z ) has 2 + o (1) moment locally at θ ∗ 33 / 59

Examples satisfying assumptions • Linear regression : f ( θ, z ) = 1 2 ( y − x ⊤ θ ) 2 . • Logistic regression : f ( θ, z ) = − yx ⊤ θ + log � 1 + e x ⊤ θ � . • Penalized regression : Add a ridge penalty λ � θ � 2 . • Huber regression : f ( θ, z ) = ρ λ ( y − x ⊤ θ ) , where ρ λ ( a ) = a 2 / 2 for | a | ≤ λ and ρ λ ( a ) = λ | a | − λ 2 / 2 otherwise. Sufficient conditions X in generic position, and E � X � 4+ o (1) < ∞ and E | Y | 2+ o (1) � X � 2+ o (1) < ∞ 34 / 59

Main theoretical results Theorem (S. and Zhu) Assume K and B 1 , . . . , B K are fixed, n k ∝ N as N → ∞ , and µ x has a nonzero derivative at θ ∗ . Taking γ j ≍ j − α for α ∈ (0 . 5 , 1) gives µ x − µ ∗ x = ⇒ t T − 1 SE x 35 / 59

Main theoretical results Theorem (S. and Zhu) Assume K and B 1 , . . . , B K are fixed, n k ∝ N as N → ∞ , and µ x has a nonzero derivative at θ ∗ . Taking γ j ≍ j − α for α ∈ (0 . 5 , 1) gives µ x − µ ∗ x = ⇒ t T − 1 SE x Confidence intervals µ ∗ � � �� lim x ∈ µ x − t T − 1 , 1 − α 2 SE x , µ x + t T − 1 , 1 − α 2 SE x = 1 − α N →∞ P 35 / 59

Main theoretical results Theorem (S. and Zhu) Assume K and B 1 , . . . , B K are fixed, n k ∝ N as N → ∞ , and µ x has a nonzero derivative at θ ∗ . Taking γ j ≍ j − α for α ∈ (0 . 5 , 1) gives µ x − µ ∗ x = ⇒ t T − 1 SE x Confidence intervals µ ∗ � � �� lim x ∈ µ x − t T − 1 , 1 − α 2 SE x , µ x + t T − 1 , 1 − α 2 SE x = 1 − α N →∞ P Fulfilled • Online in nature with same computational cost as vanilla SGD • A confidence interval for µ ∗ x in addition to an estimator 35 / 59

How accurate is the HiGrad estimator? 35 / 59

Optimal variance with optimal weights By Cauchy–Schwarz � K � � K k � w 2 N V ar ( µ x ) = (1 + o (1)) σ 2 � � � k n k B i � k n k i =1 B i i =1 k =0 k =0 � K � 2 � � ≥ (1 + o (1)) σ 2 w 2 = (1 + o (1)) σ 2 , k k =0 with equality if � k k = n k i =1 B i w ∗ N 36 / 59

Optimal variance with optimal weights By Cauchy–Schwarz � K � � K k � w 2 N V ar ( µ x ) = (1 + o (1)) σ 2 � � � k n k B i � k n k i =1 B i i =1 k =0 k =0 � K � 2 � � ≥ (1 + o (1)) σ 2 w 2 = (1 + o (1)) σ 2 , k k =0 with equality if � k k = n k i =1 B i w ∗ N • Segments at an early level weighted less 36 / 59

Optimal variance with optimal weights By Cauchy–Schwarz � K � � K k � w 2 N V ar ( µ x ) = (1 + o (1)) σ 2 � � � k n k B i � k n k i =1 B i i =1 k =0 k =0 � K � 2 � � ≥ (1 + o (1)) σ 2 w 2 = (1 + o (1)) σ 2 , k k =0 with equality if � k k = n k i =1 B i w ∗ N • Segments at an early level weighted less • HiGrad estimator has the same asymptotic variance as vanilla SGD 36 / 59

Optimal variance with optimal weights By Cauchy–Schwarz � K � � K k � w 2 N V ar ( µ x ) = (1 + o (1)) σ 2 � � � k n k B i � k n k i =1 B i i =1 k =0 k =0 � K � 2 � � ≥ (1 + o (1)) σ 2 w 2 = (1 + o (1)) σ 2 , k k =0 with equality if � k k = n k i =1 B i w ∗ N • Segments at an early level weighted less • HiGrad estimator has the same asymptotic variance as vanilla SGD • Achieves Cramér–Rao lower bound when model specified 36 / 59

Prediction intervals for vanilla SGD Theorem (S. and Zhu) Run vanilla SGD on a fresh dataset of the same size, producing µ SGD . Then, x with optimal weights, √ √ � � �� µ SGD lim ∈ µ x − 2 t T − 1 , 1 − α 2 SE x , µ x + 2 t T − 1 , 1 − α 2 SE x = 1 − α. N →∞ P x • µ SGD can be replaced by the HiGrad estimator with the same structure • Interpretable even under model misspecification x 37 / 59

HiGrad enjoys three appreciable properties Under certain assumptions, for example, f being locally strongly convex Fulfilled • Online in nature with same computational cost as vanilla SGD • A confidence interval for µ ∗ x in addition to an estimator • Estimator (almost) as accurate as vanilla SGD 38 / 59

Which one? 40 / 59

Length of confidence intervals Denote by L CI = 2 t T − 1 , 1 − α 2 SE x the length of HiGrad confidence interval Proposition (S. and Zhu) √ � T √ � N E L CI → 2 σ 2 t T − 1 , 1 − α 2 Γ 2 √ � T − 1 � T − 1 Γ 2 41 / 59

Length of confidence intervals Denote by L CI = 2 t T − 1 , 1 − α 2 SE x the length of HiGrad confidence interval Proposition (S. and Zhu) √ � T √ � N E L CI → 2 σ 2 t T − 1 , 1 − α 2 Γ 2 √ � T − 1 � T − 1 Γ 2 � T • The function t T − 1 , 1 − α � 2 Γ 2 √ � is decreasing in T ≥ 2 � T − 1 T − 1 Γ 2 41 / 59

Length of confidence intervals Denote by L CI = 2 t T − 1 , 1 − α 2 SE x the length of HiGrad confidence interval Proposition (S. and Zhu) √ � T √ � N E L CI → 2 σ 2 t T − 1 , 1 − α 2 Γ 2 √ � T − 1 � T − 1 Γ 2 � T • The function t T − 1 , 1 − α � 2 Γ 2 √ � is decreasing in T ≥ 2 � T − 1 T − 1 Γ 2 • The more threads, the shorter the HiGrad confidence interval on average 41 / 59

Length of confidence intervals Denote by L CI = 2 t T − 1 , 1 − α 2 SE x the length of HiGrad confidence interval Proposition (S. and Zhu) √ � T √ � N E L CI → 2 σ 2 t T − 1 , 1 − α 2 Γ 2 √ � T − 1 � T − 1 Γ 2 � T • The function t T − 1 , 1 − α � 2 Γ 2 √ � is decreasing in T ≥ 2 � T − 1 T − 1 Γ 2 • The more threads, the shorter the HiGrad confidence interval on average • More contrasting leads to shorter confidence interval 41 / 59

Really want to set T = 1000 ? 42 / 59

T = 4 is sufficient ● 9 Length 6 ● 3 ● ● ● ● ● ● ● 0 2 4 6 8 10 T t T − 1 , 0 . 975 Γ ( T/ 2) √ Plot of T − 1 Γ ( T/ 2 − 0 . 5) • Too many threads result in inaccurate normality (unless N is huge) • Large T leads to much contrasting and little sharing 43 / 59

How to choose ( n 0 , . . . , n K ) ? n 0 + B 1 n 1 + B 1 B 2 n 2 + B 1 B 2 B 3 n 3 + · · · + B 1 B 2 · · · B K n K = N Length of each thread L K := n 0 + n 1 + · · · + n K 44 / 59

How to choose ( n 0 , . . . , n K ) ? n 0 + B 1 n 1 + B 1 B 2 n 2 + B 1 B 2 B 3 n 3 + · · · + B 1 B 2 · · · B K n K = N Length of each thread L K := n 0 + n 1 + · · · + n K • Sharing: want a larger L K by setting n 0 > n 1 > · · · > n K 44 / 59

How to choose ( n 0 , . . . , n K ) ? n 0 + B 1 n 1 + B 1 B 2 n 2 + B 1 B 2 B 3 n 3 + · · · + B 1 B 2 · · · B K n K = N Length of each thread L K := n 0 + n 1 + · · · + n K • Sharing: want a larger L K by setting n 0 > n 1 > · · · > n K • Contrasting: want n 0 < n 1 < · · · < n K 44 / 59

General simulation setup X generated as i.i.d. N (0 , 1) and Z = ( X, Y ) ∈ R d × R . Set N = 10 6 and use γ j = 0 . 5 j − 0 . 55 • Linear regression Y ∼ N ( µ X ( θ ∗ ) , 1) , where µ x ( θ ) = x ′ θ • Logistic regression Y ∼ Bernoulli( µ X ( θ ∗ )) , where e x ′ θ µ x ( θ ) = 1 + e x ′ θ Criteria • Accuracy: � θ − θ ∗ � 2 , where θ averaged over T threads • Coverage probability and length of confidence interval 46 / 59

Accuracy Dimension d = 50 . MSE � θ − θ ∗ � 2 normalized by that of vanilla SGD • null case where θ 1 = · · · = θ 50 = 0 • dense case where θ 1 = · · · = θ 50 = 1 √ • sparse case where θ 1 = · · · = θ 5 = 50 1 5 , θ 6 = · · · = θ 50 = 0 √ 47 / 59

Accuracy : : : : Linear regression, null Linear regression, sparse Linear regression, dense 1.30 1.30 1.30 Normalized risk Normalized risk Normalized risk 1.20 1.20 1.20 1.10 1.10 1.10 1.00 1.00 1.00 1e+04 5e+04 2e+05 5e+05 1e+04 5e+04 2e+05 5e+05 1e+04 5e+04 2e+05 5e+05 Total number of steps Total number of steps Total number of steps Logistic regression, null Logistic regression, sparse Logistic regression, dense 1.30 1.30 1.30 Normalized risk Normalized risk Normalized risk 1.20 1.20 1.20 1.10 1.10 1.10 1.00 1.00 1.00 1e+04 5e+04 2e+05 5e+05 1e+04 5e+04 2e+05 5e+05 1e+04 5e+04 2e+05 5e+05 Total number of steps Total number of steps Total number of steps 48 / 59

Coverage and CI length HiGrad configurations • K = 1 , then n 1 = n 0 = r = 1 ; • K = 2 , then n 1 /n 0 = n 2 /n 1 = r ∈ { 0 . 75 , 1 , 1 . 25 , 1 . 5 } Set θ ∗ i = ( i − 1) /d for i = 1 , . . . , d and α = 5% . Use measure 20 1 � 1 ( µ x i ( θ ∗ ) ∈ CI x i ) 20 i =1 49 / 59

Linear regression: d = 20 0.956 1, 4, 1 0.0851 0.938 1, 8, 1 0.0683 0.9185 1, 12, 1 0.0653 0.887 1, 16, 1 0.0637 0.8488 1, 20, 1 0.0637 0.9425 2, 2, 1 0.0801 0.9472 2, 2, 1.25 0.0811 0.9452 2, 2, 1.5 0.0828 0.9448 2, 2, 2 0.0815 0.924 3, 2, 1 0.061 0.9318 3, 2, 1.25 0.0614 0.935 3, 2, 1.5 0.062 0.9378 3, 2, 2 0.0633 0.925 2, 3, 1 0.0605 0.9185 2, 3, 1.25 0.0606 0.9245 2, 3, 1.5 0.0618 0.9348 2, 3, 2 0.0621 50 / 59

Linear regression: d = 100 0.9472 1, 4, 1 0.2403 0.9478 1, 8, 1 0.2197 0.9308 1, 12, 1 0.2312 0.92 1, 16, 1 0.2495 0.9125 1, 20, 1 0.2649 0.9312 2, 2, 1 0.1917 0.9338 2, 2, 1.25 0.1927 0.9358 2, 2, 1.5 0.1946 0.9302 2, 2, 2 0.1972 0.9 3, 2, 1 0.1412 0.9065 3, 2, 1.25 0.1428 0.9148 3, 2, 1.5 0.1453 0.917 3, 2, 2 0.1489 0.894 2, 3, 1 0.1457 0.8992 2, 3, 1.25 0.1466 0.897 2, 3, 1.5 0.1491 0.9115 2, 3, 2 0.15 51 / 59

HiGrad: Statistical Inference for Stochastic Approximation and - PowerPoint PPT Presentation

HiGrad: Statistical Inference for Stochastic Approximation and Online Learning Weijie Su University of Pennsylvania Collaborator Yuancheng Zhu (UPenn) 2 / 59 Learning by optimization Sample Z 1 , . . . , Z N , and f ( , z ) is cost

Stochastic approximation for adaptive Markov chain Monte Carlo algorithms Gersende FORT LTCI /

6. Approximation and fitting norm approximation least-norm problems regularized

Multi-level stochastic approximation algorithms Noufel Frikha Universit e Paris Diderot, LPMA

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

STAT 401A - Statistical Methods for Research Workers Statistical Inference Jarad Niemi (Dr. J)

Mathematical approximation Jo Hardin Professor, Pomona College DataCamp Inference for Linear

Foundations for Inference I Dajiang Liu @PHS525 Feb-09-2016 Statistical Inference

UQ, STAT2201, 2017, Lecture 6 Unit 6 Statistical Inference Ideas. 1 Statistical Inference is

Variational Inference for GPs: Presenters Group1: Stochastic variational inference. Slides 2 - 28

Stochastic approximation-based algorithms, when the Monte Carlo bias does not vanish Gersende

Bridging the gap between Stochastic Approximation and Markov chains Aymeric DIEULEVEUT ENS

Stochastic Approximation in Hilbert Spaces Aymeric DIEULEVEUT Supervised by Francis BACH

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Statistical Natural Language Processing Statistical models: learning, inference, estimation,

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Some References P. Carpentier Master MMMEF Cours MNOS 2014-2015 263 / 263 Stochastic

A graphic comparison of the Fieller and Delta intervals for ratios of parameter estimates. Joe

Statistical Model Checking and Rare Events Paolo Zuliani Joint work with Edmund M. Clarke

CS70: Jean Walrand: Lecture 29. Confidence? Confidence? Confidence is essential is many

A Course in Applied Econometrics 1. Introduction Lecture 10 2. Example I: Missing Data 3.

Quantifying Chance Part 1: Sampling Variability INFO-1301, Quantitative Reasoning 1 University

Lecture 4. Maximum Likelihood Estimation - confidence intervals. Igor Rychlik Chalmers

Confidence intervals for the mixing time of a reversible Markov chain from a single sample path

Bootstrapping 18.05 Spring 2014 Jeremy Orloff and Jonathan Bloom Agenda Empirical bootstrap