statistical aspects of quantum computing
play

Statistical Aspects of Quantum Computing Yazhen Wang Department of - PowerPoint PPT Presentation

Statistical Aspects of Quantum Computing Yazhen Wang Department of Statistics University of Wisconsin-Madison http://www.stat.wisc.edu/ yzwang Near-term Applications of Quantum Computing Fermilab, December 6-7, 2017 Yazhen (at UW-Madison)


  1. Statistical Aspects of Quantum Computing Yazhen Wang Department of Statistics University of Wisconsin-Madison http://www.stat.wisc.edu/ ∼ yzwang Near-term Applications of Quantum Computing Fermilab, December 6-7, 2017 Yazhen (at UW-Madison) 1 / 40

  2. Outline Statistical learning with quantum annealing Statistical analysis of quantum computing data Yazhen (at UW-Madison) 2 / 40

  3. Statistics and Optimization MLE/M-estimation, Non-parametric smoothing, · · · n L ( θ ; X n ) = 1 � • Stochastic optimization problem: min ℓ ( θ ; X i ) n θ i = 1 • Minimization solution gives an estimator or a classifier. Examples : ℓ ( θ ; X i ) = log pdf ; residual square sum / loss + penalty Yazhen (at UW-Madison) 3 / 40

  4. Statistics and Optimization MLE/M-estimation, Non-parametric smoothing, · · · n L ( θ ; X n ) = 1 � • Stochastic optimization problem: min ℓ ( θ ; X i ) n θ i = 1 • Minimization solution gives an estimator or a classifier. Examples : ℓ ( θ ; X i ) = log pdf ; residual square sum / loss + penalty Take g ( θ ) = E [ L ( θ ; X n )] = E [ ℓ ( θ ; X 1 )] • Optimization problem: min g ( θ ) θ • Minimization solution defines a true parameter value. Yazhen (at UW-Madison) 3 / 40

  5. Statistics and Optimization MLE/M-estimation, Non-parametric smoothing, · · · n L ( θ ; X n ) = 1 � • Stochastic optimization problem: min ℓ ( θ ; X i ) n θ i = 1 • Minimization solution gives an estimator or a classifier. Examples : ℓ ( θ ; X i ) = log pdf ; residual square sum / loss + penalty Take g ( θ ) = E [ L ( θ ; X n )] = E [ ℓ ( θ ; X 1 )] • Optimization problem: min g ( θ ) θ • Minimization solution defines a true parameter value. Goals: Use data X n to do the following (i) Evaluate estimators/classifiers (minimization solutions) Computing (ii) Statistical study of estimators/classifiers – Inference Yazhen (at UW-Madison) 3 / 40

  6. Computer Power Demand Yazhen (at UW-Madison) 4 / 40

  7. Computer Power Demand BIG DATA Yazhen (at UW-Madison) 4 / 40

  8. Computer Power Demand Scientific Studies and BIG DATA Computational Applications Yazhen (at UW-Madison) 4 / 40

  9. Learning examples Machine learning and compressed sensing • Matrix completion, matrix factorization, tensor decomposition, phase retrieval, neural network. Yazhen (at UW-Madison) 5 / 40

  10. Learning examples Machine learning and compressed sensing • Matrix completion, matrix factorization, tensor decomposition, phase retrieval, neural network. History Yazhen (at UW-Madison) 5 / 40

  11. Learning examples Machine learning and compressed sensing • Matrix completion, matrix factorization, tensor decomposition, phase retrieval, neural network. History Dog vs cat Yazhen (at UW-Madison) 5 / 40

  12. Learning examples Machine learning and compressed sensing • Matrix completion, matrix factorization, tensor decomposition, phase retrieval, neural network. Neural network: Layers in a chain structure Each layer is a function of the layer preceded it. Layer j : h j = g j ( a j h j − 1 + b j ) , ( a j , b j ) = weights, g j = activation function (sigmoid, softmax or rectifier) History Dog vs cat Yazhen (at UW-Madison) 5 / 40

  13. Gradient Descent Alorithms: Solve min θ g ( θ ) Gradient descent algorithm • Start at initial value x 0 , x k = x k − 1 − δ ∇ g ( x k − 1 ) , δ = learning rate , ∇ = derivative operator Yazhen (at UW-Madison) 6 / 40

  14. Gradient Descent Alorithms: Solve min θ g ( θ ) Gradient descent algorithm • Start at initial value x 0 , x k = x k − 1 − δ ∇ g ( x k − 1 ) , δ = learning rate , ∇ = derivative operator Accelerated Gradient descent algorithm (Nesterov) • Start at initial values x 0 and y 0 = x 0 , y k = x k + k − 1 x k = y k − 1 − δ ∇ g ( y k − 1 ) , k + 2 ( x k − x k − 1 ) Yazhen (at UW-Madison) 6 / 40

  15. Gradient Descent Alorithms: Solve min θ g ( θ ) Gradient descent algorithm • Start at initial value x 0 , x k = x k − 1 − δ ∇ g ( x k − 1 ) , δ = learning rate , ∇ = derivative operator Continuous curve X t to approximate discrete { x k : k ≥ 0 } X t = derivative = dX t ˙ ˙ Differential equation: X t + ∇ g ( X t ) = 0 , dt Accelerated Gradient descent algorithm (Nesterov) • Start at initial values x 0 and y 0 = x 0 , y k = x k + k − 1 x k = y k − 1 − δ ∇ g ( y k − 1 ) , k + 2 ( x k − x k − 1 ) Yazhen (at UW-Madison) 6 / 40

  16. Gradient Descent Alorithms: Solve min θ g ( θ ) Gradient descent algorithm • Start at initial value x 0 , x k = x k − 1 − δ ∇ g ( x k − 1 ) , δ = learning rate , ∇ = derivative operator Continuous curve X t to approximate discrete { x k : k ≥ 0 } X t = derivative = dX t ˙ ˙ Differential equation: X t + ∇ g ( X t ) = 0 , dt Accelerated Gradient descent algorithm (Nesterov) • Start at initial values x 0 and y 0 = x 0 , y k = x k + k − 1 x k = y k − 1 − δ ∇ g ( y k − 1 ) , k + 2 ( x k − x k − 1 ) Continuous curve X t to approximate discrete { x k : k ≥ 0 } X t = d 2 X t X t + 3 Differential equation: ¨ ˙ ¨ X t + ∇ g ( X t ) = 0 , dt 2 t Yazhen (at UW-Madison) 6 / 40

  17. Gradient Descent Alorithms: Solve min θ g ( θ ) Gradient descent algorithm • Start at initial value x 0 , x k = x k − 1 − δ ∇ g ( x k − 1 ) , δ = learning rate , ∇ = derivative operator Continuous curve X t to approximate discrete { x k : k ≥ 0 } X t = derivative = dX t ˙ ˙ Differential equation: X t + ∇ g ( X t ) = 0 , dt Convergence to the minimization solution at rate= 1 / k or 1 / t ( ↑ ) as t , k → ∞ . For the ccelerated case: Rate = 1 / k 2 or 1 / t 2 ( ↓ ) Accelerated Gradient descent algorithm (Nesterov) • Start at initial values x 0 and y 0 = x 0 , y k = x k + k − 1 x k = y k − 1 − δ ∇ g ( y k − 1 ) , k + 2 ( x k − x k − 1 ) Continuous curve X t to approximate discrete { x k : k ≥ 0 } X t = d 2 X t X t + 3 Differential equation: ¨ ˙ ¨ X t + ∇ g ( X t ) = 0 , dt 2 t Yazhen (at UW-Madison) 6 / 40

  18. Stochastic Gradient Descent Stochastic optimization: min θ L ( θ ; X n ) , X n = ( X 1 , · · · , X n ) • Gradient descent algorithm to compute x k iteratively n x k = x k − 1 − δ ∇L ( x k − 1 ; X n ) , ∇L ( θ ; X n ) = 1 � ∇ ℓ ( θ ; X i ) n i = 1 Yazhen (at UW-Madison) 7 / 40

  19. Stochastic Gradient Descent Stochastic optimization: min θ L ( θ ; X n ) , X n = ( X 1 , · · · , X n ) • Gradient descent algorithm to compute x k iteratively n x k = x k − 1 − δ ∇L ( x k − 1 ; X n ) , ∇L ( θ ; X n ) = 1 � ∇ ℓ ( θ ; X i ) n i = 1 BigData: expensive to evaluate all ∇ ℓ ( θ ; X i ) at each iteration • Replace ∇L ( θ ; X n ) by m m ) = 1 ∇ ˆ L m ( θ ; X ∗ � ∇ ℓ ( θ ; X ∗ j ) , m ≪ n m j = 1 X ∗ m = ( X ∗ 1 , · · · , X ∗ m ) = subsample of X n (minibatch or bootstrap sample). Yazhen (at UW-Madison) 7 / 40

  20. Stochastic Gradient Descent Stochastic optimization: min θ L ( θ ; X n ) , X n = ( X 1 , · · · , X n ) • Gradient descent algorithm to compute x k iteratively n x k = x k − 1 − δ ∇L ( x k − 1 ; X n ) , ∇L ( θ ; X n ) = 1 � ∇ ℓ ( θ ; X i ) n i = 1 BigData: expensive to evaluate all ∇ ℓ ( θ ; X i ) at each iteration • Replace ∇L ( θ ; X n ) by m m ) = 1 ∇ ˆ L m ( θ ; X ∗ � ∇ ℓ ( θ ; X ∗ j ) , m ≪ n m j = 1 X ∗ m = ( X ∗ 1 , · · · , X ∗ m ) = subsample of X n (minibatch or bootstrap sample). Stochastic gradient descent algorithm x ∗ k = x ∗ k − 1 − δ ∇ ˆ L m ( x ∗ k − 1 ; X ∗ m ) Yazhen (at UW-Madison) 7 / 40

  21. Stochastic Gradient Descent Stochastic optimization: min θ L ( θ ; X n ) , X n = ( X 1 , · · · , X n ) • Gradient descent algorithm to compute x k iteratively n x k = x k − 1 − δ ∇L ( x k − 1 ; X n ) , ∇L ( θ ; X n ) = 1 � ∇ ℓ ( θ ; X i ) n i = 1 BigData: expensive to evaluate all ∇ ℓ ( θ ; X i ) at each iteration • Replace ∇L ( θ ; X n ) by m m ) = 1 ∇ ˆ L m ( θ ; X ∗ � ∇ ℓ ( θ ; X ∗ j ) , m ≪ n m j = 1 X ∗ m = ( X ∗ 1 , · · · , X ∗ m ) = subsample of X n (minibatch or bootstrap sample). Stochastic gradient descent algorithm x ∗ k = x ∗ k − 1 − δ ∇ ˆ L m ( x ∗ k − 1 ; X ∗ m ) Continuous curve X ∗ t to approximate discrete { x ∗ k : k ≥ 0 } X ∗ t obeys stochastic differential equation. Yazhen (at UW-Madison) 7 / 40

  22. Gradient Descent vs Stochastic Gradient Descent Gradient Descent Yazhen (at UW-Madison) 8 / 40

  23. Gradient Descent vs Stochastic Gradient Descent Gradient Descent Stochastic gradient descent Yazhen (at UW-Madison) 8 / 40

  24. Statistical Analysis of Gradient Descent (Wang, 2017) Continuous curve model Stochastic differential equation: dX ∗ t + ∇ g ( X ∗ t ) dt + σ ( X ∗ t ) dW t = 0 W t = Brownian motion For the accelerated case: 2nd order stochastic differential equation Yazhen (at UW-Madison) 9 / 40

  25. Statistical Analysis of Gradient Descent (Wang, 2017) Continuous curve model Stochastic differential equation: dX ∗ t + ∇ g ( X ∗ t ) dt + σ ( X ∗ t ) dW t = 0 W t = Brownian motion For the accelerated case: 2nd order stochastic differential equation and their asymptotic distribution as m , n → ∞ via stochastic differential equations Yazhen (at UW-Madison) 9 / 40

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend