on the interplay of network structure and gradient
play

On the interplay of network structure and gradient convergence in - PowerPoint PPT Presentation

On the interplay of network structure and gradient convergence in deep learning Vikas Singh , Vamsi K. Ithapu Sathya N. Ravi Computer Sciences Biostatistics and Medical Informatics University of Wisconsin Madison Sep 28,


  1. Problem The Problem – reformulated We need informed or systematic design strategies for the choosing network structure (UW-Madison) Network structure vs. convergence Sep 28, 2016 14 / 41

  2. Problem Solution strategy The Solution strategy What is the best possible network for the given task? Need informed design strategies Part I (UW-Madison) Network structure vs. convergence Sep 28, 2016 15 / 41

  3. Problem Solution strategy The Solution strategy What is the best possible network for the given task? Need informed design strategies Part I Construct the relevant bounds • Gradient convergence + Learning Mechanism + Network/Data Statistics (UW-Madison) Network structure vs. convergence Sep 28, 2016 15 / 41

  4. Problem Solution strategy The Solution strategy What is the best possible network for the given task? Need informed design strategies Part I Construct the relevant bounds • Gradient convergence + Learning Mechanism + Network/Data Statistics Part II (UW-Madison) Network structure vs. convergence Sep 28, 2016 15 / 41

  5. Problem Solution strategy The Solution strategy What is the best possible network for the given task? Need informed design strategies Part I Construct the relevant bounds • Gradient convergence + Learning Mechanism + Network/Data Statistics Part II Construct design procedures using the bounds • For the given dataset, a pre-specified convergence level Find the depth, hidden layer lengths, etc. (UW-Madison) Network structure vs. convergence Sep 28, 2016 15 / 41

  6. Problem Solution strategy The Solution strategy – This work What is the best possible network for the given task? Need informed design strategies Part I Construct the relevant bounds • Gradient convergence + Learning Mechanism + Network/Data Statistics Part II Construct design procedures using the bounds • For the given dataset, a pre-specified convergence level Find the depth, hidden layer lengths, etc. (UW-Madison) Network structure vs. convergence Sep 28, 2016 16 / 41

  7. Problem Solution strategy The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics (UW-Madison) Network structure vs. convergence Sep 28, 2016 17 / 41

  8. Problem Solution strategy The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics → The depth parameter L (UW-Madison) Network structure vs. convergence Sep 28, 2016 17 / 41

  9. Problem Solution strategy The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics → The depth parameter L → The layer lengths ( d 0 , d 1 , . . . , d L − 1 , d L ) (UW-Madison) Network structure vs. convergence Sep 28, 2016 17 / 41

  10. Problem Solution strategy The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics → The depth parameter L → The layer lengths ( d 0 , d 1 , . . . , d L − 1 , d L ) → The activation functions ( σ 1 , . . . , σ L ) (UW-Madison) Network structure vs. convergence Sep 28, 2016 17 / 41

  11. Problem Solution strategy The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics → The depth parameter L → The layer lengths ( d 0 , d 1 , . . . , d L − 1 , d L ) → The activation functions ( σ 1 , . . . , σ L ) Bounded and Smooth; Focus on Sigmoid (UW-Madison) Network structure vs. convergence Sep 28, 2016 17 / 41

  12. Problem Solution strategy The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics → The depth parameter L → The layer lengths ( d 0 , d 1 , . . . , d L − 1 , d L ) → The activation functions ( σ 1 , . . . , σ L ) Bounded and Smooth; Focus on Sigmoid → Average first-moment (UW-Madison) Network structure vs. convergence Sep 28, 2016 17 / 41

  13. Problem Solution strategy The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics → The depth parameter L → The layer lengths ( d 0 , d 1 , . . . , d L − 1 , d L ) → The activation functions ( σ 1 , . . . , σ L ) Bounded and Smooth; Focus on Sigmoid → Average first-moment µ x = 1 j E x j , τ x = 1 j E 2 x j � � d 0 d 0 (UW-Madison) Network structure vs. convergence Sep 28, 2016 17 / 41

  14. Problem Solution strategy The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics min f ( W ) := E x , y ∼X L ( x , y ; W ) W (UW-Madison) Network structure vs. convergence Sep 28, 2016 18 / 41

  15. Problem Solution strategy The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics min f ( W ) := E x , y ∼X L ( x , y ; W ) W → L := ℓ 2 Loss (UW-Madison) Network structure vs. convergence Sep 28, 2016 18 / 41

  16. Problem Solution strategy The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics min f ( W ) := E x , y ∼X L ( x , y ; W ) W → L := ℓ 2 Loss W ∈ R d Stochastic Gradients OR W ∈ Ω := Box-constraint [ − w , w ] d Projected Gradients (UW-Madison) Network structure vs. convergence Sep 28, 2016 18 / 41

  17. Problem Solution strategy The Interplay – Gradient Convergence Ideally interested in generalization (UW-Madison) Network structure vs. convergence Sep 28, 2016 19 / 41

  18. Problem Solution strategy The Interplay – Gradient Convergence Ideally interested in generalization (UW-Madison) Network structure vs. convergence Sep 28, 2016 19 / 41

  19. Problem Solution strategy The Interplay – Gradient Convergence Ideally interested in generalization Convergence instead? R : Last iteration – In general , training time is fixed apriori (UW-Madison) Network structure vs. convergence Sep 28, 2016 19 / 41

  20. Problem Solution strategy The Interplay – Gradient Convergence Ideally interested in generalization Convergence instead? R : Last iteration – In general , training time is fixed apriori The expected gradients ∆ := E R , x , y �∇ W f ( W R ) � 2 (UW-Madison) Network structure vs. convergence Sep 28, 2016 19 / 41

  21. Problem Solution strategy The Interplay – Gradient Convergence Ideally interested in generalization Control on last/stopping iteration Convergence instead? R : Last iteration – In general , training time is fixed apriori The expected gradients ∆ := E R , x , y �∇ W f ( W R ) � 2 (UW-Madison) Network structure vs. convergence Sep 28, 2016 19 / 41

  22. Problem Solution strategy The Interplay – Gradient Convergence Ideally interested in generalization Control on last/stopping iteration Convergence instead? R : Last iteration – In general , training time is fixed apriori The expected gradients ∆ := E R , x , y �∇ W f ( W R ) � 2 Under mild assumptions, ∆ can be bounded when- ever R is chosen randomly [Ghadimi and Lan 2013] (UW-Madison) Network structure vs. convergence Sep 28, 2016 19 / 41

  23. Problem Solution strategy The Interplay – Gradient Convergence Gradients backpropagation + randomly stop after some iterations (UW-Madison) Network structure vs. convergence Sep 28, 2016 20 / 41

  24. Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network (UW-Madison) Network structure vs. convergence Sep 28, 2016 21 / 41

  25. Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N (UW-Madison) Network structure vs. convergence Sep 28, 2016 21 / 41

  26. Problem Single-layer Networks The Interplay – Gradient Convergence Decreasing stepsizes Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N (UW-Madison) Network structure vs. convergence Sep 28, 2016 21 / 41

  27. Problem Single-layer Networks The Interplay – Gradient Convergence Decreasing stepsizes Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N N : Maximum allowable iterations the stopping distribution R ∈ [ 1 , N ] ( N ≫ R ) (UW-Madison) Network structure vs. convergence Sep 28, 2016 21 / 41

  28. Problem Single-layer Networks The Interplay – Gradient Convergence Decreasing stepsizes Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N N : Maximum allowable iterations the stopping distribution R ∈ [ 1 , N ] ( N ≫ R ) ∆ : Expected gradients (UW-Madison) Network structure vs. convergence Sep 28, 2016 21 / 41

  29. Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N (UW-Madison) Network structure vs. convergence Sep 28, 2016 22 / 41

  30. Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N D f ≈ f ( W 1 ) (UW-Madison) Network structure vs. convergence Sep 28, 2016 22 / 41

  31. Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N D f ≈ f ( W 1 ) H N ≈ 0 . 2 γ GenHar ( N , ρ ) N : Maximum allowable iterations (UW-Madison) Network structure vs. convergence Sep 28, 2016 22 / 41

  32. Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N D f ≈ f ( W 1 ) Goodness of fit – Influence of W 1 H N ≈ 0 . 2 γ GenHar ( N , ρ ) N : Maximum allowable iterations (UW-Madison) Network structure vs. convergence Sep 28, 2016 22 / 41

  33. Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N D f ≈ f ( W 1 ) Goodness of fit – Influence of W 1 H N ≈ 0 . 2 γ GenHar ( N , ρ ) • Sublinear decay vs. N N : Maximum allowable iterations (UW-Madison) Network structure vs. convergence Sep 28, 2016 22 / 41

  34. Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N Ψ ≈ q d 0 d 1 γ B (0 . 05 < q < 0 . 25) d 0 d 1 := # unknowns (UW-Madison) Network structure vs. convergence Sep 28, 2016 23 / 41

  35. Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N Ψ ≈ q d 0 d 1 γ B Influence of # free parameters (degrees of freedom) (0 . 05 < q < 0 . 25) d 0 d 1 := # unknowns (UW-Madison) Network structure vs. convergence Sep 28, 2016 23 / 41

  36. Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N Ψ ≈ q d 0 d 1 γ B Influence of # free parameters (degrees of freedom) (0 . 05 < q < 0 . 25) Bias from mini-batch size d 0 d 1 := # unknowns (UW-Madison) Network structure vs. convergence Sep 28, 2016 23 / 41

  37. Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N • Ideal scenario: Large # samples; Small network (UW-Madison) Network structure vs. convergence Sep 28, 2016 24 / 41

  38. Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Expected Gradients For 1-layer network with stepsizes γ k = γ k ρ ( ρ > 0) and P R ( k ) = γ k ( 1 − 0 . 75 γ k ) , we have � D f � ∆ ≤ + Ψ H N • Ideal scenario: Large # samples; Small network • Realistic scenario: Reasonable network size; Large B with long training time (UW-Madison) Network structure vs. convergence Sep 28, 2016 24 / 41

  39. Problem Single-layer Networks The Interplay – Gradient Convergence for small ρ i.e, slow stepsize decay P R ( k ) approaches a uniform distribution (UW-Madison) Network structure vs. convergence Sep 28, 2016 25 / 41

  40. Problem Single-layer Networks The Interplay – Gradient Convergence for small ρ i.e, slow stepsize decay P R ( k ) approaches a uniform distribution � 5 D f � ∆ � N γ + Ψ (UW-Madison) Network structure vs. convergence Sep 28, 2016 25 / 41

  41. Problem Single-layer Networks The Interplay – Gradient Convergence for small ρ i.e, slow stepsize decay P R ( k ) approaches a uniform distribution � 5 D f � ∆ � N γ + Ψ when ρ = 0 i.e., constant stepsize P R ( k ) := UNIF [ 1 , N ] (UW-Madison) Network structure vs. convergence Sep 28, 2016 25 / 41

  42. Problem Single-layer Networks The Interplay – Gradient Convergence for small ρ i.e, slow stepsize decay P R ( k ) approaches a uniform distribution � 5 D f � ∆ � N γ + Ψ when ρ = 0 i.e., constant stepsize P R ( k ) := UNIF [ 1 , N ] � D f � ∆ ≤ N γ + Ψ (UW-Madison) Network structure vs. convergence Sep 28, 2016 25 / 41

  43. Problem Single-layer Networks The Interplay – Gradient Convergence for small ρ i.e, slow stepsize decay P R ( k ) approaches a uniform distribution � 5 D f � ∆ � N γ + Ψ Uniform stopping may not when ρ = 0 i.e., constant stepsize be interesting! P R ( k ) := UNIF [ 1 , N ] � D f � ∆ ≤ N γ + Ψ (UW-Madison) Network structure vs. convergence Sep 28, 2016 25 / 41

  44. Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network + Customized P R ( k ) (UW-Madison) Network structure vs. convergence Sep 28, 2016 26 / 41

  45. Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network + Customized P R ( k ) Push R to be as close as possible to N (UW-Madison) Network structure vs. convergence Sep 28, 2016 26 / 41

  46. Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network + Customized P R ( k ) Push R to be as close as possible to N P R ( k ) = 0 P R ( k ) = ν N (UW-Madison) Network structure vs. convergence Sep 28, 2016 26 / 41

  47. Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network + Customized P R ( k ) Push R to be as close as possible to N P R ( k ) = 0 P R ( k ) = ν N Expected Gradients + P R ( · ) from above example For 1-layer network with constant stepsize γ , we have � 5 D f � ∆ ≤ ν N γ + Ψ (UW-Madison) Network structure vs. convergence Sep 28, 2016 26 / 41

  48. Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network + Customized P R ( k ) Push R to be as close as possible to N P R ( k ) = 0 P R ( k ) = ν N Expected Gradients + P R ( · ) from above example For 1-layer network with constant stepsize γ , we have � 5 D f � ∆ ≤ ν N γ + Ψ require P R ( k ) ≤ P R ( k + 1 ) (UW-Madison) Network structure vs. convergence Sep 28, 2016 26 / 41

  49. Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network + Customized P R ( k ) Push R to be as close as possible to N P R ( k ) = 0 For ν ≫ 1, R → N P R ( k ) = ν N Expected Gradients + P R ( · ) from above example For 1-layer network with constant stepsize γ , we have � 5 D f � ∆ ≤ ν N γ + Ψ require P R ( k ) ≤ P R ( k + 1 ) (UW-Madison) Network structure vs. convergence Sep 28, 2016 26 / 41

  50. Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network + Customized P R ( k ) Push R to be as close as possible to N P R ( k ) = 0 For ν ≫ 1, R → N P R ( k ) = ν N bound too loose Expected Gradients + P R ( · ) from above example For 1-layer network with constant stepsize γ , we have � 5 D f � ∆ ≤ ν N γ + Ψ require P R ( k ) ≤ P R ( k + 1 ) (UW-Madison) Network structure vs. convergence Sep 28, 2016 26 / 41

  51. Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Using T independent random stopping iterations (UW-Madison) Network structure vs. convergence Sep 28, 2016 27 / 41

  52. Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Using T independent random stopping iterations Large deviation estimate (UW-Madison) Network structure vs. convergence Sep 28, 2016 27 / 41

  53. Problem Single-layer Networks The Interplay – Gradient Convergence Single-layer Network Using T independent random stopping iterations Large deviation estimate Let ǫ > 0 and 0 < δ ≪ 1. min t �∇ W f ( W R t ) � 2 ≤ ǫ � � An ( ǫ, δ ) -solution guarantees Pr ≥ 1 − δ (UW-Madison) Network structure vs. convergence Sep 28, 2016 27 / 41

  54. Problem Multi-layer Networks The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics (UW-Madison) Network structure vs. convergence Sep 28, 2016 28 / 41

  55. Problem Multi-layer Networks The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together (UW-Madison) Network structure vs. convergence Sep 28, 2016 28 / 41

  56. Problem Multi-layer Networks The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism (UW-Madison) Network structure vs. convergence Sep 28, 2016 28 / 41

  57. Problem Multi-layer Networks The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism • Initialize (or Warm-start or Pretrain) each of the layers sequentially (UW-Madison) Network structure vs. convergence Sep 28, 2016 28 / 41

  58. Problem Multi-layer Networks The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism • Initialize (or Warm-start or Pretrain) each of the layers sequentially x (w.p. 1 − ζ , the j th unit is 0) x → ˜ (UW-Madison) Network structure vs. convergence Sep 28, 2016 28 / 41

  59. Problem Multi-layer Networks The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism • Initialize (or Warm-start or Pretrain) each of the layers sequentially x (w.p. 1 − ζ , the j th unit is 0) x → ˜ h 1 = σ ( W 1 ˜ L ( x , W ) = � x − h 1 � 2 W ∈ [ − w , w ] d x ) with Referred to as a Denoising Autoencoder (UW-Madison) Network structure vs. convergence Sep 28, 2016 28 / 41

  60. Problem Multi-layer Networks The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism • Initialize (or Warm-start or Pretrain) each of the layers sequentially x (w.p. 1 − ζ , the j th unit is 0) x → ˜ h 1 = σ ( W 1 ˜ L ( x , W ) = � x − h 1 � 2 W ∈ [ − w , w ] d x ) with Referred to as a Denoising Autoencoder • L − 1 such DAs are learned x → h 1 → . . . h L − 2 → h L − 1 (UW-Madison) Network structure vs. convergence Sep 28, 2016 28 / 41

  61. Problem Multi-layer Networks The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together (UW-Madison) Network structure vs. convergence Sep 28, 2016 29 / 41

  62. Problem Multi-layer Networks The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism • Bring in the y s; perform backpropagation (UW-Madison) Network structure vs. convergence Sep 28, 2016 29 / 41

  63. Problem Multi-layer Networks The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism • Bring in the y s; perform backpropagation Use stochastic gradients; start at L th -layer Propagate the gradients (UW-Madison) Network structure vs. convergence Sep 28, 2016 29 / 41

  64. Problem Multi-layer Networks The Interplay Gradient convergence + Learning Mechanism + Network/Data Statistics Multi-layer Neural Network L − 1 single-layer networks put together Typical mechanism • Bring in the y s; perform backpropagation Use stochastic gradients; start at L th -layer Propagate the gradients → Dropout Update only a fraction ( ζ ) of all the parameters (UW-Madison) Network structure vs. convergence Sep 28, 2016 29 / 41

  65. Problem Multi-layer Networks The Interplay – Learning Mechanism Multi-layer Neural Network (UW-Madison) Network structure vs. convergence Sep 28, 2016 30 / 41

  66. Problem Multi-layer Networks The Interplay – Learning Mechanism Multi-layer Neural Network The new mechanism – Randomized stopping strategy at all stages (UW-Madison) Network structure vs. convergence Sep 28, 2016 30 / 41

  67. Problem Multi-layer Networks The Interplay – Learning Mechanism Multi-layer Neural Network The new mechanism – Randomized stopping strategy at all stages • L − 1 layers are initialized to ( α, δ α ) solutions α : Goodness of pretraning (UW-Madison) Network structure vs. convergence Sep 28, 2016 30 / 41

  68. Problem Multi-layer Networks The Interplay – Learning Mechanism Multi-layer Neural Network The new mechanism – Randomized stopping strategy at all stages • L − 1 layers are initialized to ( α, δ α ) solutions α : Goodness of pretraning • Gradient backpropagation is performed to a ( ǫ, δ ) solution (UW-Madison) Network structure vs. convergence Sep 28, 2016 30 / 41

  69. Problem Multi-layer Networks The Interplay – The most general result Multi-layer Neural Network For L -layered network with dropout rate ζ and constant stepsize γ , pretrained to ( α, δ α ) , we have � D f � ∆ ≤ Ne + Π (UW-Madison) Network structure vs. convergence Sep 28, 2016 31 / 41

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend