Statistical Learning Theory Machine Learning Summer School, Kyoto, - PowerPoint PPT Presentation

Statistical Learning Theory Upon observing the training data {( x 1 , y 1 ) , . . . , ( x n , y n )} , the learner is asked to summarize what she had learned about the relationship between x and y . f n ∶ X ↦ Y . The hat The learner’s summary takes the form of a function ˆ indicates that this function depends on the training data. Learning algorithm : a mapping {( x 1 , y 1 ) , . . . , ( x n , y n )} � → ˆ f n . The quality of the learned relationship is given by comparing the response ˆ f n ( x ) to y for a pair ( x , y ) independently drawn from the same distribution P : E ( x , y ) ℓ ( ˆ f n ( x ) , y ) where ℓ ∶ Y × Y ↦ R is a loss function . This is our measure of performance. 21 / 130

Loss Functions ▸ Indicator loss (classification): ℓ ( y , y ′ ) = I { y ≠ y ′ } ▸ Square loss: ℓ ( y , y ′ ) = ( y − y ′ ) 2 ▸ Absolute loss: ℓ ( y , y ′ ) = ∣ y − y ′ ∣ 22 / 130

Examples Probably the simplest learning algorithm that you are probably familiar with is linear least squares : Given ( x 1 , y 1 ) , . . . , ( x n , y n ) , let ∑ n ( y i − ⟨ β , x i ⟩) 2 1 ˆ β = arg min n β ∈ R d i = 1 and define f n ( x ) = ⟨ ˆ β , x ⟩ ˆ Another basic method is regularized least squares : ( y i − ⟨ β , x i ⟩) 2 + λ ∥ β ∥ 2 n 1 ∑ ˆ β = arg min n β ∈ R d i = 1 23 / 130

Methods vs Problems Algorithms ˆ Distributions P f n 24 / 130

Expected Loss and Empirical Loss The expected loss of any function f ∶ X ↦ Y is L ( f ) = E ℓ ( f ( x ) , y ) Since P is unknown, we cannot calculate L ( f ) . However, we can calculate the empirical loss of f ∶ X ↦ Y L ( f ) = 1 n ℓ ( f ( x i ) , y i ) ∑ ˆ n i = 1 25 / 130

... again, what is random here? Since data ( x 1 , y 1 ) , . . . , ( x n , y n ) are a random i.i.d. draw from P , L ( f ) is a random quantity ▸ ˆ ▸ ˆ f n is a random quantity (a random function, output of our learning procedure after seeing data) ▸ hence, L ( ˆ f n ) is also a random quantity ▸ for a given f ∶ X → Y , the quantity L ( f ) is not random! It is important that these are understood before we proceed further. 26 / 130

The Gold Standard Within the framework we set up, the smallest expected loss is achieved by the Bayes optimal function f ∗ = arg min L ( f ) f where the minimization is over all (measurable) prediction rules f ∶ X ↦ Y . The value of the lowest expected loss is called the Bayes error : L ( f ∗ ) = inf f L ( f ) Of course, we cannot calculate any of these quantities since P is unknown. 27 / 130

Bayes Optimal Function Bayes optimal function f ∗ takes on the following forms in these two particular cases: ▸ Binary classification ( Y = { 0, 1 } ) with the indicator loss: f ∗ ( x ) = I { η ( x )≥ 1 / 2 } , η ( x ) = E [ Y ∣ X = x ] where 1 η ( x ) 0 28 / 130

Bayes Optimal Function Bayes optimal function f ∗ takes on the following forms in these two particular cases: ▸ Binary classification ( Y = { 0, 1 } ) with the indicator loss: f ∗ ( x ) = I { η ( x )≥ 1 / 2 } , η ( x ) = E [ Y ∣ X = x ] where 1 η ( x ) 0 ▸ Regression ( Y = R ) with squared loss: f ∗ ( x ) = η ( x ) , η ( x ) = E [ Y ∣ X = x ] where 28 / 130

The big question: is there a way to construct a learning algorithm with a guarantee that L ( ˆ f n ) − L ( f ∗ ) is small for large enough sample size n ? 29 / 130

Outline Introduction Statistical Learning Theory The Setting of SLT Consistency, No Free Lunch Theorems, Bias-Variance Tradeoff Tools from Probability, Empirical Processes From Finite to Infinite Classes Uniform Convergence, Symmetrization, and Rademacher Complexity Large Margin Theory for Classification Properties of Rademacher Complexity Covering Numbers and Scale-Sensitive Dimensions Faster Rates Model Selection Sequential Prediction / Online Learning Motivation Supervised Learning Online Convex and Linear Optimization Online-to-Batch Conversion, SVM optimization 30 / 130

Consistency An algorithm that ensures n →∞ L ( ˆ f n ) = L ( f ∗ ) lim almost surely is called consistent . Consistency ensures that our algorithm is approaching the best possible prediction performance as the sample size increases. The good news: consistency is possible to achieve. ▸ easy if X is a finite or countable set ▸ not too hard if X is infinite, and the underlying relationship between x and y is “continuous” 31 / 130

The bad news... In general, we cannot prove anything “interesting” about L ( ˆ f n ) − L ( f ∗ ) , unless we make further assumptions (incorporate prior knowledge ). What do we mean by “nothing interesting”? This is the subject of the so-called “No Free Lunch” Theorems. Unless we posit further assumptions, 32 / 130

The bad news... In general, we cannot prove anything “interesting” about L ( ˆ f n ) − L ( f ∗ ) , unless we make further assumptions (incorporate prior knowledge ). What do we mean by “nothing interesting”? This is the subject of the so-called “No Free Lunch” Theorems. Unless we posit further assumptions, f n , any n and any ǫ > 0, there exists a distribution ▸ For any algorithm ˆ P such that L ( f ∗ ) = 0 and E L ( ˆ f n ) ≥ 1 2 − ǫ 32 / 130

The bad news... In general, we cannot prove anything “interesting” about L ( ˆ f n ) − L ( f ∗ ) , unless we make further assumptions (incorporate prior knowledge ). What do we mean by “nothing interesting”? This is the subject of the so-called “No Free Lunch” Theorems. Unless we posit further assumptions, f n , any n and any ǫ > 0, there exists a distribution ▸ For any algorithm ˆ P such that L ( f ∗ ) = 0 and E L ( ˆ f n ) ≥ 1 2 − ǫ ▸ For any algorithm ˆ f n , and any sequence a n that converges to 0, there exists a probability distribution P such that L ( f ∗ ) = 0 and for all n E L ( ˆ f n ) ≥ a n Reference: (Devroye, Gy¨ orfi, Lugosi: A Probabilistic Theory of Pattern Recognition ), (Bousquet, Boucheron, Lugosi, 2004) 32 / 130

is this really “bad news”? Not really. We always have some domain knowledge. Two ways of incorporating prior knowledge: ▸ Direct way: assume that the distribution P is not arbitrary (also known as a modeling approach, generative approach, statistical modeling) ▸ Indirect way: redefine the goal to perform as well as a reference set F of predictors: L ( ˆ f n ) − inf f ∈F L ( f ) This is known as a discriminative approach. F encapsulates our inductive bias . 33 / 130

Pros/Cons of the two approaches Pros of the discriminative approach: we never assume that P takes some particular form, but we rather put our prior knowledge into “what are the types of predictors that will do well”. Cons: cannot really interpret ˆ f n . Pros of the generative approach: can estimate the model / parameters of the distribution ( inference ). Cons: it is not clear what the analysis says if the assumption is actually violated. Both approaches have their advantages. A machine learning researcher or practitioner should ideally know both and should understand their strengths and weaknesses. In this tutorial we only focus on the discriminative approach. 34 / 130

Example: Linear Discriminant Analysis Consider the classification problem with Y = { 0, 1 } . Suppose the class-conditional densities are multivariate Gaussian with the same covariance Σ = I : p ( x ∣ y = 0 ) = ( 2 π ) − k / 2 exp {− 1 2 ∥ x − µ 0 ∥ 2 } and p ( x ∣ y = 1 ) = ( 2 π ) − k / 2 exp {− 1 2 ∥ x − µ 1 ∥ 2 } The “best” (Bayes) classifier is f ∗ = I { P ( y = 1 ∣ x ) ≥ 1 / 2 } which corresponds to the half-space defined by the decision boundary p ( x ∣ y = 1 ) ≥ p ( x ∣ y = 0 ) . This boundary is linear . 35 / 130

Example: Linear Discriminant Analysis The (linear) optimal decision boundary comes from our generative assumption on the form of the underlying distribution. Alternatively, we could have indirectly postulated that we will be looking for a linear discriminant between the two classes, without making distributional assumptions. Such linear discriminant (classification) functions are I {⟨ w , x ⟩≥ b } for a unit-norm w and some bias b ∈ R . Quadratic Discriminant Analysis: If unequal correlation matrices Σ 1 and Σ 2 are assumed, the resulting boundary is quadratic. We can then define classification function by I { q ( x )≥ 0 } where q ( x ) is a quadratic function. 36 / 130

Bias-Variance Tradeoff How do we choose the inductive bias F ? L ( ˆ f n ) − L ( f ∗ ) = L ( ˆ f n ) − inf f ∈F L ( f ) + f ∈F L ( f ) − L ( f ∗ ) inf �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� Estimation Error Approximation Error ˆ f ∗ f F f n F Clearly, the two terms are at odds with each other: ▸ Making F larger means smaller approximation error but (as we will see) larger estimation error ▸ Taking a larger sample n means smaller estimation error and has no effect on the approximation error. ▸ Thus, it makes sense to trade off size of F and n . This is called Structural Risk Minimization , or Method of Sieves , or Model Selection . 37 / 130

Bias-Variance Tradeoff We will only focus on the estimation error, yet the ideas we develop will make it possible to read about model selection on your own. Note: if we guessed correctly and f ∗ ∈ F , then L ( ˆ f n ) − L ( f ∗ ) = L ( ˆ f n ) − inf f ∈F L ( f ) For a particular problem, one hopes that prior knowledge about the problem can ensure that the approximation error inf f ∈F L ( f ) − L ( f ∗ ) is small. 38 / 130

Occam’s Razor Occam’s Razor is often quoted as a principle for choosing the simplest theory or explanation out of the possible ones. However, this is a rather philosophical argument since simplicity is not uniquely defined. We will discuss this issue later. What we will do is to try to understand “complexity” when it comes to behavior of certain stochastic processes. Such a question will be well-defined mathematically. 39 / 130

Looking Ahead So far: represented prior knowledge by means of the class F . Looking forward, we can find an algorithm that, after looking at a dataset of size n , produces ˆ f n such that L ( ˆ f n ) − inf f ∈F L ( f ) decreases (in a certain sense which we will make precise) at a non-trivial rate which depends on “richness” of F . This will give a sample complexity guarantee: how many samples are needed to make the error smaller than a desired accuracy. 40 / 130

Types of Bounds In expectation vs in probability (control the mean vs control the tails): E { L ( ˆ f n ) − inf f ∈F L ( f )} < ψ ( n ) P ( L ( ˆ f n ) − inf f ∈F L ( f ) ≥ ǫ ) < ψ ( n , ǫ ) vs 42 / 130

Types of Bounds In expectation vs in probability (control the mean vs control the tails): E { L ( ˆ f n ) − inf f ∈F L ( f )} < ψ ( n ) P ( L ( ˆ f n ) − inf f ∈F L ( f ) ≥ ǫ ) < ψ ( n , ǫ ) vs The in-probability bound can be inverted as P ( L ( ˆ f n ) − inf f ∈F L ( f ) ≥ φ ( δ , n )) < δ by setting δ ∶ = ψ ( ǫ , n ) and solving for ǫ . In this lecture, we are after the function φ ( δ , n ) . We will call it “the rate”. “With high probability” typically means logarithmic dependence of φ ( δ , n ) on 1 / δ . Very desirable: the bound grows only modestly even for high confidence bounds. 42 / 130

Sample Complexity Sample complexity is the sample size required by the algorithm ˆ f n to guarantee L ( ˆ f n ) − inf f ∈F L ( f ) ≤ ǫ with probability at least 1 − δ . Of course, we just need to invert a bound P ( L ( ˆ f n ) − inf f ∈F L ( f ) ≥ φ ( δ , n )) < δ by setting ǫ ∶ = φ ( δ , n ) and solving for n . In other words, n ( ǫ , δ ) is sample complexity of the algorithm ˆ f n if P ( L ( ˆ f n ) − inf f ∈F L ( f ) ≥ ǫ ) ≤ δ as soon as n ≥ n ( ǫ , δ ) . Hence, “rate” can be translated into “sample complexity” and vice versa. Easy to remember: rate O ( 1 /√ n ) means O ( 1 / ǫ 2 ) sample complexity, whereas rate O ( 1 / n ) is a smaller O ( 1 / ǫ ) sample complexity. 43 / 130

Types of Bounds Other distinctions to keep in mind: We can ask for bounds (either in expectation or in probability) on the following random variables: L ( ˆ f n ) − L ( f ∗ ) ( A ) L ( ˆ f n ) − inf f ∈F L ( f ) ( B ) L ( ˆ f n ) − ˆ L ( ˆ f n ) ( C ) { L ( f ) − ˆ L ( f )} ( D ) sup f ∈F { L ( f ) − ˆ L ( f ) − pen n ( f )} ( E ) sup f ∈F Let’s make sure we understand the differences between these random quantities! 44 / 130

Types of Bounds Upper bounds on ( D ) and ( E ) are used as tools for achieving the other bounds. Let’s see why. f n ∈ F , Obviously, for any algorithm that outputs ˆ L ( ˆ f n ) − ˆ L ( ˆ f n ) ≤ sup { L ( f ) − ˆ L ( f )} f ∈F and so a bound on ( D ) implies a bound on ( C ) . How about a bound on ( B ) ? Is it implied by ( C ) or ( D ) ? It depends on what the algorithm does! Denote f F = arg min f ∈F L ( f ) . Suppose ( D ) is small. It then makes sense to ask the learning algorithm to minimize or (approximately minimize) the empirical error (why?) 45 / 130

Canonical Algorithms Empirical Risk Minimization (ERM) algorithm: f n = arg min L ( f ) ˆ ˆ f ∈F Regularized Empirical Risk Minimization algorithm: f n = arg min L ( f ) + pen n ( f ) ˆ ˆ f ∈F We will deal with the regularized ERM a bit later. For now, let’s focus on ERM. Remark: to actually compute f ∈ F minimizing the above objectives, one needs to employ some optimization methods. In practice, the objective might be optimized only approximately. 46 / 130

Performance of ERM If ˆ f n is an ERM, L ( ˆ f n ) − L ( f F ) ≤ { L ( ˆ f n ) − ˆ L ( ˆ f n )} + { ˆ L ( ˆ f n ) − ˆ L ( f F )} + { ˆ L ( f F ) − L ( f F )} ≤ { L ( ˆ f n ) − ˆ L ( ˆ f n )} + { ˆ L ( f F ) − L ( f F )} �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� ( C ) ≤ sup { L ( f ) − ˆ L ( f )} + { ˆ L ( f F ) − L ( f F )} f ∈F �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� ( D ) because the second term is negative. So, ( C ) also implies a bound on ( B ) f n is ERM (or “close” to ERM). Also, ( D ) also implies a bound on when ˆ ( B ) . L ( f F ) − L ( f F ) ? Central Limit Theorem says What about this extra term ˆ that for i.i.d. random variables with bounded second moment, the average converges to the expectation. Let’s quantify this. 47 / 130

Hoeffding Inequality Let W , W 1 , . . . , W n be i.i.d. such that P ( a ≤ W ≤ b ) = 1. Then 2 nǫ 2 P ( E W − 1 ∑ n W i > ǫ ) ≤ exp ( − ( b − a ) 2 ) n i = 1 and 2 nǫ 2 P ( 1 ∑ n W i − E W > ǫ ) ≤ exp ( − ( b − a ) 2 ) n i = 1 Let W i = ℓ ( f F ( x i ) , y i ) . Clearly, W 1 , . . . , W i are i.i.d. Then, P (∣ L ( f F ) − ˆ L ( f F )∣ > ǫ ) ≤ 2 exp ( − ( b − a ) 2 ) 2 nǫ 2 assuming a ≤ ℓ ( f F ( x ) , y ) ≤ b for all x ∈ X , y ∈ Y . 48 / 130

Wait, Are We Done? Can’t we conclude directly that ( C ) is small? That is, P ( E ℓ ( ˆ f n ( x ) , y ) − 1 ℓ ( ˆ f n ( x i ) , y i ) > ǫ ) ≤ 2 exp ( − ( b − a ) 2 ) 2 nǫ 2 ∑ n ? n i = 1 49 / 130

Wait, Are We Done? Can’t we conclude directly that ( C ) is small? That is, P ( E ℓ ( ˆ f n ( x ) , y ) − 1 ℓ ( ˆ f n ( x i ) , y i ) > ǫ ) ≤ 2 exp ( − ( b − a ) 2 ) 2 nǫ 2 ∑ n ? n i = 1 No! The random variables ℓ ( ˆ f n ( x i ) , y i ) are not necessarily independent and it is possible that E ℓ ( ˆ f n ( x ) , y ) = E W ≠ E ℓ ( ˆ f n ( x i ) , y i ) = E W i The expected loss is “out of sample performance” while the second term is “in sample”. We say that ℓ ( ˆ f n ( x i ) , y i ) is a biased estimate of E ℓ ( ˆ f n ( x ) , y ) . How bad can this bias be? 49 / 130

Example ▸ X = [ 0, 1 ] , Y = { 0, 1 } ▸ ℓ ( f ( X i ) , Y i ) = I { f ( X i )≠ Y i } ▸ distribution P = P x × P y ∣ x with P x = Unif [ 0, 1 ] and P y ∣ x = δ y = 1 ▸ function class F = ∪ n ∈ N { f = f S ∶ S ⊂ X , ∣ S ∣ = n , f S ( x ) = I { x ∈ S } } 1 0 1 ERM ˆ f n memorizes (perfectly fits) the data, but has no ability to generalize. Observe that 0 = E ℓ ( ˆ f n ( x i ) , y i ) ≠ E ℓ ( ˆ f n ( x ) , y ) = 1 This phenomenon is called overfitting . 50 / 130

Example Not only is ( C ) large in this example. Also, uniform deviations ( D ) do not converge to zero. For any n ∈ N and any ( x 1 , y 1 ) , . . . , ( x n , y n ) ∼ P { E x , y ℓ ( f ( x ) , y ) − 1 ∑ n ℓ ( f ( x i ) , y i )} = 1 sup n f ∈F i = 1 Where do we go from here? Two approaches: 1. understand how to upper bound uniform deviations ( D ) 2. find properties of algorithms that limit in some way the bias of ℓ ( ˆ f n ( x i ) , y i ) . Stability and compression are two such approaches. 51 / 130

Uniform Deviations We first focus on understanding { E x , y ℓ ( f ( x ) , y ) − 1 ℓ ( f ( x i ) , y i )} ∑ n sup n f ∈F i = 1 If F = { f 0 } consists of a single function, then clearly { E ℓ ( f ( x ) , y ) − 1 n ℓ ( f ( x i ) , y i )} = { E ℓ ( f 0 ( x ) , y ) − 1 n ℓ ( f 0 ( x i ) , y i )} ∑ ∑ sup n n f ∈ F i = 1 i = 1 This quantity is O P ( 1 /√ n ) by Hoeffding’s inequality, assuming a ≤ ℓ ( f 0 ( x ) , y ) ≤ b . Moral: for “simple” classes F the uniform deviations ( D ) can be bounded while for “rich” classes not. We will see how far we can push the size of F . 52 / 130

A bit of notation to simplify things... To ease the notation, ▸ Let z i = ( x i , y i ) so that the training data is { z 1 , . . . , z n } ▸ g ( z ) = ℓ ( f ( x ) , y ) for z = ( x , y ) ▸ Loss class G = { g ∶ g ( z ) = ℓ ( f ( x ) , y )} = ℓ ○ F g n = ℓ ( ˆ f n ( ⋅ ) , ⋅ ) , g G = ℓ ( f F ( ⋅ ) , ⋅ ) ▸ ˆ ▸ g ∗ = arg min g E g ( z ) = ℓ ( f ∗ ( ⋅ ) , ⋅ ) is Bayes optimal (loss) function We can now work with the set G , but keep in mind that each g ∈ G corresponds to an f ∈ F : g ∈ G → f ∈ F ← Once again, the quantity of interest is { E g ( z ) − 1 n g ( z i )} ∑ sup n g ∈G i = 1 On the next slide, we visualize deviations E g ( z ) − 1 n ∑ n i = 1 g ( z i ) for all possible functions g and discuss all the concepts introduces so far. 53 / 130

Empirical Process Viewpoint E g 0 g ∗ all functions 54 / 130

Empirical Process Viewpoint n 1 X g ( z i ) E g n i = 1 0 g ∗ all functions 54 / 130

Empirical Process Viewpoint n 1 X g ( z i ) E g n i = 1 0 g ∗ ˆ g n all functions 54 / 130

Empirical Process Viewpoint n 1 X g ( z i ) n i = 1 0 g ∗ ˆ g n 54 / 130

Empirical Process Viewpoint n G 1 X g ( z i ) E g n i = 1 0 g ∗ all functions 54 / 130

Empirical Process Viewpoint n G 1 X g ( z i ) E g n i = 1 0 g ∗ g G ˆ g n all functions 54 / 130

Empirical Process Viewpoint n G 1 X g ( z i ) E g n i = 1 0 g ∗ all functions 54 / 130

Empirical Process Viewpoint A stochastic process is a collection of random variables indexed by some set. An empirical process is a stochastic process { E g ( z ) − 1 n g ( z i )} ∑ n g ∈ G i = 1 indexed by a function class G . Uniform Law of Large Numbers: ∣ E g − 1 ∑ n g ( z i )∣ → 0 sup n g ∈ G i = 1 in probability. 55 / 130

Empirical Process Viewpoint A stochastic process is a collection of random variables indexed by some set. An empirical process is a stochastic process { E g ( z ) − 1 n g ( z i )} ∑ n g ∈ G i = 1 indexed by a function class G . Uniform Law of Large Numbers: ∣ E g − 1 ∑ n g ( z i )∣ → 0 sup n g ∈ G i = 1 in probability. Key question: How “big” can G be for the supremum of the empirical process to still be manageable? 55 / 130

Union Bound (Boole’s inequality) Boole’s inequality: for a finite or countable set of events, P ( ∪ j A j ) ≤ ∑ P ( A j ) j Let G = { g 1 , . . . , g N } . Then P ( ∃ g ∈ G ∶ E g − 1 g ( z i ) > ǫ ) ≤ P ( E g j − 1 g j ( z i ) > ǫ ) ∑ n ∑ N ∑ n n n i = 1 j = 1 i = 1 Assuming P ( a ≤ g ( z i ) ≤ b ) = 1 for every g ∈ G , P ( sup { E g − 1 n g ( z i )} > ǫ ) ≤ N exp ( − ( b − a ) 2 ) 2 nǫ 2 ∑ n g ∈ G i = 1 56 / 130

Finite Class Alternatively, we set δ = N exp (− 2 nǫ 2 ( b − a ) 2 ) and write √ P ⎛ log ( N ) + log ( 1 / δ ) ⎞ { E g − 1 g ( z i )} > ( b − a ) ⎠ ≤ δ ∑ n ⎝ sup n 2 n g ∈G i = 1 Another way to write it: with probability at least 1 − δ , √ log ( N ) + log ( 1 / δ ) { E g − 1 ∑ n g ( z i )} ≤ ( b − a ) sup n 2 n g ∈ G i = 1 Hence, with probability at least 1 − δ , the ERM algorithm ˆ f n for a class F of cardinality N satisfies √ log ( N ) + log ( 1 / δ ) L ( ˆ f n ) − inf f ∈ F L ( f ) ≤ 2 ( b − a ) 2 n assuming a ≤ ℓ ( f ( x ) , y ) ≤ b for all f ∈ F , x ∈ X , y ∈ Y . The constant 2 is due to the L ( f F ) − ˆ L ( f F ) term. This is a loose upper bound. 57 / 130

Once again... A take-away message is that the following two statements are worlds apart: with probability at least 1 − δ , for any g ∈ G , E g − 1 g ( z i ) ≤ ǫ ∑ n n i = 1 vs for any g ∈ G , with probability at least 1 − δ , E g − 1 n g ( z i ) ≤ ǫ ∑ n i = 1 The second statement follows from CLT, while the first statement is often difficult to obtain and only holds for some G . 58 / 130

Countable Class: Weighted Union Bound Let G be countable and fix a distribution w on G such that ∑ g ∈G w ( g ) ≤ 1. For any δ > 0, for any g ∈ G √ P ⎛ ⎞ log 1 / w ( g ) + log ( 1 / δ ) ⎝ E g − 1 n g ( z i ) ≥ ( b − a ) ⎠ ≤ δ ⋅ w ( g ) ∑ n 2 n i = 1 by Hoeffding’s inequality (easy to verify!). By the Union Bound, √ P ⎛ ⎞ log 1 / w ( g ) + log ( 1 / δ ) ⎝ ∃ g ∈ G ∶ E g − 1 n g ( z i ) ≥ ( b − a ) ⎠ ≤ δ ∑ w ( g ) ≤ δ ∑ 2 n n g ∈ G i = 1 Therefore, with probability at least 1 − δ , for all f ∈ F √ log 1 / w ( f ) + log ( 1 / δ ) L ( f ) − ˆ L ( f ) ≤ ( b − a ) 2 n �ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ�ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ� pen n ( f ) 60 / 130

Countable Class: Weighted Union Bound If ˆ f n is a regularized ERM, L ( ˆ f n ) − L ( f F ) ≤ { L ( ˆ f n ) − ˆ L ( ˆ f n ) − pen n ( ˆ f n )} + { ˆ L ( ˆ f n ) + pen n ( ˆ f n ) − ˆ L ( f F ) − pen n ( f F )} + { ˆ L ( f F ) − L ( f F )} + pen n ( f F ) ≤ sup { L ( f ) − ˆ L ( f ) − pen n ( f )} + { ˆ L ( f F ) − L ( f F )} + pen n ( f F ) f ∈F So, ( E ) implies a bound on ( B ) when ˆ f n is regularized ERM. From the weighted union bound for a countable class: L ( ˆ f n ) − L ( f F ) ≤ { ˆ L ( f F ) − L ( f F )} + pen n ( f F ) √ log 1 / w ( f F ) + log ( 1 / δ ) ≤ 2 ( b − a ) 2 n 61 / 130

Uncountable Class: Compression Bounds Let us make the dependence of the algorithm ˆ f n on the training set S = {( x 1 , y 1 ) , . . . , ( x n , y n )} explicit: ˆ f n = ˆ f n [ S ] . Suppose F has the property that there exists a “compression function” C k which selects from any dataset S of any size n a subset of k labeled examples C k ( S ) ⊆ S such that the algorithm can be written as f n [ S ] = ˆ f k [ C k ( S )] ˆ Then, L ( ˆ f n ) − ˆ L ( ˆ f n ) = E ℓ ( ˆ f k [ C k ( S )]( x ) , y ) − 1 n ℓ ( ˆ f k [ C k ( S )]( x i ) , y i ) ∑ n i = 1 ≤ I ⊂ { 1,..., n } , ∣ I ∣ ≤ k { E ℓ ( ˆ f k [ S I ]( x ) , y ) − 1 n ℓ ( ˆ f k [ S I ]( x i ) , y i )} ∑ max n i = 1 62 / 130

Uncountable Class: Compression Bounds f k [ S I ] only depends on k out of n points, the empirical average is Since ˆ “mostly out of sample”. Adding and subtracting ℓ ( ˆ f k [ S I ]( x ′ ) , y ′ ) 1 ∑ n ( x ′ , y ′ )∈ W for an additional set of i.i.d. random variables W = {( x ′ 1 ) , . . . , ( x ′ k )} 1 , y ′ k , y ′ results in an upper bound ⎧ ⎫ ⎪ ⎪ ⎪ ⎪ + ( b − a ) k ⎪ ⎪ ⎨ E ℓ ( ˆ f k [ S I ]( x ) , y ) − 1 ℓ ( ˆ f k [ S I ]( x ) , y ) ⎬ ∑ ⎪ ⎪ max ⎪ ⎪ ⎪ ⎪ I ⊂ { 1,..., n } , ∣ I ∣ ≤ k ⎩ n ⎭ n ( x , y ) ∈ S ∖ S I ∪ W ∣ I ∣ We appeal to the union bound over the ( n k ) possibilities, with a Hoeffding’s bound for each. Then with probability at least 1 − δ , √ k log ( en / k ) + log ( 1 / δ ) + ( b − a ) k L ( ˆ f n ) − inf f ∈ F L ( f ) ≤ 2 ( b − a ) 2 n n assuming a ≤ ℓ ( f ( x ) , y ) ≤ b for all f ∈ F , x ∈ X , y ∈ Y . 63 / 130

Example: Classification with Thresholds in 1D ▸ X = [ 0, 1 ] , Y = { 0, 1 } ▸ F = { f θ ∶ f θ ( x ) = I { x ≥ θ } , θ ∈ [ 0, 1 ]} ▸ ℓ ( f θ ( x ) , y ) = I { f θ ( x )≠ y } ˆ f n 0 1 For any set of data ( x 1 , y 1 ) , . . . , ( x n , y n ) , the ERM solution ˆ f n has the property that the first occurrence x l on the left of the threshold has label y l = 0, while first occurrence x r on the right – label y r = 1. Enough to take k = 2 and define ˆ f n [ S ] = ˆ f 2 [( x l , 0 ) , ( x r , 1 )] . 64 / 130

Stability Yet another way to limit the bias of ℓ ( ˆ f n ( x i ) , y i ) as an estimate of L ( ˆ f n ) is through a notion of stability. An algorithm ˆ f n is stable if a change (or removal) of a single data point does not change (in a certain mathematical sense) the function ˆ f n by much. f n = f 0 without even looking at Of course, a dumb algorithm which outputs ˆ data is very stable and ℓ ( ˆ f n ( x i ) , y i ) are independent random variables... But it is not a good algorithm! We would like to have an algorithm that both approximately minimizes the empirical error and is stable. Turns out, certain types of regularization methods are stable. Example: ( f ( x i ) − y i ) 2 + λ ∥ f ∥ 2 f n = arg min n 1 ∑ ˆ K f ∈F n i = 1 where ∥ ⋅ ∥ is the norm induced by the kernel of a reproducing kernel Hilbert space (RKHS) F . 65 / 130

Summary so far We proved upper bounds on L ( ˆ f n ) − L ( f F ) for ▸ ERM over a finite class ▸ Regularized ERM over a countable class (weighted union bound) ▸ ERM over classes F with the compression property ▸ ERM or Regularized ERM that are stable (only sketched it) What about a more general situation? Is there a way to measure complexity of F that tells us whether ERM will succeed? 66 / 130

Uniform Convergence and Symmetrization Let z ′ 1 , . . . , z ′ n be another set of n i.i.d. random variables from P . Let ǫ 1 , . . . , ǫ n be i.i.d. Rademacher random variables: P ( ǫ i = − 1 ) = P ( ǫ i = + 1 ) = 1 / 2 Let’s get through a few manipulations: { E g ( z ) − 1 ∑ n g ( z i )} = E z 1 ∶ n sup { E z ′ 1 ∶ n { 1 ∑ n g ( z ′ i )} − 1 ∑ n g ( z i )} E sup n n n g ∈G g ∈ G i = 1 i = 1 i = 1 By Jensen’s inequality, this is upper bounded by { 1 n g ( z ′ i ) − 1 n g ( z i )} ∑ ∑ 1 ∶ n sup E z 1 ∶ n , z ′ n n g ∈ G i = 1 i = 1 which is equal to { 1 n ǫ i ( g ( z ′ i ) − g ( z i ))} ∑ 1 ∶ n sup E ǫ 1 ∶ n E z 1 ∶ n , z ′ n g ∈ G i = 1 68 / 130

Uniform Convergence and Symmetrization { 1 n ǫ i ( g ( z ′ i ) − g ( z i ))} ∑ 1 ∶ n sup E ǫ 1 ∶ n E z 1 ∶ n , z ′ n g ∈G i = 1 ≤ E sup { 1 ∑ n ǫ i g ( z ′ i )} + E sup { 1 ∑ n − ǫ i g ( z i )} n n g ∈ G g ∈ G i = 1 i = 1 = 2 E sup { 1 n ǫ i g ( z i )} ∑ n g ∈ G i = 1 The empirical Rademacher averages of G are defined as ̂ R n ( G ) = E [ sup { 1 n ǫ i g ( z i )} ∣ z 1 , . . . , z n ] ∑ g ∈ G n i = 1 The Rademacher average (or Rademacher complexity ) of G is R n ( G ) = E z 1 ∶ n ̂ R n ( G ) 69 / 130

Classification: Loss Function Disappears Let us focus on binary classification with indicator loss and let F be a class of { 0, 1 } -valued functions. We have ℓ ( f ( x ) , y ) = I { f ( x )≠ y } = ( 1 − 2 y ) f ( x ) + y and thus ̂ R n ( G ) = E [ sup { 1 n ǫ i ( f ( x i )( 1 − 2 y i ) + y i )} ∣ ( x 1 , y 1 ) . . . , ( x n , y n )] ∑ n f ∈F i = 1 ǫ i f ( x i )} ∣ x 1 , . . . , x n ] = ̂ = E [ sup { 1 ∑ n R n ( F ) n f ∈ F i = 1 because, given y 1 , . . . , y n , the distribution of ǫ i ( 1 − 2 y i ) is the same as ǫ i . 70 / 130

Vapnik-Chervonenkis Theory for Classification We are now left examining E [ sup { 1 ∑ n ǫ i f ( x i )} ∣ x 1 , . . . , x n ] n f ∈F i = 1 Given x 1 , . . . , x n , define the projection of F onto sample: F ∣ x 1 ∶ n = {( f ( x 1 ) , . . . , f ( x n )) ∈ { 0, 1 } n ∶ f ∈ F } ⊆ { 0, 1 } n Clearly, this is a finite set and √ 2 log card ( F ∣ x 1 ∶ n ) ̂ R n ( F ) = E ǫ 1 ∶ n ∑ n ǫ i v i ≤ 1 max v ∈ F∣ x 1 ∶ n n n i = 1 This is because a maximum of N (sub)Gaussian random variables ∼ √ log N . The bound is nontrivial as long as log card ( F ∣ x 1 ∶ n ) = o ( n ) . 71 / 130

Vapnik-Chervonenkis Theory for Classification The growth function is defined as Π F ( n ) = max { card ( F ∣ x 1 ,..., x n ) ∶ x 1 , . . . , x n ∈ X } The growth function measures expressiveness of F . In particular, if F can produce all possible signs (that is, Π F ( n ) = 2 n ), the bound becomes useless. We say that F shatters some set x 1 , . . . , x n if F ∣ x n = { 0, 1 } n . The Vapnik-Chervonenkis (VC) dimension of the class F is defined as vc ( F ) = max { d ∶ Π F ( t ) = 2 t } Vapnik-Chervonenkis-Sauer-Shelah Lemma: If d = vc ( F ) < ∞ , then Π F ( n ) ≤ d ( n d ) ≤ ( en d ) d ∑ i = 0 72 / 130

Vapnik-Chervonenkis Theory for Classification Conclusion: for any F with vc ( F ) < ∞ , the ERM algorithm satisfies √ 2 d log ( en / d ) E { L ( ˆ f n ) − inf f ∈F L ( f )} ≤ 2 n While we proved the result in expectation, the same type of bound holds with high probability. VC dimension is a combinatorial dimension of a binary-valued function class. Its finiteness is necessary and sufficient for learnability if we place no assumptions on the distribution P . Remark: the bound is similar to that obtained through compression. In fact, the exact relationship between compression and VC dimension is still an open question. 73 / 130

Vapnik-Chervonenkis Theory for Classification Examples of VC classes: ▸ Half-spaces F = { I {⟨ w , x ⟩+ b ≥ 0 } ∶ w ∈ R d , ∥ w ∥ = 1, b ∈ R } has vc ( F ) = d + 1 ▸ For a vector space H of dimension d , VC dimension of F = { I { h ( x )≥ 0 } ∶ h ∈ H } is at most d ▸ The set of Euclidean balls F = { I {∑ d i = 1 ∥ x i − a i ∥ 2 ≤ b } ∶ a ∈ R d , b ∈ R } has VC dimension at most d + 2. ▸ Functions that can be computed using a finite number of arithmetic operations (see (Goldberg and Jerrum, 1995) ) However: F = { f α ( x ) = I { sin ( αx )≥ 0 } ∶ α ∈ R } has infinite VC dimension, so it is not correct to think of VC dimension as the number of parameters! 74 / 130

Vapnik-Chervonenkis Theory for Classification Examples of VC classes: ▸ Half-spaces F = { I {⟨ w , x ⟩+ b ≥ 0 } ∶ w ∈ R d , ∥ w ∥ = 1, b ∈ R } has vc ( F ) = d + 1 ▸ For a vector space H of dimension d , VC dimension of F = { I { h ( x )≥ 0 } ∶ h ∈ H } is at most d ▸ The set of Euclidean balls F = { I {∑ d i = 1 ∥ x i − a i ∥ 2 ≤ b } ∶ a ∈ R d , b ∈ R } has VC dimension at most d + 2. ▸ Functions that can be computed using a finite number of arithmetic operations (see (Goldberg and Jerrum, 1995) ) However: F = { f α ( x ) = I { sin ( αx )≥ 0 } ∶ α ∈ R } has infinite VC dimension, so it is not correct to think of VC dimension as the number of parameters! Unfortunately, the VC theory is unable to explain the good performance of neural networks and Support Vector Machines! This prompted the development of a margin-based theory. 74 / 130

Classification with Real-Valued Functions Many methods use I ( F ) = { I { f ≥ 0 } ∶ f ∈ F } for classification. The VC dimension can be very large, yet in practice the methods work well. Example: f ( x ) = f w ( x ) = ⟨ w , ψ ( x )⟩ where ψ is a mapping to a high- dimensional feature space (see Kernel Methods). The VC dimension of the set is typically huge (equal to the dimensionality of ψ ( x ) ) or infinite, yet the methods perform well! Is there an explanation beyond VC theory? 76 / 130

Margins Hard margin: ∃ f ∈ F ∶ ∀ i , y i f ( x i ) ≥ γ f ( x ) More generally, we hope to have card ({ i ∶ y i f ( x i ) < γ }) ∃ f ∈ F ∶ is small n 77 / 130

Surrogate Loss Define ⎧ ⎪ if s ≤ 0 ⎪ ⎪ ⎪ 1 φ ( s ) = ⎨ 1 − s / γ if 0 < s < γ ⎪ ⎪ ⎪ ⎪ if s ≥ γ ⎩ 0 I { y ≠ sign ( f ( x ))} = I { yf ( x )≤ 0 } ≤ φ ( yf ( x )) ≤ ψ ( yf ( x )) = I { yf ( x )≤ γ } Then: The function φ is an example of a surrogate loss function . φ ( yf ( x )) ψ ( yf ( x )) I { yf ( x ) 6 0 } yf ( x ) γ Let L φ ( f ) = E φ ( yf ( x )) L φ ( f ) = 1 n φ ( y i f ( x i )) ∑ ˆ and n i = 1 Then L ( f ) ≤ L φ ( f ) , L φ ( f ) ≤ ˆ L ψ ( f ) ˆ 78 / 130

Surrogate Loss Now consider uniform deviations for the surrogate loss: { L φ ( f ) − ˆ L φ ( f )} E sup f ∈F We had shown that this quantity is at most 2 R n ( φ ( F )) for φ ( F ) = { g ( z ) = φ ( yf ( x )) ∶ f ∈ F } A useful property of Rademacher averages: R n ( φ ( F )) ≤ L R n ( F ) if φ is L -Lipschitz. Observe that in our example φ is 1 / γ -Lipschitz. Hence, { L φ ( f ) − ˆ L φ ( f )} ≤ 2 γ R n ( F ) E sup f ∈F 79 / 130

Margin Bound Same result in high probability: with probability at least 1 − δ , √ log ( 1 / δ ) { L φ ( f ) − ˆ L φ ( f )} ≤ 2 γ R n (F) + sup 2 n f ∈F With probability at least 1 − δ , for all f ∈ F √ log ( 1 / δ ) L ( f ) ≤ ˆ L ψ ( f ) + 2 γ R n (F) + 2 n If ˆ f n is minimizing margin loss f n = arg min ∑ n φ ( y i f ( x i )) 1 ˆ f ∈F n i = 1 then with probability at least 1 − δ √ log ( 1 / δ ) L ( ˆ f n ) ≤ inf f ∈ F L ψ ( f ) + 4 γ R n (F) + 2 2 n Note: φ assumes knowledge of γ , but this assumption can be removed. 80 / 130

Useful Properties 1. If F ⊆ G , then ̂ R n ( F ) ≤ ̂ R n ( G ) R n ( F ) = ̂ ̂ R n ( conv ( F )) 2. 3. For any c ∈ R , ̂ R n ( c F ) = ∣ c ∣ ̂ R n ( F ) 4. If φ ∶ R ↦ R is L -Lipschitz (that is, φ ( a ) − φ ( b ) ≤ L ∣ a − b ∣ for all a , b ∈ R ), then R n ( φ ○ F ) ≤ L ̂ ̂ R n ( F ) 82 / 130

Rademacher Complexity of Kernel Classes ▸ Feature map φ ∶ X ↦ ℓ 2 and p.d. kernel K ( x 1 , x 2 ) = ⟨ φ ( x 1 ) , φ ( x 2 )⟩ ▸ The set F B = { f ( x ) = ⟨ w , φ ( x )⟩ ∶ ∥ w ∥ ≤ B } is a ball in H ▸ Reproducing property f ( x ) = ⟨ f , K ( x , ⋅ )⟩ An easy calculation shows that empirical Rademacher averages are upper bounded as ̂ R n ( F B ) = E sup ∑ n ǫ i f ( x i ) = E sup ∑ n ǫ i ⟨ f , K ( x i , ⋅ )⟩ 1 1 n n f ∈F 1 f ∈ F B i = 1 i = 1 = E sup ⟨ f , 1 n ǫ i K ( x i , ⋅ )⟩ = B ⋅ E ∥ 1 n ǫ i K ( x i , ⋅ )∥ ∑ ∑ n n f ∈ F B i = 1 i = 1 n E ⎛ ǫ i ǫ j ⟨ K ( x i , ⋅ ) , K ( x j , ⋅ )⟩⎞ − 1 / 2 = B n ∑ ⎝ ⎠ i , j = 1 − 1 / 2 ≤ B n ( n K ( x i , x i )) ∑ i = 1 A data-independent bound of O ( Bκ /√ n ) can be obtained if sup x ∈ X K ( x , x ) ≤ κ 2 . Then κ and B are the effective “dimensions”. 83 / 130

Other Examples Using properties of Rademacher averages, we may establish guarantees for learning with neural networks, decision trees, and so on. Powerful technique, typically requires only a few lines of algebra. Occasionally, covering numbers and scale-sensitive dimensions can be easier to deal with. 84 / 130

Real-Valued Functions: Covering Numbers Consider ▸ a class F of [ − 1, 1 ] -valued functions ▸ let Y = [ − 1, 1 ] , ℓ ( f ( x ) , y ) = ∣ f ( x ) − y ∣ We have L ( f ) ≤ 2 E x 1 ∶ n ̂ L ( f ) − ˆ R n (F) E sup f ∈F For real-valued functions the cardinality of F∣ x 1 ∶ n is infinite. However, similar functions f and f ′ with ( f ( x 1 ) , . . . , f ( x n )) ≈ ( f ′ ( x 1 ) , . . . , f ′ ( x n )) should be treated as the same. α 86 / 130

Statistical Learning Theory Machine Learning Summer School, Kyoto, - PowerPoint PPT Presentation

Statistical Learning Theory Machine Learning Summer School, Kyoto, Japan Alexander (Sasha) Rakhlin University of Pennsylvania, The Wharton School Penn Research in Machine Learning (PRiML) August 27-28, 2012 1 / 130 References Parts of these

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik

Statistical and Computational Statistical and Computational Learning Theory Learning Theory

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

MIT 9.520/6.860, Fall 2019 Statistical Learning Theory and Applications Class 02: Statistical

Chapter 1: Probability Theory (a recap) STK4011/9011: Statistical Inference Theory Johan Pensar

Day 1: Introduction to Statistical Learning Lucas Leemann Essex Summer School Introduction to

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

Overview of statistical learning theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Statistical

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

STA 214: Probability & Statistical Models STA 214: Analysis of Statistical Models

Econ 2148, fall 2019 Statistical decision theory Maximilian Kasy Department of Economics,

Econ 2148, fall 2017 Statistical decision theory Maximilian Kasy Department of Economics,

Advanced Econometrics 2, Hilary term 2020 Statistical decision theory Maximilian Kasy Department

Advanced Econometrics 2, Hilary term 2021 Statistical decision theory Maximilian Kasy Department

Monolithic Batch Goes Microservice Streaming A story about one transformation Charles Tye &

CONGA: Distributed Congestion-Aware Load Balancing for Datacenters Mohammad Alizadeh Tom Edsall,

MCT-MultiPlex Features Three Technologies Near Infrared (NIR) based on MCT-360 NIR Transmitter;

Scalable and Reliable Data Broadcast with Kascade ephane Martin, Tomasz Buchert, Pierric Willemet,

Angular Correlations in High Energy Collisions Andrew Larkoski SLAC with Martin Jankowiak

Conformal Symmetry and the Weak Scale Hermann Nicolai MPI f ur Gravitationsphysik, Potsdam

Summary: Netstation Properties 1. Physical Attributes - Gigabit channels. No slot limit.

Models Social and Economic Networks Jafar Habibi MohammadAmin Fazli Social and Economic

Sambuz

Useful Links

Newsletter

Mail Us

Statistical Learning Theory Machine Learning Summer School, Kyoto, - PowerPoint PPT Presentation

Statistical Learning Theory Machine Learning Summer School, Kyoto, Japan Alexander (Sasha) Rakhlin University of Pennsylvania, The Wharton School Penn Research in Machine Learning (PRiML) August 27-28, 2012 1 / 130 References Parts of these

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik

Statistical and Computational Statistical and Computational Learning Theory Learning Theory

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

MIT 9.520/6.860, Fall 2019 Statistical Learning Theory and Applications Class 02: Statistical

Chapter 1: Probability Theory (a recap) STK4011/9011: Statistical Inference Theory Johan Pensar

Day 1: Introduction to Statistical Learning Lucas Leemann Essex Summer School Introduction to

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

Overview of statistical learning theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Statistical

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

STA 214: Probability &amp; Statistical Models STA 214: Analysis of Statistical Models

Econ 2148, fall 2019 Statistical decision theory Maximilian Kasy Department of Economics,

Econ 2148, fall 2017 Statistical decision theory Maximilian Kasy Department of Economics,

Advanced Econometrics 2, Hilary term 2020 Statistical decision theory Maximilian Kasy Department

Advanced Econometrics 2, Hilary term 2021 Statistical decision theory Maximilian Kasy Department

Monolithic Batch Goes Microservice Streaming A story about one transformation Charles Tye &amp;

CONGA: Distributed Congestion-Aware Load Balancing for Datacenters Mohammad Alizadeh Tom Edsall,

MCT-MultiPlex Features Three Technologies Near Infrared (NIR) based on MCT-360 NIR Transmitter;

Scalable and Reliable Data Broadcast with Kascade ephane Martin, Tomasz Buchert, Pierric Willemet,

Angular Correlations in High Energy Collisions Andrew Larkoski SLAC with Martin Jankowiak

Conformal Symmetry and the Weak Scale Hermann Nicolai MPI f ur Gravitationsphysik, Potsdam

Summary: Netstation Properties 1. Physical Attributes - Gigabit channels. No slot limit.

Models Social and Economic Networks Jafar Habibi MohammadAmin Fazli Social and Economic

Sambuz

Useful Links

Newsletter

Mail Us

STA 214: Probability & Statistical Models STA 214: Analysis of Statistical Models

Monolithic Batch Goes Microservice Streaming A story about one transformation Charles Tye &