Scalable Machine Learning 2. Statistics Alex Smola Yahoo! Research - PowerPoint PPT Presentation

Delivered-To: alex.smola@gmail.com Received: by 10.216.47.73 with SMTP id s51cs361171web; this is a gross Tue, 3 Jan 2012 14:17:53 -0800 (PST) Received: by 10.213.17.145 with SMTP id s17mr2519891eba.147.1325629071725; Tue, 03 Jan 2012 14:17:51 -0800 (PST) Return-Path: <alex+caf_=alex.smola=gmail.com@smola.org> simplification Received: from mail-ey0-f175.google.com (mail-ey0-f175.google.com [209.85.215.175]) by mx.google.com with ESMTPS id n4si29264232eef.57.2012.01.03.14.17.51 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 03 Jan 2012 14:17:51 -0800 (PST) Received-SPF: neutral (google.com: 209.85.215.175 is neither permitted nor denied by best guess record for domain of alex+caf_=alex.smola=gmail.com@smola.org) client- ip=209.85.215.175; Authentication-Results: mx.google.com; spf=neutral (google.com: 209.85.215.175 is neither permitted nor denied by best guess record for domain of alex +caf_=alex.smola=gmail.com@smola.org) smtp.mail=alex+caf_=alex.smola=gmail.com@smola.org; • date dkim=pass (test mode) header.i=@googlemail.com Received: by eaal1 with SMTP id l1so15092746eaa.6 for <alex.smola@gmail.com>; Tue, 03 Jan 2012 14:17:51 -0800 (PST) Received: by 10.205.135.18 with SMTP id ie18mr5325064bkc.72.1325629071362; • time Tue, 03 Jan 2012 14:17:51 -0800 (PST) X-Forwarded-To: alex.smola@gmail.com X-Forwarded-For: alex@smola.org alex.smola@gmail.com Delivered-To: alex@smola.org Received: by 10.204.65.198 with SMTP id k6cs206093bki; • recipient path Tue, 3 Jan 2012 14:17:50 -0800 (PST) Received: by 10.52.88.179 with SMTP id bh19mr10729402vdb.38.1325629068795; Tue, 03 Jan 2012 14:17:48 -0800 (PST) Return-Path: <althoff.tim@googlemail.com> Received: from mail-vx0-f179.google.com (mail-vx0-f179.google.com [209.85.220.179]) • IP number by mx.google.com with ESMTPS id dt4si11767074vdb.93.2012.01.03.14.17.48 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 03 Jan 2012 14:17:48 -0800 (PST) Received-SPF: pass (google.com: domain of althoff.tim@googlemail.com designates • sender 209.85.220.179 as permitted sender) client-ip=209.85.220.179; Received: by vcbf13 with SMTP id f13so11295098vcb.10 for <alex@smola.org>; Tue, 03 Jan 2012 14:17:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; • encoding h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; bh=WCbdZ5sXac25dpH02XcRyDOdts993hKwsAVXpGrFh0w=; b=WK2B2+ExWnf/gvTkw6uUvKuP4XeoKnlJq3USYTm0RARK8dSFjyOQsIHeAP9Yssxp6O • many more features 7ngGoTzYqd+ZsyJfvQcLAWp1PCJhG8AMcnqWkx0NMeoFvIp2HQooZwxSOCx5ZRgY+7qX uIbbdna4lUDXj6UFe16SpLDCkptd8OZ3gr7+o= MIME-Version: 1.0 Received: by 10.220.108.81 with SMTP id e17mr24104004vcp.67.1325629067787; Tue, 03 Jan 2012 14:17:47 -0800 (PST) Sender: althoff.tim@googlemail.com Received: by 10.220.17.129 with HTTP; Tue, 3 Jan 2012 14:17:47 -0800 (PST) Date: Tue, 3 Jan 2012 14:17:47 -0800 X-Google-Sender-Auth: 6bwi6D17HjZIkxOEol38NZzyeHs Message-ID: <CAFJJHDGPBW+SdZg0MdAABiAKydDk9tpeMoDijYGjoGO-WC7osg@mail.gmail.com> Subject: CS 281B. Advanced Topics in Learning and Decision Making From: Tim Althoff <althoff@eecs.berkeley.edu>

Recall - Map Reduce • 1000s of (faulty) machines • Lots of jobs are mostly embarrassingly parallel (except for a sorting/transpose phase) • Functional programming origins • Map(key,value) processes each (key,value) pair and outputs a new (key,value) pair • Reduce(key,value) reduces all instances with same key to aggregate • Example - extremely naive wordcount • Map(docID, document) for each document emit many (wordID, count) pairs • Reduce(wordID, count) sum over all counts for given wordID and emit (wordID, aggregate) from Ramakrishnan, Sakrejda, Canon, DoE 2011

Recall - Map Reduce • 1000s of (faulty) machines • Lots of jobs are mostly embarrassingly parallel (except for a sorting/transpose phase) • Functional programming origins • Map(key,value) processes each (key,value) pair and outputs a new (key,value) pair • Reduce(key,value) reduces all instances with same key to aggregate • Example - extremely naive wordcount • Map(docID, document) for each document emit many (wordID, count) pairs • Reduce(wordID, count) sum over all counts for given wordID and emit (wordID, aggregate)

Naive NaiveBayes Classifier • Two classes (spam/ham) • Binary features (e.g. presence of $$$, viagra) • Simplistic Algorithm • Count occurrences of feature for spam/ham • Count number of spam/ham mails spam probability feature probability p ( x i = TRUE | y ) = n ( i, y ) and p ( y ) = n ( y ) n ( y ) n p ( y | x ) ∝ n ( y ) n ( i, y ) n ( y ) − n ( i, y ) Y Y n ( y ) n ( y ) n i : x i =TRUE i : x i =FALSE

Naive NaiveBayes Classifier what if n(i,y)=n(y)? what if n(i,y)=0? p ( y | x ) ∝ n ( y ) n ( i, y ) n ( y ) − n ( i, y ) Y Y n ( y ) n ( y ) n i : x i =TRUE i : x i =FALSE

Simple Algorithm • For each document (x,y) do • Aggregate label counts given y • For each feature x i i n x do • Aggregate statistic for (x i , y) for each y • For y estimate distribution p(y) • For each (x i ,y) pair do Estimate distribution p(x i |y), e.g. Parzen Windows, Exponential family (Gauss, Laplace, Poisson, ...), Mixture • Given new instance compute Y p ( y | x ) ∝ p ( y ) p ( x j | y ) j

Simple Algorithm • For each document (x,y) do • Aggregate label counts given y pass over all data • For each feature x i i n x do • Aggregate statistic for (x i , y) for each y • For y estimate distribution p(y) trivially parallel • For each (x i ,y) pair do Estimate distribution p(x i |y), e.g. Parzen Windows, Exponential family (Gauss, Laplace, Poisson, ...), Mixture • Given new instance compute Y p ( y | x ) ∝ p ( y ) p ( x j | y ) j

MapReduce Algorithm • Map(document (x,y)) • For each mapper for each feature x i i n x do • Aggregate statistic for (x i , y) for each y • Send statistics (key = (x i ,y), value = counts) to reducer • Reduce(x i , y) • Aggregate over all messages from mappers • Estimate distribution p(x i |y), e.g. Parzen Windows, Exponential family (Gauss, Laplace, Poisson, ...), Mixture • Send coordinate-wise model to global storage • Given new instance compute Y p ( y | x ) ∝ p ( y ) p ( x j | y ) j

MapReduce Algorithm • Map(document (x,y)) local per • For each mapper for each feature x i i n x do • Aggregate statistic for (x i , y) for each y chunkserver • Send statistics (key = (x i ,y), value = counts) to reducer • Reduce(x i , y) only aggregates • Aggregate over all messages from mappers needed • Estimate distribution p(x i |y), e.g. Parzen Windows, Exponential family (Gauss, Laplace, Poisson, ...), Mixture • Send coordinate-wise model to global storage • Given new instance compute Y p ( y | x ) ∝ p ( y ) p ( x j | y ) j

Estimating Probabilities

Binomial Distribution • Two outcomes (head, tail); (0,1) • Data likelihood p ( X ; π ) = π n 1 (1 − π ) n 0 • Maximum Likelihood Estimation • Constrained optimization problem π ∈ [0 , 1] e x θ • Incorporate constraint via p ( x ; θ ) = 1 + e θ • Taking derivatives yields n 1 n 1 θ = log ⇒ p ( x = 1) = ⇐ n 0 + n 1 n 0 + n 1

... in detail ... n n e θ x i Y Y p ( X ; θ ) = p ( x i ; θ ) = 1 + e θ i =1 i =1 n X 1 + e θ ⇤ ⇥ = ⇒ log p ( X ; θ ) = θ x i − n log i =1 n e θ X = ⇒ ∂ θ log p ( X ; θ ) = x i − n 1 + e θ i =1 n e θ ⇒ 1 X x i = 1 + e θ = p ( x = 1) ⇐ n i =1

... in detail ... n n e θ x i Y Y p ( X ; θ ) = p ( x i ; θ ) = 1 + e θ i =1 i =1 n X 1 + e θ ⇤ ⇥ = ⇒ log p ( X ; θ ) = θ x i − n log i =1 n e θ X = ⇒ ∂ θ log p ( X ; θ ) = x i − n 1 + e θ i =1 n e θ ⇒ 1 X x i = 1 + e θ = p ( x = 1) ⇐ n i =1 empirical probability of x=1

Discrete Distribution • n outcomes (e.g. USA, Canada, India, UK, NZ) • Data likelihood Y π n i p ( X ; π ) = i i • Maximum Likelihood Estimation • Constrained optimization problem ... or ... exp θ x • Incorporate constraint via p ( x ; θ ) = P x 0 exp θ x 0 • Taking derivatives yields n i n i θ i = log ⇒ p ( x = i ) = ⇐ P P j n j j n j

Tossing a Dice 12 24 60 120

Key Questions • Do empirical averages converge? • Probabilities • Means / moments • Rate of convergence and limit distribution • Worst case guarantees • Using prior knowledge drug testing, semiconductor fabs computational advertising user interface design ...

2.2 Tail Bounds Chebyshev Chernoff Hoeffding

Expectations • Random variable x with probability measure • Expected value of f(x) Z E [ f ( x )] = f ( x ) dp ( x ) • Special case - discrete probability mass Z Pr { x = c } = E [ { x = c } ] = { x = c } dp ( x ) (same trick works for intervals) • Draw x i identically and independently from p • Empirical average n n E emp [ f ( x )] = 1 emp { x = c } = 1 X X f ( x i ) and Pr { x i = c } n n i =1 i =1

Deviations • Gambler rolls dice 100 times n P ( X = 6) = 1 ˆ X { x i = 6 } n i =1 • ‘6’ only occurs 11 times. Fair number is16.7 IS THE DICE TAINTED? • Probability of seeing ‘6’ at most 11 times 11 11 � i  5 � 100 − i ✓ 100 ◆  1 X X Pr( X ≤ 11) = p ( i ) = ≈ 7 . 0% 6 6 i i =0 i =0 It’s probably OK ... can we develop general theory?

Deviations • Gambler rolls dice 100 times n P ( X = 6) = 1 ˆ X { x i = 6 } n i =1 • ‘6’ only occurs 11 times. Fair number is16.7 ad campaign working IS THE DICE TAINTED? new page layout better drug working • Probability of seeing ‘6’ at most 11 times 11 11 � i  5 � 100 − i ✓ 100 ◆  1 X X Pr( X ≤ 11) = p ( i ) = ≈ 7 . 0% 6 6 i i =0 i =0 It’s probably OK ... can we develop general theory?

Empirical average for a dice 6 5 4 3 2 1 10 1 10 2 10 3 how quickly does it converge?

Law of Large Numbers • Random variables x i with mean µ = E [ x i ] n µ n := 1 • Empirical average X ˆ x i n i =1 • Weak Law of Large Numbers n →∞ Pr ( | ˆ lim µ n − µ | ≤ ✏ ) = 1 for any ✏ > 0 • Strong Law of Large Numbers ⇣ ⌘ Pr n →∞ ˆ lim µ n = µ = 1 this means convergence in probability

Empirical average for a dice 6 5 4 3 2 1 5 sample traces 10 1 10 2 10 3 • Upper and lower bounds are p µ ± Var( x ) /n • This is an example of the central limit theorem

Central Limit Theorem • Independent random variables x i with mean μ i and standard deviation σ i • The random variable " n 2 " n # − 1 # X X σ 2 z n := x i − µ i i i =1 i =1 converges to a Normal Distribution N (0 , 1) • Special case - IID random variables & average √ n " # n 1 X → N (0 , 1) x i − µ n σ i =1 convergence ⇣ ⌘ n − 1 O 2

Slutsky’s Theorem • Continuous mapping theorem • X i and Y i sequences of random variables • X i has as its limit the random variable X • Y i has as its limit the constant c • g(x,y) is continuous function for all g(x,c) • g(X i , Y i ) converges in distribution to g(X,c)

Delta Method • Random variable X i convergent to b a − 2 n ( X n − b ) → N (0 , Σ ) with a 2 n → 0 for n → ∞ • g is a continuously differentiable function for b • Then g(X i ) inherits convergence properties a � 2 n ( g ( X n ) � g ( b )) ! N (0 , [ r x g ( b )] Σ [ r x g ( b )] > ) • Proof: use Taylor expansion for g(X n ) - g(b) a � 2 n [ g ( X n ) � g ( b )] = [ r x g ( ξ n )] > a � 2 n ( X n � b ) • g( ξ n ) is on line segment [X n , b] • By Slutsky’s theorem it converges to g(b) • Hence g(X i ) is asymptotically normal

Tools for the proof

Fourier Transform • Fourier transform relations Z F [ f ]( ω ) := (2 π ) − d R n f ( x ) exp( � i h ω , x i ) dx 2 Z F − 1 [ g ]( x ) := (2 π ) − d R n g ( ω ) exp( i h ω , x i ) d ω . 2 • Useful identities • Identity F − 1 � F = F � F − 1 = Id • Derivative F [ ∂ x f ] = − i ω F [ f ] • Convolution (also holds for inverse transform) d 2 F [ f ] · F [ g ] F [ f � g ] = (2 π )

The Characteristic Function Method • Characteristic function Z φ X ( ω ) := F − 1 [ p ( x )] = exp( i h ω , x i ) dp ( x ) • For X and Y independent we have • Joint distribution is convolution Z p X + Y ( z ) = p X ( z � y ) p Y ( y ) dy = p X � p Y • Characteristic function is product φ X + Y ( ω ) = φ X ( ω ) · φ Y ( ω ) • Proof - plug in definition of Fourier transform • Characteristic function is unique

Proof - Weak law of large numbers • Require that expectation exists • Taylor expansion of exponential exp( iwx ) = 1 + i h w, x i + o ( | w | ) and hence φ X ( ω ) = 1 + iw E X [ x ] + o ( | w | ) . (need to assume that we can bound the tail) • Average of random variables convolution ◆ m ✓ 1 + i mwµ + o ( m − 1 | w | ) µ m ( ω ) = φ ˆ vanishing higher • Limit is constant distribution order terms µ m ( ω ) → exp i ω µ = 1 + i ω µ + . . . φ ˆ mean

Warning • Moments may not always exist • Cauchy distribution p ( x ) = 1 1 1 + x 2 π • For the mean to exist the following integral would have to converge Z ∞ Z ∞ | x | dp ( x ) ≥ 2 1 + x 2 dx ≥ 1 1 Z x xdx = ∞ π π 1 1

Proof - Central limit theorem • Require that second order moment exists (we assume they’re all identical WLOG) • Characteristic function exp( iwx ) = 1 + iwx − 1 2 w 2 x 2 + o ( | w | 2 ) and hence φ X ( ω ) = 1 + iw E X [ x ] − 1 2 w 2 var X [ x ] + o ( | w | 2 ) " n 2 " n • Subtract out mean (centering) # − 1 # X X σ 2 z n := x i − µ i i i =1 i =1 ◆ m ✓ 1 − 1 ✓ − 1 ◆ 2 mw 2 + o ( m − 1 | w | 2 ) 2 w 2 φ Z m ( ω ) = → exp for m → ∞ This is the FT of a Normal Distribution

Central Limit Theorem in Practice 1.0 1.0 1.0 1.0 1.0 unscaled 0.5 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.0 -5 0 5 -5 0 5 -5 0 5 -5 0 5 -5 0 5 1.5 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 1.0 scaled 0.5 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.0 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1

Finite sample tail bounds

Simple tail bounds • Gauss Markov inequality Random variable X with mean μ Pr( X ≥ ✏ ) ≤ µ/ ✏ Proof - decompose expectation Z ∞ Z ∞ Z ∞ x xdp ( x ) = µ ✏ dp ( x ) ≤ ✏ − 1 Pr( X ≥ ✏ ) = dp ( x ) ≤ ✏ . 0 ✏ ✏ • Chebyshev inequality Random variable X with mean μ and variance σ 2 p µ m � µ k > ✏ )  � 2 m − 1 ✏ − 2 or equivalently ✏  � / Pr( | ˆ m � Proof - applying Gauss-Markov to Y = (X - μ ) 2 with confidence ε 2 yields the result.

Scaling behavior • Gauss-Markov ✏ ≤ µ � Scales properly in μ but expensive in δ • Chebyshev � ✏ ≤ √ m � Proper scaling in σ but still bad in δ Can we get logarithmic scaling in δ ?

Chernoff bound • KL-divergence variant of Chernoff bound q + (1 − p ) log 1 − p K ( p, q ) = p log p 1 − q • n independent tosses from biased coin with p (X ) − 2 n ( p − q ) 2 � � Pr ≤ exp ( − nK ( q, p )) ≤ exp x i ≥ nq i Pinsker’s inequality Pinsker’s inequality • Proof w.l.o.g. q > p and set k ≥ qn i x i = k | p } = q k (1 − q ) n − k p k (1 − p ) n − k ≥ q qn (1 − q ) n − qn Pr { P i x i = k | q } p qn (1 − p ) n − qn = exp ( nK ( q, p )) Pr { P (X ) (X ) X X Pr x i = k | p Pr x i = k | q exp( − nK ( q, p )) ≤ exp( − nK ( q, p )) ≤ i i k ≥ nq k ≥ nq

McDiarmid Inequality • Independent random variables X i • Function f : X m → R • Deviation from expected value − 2 ✏ 2 C − 2 � � Pr ( | f ( x 1 , . . . , x m ) − E X 1 ,...,X m [ f ( x 1 , . . . , x m )] | > ✏ ) ≤ 2 exp m Here C is given by where X C 2 = c 2 i i =1 | f ( x 1 , . . . , x i , . . . , x m ) − f ( x 1 , . . . , x 0 i , . . . , x m ) | ≤ c i • Hoeffding’s theorem f is average and X i have bounded range c − 2 m ✏ 2 ✓ ◆ Pr ( | ˆ µ m − µ | > ✏ ) ≤ 2 exp . c 2

Scaling behavior • Hoeffding − 2 m ✏ 2 ✓ ◆ � := Pr ( | ˆ µ m − µ | > ✏ ) ≤ 2 exp c 2 ⇒ log � / 2 ≤ − 2 m ✏ 2 = c 2 r log 2 − log � = ⇒ ✏ ≤ c 2 m This helps when we need to combine several tail bounds since we only pay logarithmically in terms of their combination.

More tail bounds • Higher order moments • Bernstein inequality (needs variance bound) t 2 / 2 ✓ ◆ Pr ( µ m − µ ≥ ✏ ) ≤ exp − P i E [ X 2 i ] + Mt/ 3 here M upper-bounds the random variables X i • Proof via Gauss-Markov inequality applied to exponential sums (hence exp. inequality) • See also Azuma, Bennett, Chernoff, ... • Absolute / relative error bounds • Bounds for (weakly) dependent random variables

Tail bounds in practice

A/B testing • Two possible webpage layouts • Which layout is better? • Experiment • Half of the users see A • The other half sees design B • How many trials do we need to decide which page attracts more clicks? Assume that the probabilities are p(A) = 0.1 and p(B) = 0.11 respectively and that p(A) is known

Chebyshev Inequality • Need to bound for a deviation of 0.01 • Mean is p(B) = 0.11 (we don’t know this yet) • Want failure probability of 5% • If we have no prior knowledge, we can only bound the variance by σ 2 = 0.25 m ≤ � 2 0 . 25 ✏ 2 � = 0 . 01 2 · 0 . 05 = 50 , 000 • If we know that the click probability is at most 0.15 we can bound the variance at 0.15 * 0.85 = 0.1275. This requires at most 25,500 users.

Hoeffding’s bound • Random variable has bounded range [0, 1] (click or no click), hence c=1 • Solve Hoeffding’s inequality for m m ≤ − c 2 log � / 2 = − 1 · log 0 . 025 < 18 , 445 2 ✏ 2 2 · 0 . 01 2 This is slightly better than Chebyshev.

Normal Approximation (Central Limit Theorem) • Use asymptotic normality • Gaussian interval containing 0.95 probability Z µ + ✏ − ( x − µ ) 2 ✓ ◆ 1 exp dx = 0 . 95 2 πσ 2 2 σ 2 µ − ✏ is given by ε = 2.96 σ . • Use variance bound of 0.1275 (see Chebyshev) m ≤ 2 . 96 2 � 2 = 2 . 96 2 · 0 . 1275 ≤ 11 , 172 ✏ 2 0 . 01 2 Same rate as Hoeffding bound! Better bounds by bounding the variance.

Beyond • Many different layouts? • Combinatorial strategy to generate them (aka the Thai Restaurant process) • What if it depends on the user / time of day • Stateful user (e.g. query keywords in search) • What if we have a good prior of the response (rather than variance bound)? • Explore/exploit/reinforcement learning/control (more details at the end of this class)

2.3 Kernel Density Estimation Parzen

Density Estimation • For discrete bins (e.g. male/female; English/French/German/Spanish/Chinese) we get good uniform convergence: • Applying the union bound and Hoeffding ✓ ◆ X Pr sup | ˆ p ( a ) − p ( a ) | ≥ ✏ Pr ( | ˆ p ( a ) − p ( a ) | ≥ ✏ ) ≤ a ∈ A a ∈ A − 2 m ✏ 2 � � ≤ 2 | A | exp good news • Solving for error probability r log 2 | A | − log � � 2 | A | ≤ exp( − m ✏ 2 ) = ⇒ ✏ ≤ 2 m

Density Estimation 0.10 0.05 sample underlying density 0.04 0.03 0.05 0.02 0.01 0.00 0.00 40 50 60 70 80 90 100 110 40 50 60 70 80 90 100 110 • Continuous domain = infinite number of bins • Curse of dimensionality • 10 bins on [0, 1] is probably good • 10 10 bins on [0, 1] 10 requires high accuracy in estimate: probability mass per cell also decreases by 10 10 .

Bin Counting

Parzen Windows • Naive approach Use empirical density (delta distributions) m p emp ( x ) = 1 X δ x i ( x ) m i =1 • This breaks if we see slightly different instances • Kernel density estimate Smear out empirical density with a nonnegative smoothing kernel k x (x’) satisfying Z k x ( x 0 ) dx 0 = 1 for all x X

Parzen Windows • Density estimate m p emp ( x ) = 1 X δ x i ( x ) m i =1 m p ( x ) = 1 X ˆ k x i ( x ) m • Smoothing kernels i =1 1.0 1.0 1.0 1.0 Gauss Epanechikov Uniform Laplace 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 3 1 1 2 x 2 (2 π ) − 1 2 e − 1 2 e − | x | 4 max(0 , 1 − x 2 ) 2 χ [ − 1 , 1] ( x )

Size matters 0.050 0.050 0.050 0.050 0.025 0.025 0.025 0.025 0.000 0.000 0.000 0.000 40 60 80 100 40 60 80 100 40 60 80 100 40 60 80 100 0.3 1 3 10 0.10 0.05 0.04 0.03 0.05 0.02 0.01 0.00 0.00 40 50 60 70 80 90 100 110 40 50 60 70 80 90 100 110

Size matters 0.050 0.050 0.050 0.050 0.025 0.025 0.025 0.025 0.000 0.000 0.000 0.000 40 60 80 100 40 60 80 100 40 60 80 100 40 60 80 100 ✓ x − x i ◆ • Kernel width k x i ( x ) = r − d h r • Too narrow overfits • Too wide smoothes with constant distribution • How to choose?

Smoothing

Capacity Control

Capacity control • Need automatic mechanism to select scale • Overfitting • Maximum likelihood will lead to r=0 (smoothing kernels peak at instances) • This is (typically) a set of measure 0. • Validation set Set aside data just for calibrating r • Leave-one-out estimation Estimate likelihood using all but one instance • Alternatives: use a prior on r; convergence analysis

Scalable Machine Learning 2. Statistics Alex Smola Yahoo! Research - PowerPoint PPT Presentation

Scalable Machine Learning 2. Statistics Alex Smola Yahoo! Research and ANU http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12 xkcd.com 2. Statistics Essential tools for data analysis Statistics Probabilities Bayes rule,

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Bayesian hypothesis testing Dr. Jarad Niemi STAT 544 - Iowa State University March 7, 2019

Conditional probabilities P R AC TIC IN G STATISTIC S IN TE R VIE W QU E STION S IN P YTH ON

CS 730/730W/830: Intro AI Naive Bayes Boosting 1 handout: slides asst 5 milestone was due

1 2 3 4 5 6 7 8

7 Modelling Uncertainty Bayes theorem 7 Modelling Uncertainty Bayes theorem

Advanced Mathematical Methods Part II Statistics Statistical Inference Mel Slater

ET-805 Bayesian Knowledge Tracing Ramkumar.Rajendran@iitb.ac.in Activity - TPS Think

The Geometry of Imprecise Inference Mik elis Bickis University of Saskatchewan ISIPTA15

Sambuz

Useful Links

Newsletter

Mail Us