scalable machine learning
play

Scalable Machine Learning 2. Statistics Alex Smola Yahoo! Research - PowerPoint PPT Presentation

Scalable Machine Learning 2. Statistics Alex Smola Yahoo! Research and ANU http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12 xkcd.com 2. Statistics Essential tools for data analysis Statistics Probabilities Bayes rule,


  1. Delivered-To: alex.smola@gmail.com Received: by 10.216.47.73 with SMTP id s51cs361171web; this is a gross Tue, 3 Jan 2012 14:17:53 -0800 (PST) Received: by 10.213.17.145 with SMTP id s17mr2519891eba.147.1325629071725; Tue, 03 Jan 2012 14:17:51 -0800 (PST) Return-Path: <alex+caf_=alex.smola=gmail.com@smola.org> simplification Received: from mail-ey0-f175.google.com (mail-ey0-f175.google.com [209.85.215.175]) by mx.google.com with ESMTPS id n4si29264232eef.57.2012.01.03.14.17.51 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 03 Jan 2012 14:17:51 -0800 (PST) Received-SPF: neutral (google.com: 209.85.215.175 is neither permitted nor denied by best guess record for domain of alex+caf_=alex.smola=gmail.com@smola.org) client- ip=209.85.215.175; Authentication-Results: mx.google.com; spf=neutral (google.com: 209.85.215.175 is neither permitted nor denied by best guess record for domain of alex +caf_=alex.smola=gmail.com@smola.org) smtp.mail=alex+caf_=alex.smola=gmail.com@smola.org; • date dkim=pass (test mode) header.i=@googlemail.com Received: by eaal1 with SMTP id l1so15092746eaa.6 for <alex.smola@gmail.com>; Tue, 03 Jan 2012 14:17:51 -0800 (PST) Received: by 10.205.135.18 with SMTP id ie18mr5325064bkc.72.1325629071362; • time Tue, 03 Jan 2012 14:17:51 -0800 (PST) X-Forwarded-To: alex.smola@gmail.com X-Forwarded-For: alex@smola.org alex.smola@gmail.com Delivered-To: alex@smola.org Received: by 10.204.65.198 with SMTP id k6cs206093bki; • recipient path Tue, 3 Jan 2012 14:17:50 -0800 (PST) Received: by 10.52.88.179 with SMTP id bh19mr10729402vdb.38.1325629068795; Tue, 03 Jan 2012 14:17:48 -0800 (PST) Return-Path: <althoff.tim@googlemail.com> Received: from mail-vx0-f179.google.com (mail-vx0-f179.google.com [209.85.220.179]) • IP number by mx.google.com with ESMTPS id dt4si11767074vdb.93.2012.01.03.14.17.48 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 03 Jan 2012 14:17:48 -0800 (PST) Received-SPF: pass (google.com: domain of althoff.tim@googlemail.com designates • sender 209.85.220.179 as permitted sender) client-ip=209.85.220.179; Received: by vcbf13 with SMTP id f13so11295098vcb.10 for <alex@smola.org>; Tue, 03 Jan 2012 14:17:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; • encoding h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; bh=WCbdZ5sXac25dpH02XcRyDOdts993hKwsAVXpGrFh0w=; b=WK2B2+ExWnf/gvTkw6uUvKuP4XeoKnlJq3USYTm0RARK8dSFjyOQsIHeAP9Yssxp6O • many more features 7ngGoTzYqd+ZsyJfvQcLAWp1PCJhG8AMcnqWkx0NMeoFvIp2HQooZwxSOCx5ZRgY+7qX uIbbdna4lUDXj6UFe16SpLDCkptd8OZ3gr7+o= MIME-Version: 1.0 Received: by 10.220.108.81 with SMTP id e17mr24104004vcp.67.1325629067787; Tue, 03 Jan 2012 14:17:47 -0800 (PST) Sender: althoff.tim@googlemail.com Received: by 10.220.17.129 with HTTP; Tue, 3 Jan 2012 14:17:47 -0800 (PST) Date: Tue, 3 Jan 2012 14:17:47 -0800 X-Google-Sender-Auth: 6bwi6D17HjZIkxOEol38NZzyeHs Message-ID: <CAFJJHDGPBW+SdZg0MdAABiAKydDk9tpeMoDijYGjoGO-WC7osg@mail.gmail.com> Subject: CS 281B. Advanced Topics in Learning and Decision Making From: Tim Althoff <althoff@eecs.berkeley.edu>

  2. Recall - Map Reduce • 1000s of (faulty) machines • Lots of jobs are mostly embarrassingly parallel (except for a sorting/transpose phase) • Functional programming origins • Map(key,value) processes each (key,value) pair and outputs a new (key,value) pair • Reduce(key,value) reduces all instances with same key to aggregate • Example - extremely naive wordcount • Map(docID, document) for each document emit many (wordID, count) pairs • Reduce(wordID, count) sum over all counts for given wordID and emit (wordID, aggregate) from Ramakrishnan, Sakrejda, Canon, DoE 2011

  3. Recall - Map Reduce • 1000s of (faulty) machines • Lots of jobs are mostly embarrassingly parallel (except for a sorting/transpose phase) • Functional programming origins • Map(key,value) processes each (key,value) pair and outputs a new (key,value) pair • Reduce(key,value) reduces all instances with same key to aggregate • Example - extremely naive wordcount • Map(docID, document) for each document emit many (wordID, count) pairs • Reduce(wordID, count) sum over all counts for given wordID and emit (wordID, aggregate)

  4. Naive NaiveBayes Classifier • Two classes (spam/ham) • Binary features (e.g. presence of $$$, viagra) • Simplistic Algorithm • Count occurrences of feature for spam/ham • Count number of spam/ham mails spam probability feature probability p ( x i = TRUE | y ) = n ( i, y ) and p ( y ) = n ( y ) n ( y ) n p ( y | x ) ∝ n ( y ) n ( i, y ) n ( y ) − n ( i, y ) Y Y n ( y ) n ( y ) n i : x i =TRUE i : x i =FALSE

  5. Naive NaiveBayes Classifier what if n(i,y)=n(y)? what if n(i,y)=0? p ( y | x ) ∝ n ( y ) n ( i, y ) n ( y ) − n ( i, y ) Y Y n ( y ) n ( y ) n i : x i =TRUE i : x i =FALSE

  6. Naive NaiveBayes Classifier what if n(i,y)=n(y)? what if n(i,y)=0? p ( y | x ) ∝ n ( y ) n ( i, y ) n ( y ) − n ( i, y ) Y Y n ( y ) n ( y ) n i : x i =TRUE i : x i =FALSE

  7. Simple Algorithm • For each document (x,y) do • Aggregate label counts given y • For each feature x i i n x do • Aggregate statistic for (x i , y) for each y • For y estimate distribution p(y) • For each (x i ,y) pair do Estimate distribution p(x i |y), e.g. Parzen Windows, Exponential family (Gauss, Laplace, Poisson, ...), Mixture • Given new instance compute Y p ( y | x ) ∝ p ( y ) p ( x j | y ) j

  8. Simple Algorithm • For each document (x,y) do • Aggregate label counts given y pass over all data • For each feature x i i n x do • Aggregate statistic for (x i , y) for each y • For y estimate distribution p(y) trivially parallel • For each (x i ,y) pair do Estimate distribution p(x i |y), e.g. Parzen Windows, Exponential family (Gauss, Laplace, Poisson, ...), Mixture • Given new instance compute Y p ( y | x ) ∝ p ( y ) p ( x j | y ) j

  9. MapReduce Algorithm • Map(document (x,y)) • For each mapper for each feature x i i n x do • Aggregate statistic for (x i , y) for each y • Send statistics (key = (x i ,y), value = counts) to reducer • Reduce(x i , y) • Aggregate over all messages from mappers • Estimate distribution p(x i |y), e.g. Parzen Windows, Exponential family (Gauss, Laplace, Poisson, ...), Mixture • Send coordinate-wise model to global storage • Given new instance compute Y p ( y | x ) ∝ p ( y ) p ( x j | y ) j

  10. MapReduce Algorithm • Map(document (x,y)) local per • For each mapper for each feature x i i n x do • Aggregate statistic for (x i , y) for each y chunkserver • Send statistics (key = (x i ,y), value = counts) to reducer • Reduce(x i , y) only aggregates • Aggregate over all messages from mappers needed • Estimate distribution p(x i |y), e.g. Parzen Windows, Exponential family (Gauss, Laplace, Poisson, ...), Mixture • Send coordinate-wise model to global storage • Given new instance compute Y p ( y | x ) ∝ p ( y ) p ( x j | y ) j

  11. Estimating Probabilities

  12. Binomial Distribution • Two outcomes (head, tail); (0,1) • Data likelihood p ( X ; π ) = π n 1 (1 − π ) n 0 • Maximum Likelihood Estimation • Constrained optimization problem π ∈ [0 , 1] e x θ • Incorporate constraint via p ( x ; θ ) = 1 + e θ • Taking derivatives yields n 1 n 1 θ = log ⇒ p ( x = 1) = ⇐ n 0 + n 1 n 0 + n 1

  13. ... in detail ... n n e θ x i Y Y p ( X ; θ ) = p ( x i ; θ ) = 1 + e θ i =1 i =1 n X 1 + e θ ⇤ ⇥ = ⇒ log p ( X ; θ ) = θ x i − n log i =1 n e θ X = ⇒ ∂ θ log p ( X ; θ ) = x i − n 1 + e θ i =1 n e θ ⇒ 1 X x i = 1 + e θ = p ( x = 1) ⇐ n i =1

  14. ... in detail ... n n e θ x i Y Y p ( X ; θ ) = p ( x i ; θ ) = 1 + e θ i =1 i =1 n X 1 + e θ ⇤ ⇥ = ⇒ log p ( X ; θ ) = θ x i − n log i =1 n e θ X = ⇒ ∂ θ log p ( X ; θ ) = x i − n 1 + e θ i =1 n e θ ⇒ 1 X x i = 1 + e θ = p ( x = 1) ⇐ n i =1 empirical probability of x=1

  15. Discrete Distribution • n outcomes (e.g. USA, Canada, India, UK, NZ) • Data likelihood Y π n i p ( X ; π ) = i i • Maximum Likelihood Estimation • Constrained optimization problem ... or ... exp θ x • Incorporate constraint via p ( x ; θ ) = P x 0 exp θ x 0 • Taking derivatives yields n i n i θ i = log ⇒ p ( x = i ) = ⇐ P P j n j j n j

  16. Tossing a Dice 12 24 60 120

  17. Tossing a Dice 12 24 60 120

  18. Key Questions • Do empirical averages converge? • Probabilities • Means / moments • Rate of convergence and limit distribution • Worst case guarantees • Using prior knowledge drug testing, semiconductor fabs computational advertising user interface design ...

  19. 2.2 Tail Bounds Chebyshev Chernoff Hoeffding

  20. Expectations • Random variable x with probability measure • Expected value of f(x) Z E [ f ( x )] = f ( x ) dp ( x ) • Special case - discrete probability mass Z Pr { x = c } = E [ { x = c } ] = { x = c } dp ( x ) (same trick works for intervals) • Draw x i identically and independently from p • Empirical average n n E emp [ f ( x )] = 1 emp { x = c } = 1 X X f ( x i ) and Pr { x i = c } n n i =1 i =1

  21. Deviations • Gambler rolls dice 100 times n P ( X = 6) = 1 ˆ X { x i = 6 } n i =1 • ‘6’ only occurs 11 times. Fair number is16.7 IS THE DICE TAINTED? • Probability of seeing ‘6’ at most 11 times 11 11 � i  5 � 100 − i ✓ 100 ◆  1 X X Pr( X ≤ 11) = p ( i ) = ≈ 7 . 0% 6 6 i i =0 i =0 It’s probably OK ... can we develop general theory?

  22. Deviations • Gambler rolls dice 100 times n P ( X = 6) = 1 ˆ X { x i = 6 } n i =1 • ‘6’ only occurs 11 times. Fair number is16.7 ad campaign working IS THE DICE TAINTED? new page layout better drug working • Probability of seeing ‘6’ at most 11 times 11 11 � i  5 � 100 − i ✓ 100 ◆  1 X X Pr( X ≤ 11) = p ( i ) = ≈ 7 . 0% 6 6 i i =0 i =0 It’s probably OK ... can we develop general theory?

  23. Empirical average for a dice 6 5 4 3 2 1 10 1 10 2 10 3 how quickly does it converge?

  24. Law of Large Numbers • Random variables x i with mean µ = E [ x i ] n µ n := 1 • Empirical average X ˆ x i n i =1 • Weak Law of Large Numbers n →∞ Pr ( | ˆ lim µ n − µ | ≤ ✏ ) = 1 for any ✏ > 0 • Strong Law of Large Numbers ⇣ ⌘ Pr n →∞ ˆ lim µ n = µ = 1 this means convergence in probability

  25. Empirical average for a dice 6 5 4 3 2 1 5 sample traces 10 1 10 2 10 3 • Upper and lower bounds are p µ ± Var( x ) /n • This is an example of the central limit theorem

  26. Central Limit Theorem • Independent random variables x i with mean μ i and standard deviation σ i • The random variable " n 2 " n # − 1 # X X σ 2 z n := x i − µ i i i =1 i =1 converges to a Normal Distribution N (0 , 1) • Special case - IID random variables & average √ n " # n 1 X → N (0 , 1) x i − µ n σ i =1 convergence ⇣ ⌘ n − 1 O 2

  27. Central Limit Theorem • Independent random variables x i with mean μ i and standard deviation σ i • The random variable " n 2 " n # − 1 # X X σ 2 z n := x i − µ i i i =1 i =1 converges to a Normal Distribution N (0 , 1) • Special case - IID random variables & average √ n " # n 1 X → N (0 , 1) x i − µ n σ i =1 convergence ⇣ ⌘ n − 1 O 2

  28. Slutsky’s Theorem • Continuous mapping theorem • X i and Y i sequences of random variables • X i has as its limit the random variable X • Y i has as its limit the constant c • g(x,y) is continuous function for all g(x,c) • g(X i , Y i ) converges in distribution to g(X,c)

  29. Delta Method • Random variable X i convergent to b a − 2 n ( X n − b ) → N (0 , Σ ) with a 2 n → 0 for n → ∞ • g is a continuously differentiable function for b • Then g(X i ) inherits convergence properties a � 2 n ( g ( X n ) � g ( b )) ! N (0 , [ r x g ( b )] Σ [ r x g ( b )] > ) • Proof: use Taylor expansion for g(X n ) - g(b) a � 2 n [ g ( X n ) � g ( b )] = [ r x g ( ξ n )] > a � 2 n ( X n � b ) • g( ξ n ) is on line segment [X n , b] • By Slutsky’s theorem it converges to g(b) • Hence g(X i ) is asymptotically normal

  30. Tools for the proof

  31. Fourier Transform • Fourier transform relations Z F [ f ]( ω ) := (2 π ) − d R n f ( x ) exp( � i h ω , x i ) dx 2 Z F − 1 [ g ]( x ) := (2 π ) − d R n g ( ω ) exp( i h ω , x i ) d ω . 2 • Useful identities • Identity F − 1 � F = F � F − 1 = Id • Derivative F [ ∂ x f ] = − i ω F [ f ] • Convolution (also holds for inverse transform) d 2 F [ f ] · F [ g ] F [ f � g ] = (2 π )

  32. The Characteristic Function Method • Characteristic function Z φ X ( ω ) := F − 1 [ p ( x )] = exp( i h ω , x i ) dp ( x ) • For X and Y independent we have • Joint distribution is convolution Z p X + Y ( z ) = p X ( z � y ) p Y ( y ) dy = p X � p Y • Characteristic function is product φ X + Y ( ω ) = φ X ( ω ) · φ Y ( ω ) • Proof - plug in definition of Fourier transform • Characteristic function is unique

  33. Proof - Weak law of large numbers • Require that expectation exists • Taylor expansion of exponential exp( iwx ) = 1 + i h w, x i + o ( | w | ) and hence φ X ( ω ) = 1 + iw E X [ x ] + o ( | w | ) . (need to assume that we can bound the tail) • Average of random variables convolution ◆ m ✓ 1 + i mwµ + o ( m − 1 | w | ) µ m ( ω ) = φ ˆ vanishing higher • Limit is constant distribution order terms µ m ( ω ) → exp i ω µ = 1 + i ω µ + . . . φ ˆ mean

  34. Warning • Moments may not always exist • Cauchy distribution p ( x ) = 1 1 1 + x 2 π • For the mean to exist the following integral would have to converge Z ∞ Z ∞ | x | dp ( x ) ≥ 2 1 + x 2 dx ≥ 1 1 Z x xdx = ∞ π π 1 1

  35. Proof - Central limit theorem • Require that second order moment exists (we assume they’re all identical WLOG) • Characteristic function exp( iwx ) = 1 + iwx − 1 2 w 2 x 2 + o ( | w | 2 ) and hence φ X ( ω ) = 1 + iw E X [ x ] − 1 2 w 2 var X [ x ] + o ( | w | 2 ) " n 2 " n • Subtract out mean (centering) # − 1 # X X σ 2 z n := x i − µ i i i =1 i =1 ◆ m ✓ 1 − 1 ✓ − 1 ◆ 2 mw 2 + o ( m − 1 | w | 2 ) 2 w 2 φ Z m ( ω ) = → exp for m → ∞ This is the FT of a Normal Distribution

  36. Central Limit Theorem in Practice 1.0 1.0 1.0 1.0 1.0 unscaled 0.5 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.0 -5 0 5 -5 0 5 -5 0 5 -5 0 5 -5 0 5 1.5 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 1.0 scaled 0.5 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.0 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1

  37. Finite sample tail bounds

  38. Simple tail bounds • Gauss Markov inequality Random variable X with mean μ Pr( X ≥ ✏ ) ≤ µ/ ✏ Proof - decompose expectation Z ∞ Z ∞ Z ∞ x xdp ( x ) = µ ✏ dp ( x ) ≤ ✏ − 1 Pr( X ≥ ✏ ) = dp ( x ) ≤ ✏ . 0 ✏ ✏ • Chebyshev inequality Random variable X with mean μ and variance σ 2 p µ m � µ k > ✏ )  � 2 m − 1 ✏ − 2 or equivalently ✏  � / Pr( | ˆ m � Proof - applying Gauss-Markov to Y = (X - μ ) 2 with confidence ε 2 yields the result.

  39. Scaling behavior • Gauss-Markov ✏ ≤ µ � Scales properly in μ but expensive in δ • Chebyshev � ✏ ≤ √ m � Proper scaling in σ but still bad in δ Can we get logarithmic scaling in δ ?

  40. Chernoff bound • KL-divergence variant of Chernoff bound q + (1 − p ) log 1 − p K ( p, q ) = p log p 1 − q • n independent tosses from biased coin with p (X ) − 2 n ( p − q ) 2 � � Pr ≤ exp ( − nK ( q, p )) ≤ exp x i ≥ nq i Pinsker’s inequality Pinsker’s inequality • Proof w.l.o.g. q > p and set k ≥ qn i x i = k | p } = q k (1 − q ) n − k p k (1 − p ) n − k ≥ q qn (1 − q ) n − qn Pr { P i x i = k | q } p qn (1 − p ) n − qn = exp ( nK ( q, p )) Pr { P (X ) (X ) X X Pr x i = k | p Pr x i = k | q exp( − nK ( q, p )) ≤ exp( − nK ( q, p )) ≤ i i k ≥ nq k ≥ nq

  41. McDiarmid Inequality • Independent random variables X i • Function f : X m → R • Deviation from expected value − 2 ✏ 2 C − 2 � � Pr ( | f ( x 1 , . . . , x m ) − E X 1 ,...,X m [ f ( x 1 , . . . , x m )] | > ✏ ) ≤ 2 exp m Here C is given by where X C 2 = c 2 i i =1 | f ( x 1 , . . . , x i , . . . , x m ) − f ( x 1 , . . . , x 0 i , . . . , x m ) | ≤ c i • Hoeffding’s theorem f is average and X i have bounded range c − 2 m ✏ 2 ✓ ◆ Pr ( | ˆ µ m − µ | > ✏ ) ≤ 2 exp . c 2

  42. Scaling behavior • Hoeffding − 2 m ✏ 2 ✓ ◆ � := Pr ( | ˆ µ m − µ | > ✏ ) ≤ 2 exp c 2 ⇒ log � / 2 ≤ − 2 m ✏ 2 = c 2 r log 2 − log � = ⇒ ✏ ≤ c 2 m This helps when we need to combine several tail bounds since we only pay logarithmically in terms of their combination.

  43. More tail bounds • Higher order moments • Bernstein inequality (needs variance bound) t 2 / 2 ✓ ◆ Pr ( µ m − µ ≥ ✏ ) ≤ exp − P i E [ X 2 i ] + Mt/ 3 here M upper-bounds the random variables X i • Proof via Gauss-Markov inequality applied to exponential sums (hence exp. inequality) • See also Azuma, Bennett, Chernoff, ... • Absolute / relative error bounds • Bounds for (weakly) dependent random variables

  44. Tail bounds in practice

  45. A/B testing • Two possible webpage layouts • Which layout is better? • Experiment • Half of the users see A • The other half sees design B • How many trials do we need to decide which page attracts more clicks? Assume that the probabilities are p(A) = 0.1 and p(B) = 0.11 respectively and that p(A) is known

  46. Chebyshev Inequality • Need to bound for a deviation of 0.01 • Mean is p(B) = 0.11 (we don’t know this yet) • Want failure probability of 5% • If we have no prior knowledge, we can only bound the variance by σ 2 = 0.25 m ≤ � 2 0 . 25 ✏ 2 � = 0 . 01 2 · 0 . 05 = 50 , 000 • If we know that the click probability is at most 0.15 we can bound the variance at 0.15 * 0.85 = 0.1275. This requires at most 25,500 users.

  47. Hoeffding’s bound • Random variable has bounded range [0, 1] (click or no click), hence c=1 • Solve Hoeffding’s inequality for m m ≤ − c 2 log � / 2 = − 1 · log 0 . 025 < 18 , 445 2 ✏ 2 2 · 0 . 01 2 This is slightly better than Chebyshev.

  48. Normal Approximation (Central Limit Theorem) • Use asymptotic normality • Gaussian interval containing 0.95 probability Z µ + ✏ − ( x − µ ) 2 ✓ ◆ 1 exp dx = 0 . 95 2 πσ 2 2 σ 2 µ − ✏ is given by ε = 2.96 σ . • Use variance bound of 0.1275 (see Chebyshev) m ≤ 2 . 96 2 � 2 = 2 . 96 2 · 0 . 1275 ≤ 11 , 172 ✏ 2 0 . 01 2 Same rate as Hoeffding bound! Better bounds by bounding the variance.

  49. Beyond • Many different layouts? • Combinatorial strategy to generate them (aka the Thai Restaurant process) • What if it depends on the user / time of day • Stateful user (e.g. query keywords in search) • What if we have a good prior of the response (rather than variance bound)? • Explore/exploit/reinforcement learning/control (more details at the end of this class)

  50. 2.3 Kernel Density Estimation Parzen

  51. Density Estimation • For discrete bins (e.g. male/female; English/French/German/Spanish/Chinese) we get good uniform convergence: • Applying the union bound and Hoeffding ✓ ◆ X Pr sup | ˆ p ( a ) − p ( a ) | ≥ ✏ Pr ( | ˆ p ( a ) − p ( a ) | ≥ ✏ ) ≤ a ∈ A a ∈ A − 2 m ✏ 2 � � ≤ 2 | A | exp good news • Solving for error probability r log 2 | A | − log � � 2 | A | ≤ exp( − m ✏ 2 ) = ⇒ ✏ ≤ 2 m

  52. Density Estimation 0.10 0.05 sample underlying density 0.04 0.03 0.05 0.02 0.01 0.00 0.00 40 50 60 70 80 90 100 110 40 50 60 70 80 90 100 110 • Continuous domain = infinite number of bins • Curse of dimensionality • 10 bins on [0, 1] is probably good • 10 10 bins on [0, 1] 10 requires high accuracy in estimate: probability mass per cell also decreases by 10 10 .

  53. Bin Counting

  54. Bin Counting

  55. Bin Counting

  56. Parzen Windows • Naive approach Use empirical density (delta distributions) m p emp ( x ) = 1 X δ x i ( x ) m i =1 • This breaks if we see slightly different instances • Kernel density estimate Smear out empirical density with a nonnegative smoothing kernel k x (x’) satisfying Z k x ( x 0 ) dx 0 = 1 for all x X

  57. Parzen Windows • Density estimate m p emp ( x ) = 1 X δ x i ( x ) m i =1 m p ( x ) = 1 X ˆ k x i ( x ) m • Smoothing kernels i =1 1.0 1.0 1.0 1.0 Gauss Epanechikov Uniform Laplace 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 3 1 1 2 x 2 (2 π ) − 1 2 e − 1 2 e − | x | 4 max(0 , 1 − x 2 ) 2 χ [ − 1 , 1] ( x )

  58. Size matters 0.050 0.050 0.050 0.050 0.025 0.025 0.025 0.025 0.000 0.000 0.000 0.000 40 60 80 100 40 60 80 100 40 60 80 100 40 60 80 100 0.3 1 3 10 0.10 0.05 0.04 0.03 0.05 0.02 0.01 0.00 0.00 40 50 60 70 80 90 100 110 40 50 60 70 80 90 100 110

  59. Size matters 0.050 0.050 0.050 0.050 0.025 0.025 0.025 0.025 0.000 0.000 0.000 0.000 40 60 80 100 40 60 80 100 40 60 80 100 40 60 80 100 ✓ x − x i ◆ • Kernel width k x i ( x ) = r − d h r • Too narrow overfits • Too wide smoothes with constant distribution • How to choose?

  60. Smoothing

  61. Smoothing

  62. Smoothing

  63. Capacity Control

  64. Capacity control • Need automatic mechanism to select scale • Overfitting • Maximum likelihood will lead to r=0 (smoothing kernels peak at instances) • This is (typically) a set of measure 0. • Validation set Set aside data just for calibrating r • Leave-one-out estimation Estimate likelihood using all but one instance • Alternatives: use a prior on r; convergence analysis

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend