challenges of real world data
play

Challenges of real-world data We face an explosion in data from - PowerPoint PPT Presentation

Challenges of real-world data We face an explosion in data from e.g.: Internet transactions Satellite measurements Advances in Environmental sensors Privacy-Preserving Machine Learning


  1. Challenges of real-world data � We face an explosion in data from e.g.: � � Internet transactions � � Satellite measurements � Advances in � � Environmental sensors � � … � Privacy-Preserving Machine Learning � Real-world data can be: � � Vast (many examples) � � High-dimensional � Claire Monteleoni � � Noisy (incorrect/missing labels/features) � Center for Computational Learning Systems � � Sparse (relevant subspace is low-dim.) � Columbia University � � Streaming, time-varying � � Sensitive/private � Machine learning � Principled ML for real-world data � Goal: design algorithms to detect patterns in real-world data. � Given labeled data points, find a good classification rule. � � Describes the data � � Want efficient algorithms, with performance guarantees. � � Generalizes well � � Learning with online constraints: � � Algorithms for streaming, or time-varying data. � E.g. linear separators: � Active learning: � � Algorithms for settings in which unlabeled data is abundant, and labels are difficult to obtain. � Privacy-preserving machine learning: � � Algorithms to detect cumulative patterns in real databases, while maintaining the privacy of individuals. � New applications of machine learning: � � E.g. Climate Informatics: Algorithms to detect patterns in climate data, to answer pressing questions. �

  2. Privacy-preserving machine learning � Anonymization: not enough � Sensitive personal data is increasingly � Anonymization does not ensure privacy. � � being digitally aggregated and stored. � Attacks may be possible e.g. with: � � Auxiliary information � � � Structural information � Problem: How to maintain the privacy of individuals, when Privacy attacks: � detecting patterns in cumulative, real-world data? � [ Narayanan & Shmatikov � 08 ] identify Netflix users from anonymized � E.g. � records, IMDB. � � � Disease studies, insurance risk � � [Backstrom, Dwork & Kleinberg ‘07] identify LiveJournal social relations � � Economics research, credit risk � from anonymized network topology and minimal local information. � � � Analysis of social networks � � � Related work � Related work � � � � � � � � � � � � � � Data mining: � Data mining: � � Algorithms, often lacking strong privacy guarantees. � k-anonymity [Sweeney ‘02], l -diversity [Machanavajjhala et al. ‘06], � � Subject to various attacks. � � t-closeness [Li et al. ‘07]. Each found privacy attacks on previous. � � All are subject to composition attacks [Ganta et al. ‘08]. � Cryptography and information security: � Privacy guarantees, but machine learning less explored. � � Cryptography and information security: � � [Dwork, McSherry, Nissim & Smith, TCC 2006]: Differential Learning theory � privacy, and sensitivity method. Extensions, [Nissim et al. ’07]. � � � Learning guarantees for algorithms that adhere to strong � privacy protocols, but are not efficient. � Learning theory � � [Blum et al. ‘08] method to publish data that is differentially � private under certain query types. (Can be computationally � � prohibitive.) � � [Kasiviswanathan et al. ’08] exponential time (in dimension) � algorithm to find classifiers that respect differential privacy. �

  3. � ! differential privacy � The sensitivity method � � � � � [DMNS ‘06]: Given two databases, D 1 , D 2 that differ in one [DMNS ’06]: method to produce � ! private approximation to any function of a database. � element: � Sensitivity: For function g, sensitivity S(g) is the maximum change in g with one input. � [DMNS ’06]: Add noise, proportional to sensitivity. Output: � � f(D) = g(D) + Lap(0, S(g)/ � ) � t � A random function f is � -private, if, for any t � Pr[ f(D 1 ) = t ] � (1 + � ) Pr[ f(D 2 ) = t ] � t � Idea: Effect of one person’s data on the output: low. � g(D 1 ) � g(D 2 ) � Motivations and contributions � Regularized logistic regression � Goal: machine algorithms that maintain privacy yet output good We apply sensitivity method of [DMNS ‘06] to regularized logistic classifiers. � regression, a canonical, widely-used algorithm for learning a linear separator. � – � Adapt canonical, widely-used machine learning algorithms � – � Learning performance guarantees � – � Efficient algorithms with good practical performance � � Regularized logistic regression: � Input: (x 1 ,y 1 ),...,(x n ,y n ). � [Chaudhuri & Monteleoni, NIPS 2008]: � � x i in R d , norm at most 1. y i in {-1, +1}. � � A new privacy-preserving technique: perturb the optimization Output: � problem, instead of perturbing the solution. � n 2 w T w + 1 λ w ∗ = arg min � � � log(1 + exp( − y i w T x i )) � Applied both techniques to logistic regression, a canonical ML algorithm. � w n i =1 1 � Proved learning performance guarantees that are significantly tighter • � Derived from model: � p ( y | x ; w ) = for our new algorithm. � 1 + exp( − yw T x ) • � First term: regularization. � � Encouraging results in simulation. � • � w in R d predicts SIGN(w T x) for x in R d . �

  4. Sensitivity method applied to LR � New method for PPML � Sensitivity method [DMNS ‘06] applied to logistic regression: � A new privacy-preserving technique: perturb the optimization problem, instead of perturbing the solution. � Lemma: The sensitivity of regularized logistic regression is 2/n � . � � � No need to bound sensitivity (may be difficult for other ML algorithms) � � � Noise does not depend on (the sensitivity of) the function to be learned. � Algorithm 1 [Sensitivity-based PPLR]: � � Optimization happens after perturbation. � 1. � Solve w = regularized logistic regression with Application to regularized logistic regression: � parameter � . � Algorithm 2 [New PPLR] � 2. � Pick a vector h: � 1. � Pick a vector b: � � Pick |h| from � (d, 2/n �� ), � Where density of � � Pick |b| from � (d, 2/ � ), � � Pick direction of h uniformly. � (d,t) at x ~ � � Pick direction of b uniformly. � x d-1 e -|x|/t � 2. Output: � 3. � Output w + h. �� n 2 w T w + 1 log(1 + exp( − y i w T x i )) + 1 λ w ∗ = arg min nb T w � Theorem 1: Algorithm 1 is � -private. � n w i =1 New method for PPML � Privacy of Algorithm 2 � Theorem 2: Algorithm 2 is � -private. � Proof outline (Theorem 2): � Want to show Pr[ f(D 1 ) = w* ] � (1 + � ) Pr[ f(D 2 ) = w* ]. � Remark: Algorithm 2 solves a convex program similar to standard, regularized LR, so similar running time. � D 1 = { ( x 1 , y 1 ) , . . . , ( x n − 1 , y n − 1 ) , ( a, y ) } ∀ i, || x i || ≤ 1 || a || , || a ′ || ≤ 1 D 2 = { ( x 1 , y 1 ) , . . . , ( x n − 1 , y n − 1 ) , ( a ′ , y ′ ) } General PPML for a class of convex loss functions: � Pr[ f ( D 1 ) = w ∗ ] = Pr[ w ∗ | x 1 , . . . , x n − 1 , y 1 , . . . , y n − 1 , x n = a, y n = y ] Theorem 3: Given database X={x 1 ,…,x n }, to minimize functions of the Pr[ f ( D 2 ) = w ∗ ] = Pr[ w ∗ | x 1 , . . . , x n − 1 , y 1 , . . . , y n − 1 , x n = a ′ , y n = y ′ ] n form: � � F ( w ) = G ( w ) + l ( w, x i ) We must bound the ratio: � i =1 � If G(w), l (w, x i ) everywhere differentiable, have continuous derivatives Pr[ w ∗ | x 1 , . . . , x n − 1 , y 1 , . . . , y n − 1 , x n = a ′ , y n = y ′ ] = h ( b 1 ) Pr[ w ∗ | x 1 , . . . , x n − 1 , y 1 , . . . , y n − 1 , x n = a, y n = y ] 2 ( || b 1 || − || b 2 || ) h ( b 2 ) = e − ǫ G(w) strongly convex, l (w, x i ) convex and � � , for any x, �� ∀ i �∇ w l ( w, x ) � ≤ κ n w ∗ = arg min l ( w, x i ) + b T w � then outputting � � w G ( w ) + Where b 1 is the unique value of b that yields w* on input D 1 . (Likewise b 2 ) � � - b’s are unique because both terms in objective differentiable everywhere. � i =1 � where b = B r, s.t. B is drawn from � (d, 2 � / � ) , r is a random unit vector, � Where h(b i ) is � density function at b i . � is � -private . � � Bound RHS, using optimality of w* for both problems, and bounded norms. � � �

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend