Challenges of real-world data We face an explosion in data from - PowerPoint PPT Presentation

Challenges of real-world data � We face an explosion in data from e.g.: � � Internet transactions � � Satellite measurements � Advances in � � Environmental sensors � � … � Privacy-Preserving Machine Learning � Real-world data can be: � � Vast (many examples) � � High-dimensional � Claire Monteleoni � � Noisy (incorrect/missing labels/features) � Center for Computational Learning Systems � � Sparse (relevant subspace is low-dim.) � Columbia University � � Streaming, time-varying � � Sensitive/private � Machine learning � Principled ML for real-world data � Goal: design algorithms to detect patterns in real-world data. � Given labeled data points, find a good classification rule. � � Describes the data � � Want efficient algorithms, with performance guarantees. � � Generalizes well � � Learning with online constraints: � � Algorithms for streaming, or time-varying data. � E.g. linear separators: � Active learning: � � Algorithms for settings in which unlabeled data is abundant, and labels are difficult to obtain. � Privacy-preserving machine learning: � � Algorithms to detect cumulative patterns in real databases, while maintaining the privacy of individuals. � New applications of machine learning: � � E.g. Climate Informatics: Algorithms to detect patterns in climate data, to answer pressing questions. �

Privacy-preserving machine learning � Anonymization: not enough � Sensitive personal data is increasingly � Anonymization does not ensure privacy. � � being digitally aggregated and stored. � Attacks may be possible e.g. with: � � Auxiliary information � � � Structural information � Problem: How to maintain the privacy of individuals, when Privacy attacks: � detecting patterns in cumulative, real-world data? � [ Narayanan & Shmatikov � 08 ] identify Netflix users from anonymized � E.g. � records, IMDB. � � � Disease studies, insurance risk � � [Backstrom, Dwork & Kleinberg ‘07] identify LiveJournal social relations � � Economics research, credit risk � from anonymized network topology and minimal local information. � � � Analysis of social networks � � � Related work � Related work � � � � � � � � � � � � � � Data mining: � Data mining: � � Algorithms, often lacking strong privacy guarantees. � k-anonymity [Sweeney ‘02], l -diversity [Machanavajjhala et al. ‘06], � � Subject to various attacks. � � t-closeness [Li et al. ‘07]. Each found privacy attacks on previous. � � All are subject to composition attacks [Ganta et al. ‘08]. � Cryptography and information security: � Privacy guarantees, but machine learning less explored. � � Cryptography and information security: � � [Dwork, McSherry, Nissim & Smith, TCC 2006]: Differential Learning theory � privacy, and sensitivity method. Extensions, [Nissim et al. ’07]. � � � Learning guarantees for algorithms that adhere to strong � privacy protocols, but are not efficient. � Learning theory � � [Blum et al. ‘08] method to publish data that is differentially � private under certain query types. (Can be computationally � � prohibitive.) � � [Kasiviswanathan et al. ’08] exponential time (in dimension) � algorithm to find classifiers that respect differential privacy. �

� ! differential privacy � The sensitivity method � � � � � [DMNS ‘06]: Given two databases, D 1 , D 2 that differ in one [DMNS ’06]: method to produce � ! private approximation to any function of a database. � element: � Sensitivity: For function g, sensitivity S(g) is the maximum change in g with one input. � [DMNS ’06]: Add noise, proportional to sensitivity. Output: � � f(D) = g(D) + Lap(0, S(g)/ � ) � t � A random function f is � -private, if, for any t � Pr[ f(D 1 ) = t ] � (1 + � ) Pr[ f(D 2 ) = t ] � t � Idea: Effect of one person’s data on the output: low. � g(D 1 ) � g(D 2 ) � Motivations and contributions � Regularized logistic regression � Goal: machine algorithms that maintain privacy yet output good We apply sensitivity method of [DMNS ‘06] to regularized logistic classifiers. � regression, a canonical, widely-used algorithm for learning a linear separator. � – � Adapt canonical, widely-used machine learning algorithms � – � Learning performance guarantees � – � Efficient algorithms with good practical performance � � Regularized logistic regression: � Input: (x 1 ,y 1 ),...,(x n ,y n ). � [Chaudhuri & Monteleoni, NIPS 2008]: � � x i in R d , norm at most 1. y i in {-1, +1}. � � A new privacy-preserving technique: perturb the optimization Output: � problem, instead of perturbing the solution. � n 2 w T w + 1 λ w ∗ = arg min � � � log(1 + exp( − y i w T x i )) � Applied both techniques to logistic regression, a canonical ML algorithm. � w n i =1 1 � Proved learning performance guarantees that are significantly tighter • � Derived from model: � p ( y | x ; w ) = for our new algorithm. � 1 + exp( − yw T x ) • � First term: regularization. � � Encouraging results in simulation. � • � w in R d predicts SIGN(w T x) for x in R d . �

Sensitivity method applied to LR � New method for PPML � Sensitivity method [DMNS ‘06] applied to logistic regression: � A new privacy-preserving technique: perturb the optimization problem, instead of perturbing the solution. � Lemma: The sensitivity of regularized logistic regression is 2/n � . � � � No need to bound sensitivity (may be difficult for other ML algorithms) � � � Noise does not depend on (the sensitivity of) the function to be learned. � Algorithm 1 [Sensitivity-based PPLR]: � � Optimization happens after perturbation. � 1. � Solve w = regularized logistic regression with Application to regularized logistic regression: � parameter � . � Algorithm 2 [New PPLR] � 2. � Pick a vector h: � 1. � Pick a vector b: � � Pick |h| from � (d, 2/n �� ), � Where density of � � Pick |b| from � (d, 2/ � ), � � Pick direction of h uniformly. � (d,t) at x ~ � � Pick direction of b uniformly. � x d-1 e -|x|/t � 2. Output: � 3. � Output w + h. �� n 2 w T w + 1 log(1 + exp( − y i w T x i )) + 1 λ w ∗ = arg min nb T w � Theorem 1: Algorithm 1 is � -private. � n w i =1 New method for PPML � Privacy of Algorithm 2 � Theorem 2: Algorithm 2 is � -private. � Proof outline (Theorem 2): � Want to show Pr[ f(D 1 ) = w* ] � (1 + � ) Pr[ f(D 2 ) = w* ]. � Remark: Algorithm 2 solves a convex program similar to standard, regularized LR, so similar running time. � D 1 = { ( x 1 , y 1 ) , . . . , ( x n − 1 , y n − 1 ) , ( a, y ) } ∀ i, || x i || ≤ 1 || a || , || a ′ || ≤ 1 D 2 = { ( x 1 , y 1 ) , . . . , ( x n − 1 , y n − 1 ) , ( a ′ , y ′ ) } General PPML for a class of convex loss functions: � Pr[ f ( D 1 ) = w ∗ ] = Pr[ w ∗ | x 1 , . . . , x n − 1 , y 1 , . . . , y n − 1 , x n = a, y n = y ] Theorem 3: Given database X={x 1 ,…,x n }, to minimize functions of the Pr[ f ( D 2 ) = w ∗ ] = Pr[ w ∗ | x 1 , . . . , x n − 1 , y 1 , . . . , y n − 1 , x n = a ′ , y n = y ′ ] n form: � � F ( w ) = G ( w ) + l ( w, x i ) We must bound the ratio: � i =1 � If G(w), l (w, x i ) everywhere differentiable, have continuous derivatives Pr[ w ∗ | x 1 , . . . , x n − 1 , y 1 , . . . , y n − 1 , x n = a ′ , y n = y ′ ] = h ( b 1 ) Pr[ w ∗ | x 1 , . . . , x n − 1 , y 1 , . . . , y n − 1 , x n = a, y n = y ] 2 ( || b 1 || − || b 2 || ) h ( b 2 ) = e − ǫ G(w) strongly convex, l (w, x i ) convex and � � , for any x, �� ∀ i �∇ w l ( w, x ) � ≤ κ n w ∗ = arg min l ( w, x i ) + b T w � then outputting � � w G ( w ) + Where b 1 is the unique value of b that yields w* on input D 1 . (Likewise b 2 ) � � - b’s are unique because both terms in objective differentiable everywhere. � i =1 � where b = B r, s.t. B is drawn from � (d, 2 � / � ) , r is a random unit vector, � Where h(b i ) is � density function at b i . � is � -private . � � Bound RHS, using optimality of w* for both problems, and bounded norms. � � �

Challenges of real-world data We face an explosion in data from - PowerPoint PPT Presentation

Challenges of real-world data We face an explosion in data from e.g.: Internet transactions Satellite measurements Advances in Environmental sensors Privacy-Preserving Machine Learning

Real graduates, Real graduates, real transitions, real transitions, real stories: real

Frontal Crash Protection Frontal Crash Protection Real World Experience with Real World

Real Students Real World Real Work Real Life: A Plan for a Holistic Approach to Supporting

WORLD WORLD WORLD WORLD WORLD WORLD En End of of the Br Bron onze Age ME MEETI NG 8

Real Estate Centers Real Estate Centers Hampton Roads Real Estate Hampton Roads Real Estate

Real- Real -Time Systems Time Systems Real- -Time Systems Time Systems Real

Real Real- -Time Systems Time Systems Designing a real- Designing a real -time system time

Real- Real -time systems time systems Real- Real -time programming time programming

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

QUELLAVECO A WORLD CLASS COPPER PROJECT TOM MCCULLEY, CEO QUELLAVECO 27 NOVEMBER 2018 Real

Data Collection and Aggregation Data Collection and Aggregation 1 Challenges: data Challenges:

Practical advice Real survey data is messy Distance sampling in the Real World We've talked a

Real-World Protocols Prof. Tom Austin San Jos State University Real-World Protocols Next,

BAML Global Real Estate BAML Global Real Estate BAML Global Real Estate BAML Global Real Estate

Real Numbers in Real Applications John Harrison Intel Corporation Real numbers for fun and

Real Real Real Time Real-Time Time Time Model Checking Model Model Checking Model

Statistical Learning Philipp Koehn 9 April 2020 Philipp Koehn Artificial Intelligence:

Learning from Observations Chapter 18, Sections 13 of; based on AIMA Slides c Artificial

Mastering Drupal: Getting Up the Drupal Learning Curve Matt Cheney January 23rd, 2010 Design

React Angular or Jesse Sanders , CEO Thomas Burleson , Principal Architect React Learning Curve

Chapter3 SupplementaryNotes CS584/Fall2009/EmoryU 1

First Steps in Scientific Programming Patricio F . Ortiz University of She ffi eld, June 19,

State of Ohio Learning Community Basics Presented by: Kathleen Reynolds, LMSW Joan King, CNS,

Creating a learning organization Singapore Healthcare Management Congress August 14 16, 2018