Machine Learning for Compressive Privacy S.Y. Y. K Kung Prince - - PowerPoint PPT Presentation

machine learning for compressive privacy
SMART_READER_LITE
LIVE PREVIEW

Machine Learning for Compressive Privacy S.Y. Y. K Kung Prince - - PowerPoint PPT Presentation

Machine Learning for Compressive Privacy S.Y. Y. K Kung Prince ceton Un Univer ersi sity Machine Learning for Compressive Privacy Professor S.Y. Kung Email: kung@princeton.edu References: `Kernel Method and Machine


slide-1
SLIDE 1

Machine Learning for Compressive Privacy

S.Y.

  • Y. K

Kung Prince ceton Un Univer ersi sity

slide-2
SLIDE 2

Professor S.Y. Kung

  • Email: kung@princeton.edu

References:

  • `Kernel Method and Machine Learning”, S.Y. Kung

Cambridge University Press (CUP), 2014.

  • SPS Magazine: Invited Column on Compressive Privacy

Supplementary Materials:

  • Tentative Power Point Slides (notes provided)
  • DCA Paper (paper provided)

Machine Learning for Compressive Privacy

slide-3
SLIDE 3
  • i. Overview of Big Data Analysis
  • ii. Subspace and Visualization of Big Data (DCA)
  • iii. Compressive Privacy in Brandeis Program
  • iv. Differential Utility/Cost Advantage (DUCA)
  • v. Nonlinear and Kernel Learning Machines

Machine Learning for Compressive Privacy

slide-4
SLIDE 4

In the internet era, we experience a phenomenon of “digital everything”. Due to its quantitative (volume and velocity) and qualitative (variety) challenges, it is imperative to address various computational aspects of big data. For big data, the curse of high feature dimensionality is causing grave concerns on computational complexity and over-training.

Machine Learning for Compressive Privacy

  • i. Overview of Big Data Analysis
slide-5
SLIDE 5

Old Moore’s Law will soon lose ground to New Moore’s Law on Massive Data Readily Captured in Digital format.

  • VLSI gates

Doubles every 1.5 years

  • Data in digital format

Doubles every 2 years Device Moore’s Law The rise

  • f Data

Scientists

[Schonberger12].

Data Moore’s Law

9

Quantitatively:

  • ver 200 million emails are sent,

100 hours of video uploaded on Youtube every minute!!

slide-6
SLIDE 6

Facebook: In 2012, Facebook had over one billion users and, more importantly, with over 100 billion social links…. Every day, more than 250 Million new photos added to Facebook.

Examples of Big Data

slide-7
SLIDE 7

big= extremely high feature dimensionality:

  • computational complexity
  • data over-training causing

performance downgrades.

Volume/Velocity: Curse of Dimensionality in Big Data

𝑵 ≫ 𝑶 𝑵 𝑶 ≫ 𝑵

big= extremely large-scale data size:

  • computational complexity causing

costly hardware, long computing time, and high processing power.

𝑶

11

“The era of trillions of sensors is upon us...."

slide-8
SLIDE 8

3 Distributions of Learned Threshold: From Small to Large N (N=Sample Number per Class)

accurate threshold =0 Error function

slide-9
SLIDE 9

Data Matrix

M N

Microarray Data Matrix Or Document Data Matrix

slide-10
SLIDE 10

Qualitatively speaking,

Big data will incur a paradigm shift towards the development of "error-tolerant" machine learning algorithms

  • The variety of big data analysis stems from its divergent types of

sources - sometimes termed "megasourcing". [in "The Big Data Paradox“, David Auerbach]

  • Moreover, big data are often messy i.e. they are often imprecise and
  • incomplete. For example, some of data sources are unreliable,

vulnerable, missing, or deliberately hidden from public access. It is high time to seriously consider a major paradigm shift, both quantitatively and qualitatively, in learning machines for processing massive and messy (m&m) big data with coordinated hardware/software codesign.

slide-11
SLIDE 11

Big Data Variety: vectorial or non-vectorial data Big Data Veracity: uncertain or imprecise data

An An im implicit icit prin rincip ciple beh behin ind d big big dat data ana analysis is is is to mak ake use se of AL ALL of the he av avail ailable dat data, a, vectori

  • rial or
  • r n

non

  • n-vec

ecto tori rial (i.e .e. th the va variety), wheth ther th they a are messy sy, , impreci recise, o se, or inco incomplete (i.e .e. th the ve veracity).

Qualitatively:

slide-12
SLIDE 12

"Blind men and the elephant”

Big Data to Users = Elephant to Blind Men

Big Data Blind man Quantitatively and qualitatively:

slide-13
SLIDE 13

Go Google F Flu Trends de detects ts a a significant i inc ncrease i in H1N1 f flu ac u acti tivity ity two-weeks earlier t than C CDC - using b ng big data learni ning a ng and c cloud c computing. ng.

A lot

  • t of
  • f har

ard w wor

  • rk b

beh ehin ind th the s succe ccess:

Google Flu Trends

CDC

Jan 28, 2008

2007-2008 U.S. Flu Activity

(Mid-Altlantic Region)

percentage

  • 45

45 k keywords s a are chos

  • sen f

from 5 5M M keyw ywords

  • Half a

Billion o

  • f mathematical models t

tri ried.

slide-14
SLIDE 14

Objective of Machine Learning

The main function of machine learning is to convert the wealth of training data into useful knowledge by learning. data ≠ knowledge

Machine Learning is a research field study theories and technologies to adapt a system model based on training dataset, so that the learned model will be able to generalize and provide a correct classification or useful guidance even when inputs to the system are previously unknown.

slide-15
SLIDE 15

Machine Learning System Feature e Engineer eering Labelin eling E Enginee eering

Much of the success of machine learning lies in an effective representation of the objects. Learning models such as deep learning and kernel machines can derive a much expanded vector space, from massive and messy big data, which may then be further extracted for

  • classification. Such a learning methodology has had numerous

successful and practical applications.

slide-16
SLIDE 16

Feture R e Repres esen entat ation n

The secret of success of a machine learning system lies in finding an effective representation for the

  • bjects of interest.

In a basic representation, an object is represented as a feature vector in a finite-dimensional vector space. Two common strategies:

  • dimension reduction: to battle over-training
  • dimension expansion: to battle under-training

x ⇒ z x ⇒ φ

slide-17
SLIDE 17

Machine Learning for Compressive Privacy

We shall explore various projection methods, e.g. PCA, for dimension reduction – a prelude to visualization and privacy preserving of big data. An important development is Discriminant Component Analysis (DCA), which offers a compression scheme to enhance privacy protection in contextual and collaborative learning environment.

An implicit principle behind big data analysis is to make use of ALL

  • f the available data, vectorial or non-vectorial (i.e. the variety),

whether they are messy, imprecise, or incomplete (i.e. the veracity). Massive digital data are being rapidly captured in digital format, including digital book/voice/image/video/commerce. They come from divergent types of sources, from physical (sensor/IoT) to social and cyber (web) types. Note that visualization tools are meant to supplement (instead of replace) the domain expertise (e.g. a cardiologist) and provide a big picture to help users formulate critical questions and subsequently postulate heuristic and insightful answers.

  • ii. Subspace and Visualization of Big Data (DCA)
slide-18
SLIDE 18

Important Research Areas on Big Data By CCF vote, 55% goes to Visualization

Visualization HPC System Integration Revolutionary Algorithms

55%

slide-19
SLIDE 19

Visualization of (High-Dimensional) Big Data

  • It facilitates formulation of insightful

questions and postulation of heuristic answers.

A picture is worth a thousand words!!

  • Visualization offers an informal

presentation of massive and messy big data and supplements (instead of replaces) the domain expertise (e.g. a cardiologist).

PCA/DCA serves as a good example/evidence!

slide-20
SLIDE 20

Visualization

Multi-Dimensional scaling (MDS) Deep Learning Auto- Encoder. Tree Clustering Unsupervised

  • PCA (vector data)
  • KPCA(nonvector data)

Supervised

  • DCA (vector data)
  • KDCA(nonvector data)

K-means/ Kernel K-means SOM/Kernel SOM Vectorization

Dimension Reduction

Visualization of (High-Dimensional) Big Data

slide-21
SLIDE 21

PCA fo CA for Subspace P e Projec ecti tion & & Dimen ensio ion Reduct ction

  • n:

PCA CA aims a at t the b e best r t recover erabil ilit ity of t the e origin inal al d data i a in t n the e Original l Vect ctor Spa pace ce . The e signal al s subspac ace c e componen nents o

  • f PCA

CA ar are as e associated w with ith the e powe wer whil ile t e the n e nois ise e subspac ace c e componen ents o

  • f PCA

CA ar are t e tightl tly coupled ed w with the e recover erab abilit ity.

Min Noise Subspace

To bes est s show o

  • riginal

al i imag age. e.

Max Signal Subspace

Minimize RE (mean-square-error): Maximize Power :

Visualization for Unsupervised Learning Dataset

slide-22
SLIDE 22

[ ]

      = =

2 1 21 11

x x v v Wx y

PCA for 2D Dataset

[ ]

            =       = =

2 1 21 11 21 11 21 11

ˆ x x v v v v y v v Py x

x v y

T 1

=

x

error

subspace minor

1

yv x ˆ =

major subspace minor subspace

slide-23
SLIDE 23

The original input vector can be reconstructed from y as:

Trace Norm Criterion: M-Dimensional Case

Useful for Proof:

slide-24
SLIDE 24

PCA in Trace-Norm Formulation:

[ ] ( )

]W S [ W argmax w w w S w argmax W

T I W W : R W m 1 i i T i i T i : W PCA

T m M

trace

= ∈ = ⊥

×

= ≡

slide-25
SLIDE 25

Visualization for Supervised Learning Dataset

Signal Subspace in CVS

Enable le C Classific ication.

  • n.

DD: “Discriminant Distance” in CVS

  • PCA : Unsupervised learning - Recoverability

Scatter (Covariance) Matrix:

( )

∑ ∑

= =

= =

M i M i

trace

1 i 1 i T

p U S U λ

This image cannot currently be displayed. This image cannot currently be displayed.

Minimize RE (mean-square-error): Maximize Power :

  • DCA : Supervised Learning - Classification
  • Recoverability: Visualization

Noise Subspace in EVS

Show

  • rigin

inal l imag age. e.

  • Anti-Recoverability: Privacy

Hide de “original imag age. e.

slide-26
SLIDE 26

Signal Matrix = SB = Between-Class Matrix Noise Matrix = SW = Within-Class Matrix

detrimental

For supervised (i.e. labeled) dataset with L classes:

W

S S

W B

S S S + =

= Signal Matrix + Noise Matrix

beneficiary

S

B

S

slide-27
SLIDE 27

Linear Discriminant Analysis (LDA) maximizes the signal-to-noise ratio:

  • Fisher’s Discr

crimina nant nt An Analysis ( (LDA)

w S w w S w n s

W T B T

= = SNR

  • DCA optimizes Sum of SNRs= DD2:

2 1 i W T i i B T i 1 i i

w S w w S w n s DD SumSNR

m i m i

=         ≡         =

∑ ∑

= =

( )

m DD W P DP

m i m i

+ =         + ≡         ≡ =

∑ ∑

= = 2 1 i i i 1 i i

n n s n p '

Discriminant Distance(DD) ⇔ Discriminant Power (DP)

Enhan ance (pr proj

  • jec

ected) ) SB

B and

d suppr uppress ( (pr proj

  • jec

ected ed) S ) SW .

slide-28
SLIDE 28

component analysis

Canonical Vector Space (CVS)

Rotational Invariance of DD/DP

  • Mahalanobis Orthonormality = Canonical Orthonormality

I W S W

W T

=

Rotational Invariant

  • Euclidean Orthonormality in the original EVS

I W WT =

Not Rotational Invariant

DCA Optimization Criterion: Discriminant Power (DP)

⇒ Orthonormal Bases

  • Invariance of Orthonormality
  • Invariance of DP (Redundancy)

⇒ Rotational Invariance (Visualization )

slide-29
SLIDE 29

DD2 = Sum of SNRs

L-1

DD in CVS:

2 1 i B T i 1 i W T i i B T i

w ~ S ~ w ~ w ~ S ~ w ~ w ~ S ~ w ~ DD SumSNR

m i m i

∑ ∑

= =

= =         =

Class-ceiling property 10:1 in compression ratio where

slide-30
SLIDE 30

( )

( ) ( ) ( )

W ~ ~ W ~ S ~ W ~ w ~ S ~ w ~ w ~ I w ~ w ~ S ~ w ~ w S w w S w '

T 1 i T i 1 w ~ w ~ 1 i T i i T i 1 i W T i i T i

i T i

P trace W P

m i m i m i

= = =         =         =

∑ ∑ ∑

= = = =

DP in CVS = Power in CVS

DCA ⇒ ⇒ PCA

DCA and PCA are equivalent in CVS!

Preservation of DP under Coordinate Transformation

Al Algebraically,

∑ ∑

= =

        =         ≡

m i m i

DP

1 i W T i i T i 1 i i

w S w w S w n p

slide-31
SLIDE 31

whitening

PCA in whitened space ⌂’

L= 3, L-1=2

  • riginal

space canonical space Data ta L Laund undry y ⇒Can anonical al V Vector Sp Spac ace ( (CVS) S) The canonical vector space is defined as the whitened space.

CVS EVS

Pic ictoria iall lly,

x = (Sw) - ½ x

~

re-Mapping

w = [Sw] -½T w

~

slide-32
SLIDE 32

Forward Mapping:

DCA = PCA in CVS

PCA

Find the first m principal eigenvectors under the ``canonically normality" constraint that

Discriminant Matrix Direct EVS Method: Trace Norm Optimizer Backward Mapping: i=1,2,…,m.

slide-33
SLIDE 33

Principal Eigenvalues for the Signal Subspace:

Roles of DCA Eigenvalues

DCA maximizes RDP and optimizes (minimizes/maximizes) RE for recoverability or anti-recoverability.

slide-34
SLIDE 34
  • Dimension Reduction

(Cost, Power, Storage, Communication)

  • Prediction Performance

(Classification Accuracy in Prediction Phase)

Si Simula latio tion R Res esults lts

With the data laundry process, DCA (i.e. supervised-PCA) far outperforms PCA.

Visualization: DCA vs. PCA

slide-35
SLIDE 35

PCA PCA DCA m=5 200 runs

HAR Dataset: L=6 & M=561

lab2 HAR classifier RE Accuracy p

>>10% Gain

slide-36
SLIDE 36

DCA23 PCA12 DCA25 PCA34

Visualization: PCA vs. DCA

Long Inter-Group Distance

Before Data Laundry After Data Laundry

slide-37
SLIDE 37

In the internet era, we benefit greatly from the combination of packet switching, bandwidth, processing and storage capacities in the cloud. However, “big-data” often has a connotation of “big- brother”, since the data being collected on consumers like us is growing exponentially, attacks on our privacy are becoming a real

  • threat. New technologies are needed to better assure our privacy

protection when we upload personal data to the cloud. An important development is Discriminant Component Analysis (DCA), which offers a compression scheme to enhance privacy protection in contextual and collaborative learning environment. DCA can be viewed as a supervised PCA which can simultaneously rank order the (1) the sensitive components and (2) the desensitized components.

Machine Learning for Compressive Privacy

  • iii. Compressive Privacy in Brandeis Program
slide-38
SLIDE 38

Pumped by wireless, internet, and parallel processing technologies, cloud computing offers remotely hosted application logic units, data stores, and a diversity of application resources.

  • It offers data processing services ubiquitously, i.e. at any time,

anywhere, and for anyone.

  • It manages the server farm, supports extensive database and

vast storage space, and is ready to lease out, on demand from clients, a variable number of machines. It has the premise of elastic hosting, offering application domains for lease to clients.

Cloud Computing: Data Center

The main problem of cloud computing lie on the communication cost and privacy protection. Big data has nowadays a connotation of Big brother.

slide-39
SLIDE 39

Colorful water pipes cool one of Google’s data centers. The company sells user data to advertisers for targeted marketing

  • campaigns. Connie Zhou/Google

Google’s Data Centers

slide-40
SLIDE 40

With rapidly growing internet commerce, much of our daily activities are moving online, abundance of personal information (such as sale transactions) are being collected, stored, and circulated around the internet and cloud servers, often without the owner's knowledge Machine learning Approach to Compressive Privacy This raises an imminent concern on the protection and safety of sensitive and private data, i.e. ``Online Privacy", also known as ``Internet Privacy" or ``Cloud Privacy". This course presents some machine learning methods useful for and internet privacy preserving data mining, which has recently received a great deal of attention in the IT community.

slide-41
SLIDE 41

When the control of data protection is left entirely to the cloud server, the data privacy will unfortunately become vulnerable to hacker attack or unauthorized leakage. It is therefore much safer to keep the control for data protection solely at the hand of the data owner and not taking chance with the cloud. Why DARPA Brandeis Program?

slide-42
SLIDE 42

public sphere (space) private sphere (space)

Public vs. Private Spaces

The analysis of PP technology requires a clear separation of two spaces: public vs. private spaces.

slide-43
SLIDE 43
  • Encrypted Data
  • Decrypted Data

Public Space (Cloud) Private Space (Client)

Encryption for Privacy Protection

  • trusted authority
  • data owners
  • cloud server
slide-44
SLIDE 44

Theme of the Brandeis Program: Control for data protection should be returned to the data owner rather than at

leaving it at the mercy of the cloud server.

Why not Encryption?

To ensure protection via encryption, during the processing in the ``public sphere", the input data of each owner and any intermediate results will only be revealed/decrypted to the trusted authority.

Nevertheless, there exists substantial chance of encountering hacker attack or unauthorized leakage, when leaving the data protection entirely to the hand/mercy of the cloud server.

slide-45
SLIDE 45

Data owner should have control over data

From the data privacy's perspective, the accessibility of data is divided into two separate spheres: (1) private sphere: where data owners generate and process decrypted data; and (2) public sphere: where cloud servers can generally access only encrypted data, with the exception that only the trusted authority may access decrypted data confidentially. When the control of data protection is left entirely to the cloud server, however, the data become vulnerable to hacker attack or unauthorized leakage. It is safer to let data owner control the data privacy and not to take chance with the cloud servers. To this end, we must design privacy preserving information systems so that the shared/pushed data are only useful for the intended utility and not easily diverted for malicious privacy intrusion.

slide-46
SLIDE 46

Build information systems so that the shared data could be

  • effective and relevant for the intended

utility (e.g. classification)

  • but not easily diverted to other

purposes (e.g. privacy).

DARPA Brandeis Program

  • n Internet Privacy(IP):

US$60M/4.5 Yrs.

slide-47
SLIDE 47

Brandeis Mobile CRT

Raytheon BBN, Invincea TA3: Human Data Interaction TA1: Privacy Technologies CMU, Tel Aviv U TA4: Measuring Privacy Cybernetica, U of Tartu UC Berkeley, MIT, Cornel, U MD Iowa State U, Princeton Stealth Software Technologies TA3: Experimental Prototype Systems

slide-48
SLIDE 48

In emergency such as bomb threat, many mobile images from various sources may be voluntarily pushed to the command center for wide-scale forensic

  • analysis. In this case, CP may be used to

compute the dimension-reduced feature subspace which can (1) effectively identify the suspect(s) and (2) adequately

  • bfuscate the face images of the innocent.
slide-49
SLIDE 49

Mobile Information Flow

56

Control Center makes request for images near an incident. User’s phone responds with some incident-relevant images per their privacy policy

User Application Privacy Policy PE Android Example: location = utility; face = privacy

slide-50
SLIDE 50

More on Mobile PP Applications

”The Android platform provides several sensors that let you monitor the motion of a

  • device. Two of these sensors are always hardware-based (the accelerometer and

gyroscope)”, (Most devices have both gyroscope and accelerometer.) and ”three of these sensors can be either hardware-based or software-based (the gravity, linear acceleration, and rotation vector sensors).” Adapted from CC Liu, et al. = private DP ⇒

slide-51
SLIDE 51

Adapted from CC Liu, et al D (data) = Activity (B/L) Location (W/L) Tabs (B/L) app (M)= speech , motion, ID, password

More on Mobile PP Applications

slide-52
SLIDE 52

We need to develop new methods to jointly optimize two design considerations P & U: Privacy Protection & Utility Maximization .

Mach chine e learn rning ing Appro roac ach h to Pri rivacy Protectio tion n of

  • f Interne

rnet/ t/Cl Cloud d Data

Objective: Explore information systems simultaneously perform

 Utility Space Maximization: deliver intended data mining, classification, and learning tasks.  Privacy Space Minimization: safeguard personal/private information.

slide-53
SLIDE 53

CP involves joint optimization over three design spaces: (i) Feature Space (ii) Utility Subspace; and (iii) Cost Subspace (i.e. Privacy Subspace).

slide-54
SLIDE 54

Compressio ion n (P (PCA) Rando dom Noi

  • ise

Random

  • m T

Transfor

  • rm

USC’s Pickle: A collaborative learning model built for MFCC speaker recognition - widely used in acoustic mobile applications - to enable speaker recognition without revealing the speech content itself.

public data private data

Collaborative Learning for PP

slide-55
SLIDE 55
  • Single-user vs. Multi-user environments
  • Centralized vs. Distributed Processing
  • Unsupervised vs. Unsupervised Learning

Collaborative Learning for PP

slide-56
SLIDE 56

B vs. B It enables classification of face/speech data (utility): Example of CUEP

  • CU: classification for utility
  • EP: estimation for privacy.

while protecting the privacy (e.g. face image or speech content ) from malicious cloud users.

slide-57
SLIDE 57

Original

Masked Data

Classification Personality Privacy

  • CUEP Example classification formulation for utility but estimation for privacy.
slide-58
SLIDE 58

“ i ll i f l i b i d j b d i l l d i lik SIFT

  • Low Alert Level:
  • High Alert Level:

For example, in US, EFF 100 systems stream data from surveillance cameras (in the future, mobile cameras) can very well reveal where/when your car is.

slide-59
SLIDE 59

Discriminant Component Analysis (DCA) offers a compression scheme to enhance privacy protection in contextual and collaborative learning environment. The DCA has (a) its classification goal characterized by the discriminant distance and (b) its privacy components controlled by a ridge parameter. Therefore, DCA is a promising algorithmic tool for CP.

Discriminant Component Analysis (DCA)

slide-60
SLIDE 60

ρ’ = 0.00001≈0

Rank Order SU and NU Rank Order SP and NP This is not good enough as we want to

slide-61
SLIDE 61

Noise-Subspace Eigenfaces ρ’ = 0.00001

slide-62
SLIDE 62
slide-63
SLIDE 63

Doubly-Ridged DCA

ρ : Ridge Parameter for Noise Matrix ρ' : Ridge Parameter for Signal Matrix

Trace Norm Optimizer

Regulated Discriminant Power (RDP)

slide-64
SLIDE 64

Algorithm: DCA Learning Model

  • Compute the Discriminant Matrix:

under the canonical normality that

  • Perform Eigenvalue Decomposition:
  • The optimal DCA projection matrix is:

maximizing RE for anti-reconstruction maximizing component power for utility maximizatio

slide-65
SLIDE 65

Principal Eigenvalues for the Private Subspace: Minor Eigenvalues for the Privatized Subspace:

Dual Roles of P-DCA Eigenvalues

Simultaneously rank-order two subspaces:

slide-66
SLIDE 66
  • Privacy space is characterized by 15-

category Classification Formulation; while

  • Utility aims at
  • utility-oriented classification (e.g.

eyeglasses or not).

Example: Yale Face Dataset

slide-67
SLIDE 67

ρ’ = -0.05 ρ’ = 0

slide-68
SLIDE 68

Private Eigenfaces PU (Privatized/Utilizable) Eigenfaces ρ’ = -0.05

slide-69
SLIDE 69
slide-70
SLIDE 70

w/o rank

  • rdering
  • w. rank
  • rdering

ρ’ = -0.05 ρ’ = +0.00001 < < < <

slide-71
SLIDE 71

We shall explore joint optimization over three design spaces: (a) Feature Space, (b) Classification Space, and (c) Privacy Space. This prompts a new paradigm called DUCA to explore information systems which simultaneously perform Utility Space Maximization: deliver intended data mining, classification, and learning tasks.

Machine Learning for Compressive Privacy

  • iv. Differential Utility/Cost Advantage (DUCA)
slide-72
SLIDE 72

DUCA ≠ DCA PCA DUCA

slide-73
SLIDE 73

Our (CP) approach involves joint optimization over three design spaces: (i) Feature Space (ii) Utility Subspace; and (iii) Cost Subspace (i.e. Privacy Subspace).

slide-74
SLIDE 74
slide-75
SLIDE 75

Entropy and Venn Diagram

slide-76
SLIDE 76

Differential Privacy

slide-77
SLIDE 77

CP enables the data user to “encrypt” message using privacy-information-lossy transformation, e.g.

  • Dimension reduction (Subspace)
  • Feature selection

hence

  • preserving data owner's privacy while
  • retaining the capability in facilitating

the intended classification purpose. Compressive Privacy (CP)

slide-78
SLIDE 78

 CS Theoretical Foundations

  • Statistical Analysis
  • Subspace Optimization
  • Information Theory
  • Estimation Theory
  • Machine Learning
slide-79
SLIDE 79

Entropy and Covariance Matrix

slide-80
SLIDE 80

Double Income Problem (DIP) Privacy Utility

slide-81
SLIDE 81

Compressive Privacy

slide-82
SLIDE 82

Estimation Theory and Compressive Privacy argmin ( || fT - y || 2 + || ||2 ) Gauss-Markov (Statistical Estimation) Theorem

slide-83
SLIDE 83
slide-84
SLIDE 84
slide-85
SLIDE 85
slide-86
SLIDE 86
slide-87
SLIDE 87
slide-88
SLIDE 88
slide-89
SLIDE 89
slide-90
SLIDE 90
slide-91
SLIDE 91

Machine Learning and Compressive Privacy

slide-92
SLIDE 92
slide-93
SLIDE 93
slide-94
SLIDE 94

For CP problems, there are two types of the teacher values in the dataset: one for the utility labels and

  • ne for privacy labels. It implies that there are two

types of between-class scatter matrices, denoted by for the utility labels

and

for the privacy labels.

slide-95
SLIDE 95

DUCA for Supervised Machine Learning

slide-96
SLIDE 96

Generalized Eigenvalue Problem

Again, the optimal queries can be directly derived from the principal eigenvectors of where L and C denote the numbers of utility and privacy labels. Note that there are only L+C-2 meaningful eigenvectors, because

slide-97
SLIDE 97

DUCA is only a generalization of its predecessor, called DCA, designed for the utility-only machine learning applications. 𝑡 𝑜 𝑡 + 𝑜 𝑜 𝑡 𝑡 + 𝑜 Naturally, just like DCA, there are ways for DUCA to extract additional meaningful queries. argmax argmax argmax

slide-98
SLIDE 98

where ``H/M/L" denotes the three (High/Middle/Low) utility classes (i.e. family income) and ``+/-" denotes the two privacy classes (i.e. who-earns-more between the couple).

with the utility/ privacy class labels

slide-99
SLIDE 99

UAM and CAM can be learned from the given dataset and their respective class labels.

𝑔

1 =

  • 0.14
  • 0.87
  • 0.17
  • 0.44

𝑔

2 =

  • 0.002
  • 0.11

0.68 0.72

Parallel Extraction of Multiple Queries: Thereafter, the two principle eigenvectors of the generalized eigenvectors of can be computed as

slide-100
SLIDE 100

The two-dimensional visualization of the DUCA-CP subspace, in which the family income classes (H/M/L) are highly separable but not so for the income disparity categories (+/-). For example, the query here obviously belongs to the middle-class (utility) but its disparity (privacy) remains unknown, i.e. protected.

slide-101
SLIDE 101
slide-102
SLIDE 102

Four Categories of PPUM Formulations

  • CUCP classification formulation for both utility and privacy;
  • EUEP estimation formulation for both utility and privacy;
  • EUCP estimation for utility but classification for privacy;
  • CUEP classification for utility but estimation for privacy.

C: classification E: estimation U: Utility

CU EU

P: Privacy

CP

EP If the privacy function is prescribed by the matrix P, just replace by . If the utility function is prescribed by the matrix P, just replace by .

slide-103
SLIDE 103

If the privacy function is prescribed by the matrix P, we have the following adjustments:

slide-104
SLIDE 104

KLM

slide-105
SLIDE 105
slide-106
SLIDE 106

Three reconstruction images are based on 399 selected FFT components by different filters: (a) rectangular window (no-learning); (b) unsupervised filtering based on the component's variances; and (c) supervised filtering based on component's FDRs. (Courtesy from A. Filipowicz.)

(b) 73.9% (a) 73% (c) 73%

slide-107
SLIDE 107

Utility= Identification

Phase Magnitude

Privacy = Reconstruction

Signal Processing Solution

slide-108
SLIDE 108

Full Dimension Olivetti

Original

Reconstruction

Phase Information Proves Vital for Reconstruction, it has little, if not negative, impact on Classification.

Reconstruction

  • w. phase

w/o phase (0.97) (0.9725)

slide-109
SLIDE 109

Big data analysis usually involves nonlinear data analysis and the two most promising approaches for which kernel learning machine (KLM) and deep learning machine (DLM). The safest possible protection is to withhold the privy data from sharing in the first place. This schema, however, presents a formidable challenge in developing a machine learning tool for incomplete data analysis (IDA). Fortunately, KLM can naturally be extended to all types of nonvectorial data analysis including IDA. Moreover, KLM can facilitate: Intrinsic space and privacy: Reduce the number of training vectors need to be stored in the cloud (SVM) or even make it unnecessary to share any training data via the intrinsic kernel approach. Auto-encoder for privacy: Compare two nonlinear Auto-Encoders (KLM and DLM) for data minimizer. Kernel learning machine for privacy: For example, partially-specified feature vectors can be pairwise correlated to yield a similarity or kernel function for kernel learning machine (KLM). Thereafter, SVM or KRR supervised learning classifiers may be trained and deployed for prediction applications.

Machine Learning for Compressive Privacy

  • v. Nonlinear and Kernel Learning Machines
slide-110
SLIDE 110

The key of success lies in an effective representation of the objects by a much expanded vector space:

  • Kernel Vector Space

Machine Learning System

Feature e Engineer eering Labelin eling E Enginee eering

The feature engineering step is very laborious. Training data (e.g. human-labeled images) are used, often by trial-and-error, to learn features representative of the target (e.g. cars) . No learning at all in the feature engineering step.

  • Multilayers of Hidden Nodes in Deep Learning

Kernel learning (aka shallow learning) vs. the dull nature of deep learning!!

slide-111
SLIDE 111

Big data analysis usually involves nonlinear data analysis, for which kernel machine learning (KML) and deep learning (DML) represent the two most promising approaches. KML and DML differs in the fact that KML is based on pairwise quantification of any pair of targeted objects while DML extract feature vector for each of the targeted objects, using a cascade of feature extraction layers. In this sense, the two learning approaches complement each other very well. This points to a potentially integration of both KML and DML technologies.

Deep back-propagation and kernel learning algorithms

slide-112
SLIDE 112

Nonlinear Supervised Learning Machine Basic Learning Module in BP Back Propagation (BP) Neural Network

slide-113
SLIDE 113

BP was the key Supervised Learning Networks in 80’s. Deep Learning Networks or Deep Learning Machines

slide-114
SLIDE 114
slide-115
SLIDE 115
slide-116
SLIDE 116
slide-117
SLIDE 117
slide-118
SLIDE 118

Data scientists advocate the use of correlation analysis for unraveling information hidden in big data. [Schonberger12].

Kernel Correlation Approach to Big Data Analysis

Kernel learning represents a major discipline for correlation analysis, where pairwise inter-

sample correlations are represented by a

kernel matrix. [Kung2014]

slide-119
SLIDE 119
  • Curse of Dimensionality (Volume)
  • Curse of Large Data Size (Velocity)
  • Curse of High Feature Dimensionality

Data Matrix Kernel Matrix

Kernel Approach? 𝑵 𝑶 𝑵 𝑵 𝑶 𝑶

  • Correlation Analysis

(Statistical Perspective)

  • Algorithmic Analysis

(Algebraic and Optimization Principles) Kernel vs. BDA

Covariance Matrix

Kernel Approach to (Big) Data Analysis

slide-120
SLIDE 120

The dimens nsion n J,

Training-data-independent intrinsic feature vectors

𝜚 (𝑦) 𝑦

slide-121
SLIDE 121

LSP→

slide-122
SLIDE 122

Original vector: 𝒚

𝑙(𝒚1, 𝒚)

𝒚(1) 𝒚(𝑁)

Σ

ℎ 𝒚 = 𝑏𝑜𝑙 𝒚𝑜, 𝒚 + 𝑐

𝑂 𝑜=1

𝑏1 𝑏2 𝑏3 𝑏𝑂

𝑙(𝒚2, 𝒚) 𝑙(𝒚3, 𝒚) 𝑙(𝒚𝑂, 𝒚)

kernel vector 𝒍 weighting vector 𝒃 …

Positive class Negative class

>0 <0

bias 𝑐

𝒚1, … , 𝒚𝑂: Training samples.

Kernel Learning Machine

LSP:

holds for PCA and DCA.

slide-123
SLIDE 123

𝑇(𝒚1, 𝒚1) 𝑇(𝒚1, 𝒚2) 𝑇(𝒚2, 𝒚1) 𝑇(𝒚2, 𝒚2) ⋯ 𝑇(𝒚1, 𝒚𝑂) 𝑇(𝒚1, 𝒚2) ⋮ ⋱ ⋮ 𝑇(𝒚𝑂, 𝒚1) 𝑇(𝒚𝑂, 𝒚1) ⋯ 𝑇(𝒚𝑂, 𝒚𝑂)

imputation

non-vectors

The data is individually quantified!!

Vectorization vs. Pairwise Kernel

K(•,•) non-vectors

Non-vectorial data are pairwise quantified into similarity matrix.

= K

deep learning

kernelization

Vectorization

Kernel trick

[Aizerman’64]

slide-124
SLIDE 124

Linear, Nonlinear, & Non-Vectorial Kernel Machines

linear nonlinear Nonlinear Kernel Machines for vector data:

𝑇(𝒚1, 𝒚1) 𝑇(𝒚1, 𝒚2) 𝑇(𝒚2, 𝒚1) 𝑇(𝒚2, 𝒚2) ⋯ 𝑇(𝒚1, 𝒚𝑂) 𝑇(𝒚1, 𝒚2) ⋮ ⋱ ⋮ 𝑇(𝒚𝑂, 𝒚1) 𝑇(𝒚𝑂, 𝒚1) ⋯ 𝑇(𝒚𝑂, 𝒚𝑂)

Non-Vectorial Kernel Machines

Non-vectorial data are pairwise quantified, instead of individually quantified!!

Vector and non-vector correlation analysis

slide-125
SLIDE 125

PSI-BLAST Adjacency Matrix

?

Partial Correlation

Pairwise Quantified Kernel Matrix

slide-126
SLIDE 126

"Blind men and the elephant”

The “partial correlation”

  • f the joint features

points to the same (type of) animal.

( )

xy xy xy

I I I

y x y x, y x, K ≡

xy

I

joint features

slide-127
SLIDE 127

The objective of machine learning is often an effective classification of objects, new and unseen before.

  • Kernel Ridge Regression (KRR):
  • tuning parameter (ρ)
  • Support Vector Machine (SVM):
  • tuning parameter (C)
  • Ridge-SVM
  • more tuning parameters

Machine Learning System Feature e Engineer eering Labelin eling E Enginee eering

Turing Machines ⇒ Tuning Machines

slide-128
SLIDE 128
  • Data Privacy

Comparison of DLP and KLP

  • Built upon fundamental foundations on

Mathematics, Statistics, Information Science.

  • Data Visualizability
  • Performance: Accuracy, Precision/Specificity/Sensitivity
  • Real-Time (Incremental) Learning
  • Real-Time Classification/Prediction
  • Computational Complexity
slide-129
SLIDE 129

SVM VM/K /KRR

nonvecto rial data vectoria l data De Deep p Learni rning ng Feature res s Kern ernel el Features res (No L Learni rning ng) PCA CA DC DCA KPCA CA KDC DCA Fe Feat ature Sel Selec ection Pairw rwise se Quant ntified ed

Sys yste tem Inte tegrati ration

slide-130
SLIDE 130

Original vector: 𝒚 𝑦1 𝑦𝑁

Σ

A New w Hope → Intrinsic S Space O(J)

intrinsic vector 𝒜

𝑨1(𝒚) 𝑨2(𝒚) 𝑨3(𝒚) 𝑨𝐾(𝒚)

decision vector 𝒗 threshold: 𝑐

Σ

ℎ 𝒚 = 𝑣𝑘𝑨

𝑘(𝒚) 𝐾 𝑘=1

+ 𝑐 𝑣1 𝑣2 𝑣3 𝑣𝐾

Finit nite e Order er mean ans t that at clas assific ificati ation c complexit xity can b n be indep epen enden ent o

  • f N.
slide-131
SLIDE 131

Kernelized DCA (KDCA)

slide-132
SLIDE 132

Visualization via Linear DCA

slide-133
SLIDE 133 This image cannot currently be displayed.

Derivation of KDCA in Intrinsic Space Data Matrix in Intrinsic Space Projection Matrix in Intrinsic Space KDCA representation in Intrinsic Space Within-Class Scatter Matrix in Intrinsic Space

slide-134
SLIDE 134

KDCA for Visualization and Classification of Incomplete Data

  • 1. Given N incomplete data 𝒚𝟐, … , 𝒚𝑶 ∈ 𝑺𝑵, 𝐞𝐞𝐞𝐞𝐞𝐞

[ ] [

]

K ~ K K K D

2 1 W KDCA

ρ ρ + + =

( )

( ) ( ) ( )

             = =               = x K x K x K k z z z

m

, x , x , x A x A z

N 2 1 T T ) ( ) 2 ( ) 1 (

  

  • 3. Form A from the first m principal eigenvector of DKDCA

(with scaling such that ) :

  • 2. Form the KDCA Discriminant Matrix:

[ ]

I A K K A

W

= + ρ

T

Gaussian RBF Kernel Function:

( )

y x, K

slide-135
SLIDE 135

Visualization via (RBF) KDCA

slide-136
SLIDE 136

Original

Masked Data

How do we cope with masked data? …not in vector form.

Pairwise Quantified Kernel Matrix

?

Partial Correlation

slide-137
SLIDE 137

"Blind men and the elephant”

The “partial correlation”

  • f the joint features

points to the same (type of) animal.

( )

xy xy xy

I I I

y x y x, y x, K ≡

xy

I

joint features

slide-138
SLIDE 138

KDCA for Visualization and Classification of Incomplete Data

  • 1. Given N incomplete data 𝒚𝟐, … , 𝒚𝑶 ∈ 𝑺𝑵, 𝐞𝐞𝐞𝐞𝐞𝐞

[ ] [

]

K ~ K K K D

2 1 W KDCA

ρ ρ + + =

( )

( ) ( ) ( )

             = =               = x K x K x K k z z z

m

, x , x , x A x A z

N 2 1 T T ) ( ) 2 ( ) 1 (

  

  • 3. Form A from the first m principal eigenvector of DKDCA

(with scaling such that ) :

  • 2. Form the KDCA Discriminant Matrix:

[ ]

I A K K A

W

= + ρ

T

Partial Correlation Kernel:

This image cannot currently be displayed.
slide-139
SLIDE 139

No dim reduction

Missing Ratio

KAIDA (SVM): Masking Only

0.4% Loss

Non-Imputed PC Kernel Imputed RBF Kernel

accuracy missing ratio

0% 10% 25% 50% 75%

accuracy

100% 98% 96% 94% 92% 90% 88% 86%

missing ratio

slide-140
SLIDE 140

Original

Best of both worlds: Compression and Masking DCA is good for compressed data.

Compressed Data

… in vector form.

Masked Data

How do we cope with masked and dimension-reduced data? …not in vector form.

Kernel DCA (KDCA)

slide-141
SLIDE 141

Imputed RBF vs PC Kernels: 40% Data Missing

RBF-KDCA23

DMPC KDCA: Imputed RBF

PC-KDCA23 𝐿𝑄𝑄 & 𝐿𝑆𝑆𝑆 ⇒ 𝑨1(𝑦) 𝑨2(𝑦) 𝑨3(𝑦) = 𝑄𝑄𝐿𝑄𝑄𝑄(2) 𝑄𝑄𝐿𝑄𝑄𝑄 (3) 𝑆𝑆𝑆𝐿𝑄𝑄𝑄 (2)

DCA: After Data Laundry PCA: Before Data Laundry

RBF-KPCA12: Imputed-RBF PC-KPCA12

slide-142
SLIDE 142
  • DUCA-CP for Joint optimization

Some Parting Thoughts on Internet Privacy

  • Needs good foundations on Mathematics,

Statistics, Information Science.

  • Thank you and Question?
  • DCA/KDCA: Shields for Privacy Protection