Machine Learning for Compressive Privacy
S.Y.
- Y. K
Machine Learning for Compressive Privacy S.Y. Y. K Kung Prince - - PowerPoint PPT Presentation
Machine Learning for Compressive Privacy S.Y. Y. K Kung Prince ceton Un Univer ersi sity Machine Learning for Compressive Privacy Professor S.Y. Kung Email: kung@princeton.edu References: `Kernel Method and Machine
Professor S.Y. Kung
References:
Cambridge University Press (CUP), 2014.
Supplementary Materials:
In the internet era, we experience a phenomenon of “digital everything”. Due to its quantitative (volume and velocity) and qualitative (variety) challenges, it is imperative to address various computational aspects of big data. For big data, the curse of high feature dimensionality is causing grave concerns on computational complexity and over-training.
Old Moore’s Law will soon lose ground to New Moore’s Law on Massive Data Readily Captured in Digital format.
Doubles every 1.5 years
Doubles every 2 years Device Moore’s Law The rise
Scientists
[Schonberger12].
Data Moore’s Law
9
Quantitatively:
100 hours of video uploaded on Youtube every minute!!
Facebook: In 2012, Facebook had over one billion users and, more importantly, with over 100 billion social links…. Every day, more than 250 Million new photos added to Facebook.
big= extremely high feature dimensionality:
performance downgrades.
Volume/Velocity: Curse of Dimensionality in Big Data
𝑵 ≫ 𝑶 𝑵 𝑶 ≫ 𝑵
big= extremely large-scale data size:
costly hardware, long computing time, and high processing power.
𝑶
11
“The era of trillions of sensors is upon us...."
3 Distributions of Learned Threshold: From Small to Large N (N=Sample Number per Class)
accurate threshold =0 Error function
M N
Microarray Data Matrix Or Document Data Matrix
Qualitatively speaking,
Big data will incur a paradigm shift towards the development of "error-tolerant" machine learning algorithms
sources - sometimes termed "megasourcing". [in "The Big Data Paradox“, David Auerbach]
vulnerable, missing, or deliberately hidden from public access. It is high time to seriously consider a major paradigm shift, both quantitatively and qualitatively, in learning machines for processing massive and messy (m&m) big data with coordinated hardware/software codesign.
Big Data Variety: vectorial or non-vectorial data Big Data Veracity: uncertain or imprecise data
Qualitatively:
"Blind men and the elephant”
Big Data to Users = Elephant to Blind Men
Big Data Blind man Quantitatively and qualitatively:
Go Google F Flu Trends de detects ts a a significant i inc ncrease i in H1N1 f flu ac u acti tivity ity two-weeks earlier t than C CDC - using b ng big data learni ning a ng and c cloud c computing. ng.
A lot
ard w wor
beh ehin ind th the s succe ccess:
CDC
Jan 28, 2008
(Mid-Altlantic Region)
percentage
45 k keywords s a are chos
from 5 5M M keyw ywords
Billion o
tri ried.
Objective of Machine Learning
The main function of machine learning is to convert the wealth of training data into useful knowledge by learning. data ≠ knowledge
Machine Learning is a research field study theories and technologies to adapt a system model based on training dataset, so that the learned model will be able to generalize and provide a correct classification or useful guidance even when inputs to the system are previously unknown.
Machine Learning System Feature e Engineer eering Labelin eling E Enginee eering
Much of the success of machine learning lies in an effective representation of the objects. Learning models such as deep learning and kernel machines can derive a much expanded vector space, from massive and messy big data, which may then be further extracted for
successful and practical applications.
The secret of success of a machine learning system lies in finding an effective representation for the
In a basic representation, an object is represented as a feature vector in a finite-dimensional vector space. Two common strategies:
x ⇒ z x ⇒ φ
Machine Learning for Compressive Privacy
We shall explore various projection methods, e.g. PCA, for dimension reduction – a prelude to visualization and privacy preserving of big data. An important development is Discriminant Component Analysis (DCA), which offers a compression scheme to enhance privacy protection in contextual and collaborative learning environment.
An implicit principle behind big data analysis is to make use of ALL
whether they are messy, imprecise, or incomplete (i.e. the veracity). Massive digital data are being rapidly captured in digital format, including digital book/voice/image/video/commerce. They come from divergent types of sources, from physical (sensor/IoT) to social and cyber (web) types. Note that visualization tools are meant to supplement (instead of replace) the domain expertise (e.g. a cardiologist) and provide a big picture to help users formulate critical questions and subsequently postulate heuristic and insightful answers.
Important Research Areas on Big Data By CCF vote, 55% goes to Visualization
Visualization HPC System Integration Revolutionary Algorithms
Visualization of (High-Dimensional) Big Data
questions and postulation of heuristic answers.
presentation of massive and messy big data and supplements (instead of replaces) the domain expertise (e.g. a cardiologist).
PCA/DCA serves as a good example/evidence!
Visualization
Multi-Dimensional scaling (MDS) Deep Learning Auto- Encoder. Tree Clustering Unsupervised
Supervised
K-means/ Kernel K-means SOM/Kernel SOM Vectorization
Dimension Reduction
Visualization of (High-Dimensional) Big Data
PCA fo CA for Subspace P e Projec ecti tion & & Dimen ensio ion Reduct ction
PCA CA aims a at t the b e best r t recover erabil ilit ity of t the e origin inal al d data i a in t n the e Original l Vect ctor Spa pace ce . The e signal al s subspac ace c e componen nents o
CA ar are as e associated w with ith the e powe wer whil ile t e the n e nois ise e subspac ace c e componen ents o
CA ar are t e tightl tly coupled ed w with the e recover erab abilit ity.
Min Noise Subspace
To bes est s show o
al i imag age. e.
Max Signal Subspace
Minimize RE (mean-square-error): Maximize Power :
Visualization for Unsupervised Learning Dataset
[ ]
= =
2 1 21 11
x x v v Wx y
[ ]
= = =
2 1 21 11 21 11 21 11
ˆ x x v v v v y v v Py x
x v y
T 1
=
error
subspace minor
1
yv x ˆ =
major subspace minor subspace
The original input vector can be reconstructed from y as:
Trace Norm Criterion: M-Dimensional Case
Useful for Proof:
T I W W : R W m 1 i i T i i T i : W PCA
T m M
= ∈ = ⊥
×
Visualization for Supervised Learning Dataset
Signal Subspace in CVS
Enable le C Classific ication.
DD: “Discriminant Distance” in CVS
Scatter (Covariance) Matrix:
= =
M i M i
1 i 1 i T
Minimize RE (mean-square-error): Maximize Power :
Noise Subspace in EVS
Show
inal l imag age. e.
Hide de “original imag age. e.
Signal Matrix = SB = Between-Class Matrix Noise Matrix = SW = Within-Class Matrix
detrimental
For supervised (i.e. labeled) dataset with L classes:
W
W B
= Signal Matrix + Noise Matrix
beneficiary
B
∆
Linear Discriminant Analysis (LDA) maximizes the signal-to-noise ratio:
w S w w S w n s
W T B T
= = SNR
2 1 i W T i i B T i 1 i i
w S w w S w n s DD SumSNR
m i m i
= ≡ =
= =
m DD W P DP
m i m i
+ = + ≡ ≡ =
= = 2 1 i i i 1 i i
n n s n p '
Discriminant Distance(DD) ⇔ Discriminant Power (DP)
Enhan ance (pr proj
ected) ) SB
B and
d suppr uppress ( (pr proj
ected ed) S ) SW .
component analysis
Canonical Vector Space (CVS)
Rotational Invariance of DD/DP
I W S W
W T
=
Rotational Invariant
I W WT =
Not Rotational Invariant
DCA Optimization Criterion: Discriminant Power (DP)
⇒ Orthonormal Bases
⇒ Rotational Invariance (Visualization )
L-1
2 1 i B T i 1 i W T i i B T i
w ~ S ~ w ~ w ~ S ~ w ~ w ~ S ~ w ~ DD SumSNR
m i m i
= =
= = =
Class-ceiling property 10:1 in compression ratio where
( )
W ~ ~ W ~ S ~ W ~ w ~ S ~ w ~ w ~ I w ~ w ~ S ~ w ~ w S w w S w '
T 1 i T i 1 w ~ w ~ 1 i T i i T i 1 i W T i i T i
i T i
P trace W P
m i m i m i
= = = = =
= = = =
DCA ⇒ ⇒ PCA
DCA and PCA are equivalent in CVS!
Preservation of DP under Coordinate Transformation
≡
Al Algebraically,
= =
= ≡
m i m i
DP
1 i W T i i T i 1 i i
w S w w S w n p
whitening
PCA in whitened space ⌂’
L= 3, L-1=2
space canonical space Data ta L Laund undry y ⇒Can anonical al V Vector Sp Spac ace ( (CVS) S) The canonical vector space is defined as the whitened space.
Pic ictoria iall lly,
~
re-Mapping
~
Forward Mapping:
PCA
Find the first m principal eigenvectors under the ``canonically normality" constraint that
Discriminant Matrix Direct EVS Method: Trace Norm Optimizer Backward Mapping: i=1,2,…,m.
Principal Eigenvalues for the Signal Subspace:
DCA maximizes RDP and optimizes (minimizes/maximizes) RE for recoverability or anti-recoverability.
(Cost, Power, Storage, Communication)
(Classification Accuracy in Prediction Phase)
With the data laundry process, DCA (i.e. supervised-PCA) far outperforms PCA.
Visualization: DCA vs. PCA
PCA PCA DCA m=5 200 runs
lab2 HAR classifier RE Accuracy p
>>10% Gain
Long Inter-Group Distance
In the internet era, we benefit greatly from the combination of packet switching, bandwidth, processing and storage capacities in the cloud. However, “big-data” often has a connotation of “big- brother”, since the data being collected on consumers like us is growing exponentially, attacks on our privacy are becoming a real
protection when we upload personal data to the cloud. An important development is Discriminant Component Analysis (DCA), which offers a compression scheme to enhance privacy protection in contextual and collaborative learning environment. DCA can be viewed as a supervised PCA which can simultaneously rank order the (1) the sensitive components and (2) the desensitized components.
Machine Learning for Compressive Privacy
Pumped by wireless, internet, and parallel processing technologies, cloud computing offers remotely hosted application logic units, data stores, and a diversity of application resources.
anywhere, and for anyone.
vast storage space, and is ready to lease out, on demand from clients, a variable number of machines. It has the premise of elastic hosting, offering application domains for lease to clients.
Cloud Computing: Data Center
The main problem of cloud computing lie on the communication cost and privacy protection. Big data has nowadays a connotation of Big brother.
Colorful water pipes cool one of Google’s data centers. The company sells user data to advertisers for targeted marketing
With rapidly growing internet commerce, much of our daily activities are moving online, abundance of personal information (such as sale transactions) are being collected, stored, and circulated around the internet and cloud servers, often without the owner's knowledge Machine learning Approach to Compressive Privacy This raises an imminent concern on the protection and safety of sensitive and private data, i.e. ``Online Privacy", also known as ``Internet Privacy" or ``Cloud Privacy". This course presents some machine learning methods useful for and internet privacy preserving data mining, which has recently received a great deal of attention in the IT community.
When the control of data protection is left entirely to the cloud server, the data privacy will unfortunately become vulnerable to hacker attack or unauthorized leakage. It is therefore much safer to keep the control for data protection solely at the hand of the data owner and not taking chance with the cloud. Why DARPA Brandeis Program?
public sphere (space) private sphere (space)
The analysis of PP technology requires a clear separation of two spaces: public vs. private spaces.
Public Space (Cloud) Private Space (Client)
Theme of the Brandeis Program: Control for data protection should be returned to the data owner rather than at
leaving it at the mercy of the cloud server.
To ensure protection via encryption, during the processing in the ``public sphere", the input data of each owner and any intermediate results will only be revealed/decrypted to the trusted authority.
Nevertheless, there exists substantial chance of encountering hacker attack or unauthorized leakage, when leaving the data protection entirely to the hand/mercy of the cloud server.
Data owner should have control over data
From the data privacy's perspective, the accessibility of data is divided into two separate spheres: (1) private sphere: where data owners generate and process decrypted data; and (2) public sphere: where cloud servers can generally access only encrypted data, with the exception that only the trusted authority may access decrypted data confidentially. When the control of data protection is left entirely to the cloud server, however, the data become vulnerable to hacker attack or unauthorized leakage. It is safer to let data owner control the data privacy and not to take chance with the cloud servers. To this end, we must design privacy preserving information systems so that the shared/pushed data are only useful for the intended utility and not easily diverted for malicious privacy intrusion.
Build information systems so that the shared data could be
utility (e.g. classification)
purposes (e.g. privacy).
DARPA Brandeis Program
US$60M/4.5 Yrs.
Raytheon BBN, Invincea TA3: Human Data Interaction TA1: Privacy Technologies CMU, Tel Aviv U TA4: Measuring Privacy Cybernetica, U of Tartu UC Berkeley, MIT, Cornel, U MD Iowa State U, Princeton Stealth Software Technologies TA3: Experimental Prototype Systems
Mobile Information Flow
56
Control Center makes request for images near an incident. User’s phone responds with some incident-relevant images per their privacy policy
User Application Privacy Policy PE Android Example: location = utility; face = privacy
”The Android platform provides several sensors that let you monitor the motion of a
gyroscope)”, (Most devices have both gyroscope and accelerometer.) and ”three of these sensors can be either hardware-based or software-based (the gravity, linear acceleration, and rotation vector sensors).” Adapted from CC Liu, et al. = private DP ⇒
Adapted from CC Liu, et al D (data) = Activity (B/L) Location (W/L) Tabs (B/L) app (M)= speech , motion, ID, password
We need to develop new methods to jointly optimize two design considerations P & U: Privacy Protection & Utility Maximization .
Objective: Explore information systems simultaneously perform
Utility Space Maximization: deliver intended data mining, classification, and learning tasks. Privacy Space Minimization: safeguard personal/private information.
CP involves joint optimization over three design spaces: (i) Feature Space (ii) Utility Subspace; and (iii) Cost Subspace (i.e. Privacy Subspace).
Compressio ion n (P (PCA) Rando dom Noi
Random
Transfor
USC’s Pickle: A collaborative learning model built for MFCC speaker recognition - widely used in acoustic mobile applications - to enable speaker recognition without revealing the speech content itself.
public data private data
Collaborative Learning for PP
Collaborative Learning for PP
B vs. B It enables classification of face/speech data (utility): Example of CUEP
while protecting the privacy (e.g. face image or speech content ) from malicious cloud users.
Masked Data
Classification Personality Privacy
“ i ll i f l i b i d j b d i l l d i lik SIFT
For example, in US, EFF 100 systems stream data from surveillance cameras (in the future, mobile cameras) can very well reveal where/when your car is.
Discriminant Component Analysis (DCA) offers a compression scheme to enhance privacy protection in contextual and collaborative learning environment. The DCA has (a) its classification goal characterized by the discriminant distance and (b) its privacy components controlled by a ridge parameter. Therefore, DCA is a promising algorithmic tool for CP.
Discriminant Component Analysis (DCA)
ρ’ = 0.00001≈0
Rank Order SU and NU Rank Order SP and NP This is not good enough as we want to
Noise-Subspace Eigenfaces ρ’ = 0.00001
ρ : Ridge Parameter for Noise Matrix ρ' : Ridge Parameter for Signal Matrix
Regulated Discriminant Power (RDP)
under the canonical normality that
maximizing RE for anti-reconstruction maximizing component power for utility maximizatio
Principal Eigenvalues for the Private Subspace: Minor Eigenvalues for the Privatized Subspace:
Simultaneously rank-order two subspaces:
eyeglasses or not).
ρ’ = -0.05 ρ’ = 0
Private Eigenfaces PU (Privatized/Utilizable) Eigenfaces ρ’ = -0.05
w/o rank
ρ’ = -0.05 ρ’ = +0.00001 < < < <
We shall explore joint optimization over three design spaces: (a) Feature Space, (b) Classification Space, and (c) Privacy Space. This prompts a new paradigm called DUCA to explore information systems which simultaneously perform Utility Space Maximization: deliver intended data mining, classification, and learning tasks.
Machine Learning for Compressive Privacy
DUCA ≠ DCA PCA DUCA
Our (CP) approach involves joint optimization over three design spaces: (i) Feature Space (ii) Utility Subspace; and (iii) Cost Subspace (i.e. Privacy Subspace).
Entropy and Venn Diagram
Differential Privacy
CS Theoretical Foundations
Entropy and Covariance Matrix
Double Income Problem (DIP) Privacy Utility
Compressive Privacy
Estimation Theory and Compressive Privacy argmin ( || fT - y || 2 + || ||2 ) Gauss-Markov (Statistical Estimation) Theorem
Machine Learning and Compressive Privacy
For CP problems, there are two types of the teacher values in the dataset: one for the utility labels and
types of between-class scatter matrices, denoted by for the utility labels
and
for the privacy labels.
DUCA for Supervised Machine Learning
Again, the optimal queries can be directly derived from the principal eigenvectors of where L and C denote the numbers of utility and privacy labels. Note that there are only L+C-2 meaningful eigenvectors, because
DUCA is only a generalization of its predecessor, called DCA, designed for the utility-only machine learning applications. 𝑡 𝑜 𝑡 + 𝑜 𝑜 𝑡 𝑡 + 𝑜 Naturally, just like DCA, there are ways for DUCA to extract additional meaningful queries. argmax argmax argmax
where ``H/M/L" denotes the three (High/Middle/Low) utility classes (i.e. family income) and ``+/-" denotes the two privacy classes (i.e. who-earns-more between the couple).
with the utility/ privacy class labels
UAM and CAM can be learned from the given dataset and their respective class labels.
𝑔
1 =
𝑔
2 =
0.68 0.72
Parallel Extraction of Multiple Queries: Thereafter, the two principle eigenvectors of the generalized eigenvectors of can be computed as
The two-dimensional visualization of the DUCA-CP subspace, in which the family income classes (H/M/L) are highly separable but not so for the income disparity categories (+/-). For example, the query here obviously belongs to the middle-class (utility) but its disparity (privacy) remains unknown, i.e. protected.
Four Categories of PPUM Formulations
C: classification E: estimation U: Utility
CU EU
P: Privacy
CP
EP If the privacy function is prescribed by the matrix P, just replace by . If the utility function is prescribed by the matrix P, just replace by .
If the privacy function is prescribed by the matrix P, we have the following adjustments:
KLM
Three reconstruction images are based on 399 selected FFT components by different filters: (a) rectangular window (no-learning); (b) unsupervised filtering based on the component's variances; and (c) supervised filtering based on component's FDRs. (Courtesy from A. Filipowicz.)
(b) 73.9% (a) 73% (c) 73%
Utility= Identification
Phase Magnitude
Privacy = Reconstruction
Full Dimension Olivetti
Original
Reconstruction
Phase Information Proves Vital for Reconstruction, it has little, if not negative, impact on Classification.
Reconstruction
w/o phase (0.97) (0.9725)
Big data analysis usually involves nonlinear data analysis and the two most promising approaches for which kernel learning machine (KLM) and deep learning machine (DLM). The safest possible protection is to withhold the privy data from sharing in the first place. This schema, however, presents a formidable challenge in developing a machine learning tool for incomplete data analysis (IDA). Fortunately, KLM can naturally be extended to all types of nonvectorial data analysis including IDA. Moreover, KLM can facilitate: Intrinsic space and privacy: Reduce the number of training vectors need to be stored in the cloud (SVM) or even make it unnecessary to share any training data via the intrinsic kernel approach. Auto-encoder for privacy: Compare two nonlinear Auto-Encoders (KLM and DLM) for data minimizer. Kernel learning machine for privacy: For example, partially-specified feature vectors can be pairwise correlated to yield a similarity or kernel function for kernel learning machine (KLM). Thereafter, SVM or KRR supervised learning classifiers may be trained and deployed for prediction applications.
Machine Learning for Compressive Privacy
The key of success lies in an effective representation of the objects by a much expanded vector space:
Feature e Engineer eering Labelin eling E Enginee eering
The feature engineering step is very laborious. Training data (e.g. human-labeled images) are used, often by trial-and-error, to learn features representative of the target (e.g. cars) . No learning at all in the feature engineering step.
Kernel learning (aka shallow learning) vs. the dull nature of deep learning!!
Big data analysis usually involves nonlinear data analysis, for which kernel machine learning (KML) and deep learning (DML) represent the two most promising approaches. KML and DML differs in the fact that KML is based on pairwise quantification of any pair of targeted objects while DML extract feature vector for each of the targeted objects, using a cascade of feature extraction layers. In this sense, the two learning approaches complement each other very well. This points to a potentially integration of both KML and DML technologies.
Deep back-propagation and kernel learning algorithms
Nonlinear Supervised Learning Machine Basic Learning Module in BP Back Propagation (BP) Neural Network
BP was the key Supervised Learning Networks in 80’s. Deep Learning Networks or Deep Learning Machines
Data scientists advocate the use of correlation analysis for unraveling information hidden in big data. [Schonberger12].
Kernel Correlation Approach to Big Data Analysis
Kernel learning represents a major discipline for correlation analysis, where pairwise inter-
sample correlations are represented by a
kernel matrix. [Kung2014]
Data Matrix Kernel Matrix
Kernel Approach? 𝑵 𝑶 𝑵 𝑵 𝑶 𝑶
(Statistical Perspective)
(Algebraic and Optimization Principles) Kernel vs. BDA
Covariance Matrix
Kernel Approach to (Big) Data Analysis
The dimens nsion n J,
Training-data-independent intrinsic feature vectors
LSP→
Original vector: 𝒚
𝑙(𝒚1, 𝒚)
𝒚(1) 𝒚(𝑁)
Σ
ℎ 𝒚 = 𝑏𝑜𝑙 𝒚𝑜, 𝒚 + 𝑐
𝑂 𝑜=1
𝑏1 𝑏2 𝑏3 𝑏𝑂
𝑙(𝒚2, 𝒚) 𝑙(𝒚3, 𝒚) 𝑙(𝒚𝑂, 𝒚)
kernel vector 𝒍 weighting vector 𝒃 …
Positive class Negative class
>0 <0
bias 𝑐
𝒚1, … , 𝒚𝑂: Training samples.
Kernel Learning Machine
holds for PCA and DCA.
𝑇(𝒚1, 𝒚1) 𝑇(𝒚1, 𝒚2) 𝑇(𝒚2, 𝒚1) 𝑇(𝒚2, 𝒚2) ⋯ 𝑇(𝒚1, 𝒚𝑂) 𝑇(𝒚1, 𝒚2) ⋮ ⋱ ⋮ 𝑇(𝒚𝑂, 𝒚1) 𝑇(𝒚𝑂, 𝒚1) ⋯ 𝑇(𝒚𝑂, 𝒚𝑂)
imputation
non-vectors
The data is individually quantified!!
K(•,•) non-vectors
Non-vectorial data are pairwise quantified into similarity matrix.
= K
deep learning
kernelization
Vectorization
Kernel trick
[Aizerman’64]
Linear, Nonlinear, & Non-Vectorial Kernel Machines
linear nonlinear Nonlinear Kernel Machines for vector data:
𝑇(𝒚1, 𝒚1) 𝑇(𝒚1, 𝒚2) 𝑇(𝒚2, 𝒚1) 𝑇(𝒚2, 𝒚2) ⋯ 𝑇(𝒚1, 𝒚𝑂) 𝑇(𝒚1, 𝒚2) ⋮ ⋱ ⋮ 𝑇(𝒚𝑂, 𝒚1) 𝑇(𝒚𝑂, 𝒚1) ⋯ 𝑇(𝒚𝑂, 𝒚𝑂)
Non-Vectorial Kernel Machines
Non-vectorial data are pairwise quantified, instead of individually quantified!!
Vector and non-vector correlation analysis
PSI-BLAST Adjacency Matrix
Partial Correlation
Pairwise Quantified Kernel Matrix
"Blind men and the elephant”
The “partial correlation”
points to the same (type of) animal.
( )
xy xy xy
I I I
y x y x, y x, K ≡
xy
I
joint features
The objective of machine learning is often an effective classification of objects, new and unseen before.
Machine Learning System Feature e Engineer eering Labelin eling E Enginee eering
Turing Machines ⇒ Tuning Machines
Comparison of DLP and KLP
Mathematics, Statistics, Information Science.
SVM VM/K /KRR
nonvecto rial data vectoria l data De Deep p Learni rning ng Feature res s Kern ernel el Features res (No L Learni rning ng) PCA CA DC DCA KPCA CA KDC DCA Fe Feat ature Sel Selec ection Pairw rwise se Quant ntified ed
Original vector: 𝒚 𝑦1 𝑦𝑁
Σ
A New w Hope → Intrinsic S Space O(J)
intrinsic vector 𝒜
𝑨1(𝒚) 𝑨2(𝒚) 𝑨3(𝒚) 𝑨𝐾(𝒚)
decision vector 𝒗 threshold: 𝑐
Σ
ℎ 𝒚 = 𝑣𝑘𝑨
𝑘(𝒚) 𝐾 𝑘=1
+ 𝑐 𝑣1 𝑣2 𝑣3 𝑣𝐾
Finit nite e Order er mean ans t that at clas assific ificati ation c complexit xity can b n be indep epen enden ent o
Kernelized DCA (KDCA)
Derivation of KDCA in Intrinsic Space Data Matrix in Intrinsic Space Projection Matrix in Intrinsic Space KDCA representation in Intrinsic Space Within-Class Scatter Matrix in Intrinsic Space
KDCA for Visualization and Classification of Incomplete Data
K ~ K K K D
2 1 W KDCA
ρ ρ + + =
−
( )
( ) ( ) ( )
= = = x K x K x K k z z z
m
, x , x , x A x A z
N 2 1 T T ) ( ) 2 ( ) 1 (
(with scaling such that ) :
[ ]
I A K K A
W
= + ρ
T
Gaussian RBF Kernel Function:
Masked Data
How do we cope with masked data? …not in vector form.
Pairwise Quantified Kernel Matrix
Partial Correlation
"Blind men and the elephant”
The “partial correlation”
points to the same (type of) animal.
( )
xy xy xy
I I I
y x y x, y x, K ≡
xy
I
joint features
KDCA for Visualization and Classification of Incomplete Data
K ~ K K K D
2 1 W KDCA
ρ ρ + + =
−
( )
( ) ( ) ( )
= = = x K x K x K k z z z
m
, x , x , x A x A z
N 2 1 T T ) ( ) 2 ( ) 1 (
(with scaling such that ) :
[ ]
I A K K A
W
= + ρ
T
Partial Correlation Kernel:
This image cannot currently be displayed.No dim reduction
Missing Ratio
0.4% Loss
Non-Imputed PC Kernel Imputed RBF Kernel
accuracy missing ratio
0% 10% 25% 50% 75%
accuracy
100% 98% 96% 94% 92% 90% 88% 86%
missing ratio
Best of both worlds: Compression and Masking DCA is good for compressed data.
Compressed Data
… in vector form.
Masked Data
How do we cope with masked and dimension-reduced data? …not in vector form.
Imputed RBF vs PC Kernels: 40% Data Missing
RBF-KDCA23
DMPC KDCA: Imputed RBF
PC-KDCA23 𝐿𝑄𝑄 & 𝐿𝑆𝑆𝑆 ⇒ 𝑨1(𝑦) 𝑨2(𝑦) 𝑨3(𝑦) = 𝑄𝑄𝐿𝑄𝑄𝑄(2) 𝑄𝑄𝐿𝑄𝑄𝑄 (3) 𝑆𝑆𝑆𝐿𝑄𝑄𝑄 (2)
DCA: After Data Laundry PCA: Before Data Laundry
RBF-KPCA12: Imputed-RBF PC-KPCA12
Some Parting Thoughts on Internet Privacy
Statistics, Information Science.