Fundamentals of Data Science
Know Your Data
Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net 2019 EE448, Big Data Mining, Lecture 2
http://wnzhang.net/teaching/ee448/index.html
Know Your Data Weinan Zhang Shanghai Jiao Tong University - - PowerPoint PPT Presentation
2019 EE448, Big Data Mining, Lecture 2 Fundamentals of Data Science Know Your Data Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html References and Acknowledgement A large part of
Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net 2019 EE448, Big Data Mining, Lecture 2
http://wnzhang.net/teaching/ee448/index.html
points, objects, tuples.
A news article An image A song A Facebook user profile A transcript of a student A trajectory of a car from SJTU to FDU
data field, representing a characteristic or feature
positive)
between successive values is not known.
larger than the unit of measurement (10 K˚ is twice as high as 5 K˚).
documents
attributes
using a finite number of digits
point variables
The frequency of ‘USA’ in a news article The friend set of a Facebook user The Algebra score of a student’s transcript The time-location of the 3rd point of a trajectory The upper left pixel RGB value of an image The pitch of the 320th frame of a song
Record Data Text Data Image Data Audio Speech Data Network Data Spatio- Temporal Data
JSON Format: { WEEKDAY: Monday; GENDER: Female; AGE: 24; CITY: New York; }
that represents semantic meanings of human
Bag-of-Words Format:
{ text: 4; mining: 2; also: 1; referred: 1; to: 2; as: 1; data: 1; roughly: 1; equivalent: 1; analytics: 1; is: 1; the: 1; process: 1;
deriving: 1; high-quality: 1; information: 1; from: 1; }
Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process
information from text.
http://languagelog.ldc.upenn.edu/nll/?p=8116
edges
Stanford network dataset collection: https://snap.stanford.edu/data/ Friendship Format: Alice Bob Bob Carl Carl Victor Bob Victor Alice Victor …
https://www.microsoft.com/en-us/research/project/trajectory-data-mining/
p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12
p1 ! p2 ! ¢ ¢ ¢ ! pn p1 ! p2 ! ¢ ¢ ¢ ! pn pi = (t; x; y; a) pi = (t; x; y; a)
Slide credit: Yu Zheng
pi = (t; a)
and spread
precision
¹ = 1 n
n
X
i=1
xi ¹ = 1 n
n
X
i=1
xi
values otherwise
¹ = Pn
i=1 wixi
Pn
i=1 wi
¹ = Pn
i=1 wixi
Pn
i=1 wi
mean ¡ mode ' 3 £ (mean ¡ median) mean ¡ mode ' 3 £ (mean ¡ median)
and negatively skewed data
p(x) x
mode median mean
p(x) x
mode median mean
p(x) x
mode median mean
Positively skewed data mode < median Negatively skewed data mode > median Symmetric data mode = median
¹ = 1 n
n
X
i=1
xi = E[x] ¹ = 1 n
n
X
i=1
xi = E[x] ¾2 = 1 n
n
X
i=1
(xi ¡ ¹)2 = E[x2] ¡ E[x]2 ¾2 = 1 n
n
X
i=1
(xi ¡ ¹)2 = E[x2] ¡ E[x]2
whiskers, and plot outliers individually
min Q3 Q1 max median
the box is IQR
approximately 100 fi% of data are ≤ xi
univariant distribution against the corresponding quantiles
plotted as points in the plane
tabulated frequencies, shown as bars
cases fall into each of several categories
specified as non-overlapping intervals of some variable. The categories (bars) must be adjacent
shown on the right may have the same boxplot representation
min, Q1, median, Q3, max
different data distributions
N(x) x N(x) x
both the overall behavior and unusual occurrences)
approximately 100 fi% of data ≤ xi
corresponding quantiles of another
those at Branch 2.
Q1 median Q3
coordinates and plotted as points in the plane
variables by scatter data.
relationships among data
further quantitative analysis
derived
screen, one for each dimension
at the corresponding positions in the windows
(a) Income (b) Credit Limit (c) Transaction volume (d) Age
Note: here the m windows are arranged by income. We can check the correlations of other dimension data w.r.t. income.
projections of the data
projections of multidimensional data
https://plot.ly/pandas/line-and-scatter/
(A) The two-dimensional codes for 500 digits of each class produced by taking the first two principal components (B) The two-dimensional codes found by a 784-1000-500-250-2 autoencoder (a deep learning model).
Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." science313.5786 (2006): 504-507.
MNIST data of hand written numbers
(A) The codes produced by two- dimensional latent semantic analysis (LSA). (B) The codes produced by a 2000- 500-250-125-2 autoencoder. (a deep learning model).
Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." science313.5786 (2006): 504-507.
The Reuter Corpus Volume 2
Matrix of scatterplots (x-y-diagrams) of the k-dimensional data
Used by ermission of M. Ward, Worcester Polytechnic Institute
representation which preserves the characteristics of the data
https://blogs.sas.com/content/sgf/2018/02/06/jazz-geo-map-colorful-icon-based-display-rules/
partitioning into subspaces
subspaces, which are ‘stacked’ into each other
important attributes should be used on the outer levels.
attribute 1 attribute 2 attribute 3 attribute 4
innermost world
dimensional worlds choosing these as the axes)
interaction through data glove and stereo displays, including rotation, scaling (inner) and translation (inner/outer)
interaction by means of queries
screen into regions depending on the attribute values
according to the attribute values (classes)
MSR Netscan Image
http://www.cs.umd.edu/hcil/treemap-history/all102001.jpg
http://www.cs.umd.edu/hcil/treemap-history/
Google News output
http://www.industrial-electronics.com/images/dmct_3e_2-20.jpg
tag is represented by font size/color
there are also methods to visualize relationships, such as visualizing social networks
ggplot(data, aes(x=X1, y=value, color=variable)) + geom_line(aes(linetype=variable), size=1) + geom_point(aes(shape=variable, size=4))
When a data scientist draws a plot, she just needs to differ the lines (color, line type) and points (color, shape) by a certain categorical variable instead of specifying particular style to each line and point.
http://ggplot2.tidyverse.org/
are
dimensions
2 6 6 6 6 6 6 4 x11 ¢ ¢ ¢ x1f ¢ ¢ ¢ x1p . . . . . . . . . . . . . . . xi1 ¢ ¢ ¢ xif ¢ ¢ ¢ xip . . . . . . . . . . . . . . . xn1 ¢ ¢ ¢ xnf ¢ ¢ ¢ xnp 3 7 7 7 7 7 7 5 2 6 6 6 6 6 6 4 x11 ¢ ¢ ¢ x1f ¢ ¢ ¢ x1p . . . . . . . . . . . . . . . xi1 ¢ ¢ ¢ xif ¢ ¢ ¢ xip . . . . . . . . . . . . . . . xn1 ¢ ¢ ¢ xnf ¢ ¢ ¢ xnp 3 7 7 7 7 7 7 5 2 6 6 6 6 6 4 d(2; 1) d(3; 1) d(3; 2) . . . . . . . . . ... d(n; 1) d(n; 2) ¢ ¢ ¢ ¢ ¢ ¢ 3 7 7 7 7 7 5 2 6 6 6 6 6 4 d(2; 1) d(3; 1) d(3; 2) . . . . . . . . . ... d(n; 1) d(n; 2) ¢ ¢ ¢ ¢ ¢ ¢ 3 7 7 7 7 7 5
sim(i; j) = 1 ¡ d(i; j) sim(i; j) = 1 ¡ d(i; j)
attribute)
d(i; j) = p ¡ m p d(i; j) = p ¡ m p
x1=[Weekday=Friday, Gender=Male, City=Shanghai] x2=[Weekday=Friday, Gender=Female, City=Shanghai]
d(1; 2) = 3 ¡ 2 3 = 1 3 d(1; 2) = 3 ¡ 2 3 = 1 3
for each of the p nominal states
xi=[Weekday=Friday, Gender=Male, City=Shanghai] xi =[0,0,0,0,1,0,0 0,1 0,0,1,0…0]
Whether Weekday=Friday Whether City=Shanghai
vectors, which can be fed into various functions
binary data
symmetric binary variables:
asymmetric binary variables:
1 sum 1 q r q + r s t s + t sum q + s r + t p 1 sum 1 q r q + r s t s + t sum q + s r + t p Object j Object i
d(i; j) = r + s q + r + s + t d(i; j) = r + s q + r + s + t d(i; j) = r + s q + r + s d(i; j) = r + s q + r + s simJaccard(i; j) = q q + r + s simJaccard(i; j) = q q + r + s
measure for asymmetric binary variables):
coherence(i; j) = sup(i; j) sup(i) + sup(j) ¡ sup(i; j) = q (q + r) + (q + s) ¡ q coherence(i; j) = sup(i; j) sup(i) + sup(j) ¡ sup(i; j) = q (q + r) + (q + s) ¡ q
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N
d(Jack; Mary) = 0 + 1 2 + 0 + 1 = 0:33 d(Jack; Jim) = 1 + 1 1 + 1 + 1 = 0:67 d(Jim; Mary) = 1 + 2 1 + 1 + 2 = 0:75 d(Jack; Mary) = 0 + 1 2 + 0 + 1 = 0:33 d(Jack; Jim) = 1 + 1 1 + 1 + 1 = 0:67 d(Jim; Mary) = 1 + 2 1 + 1 + 2 = 0:75
d(i; j) = r + s q + r + s d(i; j) = r + s q + r + s
1 sum 1 q r q + r s t s + t sum q + s r + t p 1 sum 1 q r q + r s t s + t sum q + s r + t p
Object j Object i
x1=[1.2, 3.5, 1.1, 2.7, 123.9] x2=[2.0, 1.5, 1.3, 3.1, 145.1]
This dimension may dominate the proximity calculation
z = x ¡ ¹ ¾ z = x ¡ ¹ ¾
standard deviation
units of the standard deviation
above
Example:
2 4 2 4 x1 x2 x3 x4
Dissimilarity Matrix (with Euclidean Distance) Data Matrix
point attribute 1 attribute 2 x1 1 2 x2 3 5 x3 2 x4 4 5 point attribute 1 attribute 2 x1 1 2 x2 3 5 x3 2 x4 4 5 x1 x2 x3 x4 x1 x2 3.61 x3 2.24 5.1 x4 4.24 1 5.39 x1 x2 x3 x4 x1 x2 3.61 x3 2.24 5.1 x4 4.24 1 5.39
xi = (xi1; xi2; : : : ; xip) xj = (xj1; xj2; : : : ; xjp) d(i; j) = ¡ jxi1 ¡ xj1jh + jxi2 ¡ xj2jh + ¢ ¢ ¢ + jxip ¡ xjpjh¢ 1
h
xi = (xi1; xi2; : : : ; xip) xj = (xj1; xj2; : : : ; xjp) d(i; j) = ¡ jxi1 ¡ xj1jh + jxi2 ¡ xj2jh + ¢ ¢ ¢ + jxip ¡ xjpjh¢ 1
h
different between two binary vectors
d(i; j) = jxi1 ¡ xj1j + jxi2 ¡ xj2j + ¢ ¢ ¢ + jxip ¡ xjpj d(i; j) = jxi1 ¡ xj1j + jxi2 ¡ xj2j + ¢ ¢ ¢ + jxip ¡ xjpj
d(i; j) = q jxi1 ¡ xj1j2 + jxi2 ¡ xj2j2 + ¢ ¢ ¢ + jxip ¡ xjpj2 d(i; j) = q jxi1 ¡ xj1j2 + jxi2 ¡ xj2j2 + ¢ ¢ ¢ + jxip ¡ xjpj2
component (attribute) of the vectors
d(i; j) = lim
h!1
³
p
X
f=1
jxif ¡ xjfjh´ 1
h = max
f
jxif ¡ xjfj d(i; j) = lim
h!1
³
p
X
f=1
jxif ¡ xjfjh´ 1
h = max
f
jxif ¡ xjfj
Example:
2 4 2 4 x1 x2 x3 x4
Dissimilarity Matrices Data Matrix
point attribute 1 attribute 2 x1 1 2 x2 3 5 x3 2 x4 4 5 point attribute 1 attribute 2 x1 1 2 x2 3 5 x3 2 x4 4 5
x1 x2 x3 x4 x1 x2 3.61 x3 2.24 5.1 x4 4.24 1 5.39 x1 x2 x3 x4 x1 x2 3.61 x3 2.24 5.1 x4 4.24 1 5.39 x1 x2 x3 x4 x1 x2 5 x3 3 6 x4 6 1 7 x1 x2 x3 x4 x1 x2 5 x3 3 6 x4 6 1 7 x1 x2 x3 x4 x1 x2 3 x3 2 5 x4 3 1 5 x1 x2 x3 x4 x1 x2 3 x3 2 5 x4 3 1 5
Mahantan (L1) Euclidean (L2) Supremum (Lmax)
recording the frequency of a particular word (such as keywords) or phrase in the document.
mapping, ...
vectors), then where • indicates vector dot product, is the length of vector d cos(d1; d2) = (d1 ¢ d2)=(kd1k ¢ kd2k) cos(d1; d2) = (d1 ¢ d2)=(kd1k ¢ kd2k) kdk kdk
Document Team Coach Hockey Baseball Soccer Penalty Score Win Loss Season d1 5 3 2 2 d2 3 2 1 1 1 1 d3 7 2 1 3 d4 1 1 2 2 3
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
Document Team Coach Hockey Baseball Soccer Penalty Score Win Loss Season d1 5 3 2 2 d2 3 2 1 1 1 1 d3 7 2 1 3 d4 1 1 2 2 3
d1 ¢ d2 = 5 £ 3 + 0 £ 0 + 3 £ 2 + 0 £ 0 + 2 £ 1 + 0 £ 1 + 0 £ 1 + 2 £ 1 + 0 £ 0 + 0 £ 1 = 25 kd1k = (5 £ 5 + 0 £ 0 + 3 £ 3 + 0 £ 0 + 2 £ 2 + 0 £ 0 + 0 £ 0 + 2 £ 2 + 0 £ 0 + 0 £ 0)0:5 = 420:5 = 6:48 kd2k = (3 £ 3 + 0 £ 0 + 2 £ 2 + 0 £ 0 + 1 £ 1 + 1 £ 1 + 0 £ 0 + 1 £ 1 + 0 £ 0 + 1 £ 1)0:5 = 170:5 = 4:12 cos(d1; d2) = 0:94 d1 ¢ d2 = 5 £ 3 + 0 £ 0 + 3 £ 2 + 0 £ 0 + 2 £ 1 + 0 £ 1 + 0 £ 1 + 2 £ 1 + 0 £ 0 + 0 £ 1 = 25 kd1k = (5 £ 5 + 0 £ 0 + 3 £ 3 + 0 £ 0 + 2 £ 2 + 0 £ 0 + 0 £ 0 + 2 £ 2 + 0 £ 0 + 0 £ 0)0:5 = 420:5 = 6:48 kd2k = (3 £ 3 + 0 £ 0 + 2 £ 2 + 0 £ 0 + 1 £ 1 + 1 £ 1 + 0 £ 0 + 1 £ 1 + 0 £ 0 + 1 £ 1)0:5 = 170:5 = 4:12 cos(d1; d2) = 0:94
cos(d1; d2) = (d1 ¢ d2)=(kd1k ¢ kd2k) cos(d1; d2) = (d1 ¢ d2)=(kd1k ¢ kd2k)
i-th object in the f-th variable by
scaled variables
rif 2 f1; : : : ; Mfg rif 2 f1; : : : ; Mfg zif = rif ¡ 1 Mf ¡ 1 zif = rif ¡ 1 Mf ¡ 1
d(i; j) = Pp
f=1 ±(f) ij d(f) ij
Pp
f=1 ±(f) ij
d(i; j) = Pp
f=1 ±(f) ij d(f) ij
Pp
f=1 ±(f) ij
(f) = 0 if xif = xjf , or dij (f) = 1 otherwise
zif = rif ¡ 1 Mf ¡ 1 zif = rif ¡ 1 Mf ¡ 1
scaled, ratio-scaled
Web, image.
graphical displays
area of research.