CS145: INTRODUCTION TO DATA MINING
Instructor: Yizhou Sun
yzsun@cs.ucla.edu November 27, 2017
CS145: INTRODUCTION TO DATA MINING Sequence Data: Similarity Search - - PowerPoint PPT Presentation
CS145: INTRODUCTION TO DATA MINING Sequence Data: Similarity Search Instructor: Yizhou Sun yzsun@cs.ucla.edu November 27, 2017 Methods to be Learnt Vector Data Set Data Sequence Data Text Data Logistic Regression; Nave Bayes for Text
yzsun@cs.ucla.edu November 27, 2017
2
Vector Data Set Data Sequence Data Text Data Classification
Logistic Regression; Decision Tree; KNN; SVM; NN Naรฏve Bayes for Text
Clustering
K-means; hierarchical clustering; DBSCAN; Mixture Models PLSA
Prediction
Linear Regression GLM*
Frequent Pattern Mining
Apriori; FP growth GSP; PrefixSpan
Similarity Search
DTW
3
4
5
6
7
1, ๐ 2, โฆ , ๐๐
๐ข: ๐ข โ ๐ , ๐ฅโ๐๐ ๐ ๐ ๐๐ก ๐ขโ๐ ๐๐๐๐๐ฆ ๐ก๐๐ข
8
9
10
11 VanEck International Fund Fidelity Selective Precious Metal and Mineral Fund
Two similar mutual funds in the different fund group
12
13
โฒ = ๐๐โ๐(๐ท) ๐(๐ท)
14
15
16
17
18
19
20
21
+ ๐(๐ฆ๐, ๐ง๐)
22
Time complexity: O(MN)
23
24
the frequency domain
same as their Euclidean distance in the frequency domain
25
26
27
28
29
30
domain is the same as their distance in the frequency domain
31
๏ญ ๏ฝ ๏ญ ๏ฝ
1 2 1 2
n f f n t t
3 2 2
๏ฝ ๏ฝ
f n t
32
direction in which a time series is moving over a long interval
about a trend line or curve
follow during corresponding months of successive years.
33
34
๐ข) = ln(๐ ๐ข) โ ln(๐ ๐ขโ1)
35
36
37
Autocovariance
เท ๐๐๐ค(๐
๐ข,๐๐ขโ๐)
เท ๐ค๐๐ (๐
๐ข)
๐ข, ๐ ๐ขโ๐) is calculated as:
38
๐
๐ข
๐
๐ขโ๐
๐ง๐+1 ๐ง1 ๐ง๐+2 ๐ง2 โฎ โฎ ๐ง๐โ1 ๐ง๐โ๐โ1 ๐ง๐ ๐ง๐โ๐
39
๐๐ = ๐. ๐๐, very high: Last quarterโs inflation rate contains much information about this quarterโs inflation rate
40
41
42
43
45
46