Data mining for data quality assurance
1
Data Mining: A Powerful Data Mining: A Powerful Tool for Data Cleaning Tool for Data Cleaning
Jiawei Han
Department of Computer Science University of Illinois at Urbana-Champaign
- Nov. 4, 2003
Data Mining: A Powerful Data Mining: A Powerful Tool for Data - - PowerPoint PPT Presentation
Data Mining: A Powerful Data Mining: A Powerful Tool for Data Cleaning Tool for Data Cleaning Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign Nov. 4, 2003 1 Data mining for data quality assurance
Data mining for data quality assurance
1
Data mining for data quality assurance
2
Data mining for data quality assurance
3
Data mining for data quality assurance
4
Data mining for data quality assurance
5
Data mining for data quality assurance
6
Data mining for data quality assurance
7
Production year (pyear) must not be after review year (ryear) Roger Ebert (reviewer) never reviews movies with rating < 5
Certain actor has never played in action movies
Rating and rrating tend to be strongly correlated
<movie, pyear, actor, rating> <movie, genre, review, ryear, rrating, reviewer>
Data mining for data quality assurance
8
all movies at Internet Movie Database imdb.com text of reviews from the New York Times
then transferred to related matching tasks
Data mining for data quality assurance
9
Expert Knowledge Domain Data Previous Matching Tasks Training data Soft Profiler m Similarity Estimator
t2
Match Filter Combiner Soft Profiler 1 … Hard Profiler n … Hard Profiler 1 Table T1
t1
Matching Prediction Table T2
Data mining for data quality assurance
10
Data mining for data quality assurance
11
Data mining for data quality assurance
12
Match Filter Combiner
t1 t2
Hard Profiler Hard Profiler Soft Profiler Soft Profiler
… …
Matching Prediction
Data mining for data quality assurance
13
Data mining for data quality assurance
14
Baseline CiteSeer F-Value Precision Recall 0.78 0.95 0.85 0.67 0.76 0.87 0.96 0.88 0.82 0.97 0.91 0.86 PROM DT Man+AR Man+DT Man+AR+DT 0.80 0.67 0.99
Data mining for data quality assurance
15
Data mining for data quality assurance
16
Data mining for data quality assurance
17
Data mining for data quality assurance
18
2 2
Data mining for data quality assurance
19
λ α γ χ2 mc 1000 1000 1000 100 1000 ¬mc 100 100 100 1000 100 m¬c 100 100 100 1000 10000 ¬(mc) 1000 10000 100000 100000 100000 DB A1 A2 A3 A4 A5 83.64 0.91 0.83 83452 9.26 0.91 0.83 9055 1.82 0.91 0.83 1472 8.44 0.09 0.05 670 9.18 0.09 0.09 8172 1000 1000 1000 1000 A6 1 0.5 0.33
milk ¬milk coffee mc ¬mc ¬coffee m¬c ¬(mc)
Data mining for data quality assurance
20
Data mining for data quality assurance
21
Klosgen’s Q k Added value A V Certainty factor F Piatetsky- Shapiro’s P S Cohen’s k Yule’s Y Y Yule’s Q Q φ-coefficient φ All_confidence α Coherence(Jaccard) γ Cosine IS Laplace L confidence c support s Gini index G J-Measure J Mutual Information M Goodman-kruskal’s g χ2 χ2 Collective Strength S lift λ Conviction V
range from -1 to 1 range from 0 to 1
Data mining for data quality assurance
22
B ¬B A 1000 100 ¬ A 100 |AB|
0.00 0.25 0.50 0.75 1.00 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 va lu e o f m e a s u re s Q Y k PS F AV K 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+02 1.E+03 1.E+04 1.E+05 1.E+0 V I S 0.00 0.25 0.50 0.75 1.00 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 M J G s c L IS
φ
g
γ α
λ
size of |AB| size of |AB| size of |AB|
Data mining for data quality assurance
23
B ¬B A 100 1000 ¬ A 1000 |AB|
20 40 60 80 100 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06
I S 0.00 0.30 0.60 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 M J G s c L IS
0.00 0.50 1.00 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 value of measures Q Y k PS F AV K
φ
g
γ α
χ2 λ
size of |AB| size of |AB| size of |AB|
Data mining for data quality assurance
24
Input parameters Results
B ¬B A 1000 1000 ¬ A 1000 |AB|
0.00 0.50 1.00 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 value of measures Q Y k PS F AV K 200 400 600 800 1000 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06
I S 0.00 0.25 0.50 0.75 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 M J G s c L IS
g
γ α φ
χ2 λ
size of |AB| size of |AB| size of |AB|
Data mining for data quality assurance
25
B ¬B A |AB| 10000 ¬ A 100 10000
0.00 0.50 1.00 312.5 625 1250 2500 5000 10000 20000 40000 80000 160000 320000 640000
value of measures
0.E +00 1.E +04 2.E +04 3.E +04 4.E +04 5.E +04 c L IS
γ α
χ2
size of |AB| value of χ2
Data mining for data quality assurance
26
Data mining for data quality assurance
27
Data mining for data quality assurance
28
district-id frequency date
Account
account-id account-id date amount duration
Loan
loan-id payment account-id bank-to account-to amount
Order
type disp-id type issue-date
Card
card-id account-id client-id
Disposition
disp-id birth-date gender district-id
Client
client-id dist-name region #people #lt-500
District
district-id #lt-2000 #lt-10000 #gt-10000 #city ratio-urban avg-salary unemploy95 unemploy96 den-enter #crime95 #crime96 account-id date type
Transaction
trans-id amount balance symbol
Example rules: Loan(L, +) :- Loan (L, A,?,?,?,?), Account(A,?,’monthly’,?). Loan(L, +) :- Loan (L, A,?,?,?,’<1000’), Account(A,D,?,?), District(D,?, region = ‘northMoravia’,?,?,…).
Target relation: Each tuple has a class label, indicating whether a loan is paid on time.
Data mining for data quality assurance
29
Loan loan-id account-id amount duration payment 1 124 1000 12 120 + 2 124 4000 12 350 + 3 108 10000 24 500 – 4 45 12000 36 400 – 5 45 2000 24 90 + Account account-id frequency date 124 monthly 960227 108 weekly 950923 45 monthly 941209 67 weekly 950101
Loan (L, A,?,?,?), Account(A, ‘monthly’ (or ‘weekly’),?). Loan (L, A,?,?,?), Account(A,?, date<x (date>x)).
Data mining for data quality assurance
30
Account account-id frequency date IDs Class Labels 124 monthly 960227 1, 2 2+, 0− 108 weekly 950923 3 0+, 1− 45 monthly 941209 4, 5 1+, 1− 67 weekly 950101
Loan loan-id account-id amount duration payment 1 124 1000 12 120 + 2 124 4000 12 350 + 3 108 10000 24 500 – 4 45 12000 36 400 – 5 45 2000 24 90 +
Data mining for data quality assurance
31
Data mining for data quality assurance
32
Target relation
district-id frequency date
Account
account-id account-id date amount duration
Loan
loan-id payment account-id bank-to account-to amount
Order
type disp-id type issue-date
Card
card-id account-id client-id
Disposition
disp-id birth-date gender district-id
Client
client-id dist-name region #people #lt-500
District
district-id #lt-2000 #lt-10000 #gt-10000 #city ratio-urban avg-salary unemploy95 unemploy96 den-enter #crime95 #crime96 account-id date type
Transaction
trans-id amount balance symbol First predicate Second predicate
Data mining for data quality assurance
33
Scalability w.r.t. number of relations Scalability w.r.t. number of tuples
15.3 sec 90.7% CrossMine 2429 sec 81.3% TILDE 3338 sec 74.0% FOIL Time Accuracy
Data mining for data quality assurance
34
Document Owner
Data Miner Sensitive documents Document mining
Data mining for data quality assurance
35
Data mining for data quality assurance
36
Remove sensitive words (names, locations, …), numerical data, dates, etc. Only common words are kept Smash the order of words Remove up to 40% of words and add up to 40% of noises
In regards to fractal compression, I have seen 2 fractal compressed "movies". They were both fairly impressive. The first one was a 64 gray scale "movie" of Casablanca, it was 1.3MB and had 11 minutes of 13 fps video. It was a little grainy but not bad at all. The second one I saw was only 3 minutes but it had 8 bit color with 10fps and measured in at 1.2MB. I consider the fractal movies a practical thing to explore. But unlike many other formats out there, you do end up losing resolution. I don't know what kind of software/hardware was used for creating the "movies" I saw but the guy that showed them to me said it took 5-15 minutes per frame to generate. But as I said above playback was 10 or more frames per second. And how else could you put 11 minutes on one floppy disk? davidr@rincon.ema.rockwell.com My opinions are my own except where they are shared by others in which case I will probably change my mind. speed, minut, him, assign, regard, complex, took, cheer, reach, idl, send, state, consid, presum, through, divis, resolut, frame, perhap, disclaim, locat, lose, name, qualiti, except, mail, posit, cabl, els, ride, bit, gener, avail, hurt, format, said, sox, littl, own, chang, put, share, upon, softwar, card, mean, impress, util, point, saw, better, consult, file, read, movi, per, drive, mani, unlik, first, realli,
thing, system, recent, want, could, apr, sometim, had, them, gui, fine, kind, math, entri, folk, show, seek, gov, second, meet
Data mining for data quality assurance
37
a frequent pattern a class label
Data mining for data quality assurance
38
Accuracy on newsgroup dataset Accuracy on BankSearch dataset SecureClass is more accurate than SVM, Naïve Bayes, and CMAR. The accuracy of SecureClass is less affected than those three approaches. The efficiency of SecureClass is similar to SVM, and is slower than Naïve Bayes but faster than CMAR.
Data mining for data quality assurance
39
Data mining for data quality assurance
40