Cleaning data with forbidden itemsets Joeri Rammelaere with Floris - - PowerPoint PPT Presentation

cleaning data with forbidden itemsets
SMART_READER_LITE
LIVE PREVIEW

Cleaning data with forbidden itemsets Joeri Rammelaere with Floris - - PowerPoint PPT Presentation

Cleaning data with forbidden itemsets Joeri Rammelaere with Floris Geerts & Bart Goethals 2/22 Ill talk about . . . What dirty data is What forbidden itemsets are and how to mine them How to repair dirty data using nearest


slide-1
SLIDE 1

Cleaning data with forbidden itemsets

Joeri Rammelaere with Floris Geerts & Bart Goethals

slide-2
SLIDE 2

2/22

I’ll talk about . . .

◮ What dirty data is ◮ What forbidden itemsets are and how to mine them ◮ How to repair dirty data using nearest neighbours ◮ Demo!

slide-3
SLIDE 3

Dirty data

slide-4
SLIDE 4

4/22

When is data dirty?

◮ Typically:

◮ Define constraints on data ◮ Data is dirty if constraints are violated

◮ What kind of constraints?

◮ Many formalisms exist ◮ For example functional dependencies

slide-5
SLIDE 5

5/22

How do we find constraints?

◮ Human experts ◮ Master data ◮ Constraint discovery ◮ . . . ◮ But what if we only have dirty data?

slide-6
SLIDE 6

6/22

Dirty example

Age MaritalStatus Relationship Sex Country 39 Never-married Not-in-family Male USA 38 Married-AF-spouse Wife Female USA 17 Divorced Not-in-family Male USA 37 Married-civ-spouse Wife Female USA 28 Married-civ-spouse Wife Female Cuba 29 Married-civ-spouse Wife Male USA

Table: Some partial tuples from the UCI adult census dataset

slide-7
SLIDE 7

Forbidden itemsets

slide-8
SLIDE 8

8/22

Lift of an itemset

◮ Lift between two itemsets A and B:

◮ Number of occurences of A ∪ B divided by

expected nr. occurences if A and B were statistically independent

◮ Lift of an itemset A:

◮ Maximum lift between any partitioning X and Y of A

◮ We are interested in itemsets with low lift

slide-9
SLIDE 9

9/22

Converting tuples to transactions

◮ Tuple format:

Age MaritalStatus Relationship Sex Country 39 Never-married Not-in-family Male USA

◮ Transaction format:

(Age=39, MaritalStatus=Never-married, Relationship=Not-in-family, Sex=Male, Country=USA)

slide-10
SLIDE 10

10/22

What are forbidden itemsets?

◮ Infrequent itemsets (support) ◮ Negative correlation between contained items (lift)

⇒ Express forbidden value combinations

◮ For example:

◮ (Relationship=Wife,Sex=Male) ◮ (Relationship=Husband,Sex=Female) ◮ (MaritalStatus=Divorced,Age=17)

slide-11
SLIDE 11

11/22

Forbidden itemset mining

◮ Based on Eclat algorithm ◮ Maximum support threshold σ ◮ Maximum lift threshold τ ◮ Minimum support of items: θ = 1/τ

slide-12
SLIDE 12

12/22

Forbidden itemset mining

slide-13
SLIDE 13

Repairing dirty data

slide-14
SLIDE 14

14/22

Nearest neighbour imputation

◮ Separate clean and dirty tuples ◮ Choose a similarity function ◮ For each dirty tuple, find nearest clean tuple

slide-15
SLIDE 15

15/22

Nearest neighbour imputation

◮ So we have a neighbour . . . what now?

◮ Copy entire tuple ◮ Copy attributes involved in forbidden itemsets ◮ Majority voting among donors

slide-16
SLIDE 16

16/22

A problem!

◮ Repairing may cause itemsets to become Forbidden! ◮ Solution:

◮ Find number of errors ǫ ◮ Re-mine all itemsets that may become forbidden ◮ . . . after ǫ edits

slide-17
SLIDE 17

Demo

slide-18
SLIDE 18

18/22

slide-19
SLIDE 19

19/22

slide-20
SLIDE 20

20/22

slide-21
SLIDE 21

21/22

Source code

◮ Available soon at:

http://adrem.ua.ac.be/joerirammelaere

slide-22
SLIDE 22

22/22

The end . . .

Thank you for your attention! Questions?