1 Dataset Dataset Alphabet Alphabet Items that can be found - - PDF document

1
SMART_READER_LITE
LIVE PREVIEW

1 Dataset Dataset Alphabet Alphabet Items that can be found - - PDF document

Background Powerset Explorer: A Datamining Application Jordan Lee 1 2 Background Background PAST PAST Datamining accomplished with human intuition Datamining accomplished with human intuition PRESENT Computer aided with


slide-1
SLIDE 1

1

1

Powerset Explorer: A Datamining Application

Jordan Lee 2

Background

3

Background

PAST

– Datamining accomplished with human intuition

4

Background

PAST

– Datamining accomplished with human intuition

PRESENT

– Computer aided with AI and brute force CPU cycles

5

Background

PAST

– Datamining accomplished with human intuition

PRESENT

– Computer aided with AI and brute force CPU cycles

FUTURE

– Enter PowersetViewer….

6

Dataset

slide-2
SLIDE 2

2

7

Dataset

  • Alphabet

– Items that can be found in transactions – Eg. Apples, bread, chips

8

Dataset

  • Alphabet

– Items that can be found in transactions – Eg. Apples, bread, chips

  • Transaction

– Sets of items (unordered) – Eg.

Tx1 = { Apples, Chips }

– Eg.

Tx2 = { Bread }

9

Dataset

  • Alphabet

– Items that can be found in transactions – Eg. Apples, bread, chips

  • Transaction

– Sets of items (unordered) – Eg.

Tx1 = { Apples, Chips }

– Eg.

Tx2 = { Bread }

  • Transaction database

– Collection of transactions (unordered, possibly repetitive) – Eg. Walmart transaction logs

10

Example Dataset

Student enrollment database

11

Example Dataset

Student enrollment database

– Alphabet = courses

{ CPSC124, CPSC126, PHIL120, ANTH100, ENGL112 }

12

Example Dataset

Student enrollment database

– Alphabet = courses

{ CPSC124, CPSC126, PHIL120, ANTH100, ENGL112 }

– Transaction = courses student is enrolled in

#29389002 -> { CPSC 124, PHIL120, ENGL112 }

slide-3
SLIDE 3

3

13

Example Dataset

Student enrollment database

– Alphabet = courses

{ CPSC124, CPSC126, PHIL120, ANTH100, ENGL112 }

– Transaction = courses student is enrolled in

#29389002 -> { CPSC 124, PHIL120, ENGL112 }

– Transaction DB = list of student course schedules

14

Example Dataset (cont’d)

72423298 5 676 1701 3046 3900 1327 38578546 7 175 178 1182 1701 3038 680 3912 7660625 5 326 676 1701 3038 3908 43359163 3 1177 1699 4317 26495781 6 676 1177 1701 3038 3900 4275 48536452 4 1699 2339 1327 2826 64251972 6 676 1177 1701 3038 3900 2549 23212318 5 676 1701 3040 3813 3900 19820119 5 104 676 1699 3038 3900 65954629 4 480 676 3040 3908 54392012 5 676 1701 3038 3813 3899 85833501 5 676 1699 3040 3813 3900 65136197 5 676 1699 3038 3900 2580

15

Why?

Why is this interesting?

16

Why?

Why is this interesting?

– Consumer transaction logs -> trends in consumer

buying

17

Why?

Why is this interesting?

– Consumer transaction logs -> trends in consumer

buying

– Student enrollment database -> trends in

enrollment

What electives do most undergrad computer science

students take?

Departments can determine which joint majors would fit

the student population.

18

Why? (cont’d)

Dataset sizes growing exponentially

slide-4
SLIDE 4

4

19

Why? (cont’d)

Dataset sizes growing exponentially

– Human intuition has reached its limits

20

Why? (cont’d)

Dataset sizes growing exponentially

– Human intuition has reached its limits – Require computers and AI (expensive)

21

Why? (cont’d)

Dataset sizes growing exponentially

– Human intuition has reached its limits – Require computers and AI (expensive) – Information visualization can scale the power of

human intuition

22

Powerset Explorer

Code base from TreeJuxtaposer (Munzner)

– AccordianDrawer package TreeJuxtaposer

24

Powerset Explorer

Code base from TreeJuxtaposer (Munzner)

– AccordianDrawer package

Goals

slide-5
SLIDE 5

5

25

Powerset Explorer

Code base from TreeJuxtaposer (Munzner)

– AccordianDrawer package

Goals

– Focus + context exploration using grids

26

Powerset Explorer

Code base from TreeJuxtaposer (Munzner)

– AccordianDrawer package

Goals

– Focus + context exploration using grids – Guaranteed visibility

27

Powerset Explorer

Code base from TreeJuxtaposer (Munzner)

– AccordianDrawer package

Goals

– Focus + context exploration using grids – Guaranteed visibility – Marking of groups

28

Milestones Status Update

29

Milestones Status Update

#1 Completion of the basic visualization of a

randomized database of small set size (~10)

30

Milestones Status Update

#1 Completion of the basic visualization of a

randomized database of small set size (~10)

#2 Addition of a single level of “marking”.

slide-6
SLIDE 6

6

31

Milestones Status Update

#1 Completion of the basic visualization of a

randomized database of small set size (~10)

#2 Addition of a single level of “marking”. #3 Addition of multiple levels of “marking” (6)

32

Milestones Status Update

#1 Completion of the basic visualization of a

randomized database of small set size (~10)

#2 Addition of a single level of “marking”. #3 Addition of multiple levels of “marking” (6) #4 Addition of background marking to demarcate

areas of sets containing different amounts of items.

33

Milestones Status Update

#1 Completion of the basic visualization of a

randomized database of small set size (~10)

#2 Addition of a single level of “marking”. #3 Addition of multiple levels of “marking” (6) #4 Addition of background marking to demarcate

areas of sets containing different amounts of items.

#5 Implement multiple constraints

34

Milestones Status Update

#1 Completion of the basic visualization of a

randomized database of small set size (~10)

#2 Addition of a single level of “marking”. #3 Addition of multiple levels of “marking” (6) #4 Addition of background marking to demarcate

areas of sets containing different amounts of items.

#5 Implement multiple constraints #6 Increase maximum possible dataset size to at

least 100.

35

Difficulties

36

Difficulties

Multiple constraints difficult to implement on

current server-side dataminer

slide-7
SLIDE 7

7

37

Difficulties

Multiple constraints difficult to implement on

current server-side dataminer

Can not enumerate a powerset of alphabet

size greater than 14 elements (integer = 32 bits)

– Solution: use java class BigInteger

38

Difficulties

Multiple constraints difficult to implement on

current server-side dataminer

Can not enumerate a powerset of alphabet

size greater than 14 elements (integer = 32 bits)

– Solution: use java class BigInteger

High CPU and memory usage

– Solultion: upgrade computer! hack

39

Current Status

Reduced database 8680433 3 0 7 5 2768129 2 6 4 6385608 5 1 9 10 9 11 147924 5 5 2 9 5 2 234140 3 11 4 8 4331093 4 4 6 0 0 3158394 5 12 1 12 5 4 5797538 6 11 4 3 13 12 4 6243191 1 5 5872060 4 3 8 9 6

40

Current Status

  • Property file

0 CPSC 325 75.0 3 1 PHIL 327 84.0 1 2 ANTH 329 45.0 2 3 MATH 327 0.0 3 4 PSYC 328 0.0 1 5 ENGL 329 0.0 2 6 APSC 540 0.0 1 7 MECH 541 0.0 1 8 STAT 543 0.0 1 9 SPAN 201 71.0 1 10 FREN 258 76.0 2 11 ECON 260 84.0 1 12 LING 295 42.0 1 13 EECE 302 73.0 1

41 42

slide-8
SLIDE 8

8

43 44 45 46 47 48

slide-9
SLIDE 9

9

49 50 51 52 53 54

slide-10
SLIDE 10

10

55 56 57 58 59 60

slide-11
SLIDE 11

11

61 62 63 64 65 66

slide-12
SLIDE 12

12

67

Questions?