Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech - - PowerPoint PPT Presentation

data cleaning integration
SMART_READER_LITE
LIVE PREVIEW

Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech - - PowerPoint PPT Presentation

http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,


slide-1
SLIDE 1

http://poloclub.gatech.edu/cse6242


CSE6242 / CX4242: Data & Visual Analytics


Data Cleaning & Integration

Duen Horng (Polo) Chau
 Georgia Tech

Partly based on materials by 
 Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

slide-2
SLIDE 2

Data Cleaning


Why data can be dirty?

slide-3
SLIDE 3

Examples

3


 How dirty is real data?

http://blogs.verdantis.com/wp-content/uploads/2015/02/Data-cleansing.jpg

slide-4
SLIDE 4

Examples

  • duplicates
  • empty rows
  • abbreviations (different kinds)
  • difference in scales / inconsistency in description/ sometimes include units
  • typos
  • missing values
  • trailing spaces
  • incomplete cells
  • synonyms of the same thing
  • skewed distribution (outliers)
  • bad formatting / not in relational format (in a format not expected)

4

(Previous semester)


How dirty is real data?

slide-5
SLIDE 5

More to read

Big Data's Dirty Problem [Fortune]


http://fortune.com/2014/06/30/big-data-dirty-problem/

For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights [New York Times]

http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to- insights-is-janitor-work.html?_r=0 5

slide-6
SLIDE 6

Data Janitor

slide-7
SLIDE 7

Data Cleaners

Watch videos

  • Open Refine (previously Google Refine)
  • Data Wrangler (research at Stanford)

Write down

  • Examples of data dirtiness
  • Tool’s features demo-ed (or that you like)

Will collectively summarize similarities and differences afterwards

Open Refine: http://openrefine.org Data Wrangler: http://vis.stanford.edu/wrangler/ 7

slide-8
SLIDE 8
slide-9
SLIDE 9
slide-10
SLIDE 10

How are the tools similar or different?

  • [G] cluster similar entities (e.g., T&M), synonyms
  • [G, W] history
  • [G] trailing spaces
  • [W] text extraction
  • [W] export script, code (work on other systems?

interoperability)

  • [W, G] one-click (usability)
  • [G] distribution of data (apply log scale)
  • [W] pivot data (unfold)
  • [W] suggestions (even more usable)
  • [W + G] preview changes

G = Google Refine W = Data wrangler

10

slide-11
SLIDE 11

!

The videos only show some of the tools’ features. Try them out.

Google Refine: http://code.google.com/p/google-refine/ Data Wrangler: http://vis.stanford.edu/wrangler/ 11

slide-12
SLIDE 12

Data Integration

slide-13
SLIDE 13

Course Overview

Collection Cleaning Integration Visualization Analysis Presentation Dissemination

slide-14
SLIDE 14

What is Data Integration? Why is it Important?

slide-15
SLIDE 15

15

Data Integration

Combining data from different sources to provide the user with a unified view As data’s volume, velocity and variety increase, and veracity decreases, data integration presents new (and more) opportunities and challenges How to help people effectively leverage multiple data sources?


(People: analysts, researchers, practitioners, etc.)

slide-16
SLIDE 16

Examples businesses that derive value via data integration

slide-17
SLIDE 17
slide-18
SLIDE 18
slide-19
SLIDE 19

Craigslist now has map view! 
 What problem has it solved?

https://atlanta.craigslist.org/search/hhh

slide-20
SLIDE 20
slide-21
SLIDE 21

More Examples?

  • Google Now
  • Yelp?
  • Amazon — different kinds of product (dpreview.com)
  • Dealnews, slickdeals, fatwallet?
  • tinder (facebook and instagram)
  • facebook (news stories, etc.)
  • walmart (different merchants)
  • search engines
  • ? any websites with advertising (e.g., new york times)

21

slide-22
SLIDE 22

More Examples?

  • [FREE] Mint: account app, integrates multiple account (credit

card, bank, etc.), can parse receipts

  • Google News
  • Crime mapping
  • Feedly
  • app that check gas prices, coupons
  • zillow-trulia/redfin
  • imdb (movie database)
  • coin: combine multiple credits
  • ebay

22 (Previous semester)

slide-23
SLIDE 23

More Examples?

  • Palantir gotham
  • Yelp: restaurant reviews, business reviews
  • Facebook friend request: look at your friends’s friends and

recommend those friends as your friends

  • Trulia / zillow (real estate sites)
  • graph search (facebook)
  • waze
  • yahoo pipe
  • google search engine
  • google transit
  • google now / apple siri

23 (Previous semester)

slide-24
SLIDE 24

How to do data integration?

slide-25
SLIDE 25

“Low” Effort Approaches

Use database’s “Join”! (e.g., SQLite)
 When would this approach work? 
 (Or, when it won’t work?)

25

id name state 111 Smith GA 222 Johnson NY 333 Obama CA id name 111 Smith 222 Johnson 333 Obama id state 111 GA 222 NY 333 CA

Google Refine


http://code.google.com/p/google-refine/ (video #3)

slide-26
SLIDE 26

Crowd-sourcing Approaches: Freebase

26 http://wiki.freebase.com/wiki/What_is_Freebase%3F

slide-27
SLIDE 27

Freebase


(a graph of entities)


 “…a large collaborative knowledge base consisting of metadata composed mainly by its community members…”

27 Wikipedia.

slide-28
SLIDE 28

So what? 


What can you do with Freebase?


Hint: Google acquired it in 2010
 Freebase to move over to Wikidata in July (2015): http://goo.gl/3ZDTg7 28

slide-29
SLIDE 29

Given a graph of entities, like Freebase, what other cool things can you do?

29

slide-30
SLIDE 30

https://www.facebook.com/about/graphsearch

https://www.youtube.com/watch?v=W3k1USQbq80

slide-31
SLIDE 31

https://www.youtube.com/watch?v=mmQl6VGvX-c

slide-32
SLIDE 32

Feldspar

Finding Information by Association.
 CHI 2008


Polo Chau, Brad Myers, Andrew Faulring

32

Paper: http://www.cs.cmu.edu/~dchau/feldspar/feldspar-chi08.pdf YouTube: http://www.youtube.com/watch?v=Q0TIV8F_o_E&feature=youtu.be&list=ULQ0TIV8F_o_E

slide-33
SLIDE 33
slide-34
SLIDE 34

We need ways to identify the many ways that one thing may be called. How?

34

(Screenshot from FreeBase video)

slide-35
SLIDE 35

Entity Resolution


(A hard problem in data integration)
 


Polo Chau
 P . Chau
 Duen Horng Chau
 Duen Chau


  • D. Chau


35

slide-36
SLIDE 36

Why is entity resolution so Important?

Case Study 
 Let’s shop for an iPhone 6 on 
 Apple, Amazon and eBay

slide-37
SLIDE 37
slide-38
SLIDE 38
slide-39
SLIDE 39
slide-40
SLIDE 40

D-Dupe

Interactive Data Deduplication and Integration

TVCG 2008
 


University of Maryland
 Bilgic, Licamele, Getoor, Kang, Shneiderman

40 http://www.cs.umd.edu/projects/linqs/ddupe/ (skip to 0:55) http://linqs.cs.umd.edu/basilic/web/Publications/2008/kang:tvcg08/kang-tvcg08.pdf

slide-41
SLIDE 41
slide-42
SLIDE 42

Polo Paolo Alice Bob Carol Dave

slide-43
SLIDE 43

Numerous similarity functions

  • Euclidean distance


Euclidean norm / L2 norm

  • TaxiCab/Manhattan distance
  • Jaccard Similarity (e.g., used with w-shingles)


e.g., overlap of nodes’ #neighbors

  • String edit distance


e.g., “Polo Chau” vs “Polo Chan”

  • Canberra distance 


(compare ranked items)

43 http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf

Excellent read:

slide-44
SLIDE 44

Core components: Similarity functions

Determine how two entities are similar. D-Dupe’s approach: 
 Attribute similarity + relational similarity

44

Similarity score for a pair of entities

slide-45
SLIDE 45

45

Attribute similarity (a weighted sum)

slide-46
SLIDE 46

Summary for data integration

  • Enable new services
  • Improve existing services
  • New ways to interactive with data

46