Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech - - PowerPoint PPT Presentation

data cleaning integration
SMART_READER_LITE
LIVE PREVIEW

Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech - - PowerPoint PPT Presentation

http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,


slide-1
SLIDE 1

http://poloclub.gatech.edu/cse6242


CSE6242 / CX4242: Data & Visual Analytics


Data Cleaning & Integration

Duen Horng (Polo) Chau
 Georgia Tech

Partly based on materials by 
 Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

slide-2
SLIDE 2

Last Time

Big data analytics building blocks Data collection & simple data storage

  • Why SQLite?
  • Simplicity : nothing to install/

maintain, database in a single file

  • Popular: cross-platform, cross-

device

  • SQL basics (create table, join,

create index, etc.)

2

Collection Cleaning Integration Visualization Analysis Presentation Dissemination

slide-3
SLIDE 3

Data Cleaning


Why data can be dirty?

slide-4
SLIDE 4

Examples

4


 How dirty is real data?

slide-5
SLIDE 5

Examples

  • duplicates
  • empty rows
  • abbreviations (different kinds)
  • difference in scales / inconsistency in description/ sometimes include units
  • typos
  • missing values
  • trailing spaces
  • incomplete cells
  • synonyms of the same thing
  • skewed distribution (outliers)
  • bad formatting / not in relational format (in a format not expected)

5

(Fall’14)


How dirty is real data?

slide-6
SLIDE 6

More to read

Big Data's Dirty Problem [Fortune]


http://fortune.com/2014/06/30/big-data-dirty-problem/

A Taxonomy of Dirty Data [Won Kim+]


http://sci2s.ugr.es/docencia/m1/KimTaxonomy03.pdf
 (Very detailed, may be slightly outdated)


For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights [New York Times]

http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is- janitor-work.html?_r=0

6

slide-7
SLIDE 7
slide-8
SLIDE 8

Data Cleaners

Watch videos

  • Open Refine (previously Google Refine)
  • Data Wrangler (research at Stanford)

Write down

  • Examples of data dirtiness
  • Tool’s features demo-ed (or that you like)

Will collectively summarize similarities and differences afterwards

Open Refine: http://openrefine.org Data Wrangler: http://vis.stanford.edu/wrangler/ 8

slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11

How are the tools similar or different?

G = Google Refine W = Data wrangler

11

slide-12
SLIDE 12

!

The videos only show some of the tools’ features. Try them out.

Google Refine: http://code.google.com/p/google-refine/ Data Wrangler: http://vis.stanford.edu/wrangler/ 12

slide-13
SLIDE 13

Data Integration

slide-14
SLIDE 14

Course Overview

Collection Cleaning Integration Visualization Analysis Presentation Dissemination

slide-15
SLIDE 15

What is Data Integration? Why is it Important?

slide-16
SLIDE 16

16

Data Integration

Combining data from different sources to provide the user with a unified view As data’s volume, velocity and variety increase, and veracity decreases, data integration presents new (and more) opportunities and challenges How to help people effectively leverage multiple data sources?


(People: analysts, researchers, practitioners, etc.)

slide-17
SLIDE 17

Examples of businesses based on data integration

slide-18
SLIDE 18
slide-19
SLIDE 19
slide-20
SLIDE 20
slide-21
SLIDE 21

Mashup

slide-22
SLIDE 22

More Examples?

  • [FREE] Mint: account app, integrates multiple account (credit

card, bank, etc.), can parse receipts

  • Google News
  • Crime mapping
  • Feedly
  • app that check gas prices, coupons
  • zillow-trulia/redfin
  • imdb (movie database)
  • coin: combine multiple credits
  • ebay

22

slide-23
SLIDE 23

More Examples?

  • Palantir gotham
  • Yelp: restaurant reviews, business reviews
  • Facebook friend request: look at your friends’s friends and

recommend those friends as your friends

  • Trulia / zillow (real estate sites)
  • graph search (facebook)
  • waze
  • yahoo pipe
  • google search engine
  • google transit
  • google now / apple siri

23

slide-24
SLIDE 24

How to do data integration?

slide-25
SLIDE 25

“Low” Effort Approaches

Use database’s “Join”! (e.g., SQLite) Google Refine


http://code.google.com/p/google-refine/ (video #3) 25

id name state 111 Smith GA 222 Johnson NY 333 Obama CA id name 111 Smith 222 Johnson 333 Obama id state 111 GA 222 NY 333 CA

slide-26
SLIDE 26

Crowd-sourcing Approaches: Freebase

26 http://wiki.freebase.com/wiki/What_is_Freebase%3F

slide-27
SLIDE 27

Freebase


(a graph of entities)


 “…a large collaborative knowledge base consisting of metadata composed mainly by its community members…”

27 Wikipedia.

slide-28
SLIDE 28

So what? 


What can you do with Freebase?


Hint: Google acquired it in 2010
 Freebase to move over to Wikidata in July (2015): http://goo.gl/3ZDTg7 28

slide-29
SLIDE 29

http://www.google.com/insidesearch/features/search/knowledge.html

slide-30
SLIDE 30

Given a graph of entities, like Freebase, what other cool things can you do?

30

slide-31
SLIDE 31

https://www.facebook.com/about/graphsearch

slide-32
SLIDE 32

Facebook’s 
 Graph Search

Integrate your friends’ info with yours

32

slide-33
SLIDE 33

Feldspar

Finding Information by Association.
 CHI 2008


Polo Chau, Brad Myers, Andrew Faulring

33

Paper: http://www.cs.cmu.edu/~dchau/feldspar/feldspar-chi08.pdf YouTube: http://www.youtube.com/watch?v=Q0TIV8F_o_E&feature=youtu.be&list=ULQ0TIV8F_o_E

slide-34
SLIDE 34
slide-35
SLIDE 35

Summary for data integration

Opportunities

  • enable new services (Siri, padmapper)
  • enable new ways to discover info
  • improve existing services
  • reduce redundancy
  • new way to interactive with data
  • promote knowledge transfer (e.g., between

companies)

35