Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech - - PowerPoint PPT Presentation

data cleaning integration
SMART_READER_LITE
LIVE PREVIEW

Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech - - PowerPoint PPT Presentation

Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech CSE6242 / CX4242 Aug 28, 2014 Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos Last Time Big data analytics building


slide-1
SLIDE 1

Data Cleaning & Integration

CSE6242 / CX4242

Aug 28, 2014

Duen Horng (Polo) Chau Georgia Tech

Partly based on materials by 
 Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

slide-2
SLIDE 2

Last Time

Big data analytics building blocks Data collection & simple data storage

  • Why SQLite?
  • Simplicity : nothing to install/

maintain, database in a single file

  • Popular: cross-platform, cross-

device

  • SQL basics (create table, join,

create index, etc.)

2

Collection Cleaning Integration Visualization Analysis Presentation Dissemination

slide-3
SLIDE 3

Data Cleaning


How dirty is real data?

slide-4
SLIDE 4

Data Cleaners

Watch videos

  • Open Refine (previously Google Refine)
  • Data Wrangler (research at Stanford)

Write down

  • Examples of data dirtiness
  • Tool’s features demo-ed (or that you like)

Will collectively summarize similarities and differences afterwards

Open Refine: http://openrefine.org Data Wrangler: http://vis.stanford.edu/wrangler/ 4

slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7

How dirty is real data?

Examples

  • duplicates
  • empty rows
  • abbreviations (different kinds)
  • difference in scales / inconsistency in description/ sometimes include units
  • typos
  • missing values
  • trailing spaces
  • incomplete cells
  • synonyms of the same thing
  • skewed distribution (outliers)
  • bad formatting / not in relational format (in a format not expected)

7

slide-8
SLIDE 8

How are the tools similar or different?

  • [G + W] can track changes (can undo redo, roll back)
  • [G] aggregation of similar-spelling items
  • [W] can import through copy and paste
  • [G] can import data through URL
  • [W] generate code/scripts
  • [G+W] can do value transformations (e.g., log)
  • [G] can do clustering
  • [W] can build graph/charts
  • [W] can learn from your actions
  • [G + W] do sorting
  • [G + W] your data is “safe” (desktop app)
  • [W] calculated fields (similar to excel)
  • [G] overview of data values (eg, histogram/distribution plot)

G = Google Refine W = Data wrangler8

slide-9
SLIDE 9

!

The videos only show some of the tools’ features. Try them out.

Google Refine: http://code.google.com/p/google-refine/ Data Wrangler: http://vis.stanford.edu/wrangler/ 9

slide-10
SLIDE 10

Data Integration

slide-11
SLIDE 11

Course Overview

Collection Cleaning Integration Visualization Analysis Presentation Dissemination

slide-12
SLIDE 12

What is Data Integration? Why is it Important?

slide-13
SLIDE 13

13

Data Integration

Combining data from different sources to provide the user with a unified view As data’s volume, velocity and variety increase, and veracity decreases, data integration presents new (and more) opportunities and challenges How to help people effectively leverage multiple data sources?


(People: analysts, researchers, practitioners, etc.)

slide-14
SLIDE 14

Examples of businesses based on data integration

slide-15
SLIDE 15
slide-16
SLIDE 16
slide-17
SLIDE 17
slide-18
SLIDE 18

Mashup

slide-19
SLIDE 19

More Examples?

  • [FREE] Mint: account app, integrates multiple account (credit

card, bank, etc.), can parse receipts

  • Google News
  • Crime mapping
  • Feedly
  • app that check gas prices, coupons
  • zillow-trulia/redfin
  • imdb (movie database)
  • coin: combine multiple credits
  • ebay

19

slide-20
SLIDE 20

More Examples?

  • Palantir gotham
  • Yelp: restaurant reviews, business reviews
  • Facebook friend request: look at your friends’s friends and

recommend those friends as your friends

  • Trulia / zillow (real estate sites)
  • graph search (facebook)
  • waze
  • yahoo pipe
  • google search engine
  • google transit
  • google now / apple siri

20

slide-21
SLIDE 21

How to do data integration?

slide-22
SLIDE 22

“Low” Effort Approaches

Use database’s “Join”! (e.g., SQLite)

  • Google Refine


http://code.google.com/p/google-refine/ (video #3) 22

id name state 111 Smith GA 222 Johnson NY 333 Obama CA id name 111 Smith 222 Johnson 333 Obama id state 111 GA 222 NY 333 CA

slide-23
SLIDE 23

Crowd-sourcing Approaches: Freebase

23 http://wiki.freebase.com/wiki/What_is_Freebase%3F

slide-24
SLIDE 24

Freebase


(a graph of entities)


 “…a large collaborative knowledge base consisting of metadata composed mainly by its community members…”

24 Wikipedia.

slide-25
SLIDE 25

So what? 


What can you do with Freebase?


(Hint: Google acquired it in 2010)

25

slide-26
SLIDE 26

http://www.google.com/insidesearch/features/search/knowledge.html

slide-27
SLIDE 27

Given a graph of entities, like Freebase, what other cool things can you do?

27

slide-28
SLIDE 28

https://www.facebook.com/about/graphsearch

slide-29
SLIDE 29

Facebook’s 
 Graph Search

Integrate your friends’ info with yours

29

slide-30
SLIDE 30

Feldspar

Finding Information by Association.
 CHI 2008


Polo Chau, Brad Myers, Andrew Faulring

30

Paper: http://www.cs.cmu.edu/~dchau/feldspar/feldspar-chi08.pdf YouTube: http://www.youtube.com/watch?v=Q0TIV8F_o_E&feature=youtu.be&list=ULQ0TIV8F_o_E

slide-31
SLIDE 31
slide-32
SLIDE 32

Summary for data integration

Opportunities

  • enable new services (Siri, padmapper)
  • enable new ways to discover info
  • improve existing services
  • reduce redundancy
  • new way to interactive with data
  • promote knowledge transfer (e.g., between

companies)

32