http://poloclub.gatech.edu/cse6242
CSE6242 / CX4242: Data & Visual Analytics
Data Cleaning & Integration
Duen Horng (Polo) Chau Georgia Tech
Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos
Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech - - PowerPoint PPT Presentation
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,
http://poloclub.gatech.edu/cse6242
CSE6242 / CX4242: Data & Visual Analytics
Duen Horng (Polo) Chau Georgia Tech
Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos
Big data analytics building blocks Data collection & simple data storage
maintain, database in a single file
device
create index, etc.)
2
Collection Cleaning Integration Visualization Analysis Presentation Dissemination
Examples
4
Examples
5
(Fall’14)
Big Data's Dirty Problem [Fortune]
http://fortune.com/2014/06/30/big-data-dirty-problem/
A Taxonomy of Dirty Data [Won Kim+]
http://sci2s.ugr.es/docencia/m1/KimTaxonomy03.pdf (Very detailed, may be slightly outdated)
For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights [New York Times]
http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is- janitor-work.html?_r=0
6
Watch videos
Write down
Will collectively summarize similarities and differences afterwards
Open Refine: http://openrefine.org Data Wrangler: http://vis.stanford.edu/wrangler/ 8
G = Google Refine W = Data wrangler
11
Google Refine: http://code.google.com/p/google-refine/ Data Wrangler: http://vis.stanford.edu/wrangler/ 12
Collection Cleaning Integration Visualization Analysis Presentation Dissemination
16
Combining data from different sources to provide the user with a unified view As data’s volume, velocity and variety increase, and veracity decreases, data integration presents new (and more) opportunities and challenges How to help people effectively leverage multiple data sources?
(People: analysts, researchers, practitioners, etc.)
card, bank, etc.), can parse receipts
22
recommend those friends as your friends
23
Use database’s “Join”! (e.g., SQLite) Google Refine
http://code.google.com/p/google-refine/ (video #3) 25
id name state 111 Smith GA 222 Johnson NY 333 Obama CA id name 111 Smith 222 Johnson 333 Obama id state 111 GA 222 NY 333 CA
26 http://wiki.freebase.com/wiki/What_is_Freebase%3F
(a graph of entities)
27 Wikipedia.
Hint: Google acquired it in 2010 Freebase to move over to Wikidata in July (2015): http://goo.gl/3ZDTg7 28
http://www.google.com/insidesearch/features/search/knowledge.html
30
https://www.facebook.com/about/graphsearch
Integrate your friends’ info with yours
32
Finding Information by Association. CHI 2008
Polo Chau, Brad Myers, Andrew Faulring
33
Paper: http://www.cs.cmu.edu/~dchau/feldspar/feldspar-chi08.pdf YouTube: http://www.youtube.com/watch?v=Q0TIV8F_o_E&feature=youtu.be&list=ULQ0TIV8F_o_E
Opportunities
companies)
35