Data Cleaning & Integration
CSE6242 / CX4242
Aug 28, 2014
Duen Horng (Polo) Chau Georgia Tech
Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos
Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech - - PowerPoint PPT Presentation
Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech CSE6242 / CX4242 Aug 28, 2014 Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos Last Time Big data analytics building
CSE6242 / CX4242
Aug 28, 2014
Duen Horng (Polo) Chau Georgia Tech
Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos
Big data analytics building blocks Data collection & simple data storage
maintain, database in a single file
device
create index, etc.)
2
Collection Cleaning Integration Visualization Analysis Presentation Dissemination
Watch videos
Write down
Will collectively summarize similarities and differences afterwards
Open Refine: http://openrefine.org Data Wrangler: http://vis.stanford.edu/wrangler/ 4
Examples
7
G = Google Refine W = Data wrangler8
Google Refine: http://code.google.com/p/google-refine/ Data Wrangler: http://vis.stanford.edu/wrangler/ 9
Collection Cleaning Integration Visualization Analysis Presentation Dissemination
13
Combining data from different sources to provide the user with a unified view As data’s volume, velocity and variety increase, and veracity decreases, data integration presents new (and more) opportunities and challenges How to help people effectively leverage multiple data sources?
(People: analysts, researchers, practitioners, etc.)
card, bank, etc.), can parse receipts
19
recommend those friends as your friends
20
Use database’s “Join”! (e.g., SQLite)
http://code.google.com/p/google-refine/ (video #3) 22
id name state 111 Smith GA 222 Johnson NY 333 Obama CA id name 111 Smith 222 Johnson 333 Obama id state 111 GA 222 NY 333 CA
23 http://wiki.freebase.com/wiki/What_is_Freebase%3F
(a graph of entities)
24 Wikipedia.
(Hint: Google acquired it in 2010)
25
http://www.google.com/insidesearch/features/search/knowledge.html
27
https://www.facebook.com/about/graphsearch
Integrate your friends’ info with yours
29
Finding Information by Association. CHI 2008
Polo Chau, Brad Myers, Andrew Faulring
30
Paper: http://www.cs.cmu.edu/~dchau/feldspar/feldspar-chi08.pdf YouTube: http://www.youtube.com/watch?v=Q0TIV8F_o_E&feature=youtu.be&list=ULQ0TIV8F_o_E
Opportunities
companies)
32