http://poloclub.gatech.edu/cse6242
CSE6242 / CX4242: Data & Visual Analytics
Data Cleaning & Integration
Duen Horng (Polo) Chau Georgia Tech
Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos
Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech - - PowerPoint PPT Presentation
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,
http://poloclub.gatech.edu/cse6242
CSE6242 / CX4242: Data & Visual Analytics
Duen Horng (Polo) Chau Georgia Tech
Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos
Examples
3
http://blogs.verdantis.com/wp-content/uploads/2015/02/Data-cleansing.jpg
Examples
4
(Previous semester)
Big Data's Dirty Problem [Fortune]
http://fortune.com/2014/06/30/big-data-dirty-problem/
For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights [New York Times]
http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to- insights-is-janitor-work.html?_r=0 5
Watch videos
Write down
Will collectively summarize similarities and differences afterwards
Open Refine: http://openrefine.org Data Wrangler: http://vis.stanford.edu/wrangler/ 7
interoperability)
G = Google Refine W = Data wrangler
10
Google Refine: http://code.google.com/p/google-refine/ Data Wrangler: http://vis.stanford.edu/wrangler/ 11
Collection Cleaning Integration Visualization Analysis Presentation Dissemination
15
Combining data from different sources to provide the user with a unified view As data’s volume, velocity and variety increase, and veracity decreases, data integration presents new (and more) opportunities and challenges How to help people effectively leverage multiple data sources?
(People: analysts, researchers, practitioners, etc.)
Craigslist now has map view! What problem has it solved?
https://atlanta.craigslist.org/search/hhh
21
card, bank, etc.), can parse receipts
22 (Previous semester)
recommend those friends as your friends
23 (Previous semester)
Use database’s “Join”! (e.g., SQLite) When would this approach work? (Or, when it won’t work?)
25
id name state 111 Smith GA 222 Johnson NY 333 Obama CA id name 111 Smith 222 Johnson 333 Obama id state 111 GA 222 NY 333 CA
Google Refine
http://code.google.com/p/google-refine/ (video #3)
26 http://wiki.freebase.com/wiki/What_is_Freebase%3F
(a graph of entities)
27 Wikipedia.
Hint: Google acquired it in 2010 Freebase to move over to Wikidata in July (2015): http://goo.gl/3ZDTg7 28
29
https://www.facebook.com/about/graphsearch
https://www.youtube.com/watch?v=W3k1USQbq80
https://www.youtube.com/watch?v=mmQl6VGvX-c
Finding Information by Association. CHI 2008
Polo Chau, Brad Myers, Andrew Faulring
32
Paper: http://www.cs.cmu.edu/~dchau/feldspar/feldspar-chi08.pdf YouTube: http://www.youtube.com/watch?v=Q0TIV8F_o_E&feature=youtu.be&list=ULQ0TIV8F_o_E
34
(Screenshot from FreeBase video)
(A hard problem in data integration)
Polo Chau P . Chau Duen Horng Chau Duen Chau
35
Interactive Data Deduplication and Integration
TVCG 2008
University of Maryland Bilgic, Licamele, Getoor, Kang, Shneiderman
40 http://www.cs.umd.edu/projects/linqs/ddupe/ (skip to 0:55) http://linqs.cs.umd.edu/basilic/web/Publications/2008/kang:tvcg08/kang-tvcg08.pdf
Euclidean norm / L2 norm
e.g., overlap of nodes’ #neighbors
e.g., “Polo Chau” vs “Polo Chan”
(compare ranked items)
43 http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf
Excellent read:
Determine how two entities are similar. D-Dupe’s approach: Attribute similarity + relational similarity
44
Similarity score for a pair of entities
45
Attribute similarity (a weighted sum)
46