Data Integration Duen Horng (Polo) Chau Assistant Professor - - PowerPoint PPT Presentation

data integration
SMART_READER_LITE
LIVE PREVIEW

Data Integration Duen Horng (Polo) Chau Assistant Professor - - PowerPoint PPT Presentation

http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials by


slide-1
SLIDE 1

http://poloclub.gatech.edu/cse6242


CSE6242 / CX4242: Data & Visual Analytics


Data Integration

Duen Horng (Polo) Chau


Assistant Professor
 Associate Director, MS Analytics
 Georgia Tech

Partly based on materials by 
 Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

slide-2
SLIDE 2

What is Data Integration? Why is it Important?

slide-3
SLIDE 3

3

Data Integration

Combining data from different sources to provide the user with a unified view How to help people effectively leverage multiple data sources?


(People: analysts, researchers, practitioners, etc.)

slide-4
SLIDE 4

Examples businesses that derive value via data integration

slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7

Craigslist now has map view! 
 What problem has it solved?

https://atlanta.craigslist.org/search/hhh

slide-8
SLIDE 8
slide-9
SLIDE 9

More Examples?

9

slide-10
SLIDE 10

How to do data integration?

slide-11
SLIDE 11

“Low” Effort Approaches

  • 1. Use database’s “Join”! (e.g., SQLite)


When does this approach work? 
 (Or, when does it NOT work?)

11

id name state 111 Smith GA 222 Johnson NY 333 Obama CA id name 111 Smith 222 Johnson 333 Obama id state 111 GA 222 NY 333 CA

  • 2. Google Refine


http://openrefine.org (video #3)

slide-12
SLIDE 12

So, it’s great to assign 
 an ID to everything!
 
 But how?

12

slide-13
SLIDE 13

Crowd-sourcing Approaches: Freebase

13 http://wiki.freebase.com/wiki/What_is_Freebase%3F

Freebase intro: https://www.youtube.com/watch?v=TJfrNo3Z-DU

Freebase to move over to Wikidata in July (2015): http://goo.gl/3ZDTg7

slide-14
SLIDE 14

Freebase


(a graph of entities)


 “…a large collaborative knowledge base consisting of metadata composed mainly by its community members…”

14 Wikipedia.

slide-15
SLIDE 15

So what? 


What can you do with Freebase?


Hint: Google acquired it in 2010

15

slide-16
SLIDE 16

https://www.youtube.com/watch?v=mmQl6VGvX-c

slide-17
SLIDE 17

https://www.facebook.com/about/graphsearch

https://www.youtube.com/watch?v=W3k1USQbq80

slide-18
SLIDE 18

Feldspar

Finding Information by Association


Polo Chau, Brad Myers, Andrew Faulring


CHI 2008


18

Paper: http://www.cs.cmu.edu/~dchau/feldspar/feldspar-chi08.pdf YouTube: http://www.youtube.com/watch?v=Q0TIV8F_o_E&feature=youtu.be&list=ULQ0TIV8F_o_E

slide-19
SLIDE 19
slide-20
SLIDE 20

What if we don’t have the luxury of having IDs ?

20

(Screenshot from FreeBase video)

A common problem in academia:

Polo Chau
 Duen Horng Chau
 Duen Chau


  • D. Chau
slide-21
SLIDE 21

Entity Resolution


(A hard problem in data integration)


21

Then you need to do…

slide-22
SLIDE 22

Why is entity resolution important?

Case Study 
 Let’s shop for an iPhone 6 on 
 Apple, Amazon and eBay

slide-23
SLIDE 23
slide-24
SLIDE 24
slide-25
SLIDE 25
slide-26
SLIDE 26

D-Dupe

Interactive Data Deduplication and Integration

TVCG 2008
 


University of Maryland
 Bilgic, Licamele, Getoor, Kang, Shneiderman

26 http://www.cs.umd.edu/projects/linqs/ddupe/ (skip to 0:55) http://linqs.cs.umd.edu/basilic/web/Publications/2008/kang:tvcg08/kang-tvcg08.pdf

slide-27
SLIDE 27
slide-28
SLIDE 28

Polo Paolo Alice Bob Carol Dave

slide-29
SLIDE 29

Numerous similarity functions

  • Euclidean distance


Euclidean norm / L2 norm

  • TaxiCab/Manhattan distance
  • Jaccard Similarity (e.g., used with w-shingles)


e.g., overlap of nodes’ #neighbors

  • String edit distance


e.g., “Polo Chau” vs “Polo Chan”

  • Canberra distance 


(compare ranked items)

29 http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf

Excellent read:

slide-30
SLIDE 30

Core components: Similarity functions

Determine how two entities are similar. D-Dupe’s approach: 
 Attribute similarity + relational similarity

30

Similarity score for a pair of entities

slide-31
SLIDE 31

31

Attribute similarity (a weighted sum)