Data Integration Duen Horng (Polo) Chau Assistant Professor - - PowerPoint PPT Presentation

data integration
SMART_READER_LITE
LIVE PREVIEW

Data Integration Duen Horng (Polo) Chau Assistant Professor - - PowerPoint PPT Presentation

http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Data Integration Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials by


slide-1
SLIDE 1

http://poloclub.gatech.edu/cse6242


CSE6242 / CX4242: Data & Visual Analytics


Data Integration

Duen Horng (Polo) Chau


Assistant Professor
 Associate Director, MS Analytics
 Georgia Tech

Partly based on materials by 
 Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

slide-2
SLIDE 2

What is Data Integration?

Combining data from multiple sources to provide the user with a unified view.

Why is it Important?


Think about the apps, websites, and services that you use every day.

slide-3
SLIDE 3

Businesses derive value through data integration.

slide-4
SLIDE 4
slide-5
SLIDE 5

Apple Siri

slide-6
SLIDE 6
slide-7
SLIDE 7

More Examples?

  • Social media (data from users, businesses)
  • Facebook: your posts, advertisements, review
  • Search engine: Google, Bing, Yahoo, etc.
  • Smart assistants: Siri, Cortana, Alexa
  • Price comparison: Kayak
  • Uber, Lyft: drivers, traffic data, customers
  • google maps: users, restaurants, traffic….

7

slide-8
SLIDE 8

How to do data integration?

slide-9
SLIDE 9

“Low” Effort Approaches

  • 1. Use database’s “Join”! (e.g., SQLite)


When does this approach work? 
 (Or, when does it NOT work?)

9

id name salary 111 Smith $40k 222 Johnson $60k 333 Lee $50k id name 111 Smith 222 Johnson 333 Lee id salary 111 $40k 222 $60k 333 $50k

  • 2. Open Refine


http://openrefine.org (Video #3 “Reconcile and Match Data”)

slide-10
SLIDE 10

IDs are really important, and can simplify data integration!
 
 But who creates the IDs?

10

slide-11
SLIDE 11

Crowd-sourcing Approaches: Freebase

11

Freebase intro video: https://youtu.be/TJfrNo3Z-DU

Learn more about Freebase at https://en.wikipedia.org/wiki/Freebase

slide-12
SLIDE 12

Freebase


(a graph of entities)


 “…a large collaborative knowledge base consisting of metadata composed mainly by its community members…”

12 Wikipedia.

slide-13
SLIDE 13

So what? 


What can you do with the 
 Freebase knowledge graph?


Hint: Google acquired it in 2010.

13

slide-14
SLIDE 14

Learn more about Google Knowledge Graph at https://goo.gl/mkCKMg

slide-15
SLIDE 15

Freebase replaced by 
 Google Knowledge Graph API

15

Example: 
 What does Google know about Taylor Swift?
 https://developers.google.com/ knowledge-graph/

slide-16
SLIDE 16

16

What does Google know about Taylor Swift?
 https://developers.google.com/knowledge-graph/

slide-17
SLIDE 17

Google has the Knowledge Graph. Facebook has…

17

slide-18
SLIDE 18

Graph Search intro video: https://youtu.be/W3k1USQbq80

slide-19
SLIDE 19

What if we don’t have the luxury of having IDs ?

19

(Screenshot from FreeBase video)

A common problem in academia:

Polo Chau
 Duen Horng Chau
 Duen Chau


  • D. Chau
slide-20
SLIDE 20

Entity Resolution


(A hard problem in data integration)


20

Then you need to do…

slide-21
SLIDE 21

Why is entity resolution so difficult?

Let’s understand it through shopping for an iPhone on 
 Apple, Amazon and eBay

slide-22
SLIDE 22
slide-23
SLIDE 23
slide-24
SLIDE 24
slide-25
SLIDE 25

D-Dupe

Interactive Data Deduplication and Integration

TVCG 2008
 


University of Maryland
 Bilgic, Licamele, Getoor, Kang, Shneiderman

25 https://linqspub.soe.ucsc.edu/basilic/web/Publications/2006/bilgic:vast06/

slide-26
SLIDE 26
slide-27
SLIDE 27

Polo Palo Alice Bob Carol Dave

slide-28
SLIDE 28

Core components: Similarity functions

Determine how two entities are similar. D-Dupe’s approach: 
 Attribute similarity + relational similarity

28

Similarity score for a pair of entities

slide-29
SLIDE 29

29

Attribute similarity (a weighted sum)

slide-30
SLIDE 30

Numerous similarity functions

  • Euclidean distance


Euclidean norm / L2 norm

  • TaxiCab/Manhattan distance
  • Jaccard Similarity (e.g., used with w-shingles)


e.g., overlap of nodes’ #neighbors

  • String edit distance


e.g., “Polo Chau” vs “Polo Chan”


30 http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf

Excellent read:

slide-31
SLIDE 31

31

https://reference.wolfram.com/language/guide/ DistanceAndSimilarityMeasures.html

slide-32
SLIDE 32

Excellent Tutorial on Entity Resolution

http://www.umiacs.umd.edu/~getoor/Tutorials/ ER_KDD2013.pdf by Lise Getoor and Ashwin Machanavajjhala

32