Novel Data Linkage Techniques Dongwon Lee The Pennsylvania State - - PDF document

novel data linkage techniques
SMART_READER_LITE
LIVE PREVIEW

Novel Data Linkage Techniques Dongwon Lee The Pennsylvania State - - PDF document

Novel Data Linkage Techniques Dongwon Lee The Pennsylvania State University http://pike.psu.edu/ dongwon@psu.edu Problem Landscape Data Linkage : Given two data collection D 1 and D 2 , identify/link all crosswise similar data object set S


slide-1
SLIDE 1

1

Novel Data Linkage Techniques

Dongwon Lee

The Pennsylvania State University http://pike.psu.edu/ dongwon@psu.edu

KOCSEA 2008

2

Problem Landscape

 Data Linkage: Given two data collection D1

and D2, identify/link all crosswise similar data

  • bject set S with few false positives

 Abundant research in many disciplines

 DB: record linkage, merge/purge, approx. join  DL: citation matching, de-duplication  AI: identity matching  NLP: word sense disambiguation  IR: name disambiguation  LIS: name authority control

slide-2
SLIDE 2

2

Data Linkage Proj. @ Penn State

 Since 2006

 Supported by IBM, Microsoft, and NSF  http://pike.psu.edu/linkage/

 Focus on two unanswered challenges

KOCSEA 2008 3

Today’s Focus

Flexibility

record name image video DNA seq. time series

Data Scalability

parallel distributed indexing blocking

KOCSEA 2008

4

  • 1. Group Linkage [ICDM 06, ICDE 07]

Collateral, 04 The Last Samurai, 03 Minority Report, 02 Vanilla Sky, 02

  • T. Cruise

Vanilla Sky The Last Samurai Mission Impossible Mission Impossible 2 Sofa-Jumping TX204 PP0Q03

Group of Elements

slide-3
SLIDE 3

3

  • 1. Group Linkage [ICDM 06, ICDE 07]

 Key Ideas

 BM: Generalized Jaccard Similarity using Max

Weight Bipartite Matching M: O(N3)

 UB: Greedy algorithm based approx. of BM: O(N)  Theorem:

 IF UB(g1,g2) < θ → BM(g1,g2) < θ → g1 ≠ g2

KOCSEA 2008 5

  • 2. Video Linkage [CIVR 08]

<
Original
Video>
 Low
resolu3on
 Brightness
 Color
Enhancement
 Contrast
 Color
Change
 Noise/Blur
 Small
Logo
 TV
size
 Crop
 Mul3‐edi3ng


KOCSEA 2008

slide-4
SLIDE 4

4

KOCSEA 2008 7

shot1
 shot2
 shot3
 Video


A
group
of
shots


Shot


A
group
of
frames


Key frame 1 Key frame 2 Key frame n

A
group
of
key
frames








‐
reduce
#
of
computa3ons
 Key
frame
selec3on
 1.
Dynamic

 2.
Uniform

 3.
Hybrid
=
Dynamic
+
Uniform


  • 2. Video Linkage [CIVR 08]

7/8/2008

KOCSEA 2008 8

Video
1
 Video
2
 compare
 shot
1
 shot
2
 shot
1
 shot
2
 shot
3
 frames
 frames
 frames
 frames
 frames
 Compare
frames


We need features of a frame

  • 2. Video Linkage [CIVR 08]
slide-5
SLIDE 5

5

KOCSEA 2008 9

2.
Vector
of
YCbCr
blocks


M
=
#
of
blocks


16x16
pixel
block
 YCbCr
average


3.
Mo3on
vector
histogram


N
=
#
of
mo3on
vectors

 



=
9


16x16
pixel
block
 Mo3on
vectors


(0,0),
(0,1),
 (1,0),
(1,1),
 (0,‐1),
(‐1,0),
 (1‐1),
(‐1,1),
 (‐1,‐1)


1.
HSV
color
histogram
(CH)


H
 S
 V
 16
values
 4
values
 4
values


L=256


  • 2. Video Linkage [CIVR 08]
  • 3. Text Linkage [JCDL 06, QDB 08, WebDB 08]

10

DNA Gene Sequence Time Series Text

KOCSEA 2008

SAX DTW ED BLAST Parallel BLAST

slide-6
SLIDE 6

6

  • 3. Text Linkage [JCDL 06, QDB 08, WebDB 08]

KOCSEA 2008 11

VLDB SIGMOD 11000 TGAA

Importance (rank)

Text

CTATGCAG 11001 TGAC

Text

GAGAGGGTGGGC CTATGCAG TGAA GAGAGGGTGGGC TGAC

1bit, 2bit, or 4bit data coding

Prob(X=Xi) X = word weight (normalized) Prob(X≤C1) Prob(X≥C2) C1 C2 1-bit coding 2-bit coding 4-bit coding

N-gram token set with tf.idf weight DNA Conversion Lookup Table BLAST

Text

KOCSEA 2008 12

  • 3. Text Linkage [JCDL 06, QDB 08, WebDB 08]

N-gram token set with tf.idf weight Conversion Lookup Table

+

Hilbert Curve QWERTY Layout

Text

slide-7
SLIDE 7

7

Conclusion

 Group Linkage

 Handle the integration of CiteSeer and ACM DL  Each data collection with ~100,000 groups

 Video Linkage

 Can detect copied videos w. high precision/recall  Applied to Flickr

 Text Linkage

 Try to bridge three different disciplines  Solve record linkage and document clustering

problems using DNA sequence or Time Series

KOCSEA 2008 13

Conclusion

 Other Data Linkage Techniques

 Name Linkage [CACM 08, WIDM 08, SemEval 07]  Parallel Linkage [CIKM 08]  Adaptive Linkage [JCDL 07]  Hashed Linkage [TR 08]

 Future Work

 Unifying Framework  Application to other data analysis problems

KOCSEA 2008 14

http://pike.psu.edu/linkage/

slide-8
SLIDE 8

8

Credit

 Students @ Penn State  Yoojin Hong  Hung-sik Kim  Haibin Liu  Su Yan  Tao Yang  Collaborators  Ergin Elmacioglu, Yahoo, USA  Jaewoo Kang, Korea U., Korea  Nick Koudas, U. Toronto, Canada  Jeongkyu Lee, U. Bridgeport, USA  Byung-Won On, U. British Columbia, Canada  Jian Pei, Simon Fraser U., Canada  Divesh Srivastava, AT&T Labs – Research, USA

KOCSEA 2008 15