novel data linkage techniques
play

Novel Data Linkage Techniques Dongwon Lee The Pennsylvania State - PDF document

Novel Data Linkage Techniques Dongwon Lee The Pennsylvania State University http://pike.psu.edu/ dongwon@psu.edu Problem Landscape Data Linkage : Given two data collection D 1 and D 2 , identify/link all crosswise similar data object set S


  1. Novel Data Linkage Techniques Dongwon Lee The Pennsylvania State University http://pike.psu.edu/ dongwon@psu.edu Problem Landscape  Data Linkage : Given two data collection D 1 and D 2 , identify/link all crosswise similar data object set S with few false positives  Abundant research in many disciplines  DB: record linkage, merge/purge, approx. join  DL: citation matching, de-duplication  AI: identity matching  NLP: word sense disambiguation  IR: name disambiguation  LIS: name authority control KOCSEA 2008 2 1

  2. Data Linkage Proj. @ Penn State  Since 2006  Supported by IBM, Microsoft, and NSF  http://pike.psu.edu/linkage/  Focus on two unanswered challenges parallel Data Scalability distributed indexing DNA seq. Today’s blocking time series name Focus record video image Flexibility KOCSEA 2008 3 1. Group Linkage [ICDM 06, ICDE 07] T. Cruise Collateral, 04 Sofa-Jumping The Last Samurai, 03 Minority Report, 02 Vanilla Sky Vanilla Sky, 02 The Last Samurai Mission Impossible Mission Impossible 2 PP0Q03 TX204 Group of Elements KOCSEA 2008 4 2

  3. 1. Group Linkage [ICDM 06, ICDE 07]  Key Ideas  BM : Generalized Jaccard Similarity using Max Weight Bipartite Matching M : O(N 3 )  UB : Greedy algorithm based approx. of BM: O(N)  Theorem:  IF UB(g 1 ,g 2 ) < θ → BM(g 1 ,g 2 ) < θ → g 1 ≠ g 2 KOCSEA 2008 5 2. Video Linkage [CIVR 08] <
Original
Video>
 Contrast
 Brightness
 Crop
 Color
Enhancement
 Color
Change
 TV
size
 Mul3‐edi3ng
 Low
resolu3on
 Noise/Blur
 Small
Logo
 KOCSEA 2008 3

  4. 2. Video Linkage [CIVR 08] shot1
 shot2
 shot3
 A
group
of
shots
 Video
 A
group
of
frames
 Shot
 1.
Dynamic

 Key
frame
selec3on
 2.
Uniform

 Key Key Key frame frame frame 1 2 n A
group
of
key
frames
 3.
Hybrid
=
Dynamic
+
Uniform
 





‐
reduce
#
of
computa3ons
 KOCSEA 2008 7 2. Video Linkage [CIVR 08] frames
 shot
1
 Video
1
 frames
 shot
2
 Compare
frames
 compare
 frames
 shot
1
 frames
 Video
2
 shot
2
 shot
3
 We need frames
 features of a frame 7/8/2008 8 KOCSEA 2008 4

  5. 2. Video Linkage [CIVR 08] 1.
HSV
color
histogram
(CH)
 2.
Vector
of
YCbCr
blocks
 3.
Mo3on
vector
histogram
 16x16
pixel
block
 16x16
pixel
block
 H
 S
 V
 YCbCr
average
 Mo3on
vectors
 4
values
 16
values
 4
values
 (0,0),
(0,1),
 (1,0),
(1,1),
 (0,‐1),
(‐1,0),
 (1‐1),
(‐1,1),
 (‐1,‐1)
 N
=
#
of
mo3on
vectors

 M
=
#
of
blocks
 L=256
 



=
9
 9 KOCSEA 2008 3. Text Linkage [JCDL 06, QDB 08, WebDB 08] DNA Gene Sequence BLAST Parallel BLAST Text ED DTW SAX Time Series KOCSEA 2008 10 5

  6. 3. Text Linkage [JCDL 06, QDB 08, WebDB 08] DNA Conversion N-gram token set 1bit, 2bit, or 4bit Lookup Table with tf.idf weight data coding 1-bit 2-bit 4-bit coding coding coding Prob(X=X i ) Prob(X ≤ C 1 ) Prob(X ≥ C 2 ) Text C 1 C 2 X = word weight (normalized) BLAST Text CTATGCAG CTATGCAG TGAA VLDB 11000 TGAA Importance (rank) 11001 TGAC SIGMOD GAGAGGGTGGGC TGAC GAGAGGGTGGGC Text KOCSEA 2008 11 3. Text Linkage [JCDL 06, QDB 08, WebDB 08] N-gram token set Text with tf.idf weight + Conversion Lookup Table Hilbert Curve QWERTY Layout KOCSEA 2008 12 6

  7. Conclusion  Group Linkage  Handle the integration of CiteSeer and ACM DL  Each data collection with ~100,000 groups  Video Linkage  Can detect copied videos w. high precision/recall  Applied to Flickr  Text Linkage  Try to bridge three different disciplines  Solve record linkage and document clustering problems using DNA sequence or Time Series KOCSEA 2008 13 Conclusion  Other Data Linkage Techniques  Name Linkage [CACM 08, WIDM 08, SemEval 07]  Parallel Linkage [CIKM 08]  Adaptive Linkage [JCDL 07]  Hashed Linkage [TR 08]  Future Work http://pike.psu.edu/linkage/  Unifying Framework  Application to other data analysis problems KOCSEA 2008 14 7

  8. Credit  Students @ Penn State  Yoojin Hong  Hung-sik Kim  Haibin Liu  Su Yan  Tao Yang  Collaborators  Ergin Elmacioglu, Yahoo, USA  Jaewoo Kang, Korea U., Korea  Nick Koudas, U. Toronto, Canada  Jeongkyu Lee, U. Bridgeport, USA  Byung-Won On, U. British Columbia, Canada  Jian Pei, Simon Fraser U., Canada  Divesh Srivastava, AT&T Labs – Research, USA KOCSEA 2008 15 8

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend