in Entity Resolution: A Feasibility Study Weiling Li, Jongwuk Lee, - - PowerPoint PPT Presentation

in entity resolution
SMART_READER_LITE
LIVE PREVIEW

in Entity Resolution: A Feasibility Study Weiling Li, Jongwuk Lee, - - PowerPoint PPT Presentation

Human-Powered Blocking in Entity Resolution: A Feasibility Study Weiling Li, Jongwuk Lee, Dongwon Lee The Pennsylvania State University Aug, 2014 1 A Motivating Example Matching : the same sport type in an image data set. Entity


slide-1
SLIDE 1

Human-Powered Blocking in Entity Resolution: A Feasibility Study

1

Weiling Li, Jongwuk Lee, Dongwon Lee

The Pennsylvania State University Aug, 2014

slide-2
SLIDE 2

A Motivating Example

 Matching: the same sport type in an image data set.  Entity Resolution (ER)  Challenging!

2

slide-3
SLIDE 3

Machine Based ER Techniques

 Similarity Based

 Might not get an accurate result (e.g., an image data set). 

 Learning Based

 Need a good training set to train the classifier. 

3

slide-4
SLIDE 4

Crowdsourced ER

 Human workers are assigned tasks referred to the

Human Intelligence Tasks (HITs).

 Example:  Naïve Approach:

 The # of HITs for ER on n records would be . 

) (

2

n O

4

slide-5
SLIDE 5

Crowdsourced ER

 Blocking:

 Group records that are more likely to match into the same

“block”. Run pair-wise comparisons only within blocks.

 The # of HITs would be reduced to , where k is the

number of blocks and b is the average number of records in a block.

5

) (

2

kb n O 

slide-6
SLIDE 6

Workflow

 Two variations of human-powered blocking:

 An extension of the crowd-median method for blocking [1]  A hierarchical blocking method Data Set Human- powered Blocking Block 1 Block 2 Block N Human-powered Pair-wise matching Human-powered Pair-wise matching Human-powered Pair-wise matching Matching Pairs

6

slide-7
SLIDE 7

Human-powered Operations

 hp_match(r, r’)  hp_most_similar(rt, C)

7

slide-8
SLIDE 8

Human-powered Components

 FindCentroids

 Given a data set D, the workers choose K centroids.  Use hp_match(r, r’)

 Assign

 Given a data set D, and the set of centroids C, the workers

assign each record r to one block whose centroid Ci is most similar to r.

 Use hp_most_similar(rt, C)

 PairwiseMatch

 Given a data set D, the workers make decisions on whether

pairs are matched.

 Use hp_match(r, r’)

8

slide-9
SLIDE 9

Human-powered Median-based Blocking

 Human-powered UpdateCentroids [1]: Identify a centroid of a

data set

 finding the “outlier” in a set of three records (i.e., triplet).  block centroid: the least selected one.  Sampling of the triplets: L=1, H=5

 Crowdsourced k-means clustering: Assign, and

UpdateCentroids.

9

slide-10
SLIDE 10

Human-powered Hierarchical Blocking

 build a K-ary tree of blocks in a top-down fashion  Split: FindCentroids and Assign.  Example:

10

slide-11
SLIDE 11

11

slide-12
SLIDE 12

 Stopping criterion:

 |block| ≤ the pre-defined block size threshold S  The number of HITs due to further blocking (HITb) exceeds

that from direct matching (HITd).

 Stringent condition: minHITb ≥ HITd

 To improve the overall accuracy, the algorithm runs

multiple iterations.

12

slide-13
SLIDE 13

Experimental Setting

 We employ two different HIT designs:

 binary HIT: processes two records at a time.  n-ary HIT: all centroids can be displayed to the workers at

  • nce.

 Table: # of HITs for each component:

13

Human-Powered Component Binary HIT N-ary HIT FindCentroids ≥1 + 2 + … + (K-1) = K*(K-1)/2 ≥ K-1 Assign (|D|-K)*(K-1) (|D|-K)* 1= |D| - K

slide-14
SLIDE 14

Evaluation on Synthesis Data

 Synthesis data

 1000 points.

 Ground truth: Euclidean distance d(p, q) ≤ a

distance threshold T (=1.41), points p and q are matched.

 Parameter setting: S=100, K=5.

14

slide-15
SLIDE 15

Results on Synthesis Data: F1

15

slide-16
SLIDE 16

Results on Synthesis Data: cost

16

slide-17
SLIDE 17

Evaluation on Real-life Data

 Image data

 100 images from ImageNet[2].  Ground truth: If two images share the same

parent node in the hierarchy, they are matched.

 Parameter setting: S=15, K=4.  Paid $0.01/question to crowd works on Amazon

Mechanical Turk (AMT)

 Majority voting

 Optimization of HIT assignments 17

slide-18
SLIDE 18

Image Data

 This data set has 585 pairs of matching records in eight leaf

nodes.

18

slide-19
SLIDE 19

Results on Image Data: F1

19

slide-20
SLIDE 20

Results on Image Data: cost

20

slide-21
SLIDE 21

Conclusions

 Feasibility study for human-powered blocking

 Relatively High accuracy  Much Lower cost compared to the naïve approach

21

slide-22
SLIDE 22

References

 [1] H. Heikinheimo and A. Ukkonen. The crowd-median

  • algorithm. In First AAAI Conference on Human Computation

and Crowdsourcing, 2013.

 [2] http://www.image-net.org/  Dataset URLs:  http://pike.psu.edu/download/crowdsens14/1k  http://pike.psu.edu/download/crowdsens14/100imgs

22

slide-23
SLIDE 23

Questions?

23