Improving IR performance from OCRed text using cooccurrence RISOT - - PowerPoint PPT Presentation

improving ir performance from ocred text using
SMART_READER_LITE
LIVE PREVIEW

Improving IR performance from OCRed text using cooccurrence RISOT - - PowerPoint PPT Presentation

Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kripabandhu Ghosh and Anirban Chakraborty Indian Statistical Institute Kolkata, India Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Improving IR


slide-1
SLIDE 1

Improving IR performance from OCRed text using cooccurrence RISOT 2012

Kripabandhu Ghosh and Anirban Chakraborty Indian Statistical Institute Kolkata, India

Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Kolkata, India () Improving IR performance from OCRed text using cooccurrenceRISOT 2012 1 / 22

slide-2
SLIDE 2

Title Improving IR performance from OCRed text using cooccurrence RISOT 2012

Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Kolkata, India () Improving IR performance from OCRed text using cooccurrenceRISOT 2012 2 / 22

slide-3
SLIDE 3

Key Terms

Co-occurrence : We say that two words co-occur if they appear in a window of certain number of words

  • f each other in a document

LCS similarity : LCS stands for Longest Common Subsequence

LCS(industry, industrial) = industr LCS similarity(industry, industrial) = LCS(industry, industrial)/max(industry , industrial ) = 0.7

Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Kolkata, India () Improving IR performance from OCRed text using cooccurrenceRISOT 2012 3 / 22

slide-4
SLIDE 4

RISOT task

Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Kolkata, India () Improving IR performance from OCRed text using cooccurrenceRISOT 2012 4 / 22

slide-5
SLIDE 5

OCRed Resources

Legal documents as hard copies

IIT CDIP 1.0 corpus - TREC Legal Ad Hoc

SIGIR Digital Museum - Cleverdon, Salton, Sparck Jones

Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Kolkata, India () Improving IR performance from OCRed text using cooccurrenceRISOT 2012 5 / 22

slide-6
SLIDE 6

RISOT task - Without Original Corpus

Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Kolkata, India () Improving IR performance from OCRed text using cooccurrenceRISOT 2012 6 / 22

slide-7
SLIDE 7

Social Networks : Direct Connections

Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Kolkata, India () Improving IR performance from OCRed text using cooccurrenceRISOT 2012 7 / 22

slide-8
SLIDE 8

Social Networks : Indirect Connections

Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Kolkata, India () Improving IR performance from OCRed text using cooccurrenceRISOT 2012 8 / 22

slide-9
SLIDE 9

Clustering Algorithm

CLUSTERING

Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Kolkata, India () Improving IR performance from OCRed text using cooccurrenceRISOT 2012 9 / 22

slide-10
SLIDE 10

Clustering Algorithm (Phase I)

For each word w in the OCRed corpus: Let Sw1 and Sw2 be empty sets. Let Sw = Sw1 ∪ Sw2

1

For word w1 co-occurring with w, calculate LCS similarity between w and w1. Store w1 in Sw1 if LCS similarity(w, w1) > some threshold T.

2

For each w ′ in Sw1, find the words w2 co-occurring with w ′ such that LCS similarity(w, w2) > T. Include all these words in Sw1.

3

Repeat step (2) until no new word is added to Sw1.

Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Kolkata, India () Improving IR performance from OCRed text using cooccurrenceRISOT 2012 10 / 22

slide-11
SLIDE 11

Clustering Algorithm (Phase I)

Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Kolkata, India () Improving IR performance from OCRed text using cooccurrenceRISOT 2012 11 / 22

slide-12
SLIDE 12

Clustering Algorithm (Phase II)

1

Consider top m (in terms of frequency in corpus) words co-occurring with w. For each such word w3, find the words w4 cooccurring with w3 such that LCS similarity(w, w4) > T. Include all these words in Sw2.

2

For each w ′′ in Sw2, find the words w5 co-occurring with w ′′ such that LCS similarity(w, w5) > T. Include all these words in Sw2.

3

Repeat step (2) until no new word is added to Sw2.

Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Kolkata, India () Improving IR performance from OCRed text using cooccurrenceRISOT 2012 12 / 22

slide-13
SLIDE 13

Clustering Algorithm (Phase II)

Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Kolkata, India () Improving IR performance from OCRed text using cooccurrenceRISOT 2012 13 / 22

slide-14
SLIDE 14

Query word cluster mapping

1

calculate LCS similarity(wq, wC), where wC is a word in cluster C, for each word in the corpus

2

Choose all the clusters C for which LCS similarity(wq, wC) is greater than a high threshold

3

For each cluster C obtained from step (2) define C ′ = C ∪ {wq}

4

Create complete-linkage clusters from each C ′ of step (3) and keep those clusters containing wq

Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Kolkata, India () Improving IR performance from OCRed text using cooccurrenceRISOT 2012 14 / 22

slide-15
SLIDE 15

Query word cluster mapping

5

For a cluster C, let us consider LCS similarity between each pair of words in it. Let GMC denote the Geometric Mean of LCS similarity of all the pairs. Then, compute GMC for each cluster given by step (4)

6

Select the cluster C with maximum GMC as the appropriate cluster for wq

Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Kolkata, India () Improving IR performance from OCRed text using cooccurrenceRISOT 2012 15 / 22

slide-16
SLIDE 16

Query word clusters

Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Kolkata, India () Improving IR performance from OCRed text using cooccurrenceRISOT 2012 16 / 22

slide-17
SLIDE 17

Results

Run MAP P5 Original text 0.2567 0.3485 OCRed text (baseline) 0.1791 0.2738 Proposed method on OCRed text 0.19741 0.2831

1Not significant

Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Kolkata, India () Improving IR performance from OCRed text using cooccurrenceRISOT 2012 17 / 22

slide-18
SLIDE 18

Querywise Performance

Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Kolkata, India () Improving IR performance from OCRed text using cooccurrenceRISOT 2012 18 / 22

slide-19
SLIDE 19

Failure Analysis : clusters

Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Kolkata, India () Improving IR performance from OCRed text using cooccurrenceRISOT 2012 19 / 22

slide-20
SLIDE 20

Failure Analysis

Compact clusters over all-inclusive clusters Re-clustering based on string match and co-occurrence Chance co-occurrence - harmful Incorporation of co-occurrence frequencies - essential

Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Kolkata, India () Improving IR performance from OCRed text using cooccurrenceRISOT 2012 20 / 22

slide-21
SLIDE 21

Brighter side

Practical utility Language independent Context information - reliable Captures both erroneous and inflectional variants (effect of stemming)

Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Kolkata, India () Improving IR performance from OCRed text using cooccurrenceRISOT 2012 21 / 22

slide-22
SLIDE 22

THANK YOU

Kripabandhu GhoshandAnirban Chakraborty Indian Statistical Institute Kolkata, India () Improving IR performance from OCRed text using cooccurrenceRISOT 2012 22 / 22