Jointly Modeling Relevance and Sensitivity for Search Among - - PowerPoint PPT Presentation

jointly modeling relevance and sensitivity for search
SMART_READER_LITE
LIVE PREVIEW

Jointly Modeling Relevance and Sensitivity for Search Among - - PowerPoint PPT Presentation

Jointly Modeling Relevance and Sensitivity for Search Among Sensitive Content Mahmoud F. Sayed , Douglas W. Oard 2 Image credit: HITEC Dubai 10,045 FOIA requests ~ 30k work-related emails 3 E-Discovery Requesting Party Responding Party


slide-1
SLIDE 1

Jointly Modeling Relevance and Sensitivity for Search Among Sensitive Content

Mahmoud F. Sayed, Douglas W. Oard

slide-2
SLIDE 2

Image credit: HITEC Dubai

2

slide-3
SLIDE 3

10,045 FOIA requests ~ 30k work-related emails

3

slide-4
SLIDE 4

~ 75% total cost ~ 1 month

E-Discovery

  • 1. Formulation
  • 2. Acquisition
  • 3. Review for

Relevance

  • 4. Review for

Privilege

  • 5. Analysis

Requesting Party Responding Party

4

slide-5
SLIDE 5
  • Objective is to build “Search and Protection Engines”

○ Protect sensitive content ○ Still retrieve relevant content ○ Affordable ○ Fast

Motivation

  • Review is expensive

○ Hiring law firms

  • Review is time-consuming

○ Long elapsed time between request and its response ○ Not effective access to information Learning to Rank Automatic Sensitivity Classification

5

slide-6
SLIDE 6

Proposed Approaches

Sensitivity Classifier Filter Ranker

Documents Query Result

Prefilter

Sensitivity Classifier Filter Ranker

Documents Query Result

Postfilter

6

slide-7
SLIDE 7

How to evaluate such approaches?

7

slide-8
SLIDE 8

Discounted Cumulative Gain (DCG)

Highly Relevant Somewhat Relevant Not Relevant Retrieved +3 +1 Not Retrieved Highly Relevant Somewhat Relevant Not Relevant

DCG5 = 5.7

8

slide-9
SLIDE 9

Cost-Sensitive DCG (CS-DCG)

Sensitive Not Sensitive Retrieved

  • 10

Not Retrieved Highly Relevant Somewhat Relevant Sensitive Neither Relevant nor Sensitive

CS-DCG5 = 5.7 CS-DCG5 = -4.3

Highly Relevant Somewhat Relevant Not Relevant Retrieved +3 +1 Not Retrieved 9

slide-10
SLIDE 10

Normalized CS-DCG (nCS-DCG)

Highly Relevant Somewhat Relevant Sensitive Neither Relevant nor Sensitive

CS-DCG5 = 5.7 CS-DCG5 = -4.3 nCS-DCG5 = 0.60 nCS-DCG5 = 0.71 CS-DCGbest = 5.95 Best Ranking CS-DCGworst = -19.8 Worst Ranking

10

slide-11
SLIDE 11

Experiments

11

slide-12
SLIDE 12

LETOR OHSUMED Test Collection

  • 348,566 medical publications

○ Fields: title, abstract, Medical Subject Heading (MeSH), etc ○ 14,430 (w/rel judgements) for eval ○ 334,136 for sensitivity classifier training

  • 106 queries (~150 rel judgements per query)

○ 3 levels: (2) Highly Relevant, (1) Somewhat Relevant, and (0) Not Relevant

  • Simulating “sensitivity”

○ 2 MeSH labels represent sensitive content (out of 118)

■ Male Urogenital Diseases [C12] ■ Female Urogenital Diseases and Pregnancy Complications [C13]

○ 12.2% of judged documents are sensitive

12

slide-13
SLIDE 13

Sensitivity is Topic-Dependent

Easy topics Hard topics

13

slide-14
SLIDE 14

nCS-DCG@10 Comparison

14

slide-15
SLIDE 15

Proposed Approaches

Sensitivity Classifier Filter Ranker

Documents Query Result

Prefilter

Sensitivity Classifier Filter Ranker

Documents Query Result

Postfilter

Sensitivity Classifier Ranker

Documents Query Result

Joint Listwise LtR Optimizing nCS-DCG

15

slide-16
SLIDE 16

nCS-DCG@10 Comparison

Listwise LtR

16

slide-17
SLIDE 17

CS-DCG@10 Comparison

Can we reduce number of queries with negative CS-DCG scores?

20.7% 44.3% 27.3% 25.4% 17

slide-18
SLIDE 18

Cluster-Based Replacement (CBR)

  • Similar to diversity ranking

○ Retrieved documents are clustered ○ For any potentially sensitive document in the result list is replaced with a document in the same cluster but less sensitive 20 clusters using repeated bisection

11% 20.7% 18

slide-19
SLIDE 19

No filter Prefilter Postfilter Joint

unclustered clustered unclustered clustered unclustered clustered unclustered clustered

BM25

0.727 0.779* 0.800 0.797 0.800 0.797 0.727 0.779*

Linear reg.

0.761 0.764 0.811* 0.785 0.817* 0.785 0.727 0.790*

LambdaMart

0.765 0.771 0.812* 0.788 0.823* 0.792 0.753 0.786*

AdaRank

0.756 0.779 0.822* 0.792 0.817* 0.791 0.823* 0.799

  • Coor. Ascent

0.762 0.781 0.816* 0.791 0.818* 0.790 0.842* 0.805

CBR Adversely Affects nCS-DCG

* Indicates two-tailed t-test with p<0.05 19

slide-20
SLIDE 20

Conclusion

  • Proposed CS-DCG and nCS-DCG to balance between relevance and

sensitivity

  • Joint modeling approach yields better performance than straightforward

approaches

  • Cluster-based replacement can reduce number of queries with negative

CS-DCG scores

20

slide-21
SLIDE 21
  • Train a sensitivity classifier with fewer examples
  • Build test collections with real sensitivities
  • Experiment with tri-state classification

○ Sensitive ○ Needs human review ○ Not Sensitive

Next Steps

21

slide-22
SLIDE 22

Thanks!

Mahmoud F. Sayed mfayoub@cs.umd.edu

Data and code can be found at https://github.com/mfayoub/SASC

22