Privacy-aware Document Ranking with Neural Signals Jinjin Shao, - PowerPoint PPT Presentation

Privacy-aware Document Ranking with Neural Signals Jinjin Shao, Shiyu Ji, Tao Yang Department of Computer Science University of California, Santa Barbara United States

Challenge for Private Ranking Client uploads encrypted documents and index, utilizing its massive storage and computing power. Enc(Query) Client Cloud ... Enc(Doc id) Server is honest-but-curious, i.e., correctly executes protocols but observes/infers private information. Challenges for Private Search: • Feature leakage (e.g., term frequency) can lead to plaintext leakage. • Crypto-heavy techniques are too expensive.

Related Work for Private Search • Searchable Encryption [Curtmola et al. Crypto06, Cash et al. Crypto13] does not support ranking. • Leakage Abuse Attack on Encrypted Index & Features [Islam et al. NDSS12, Cash et al. CCS15, Wang et al. S&P17] launches attacks with term frequency/co-occurrence. • Order Preserving Encryption [Boldyvera et al. Crypto11] does not support arithmetic operations. • Private Additive Ranking [Xia et al. TPDS16] works for small datasets only [Agun et al. WWW18] only supports partial cloud ranking. • Private Tree-based Ranking [Bost et al. NDSS15] uses computational-heavy techniques such as Homomorphic Encryption, [Ji et al. SIGIR18] does not support neural signals.

Neural Ranking Models for Ad-hoc Search Two categories of neural ranking models: • Representation-based • Interaction-based Interaction-based models outperform in TREC relevance benchmarks: • Guo et al. CIKM16, Xiong et al. SIGIR17, Dai et al., WSDM18 Steps of interaction-based neural ranking: • Pairwise interaction of query and document terms • Kernel vector derivation from interaction matrices • Forward neural network calculation

Leakage in Interaction-based Neural Ranking Document Query 𝑛 terms 𝑜 terms Interact Similarity Matrix Term Frequency / 𝑛×𝑜 real values Term Co-occurrence Kernel Comp. Kernel Vector Plaintext attack 𝑜×𝑆 real values [Islam et al. Forward NDSS12, Cash et Network al. CCS15] Calculation

Leakage in Interaction-based Neural Ranking Document Query 𝑛 terms 𝑜 terms Interact Similarity Matrix Term Frequency / 𝑛×𝑜 real values Term Co-occurrence Kernel Comp. Kernel Vector 1. Pre-compute kernel 𝑜×𝑆 real values vectors with closed soft match map . Forward 2. Hide exact match Network signal and obfuscate Calculation kernel values.

How Kernel Values Leak Term Frequency % log 𝐿 - 𝑢, 𝑒 , % log 𝐿 1 𝑢, 𝑒 , … , % log 𝐿 3 𝑢, 𝑒 &∈( &∈( &∈( 𝐿 4 (𝑢, 𝑒) is the 𝑗 -th kernel value on the interaction of a possible query term 𝑢 and document 𝑒 , representing semantic similarity. [Xiong et al. SIGIR17] Decompose kernel values into two parts: • 𝐿 - 𝑢, 𝑒 , … , 𝐿 38- (𝑢, 𝑒) Soft Match Signals • 𝐿 3 (𝑢, 𝑒) Exact Match Signal Our analysis: Term frequency of 𝑢 in 𝑒 can be well approximated by 𝐿 3 (𝑢, 𝑒) . Solution for privacy-preserving: Replace 𝐿 3 (𝑢, 𝑒) with relevance scores from private tree ensemble.

How to Hide/Approximate Exact Match Signal log 𝐿 3 𝑢, 𝑒 , 𝑢 ∈ 𝑟 Kernel Vector Propose privacy-preserving approach : Use private tree ensemble, with encrypted features, and compute a relevance score. [Ji et al., SIGIR18] Encrypted features, e.g., Term frequency, proximity, and page quality score. Approximated Kernel Vector

Closed Soft Match Map in Detail Motivation for Soft Match • Limit precomputing. Avoid to compute all possible pairs of terms and documents. • Otherwise, 1 million docs cost ~10TB storage. • Basic idea : Precompute kernel values only for term 𝑢 and document 𝑒 , if 𝑢 appears in 𝑒 𝑢 is soft-relevant to 𝑒 . Closed Soft Match : • For two terms 𝑢 : and 𝑢 - if 1) (𝑢 : , 𝑒) is in a closed soft match map and 2) 𝑢 : and 𝑢 - are similar, then 𝑢 - , 𝑒 is in that map. Build closed soft match map with clustering • Privacy advantage : Prevent leaking term occurrence to the server (shown later).

Build Closed Soft Match Map with Clustering If a term 𝑢 : is in a 𝜐 -similar term closure, 𝑇𝑗𝑛 𝐵, 𝐶 = 0.763 𝑇𝑗𝑛 𝐶, 𝐷 = 0.722 there exists a term 𝑢 - , 𝑡𝑗𝑛(𝑢 : , 𝑢 - ) ≥ 𝜐 . 𝑇𝑗𝑛 𝐸, 𝐹 = 0.601 𝑇𝑗𝑛 𝐶, 𝐸 = 0.531 𝑇𝑗𝑛 𝐹, 𝐺 = 0.513 Fixed-threshold Clustering: 𝑇𝑗𝑛 𝐺, 𝐻 = 0.481 Apply a uniform 𝜐 for all closures. 𝑇𝑗𝑛 𝐷, 𝐺 = 0.467 … Weakness: Closures can include 𝐵 𝐹 𝐺 1) too many terms, which incurs huge 𝐷 storage cost; 𝐻 𝐶 𝐸 2) too few terms, which leads to high privacy leakage. Threshold: 0.5

Build Closed Soft Match Map with Clustering If a term 𝑢 : is in a 𝜐 -similar term closure, 𝑇𝑗𝑛 𝐵, 𝐶 = 0.763 𝑇𝑗𝑛 𝐶, 𝐷 = 0.722 there exists a term 𝑢 - , 𝑡𝑗𝑛(𝑢 : , 𝑢 - ) ≥ 𝜐 . 𝑇𝑗𝑛 𝐸, 𝐹 = 0.601 𝑇𝑗𝑛 𝐶, 𝐸 = 0.531 𝑇𝑗𝑛 𝐹, 𝐺 = 0.513 Adaptive Clustering: 𝑇𝑗𝑛 𝐺, 𝐻 = 0.481 Given a closure minimum size 𝑞 , 𝑇𝑗𝑛 𝐷, 𝐺 = 0.467 and maximum size 𝑦 , … apply a series of decreasing 𝐵 𝐹 𝐺 thresholds: 𝜐 - > 𝜐 1 … > 𝜐 T , to 𝐷 gradually expand all term closures, 𝐻 𝐶 𝐸 such that in the end, all closures are of size between 𝑞 and 𝑦 . Threshold 1: 0.7 Threshold 2: 0.4 Size target: [3, 4]

Privacy Property of Closed Soft Match Map Objective: Given a closed soft match map, show that a server adversary is unlikely to learn term frequency/occurrence of dataset 𝐸 . How to prove: There are too many different datasets 𝐸 U whose soft match maps, compared to 𝐸 , • have the same set of keys (guaranteed by Closed Soft Match Map); • have indistinguishable kernel values. The cloud server is unlikely to differentiate them. How to produce those many datasets: • Use closure-based transformation .

Closure-based Transformation: Produce Indistinguishable Datasets Step 1: For each document 𝑒 , partition all terms in 𝑒 into different groups such that terms in each group belong to the same term closure. Step 2: For each term group in 𝑒 , replace that group with any nonempty subset of the term closure associated with that group. Document 𝑒 = 𝑢 - , 𝑢 1 , 𝑢 V , 𝑢 W , 𝑢 X , 𝑢 Y Term closure Document 𝑒′ = 𝑢 - , 𝑢 1 , 𝑢 [ , 𝑢 W , 𝑢 X , 𝑢 \ , 𝑢 ] 𝑢 - , 𝑢 V , 𝑢 Y , 𝑢 [ , 𝑢 \ , 𝑢 ] Note: Server only knows hashed term ids in each term closure, but not their meanings and their individual statistical info. Statistical distance between kernel values of d and d′ with respect to a term can be very small.

Definition: 𝜻 -statistically indistinguishable Kernel values of a term 𝐮 in document 𝐞 and its transformation 𝐞′ : ⃗ f c,d = (a - , a 1 , a V , … , a f8- ) , ⃗ f c,d g = a U- , a U1 , a UV , … , a Uf8- . f8- a r − a′ r , f c,d g = - ε ≥ Statistical Distance ⃗ f c,d , ⃗ 1 ∑ rs- True for all corresponding document 𝑒 and its transformation 𝑒′ with all terms. Takeaway: ↓ ε yields ↓ Prob(successfully differentiate d from d U )

How to Minimize 𝑻𝒖𝒃𝒖𝒋𝒕𝒖𝒋𝒅𝒃𝒎 𝑬𝒋𝒕𝒖. 𝒈 𝒖,𝒆 , 𝒈 𝒖,𝒆 g Kernel Value Obfuscation For the 𝑘 -th soft kernel value in the kernel value vector: 𝑏 ‰ = Š log ‹ 𝐿 ‰ 𝑢, 𝑒 , 𝑗𝑔 𝐿 ‰ (𝑢, 𝑒) > 1, 1, 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓, where 𝑠 is a privacy parameter, 𝑢 is a term, 𝑒 is a document. Trade-off between Privacy and Ranking Accuracy: ↑ 𝑠 yields ↓ 𝑇𝑢𝑏𝑢𝑗𝑡𝑢𝑗𝑑𝑏𝑚 𝐸𝑗𝑡𝑢. yields ↑ Privacy Guarantee yields ↓ Effectiveness of Soft Match Signals

Datasets and Evaluation Objectives ü Robust04: ~0.5 million docs with 250 queries. ü ClueWeb09-Cat-B: ~50 million docs with 150 queries from Web 09-11. • Evaluation Objectives: 1. Can kernel vectors approximated with private tree ensemble rank well? 2. Can kernel value obfuscation, preserve the ranking accuracy? 3. How effective are two different methods of clustering term closures for closed soft match maps?

Evaluation on Approx. Exact Match Signal ClueWeb09-Cat-B Robuts04 Model NDCG@1 NDCG@3 NDCG@10 NDCG@1 NDCG@3 NDCG@10 LambdaMART 0.2893 0.2828 0.2827 0.5181 0.4610 0.4044 DRMM 0.2586 0.2659 0.2634 0.5049 0.4872 0.4528 KNRM 0.2663 0.2739 0.2681 0.4983 0.4812 04527 C-KNRM 0.3155 0.3124 0.3085 0.5373 0.4875 0.4586 C-KNRM* 0.2884 0.2927 0.2870 0.5007 0.4702 0.4510 C-KNRM*/T 0.3175 0.3122 0.3218 0.5404 0.5006 0.4657 C-KNRM is CONV-KNRM [Dai et al. WSDM18] C-KNRM* is C-KNRM without bigram-bigram interaction. C-KNRM*/T is C-KNRM* with private tree ensemble. Takeaway: Tree signal integration for neural kernel vectors can rank well, and even boost ranking performance.

Privacy-aware Document Ranking with Neural Signals Jinjin Shao, - PowerPoint PPT Presentation

Privacy-aware Document Ranking with Neural Signals Jinjin Shao, Shiyu Ji, Tao Yang Department of Computer Science University of California, Santa Barbara United States Challenge for Private Ranking Client uploads encrypted documents and

Asynchronous Events: Signals Signals Concepts Generating Signals Catching Signals

Asynchronous Events: Signals Signals Concepts Generating Signals

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Privacy by Design Principles of Privacy-Aware Ubiquitous Systems Marc Langheinrich Privacy by

Easy and Hard Outline Constraint Ranking in OT The Constraint Ranking problem Making fast

Data privacy: Privacy models Vicen c Torra March, 2019 Hamilton Institute, Maynooth

Topic 1: LTI Systems Overview: Introduction to Signals Types of Signals: CT/DT,

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

$ Lesson Fourteen Consumer Privacy 04/09 privacy and information information privacy: privacy

$ Lesson Ten Consumer Privacy 04/09 privacy and information information privacy: privacy that

CS305 Topic Privacy Concept Evolution Rights to Privacy Privacy and Technologies

Privacy Protection privacy notions and metrics; privacy in RFID systems; location privacy in

Ranking candidate genes from Ranking candidate genes from perturbation experiments Niko

Online Submodular Set Cover, Ranking, and Repeated Active Learning Online Ranking: At each round,

TVM for Ads Ranking @ Facebook Hao Lu, Ansha Yu, Yinghai Lu, Andrew Tulloch Ads Ranking at

Signal Encoding Techniques Digital Data, Analog Signals Analog Data, Digital Signals ITS323:

Physical Layer Lecture Progression Bottom-up through the layers: Application - HTTP,

Matched filtering 6.011, Spring 2018 Lec 24 1 Matched filtering for detecting known signal in

Ground Plane Data Analysis Heng-Ye Liao, Alan Hahn, Cheng-Ju Lin, Sarah Lockwitz 11/08/2017

How to Stay Relevant * For Oracle Professionals whoami Never Worked for Oracle Worked with

In Straw Tracker Prototype Tom-Erik Haugen, David Brown, Richard Bonventre, Andrew Edmonds.

Engaging the Traditional Student College Instruction in the age of the Millennial Nicole Brown

CSC 151 Spring 2020 Topic: Pair Programming February 3, 2020 Day 06 Agenda for today Quiz 1

Topological arguments and Kolmogorov complexity Andrei Romashenko (joint work with Alexander