ANALYZING YZING MA MASS SSIVE VE DATASETS SETS WITH ITH MI - - PowerPoint PPT Presentation

analyzing yzing ma mass ssive ve datasets sets with ith
SMART_READER_LITE
LIVE PREVIEW

ANALYZING YZING MA MASS SSIVE VE DATASETS SETS WITH ITH MI - - PowerPoint PPT Presentation

ANALYZING YZING MA MASS SSIVE VE DATASETS SETS WITH ITH MI MISSI SSING NG ENTRIES TRIES MODELS AND ALGORITHMS Nithin n Varma rma Thesis sis Advi dvisor sor: : So Sofya ya Rask skhod hodni nikov kova 1 Algorithms for


slide-1
SLIDE 1

ANALYZING YZING MA MASS SSIVE VE DATASETS SETS WITH ITH MI MISSI SSING NG ENTRIES TRIES

Nithin n Varma rma Thesis sis Advi dvisor sor: : So Sofya ya Rask skhod hodni nikov kova

1

MODELS AND ALGORITHMS

slide-2
SLIDE 2

Algorithms for massive datasets

■ Cannot read the entire dataset

– Sublinear-time algorithms

■ Performance Metrics

– Speed – Memory efficiency – Accuracy – Resilience to faults in data

2

slide-3
SLIDE 3

Faults in datasets

■ Wrong Entries (Errors)

– sublinear algorithms – machine learning – error detection and correction

■ Miss ssing ng Entries ries (Era rasu sures) es) : Ou Our Fo Focu cus

3

slide-4
SLIDE 4

Occurrence of erasures: Reasons

Data collection Hidden friend relations on social networks Accidental deletion Adversarial deletion

4

slide-5
SLIDE 5

Large dataset with erasures: Access

■ Algorithm queries the

  • racle for dataset entries

■ Algorithm does not know in advance what's erased ■ Oracle returns:

– the nonerased entry, or – special symbol ⊥ if queried point is erased Function ctions, s, Codewor words, ds, Graph aphs Oracle Algorithm Interaction

Source, CC BY-SA Source, CC BY-SA

5

slide-6
SLIDE 6

Overview of our contributions

■ Erasure-Resilien Resilient t T esting ting

[Dixit, Raskhodnikova, Thakurta & Varma ma '18, Kalemaj, Raskhodnikova & Varma ma]

■ Local cal Erasur sure-Decoding ecoding

[Raskhodnikova, Ron-Zewi & Varma ma '19] – App pplication ication to pro proper erty ty testing ing

■ Erasure-Resilien Resilient t Sublinea inear-time time Algori

  • rithms

thms for Graphs phs

[Levi, Pallavoor, Raskhodnikova & Varma ma]

■ Sensiti itivit vity y of Graph ph Algori

  • rithms

thms to Missin ssing g Edges ges

[Varma rma & Yoshida]

Functions Codewords Graphs

6

slide-7
SLIDE 7

Outline

■ Erasures in property testing ■ Erasures in local decoding ■ Average sensitivity of graph algorithms

– Definition – Main results

■ Average sensitivity of approximate maximum matching ■ Current and future directions

7

slide-8
SLIDE 8

Outline

■ Erasu sures es in pr prope perty y test sting ing ■ Erasures in local decoding ■ Average sensitivity of graph algorithms

– Definition – Main results

■ Average sensitivity of approximate maximum matching ■ Current and future directions

8

slide-9
SLIDE 9

Decision problem

■ Can't solve nontrivial decision problems without full access to input ■ Need a notion of approximation Universe YES NO Accept, w.p. ≥ 2/3 Reject, w.p. ≥ 2/3

9

slide-10
SLIDE 10

Property testing problem

[Rubinfeld & Sudan '96, Goldreich, Goldwasser & Ron '98]

■ 𝜻-fa far from property

– ≥ 𝜁 fraction of values to be changed to satisfy property

Universe 𝜁 Property 𝜁-far from the property

𝜁-tester

Reject, w.p. ≥ 2/3 Accept, w.p. ≥ 2/3

10

slide-11
SLIDE 11

(Error) T

  • lerant

testing problem

[Parnas, Ron & Rubinfeld '06]

■ 𝜷-cl close se to property

– ≤ 𝛽 fraction values can be changed to satisfy property

Universe Property 𝜁-far from property 𝛽 𝜁 ≤ 𝜷 fr fract ction

  • n of i

f inpu put is s wrong

(𝛽, 𝜁)-tolerant tester

Accept, w.p. ≥ 2/3 Reject, w.p. ≥ 2/3

11

slide-12
SLIDE 12

Erasure-resilient testing problem

[Dixit, Raskhodnikova, Thakurta & Varm rma '16]

■ Worst-case erasures, made before tester queries ■ Comp mpletio etion

– Fill-in values at erased points

Universe Can be completed to satisfy property Every completion is 𝜁-far 𝜁 ≤ 𝜷 fr fract ction

  • n of i

f inpu put is s erase sed

(𝛽, 𝜁)-erasure-resilient tester

Accept, w.p. ≥ 2/3 Reject, w.p. ≥ 2/3

12

slide-13
SLIDE 13

Relationship between models

Testing Erasure-resilient testing Tolerant testing

13

slide-14
SLIDE 14

Erasure-resilient testing: Our results

[Dixit, Raskhodnikova, Thakurta, Varma 18] ■ Blackbox transformations ■ Efficient erasure-resilient testers for other properties ■ Separation of standard and erasure-resilient testing

14

slide-15
SLIDE 15

Our blackbox transformations

■ Makes certain classes of uniform testers erasure-resilient ■ Works by simply repeating the original tester ■ Applies to:

– Monotonicity over general partial orders [FLNRRS02] – Convexity of black and white images [BMR15] – Boolean functions having at most 𝑙 alternations in values

Query y co comp mplexity xity of f (𝜷, 𝜻)-erasur sure-res esilient ilient test ster er equ qual l to 𝜻-test ster er for 𝛽 ∈ (0,1), 𝜁 ∈ (0,1)

15

slide-16
SLIDE 16

Main properties that we study

■ Monotonicity, Lipschitz properties, and convexity of real- valued functions ■ Widely studied in property testing

[EKKRV00,DGLRRS99,LR01,FLNRRS02,PRR03,AC04,F04,HK04,BRW05,PRR06,ACCL07,BGJRW12,BCGM10, BBM11, AJMS12, DJRT13, JR13, CS13a,CS13b,BlRY14,CST14,BB15,CDJS15,CDST15,BB16,CS16,KMS18,BCS18,PRV18,B18,CS19, …]

■ Optimal testers for these properties are not uniform testers

– Our blackbox transformation does not apply

16

slide-17
SLIDE 17

Optimal erasure-resilient testers

■ For functions 𝑔: 𝑜 → ℝ

– Monotonicity – Lipschitz properties – Convexity

■ For functions 𝑔: 𝑜 𝑒 → ℝ

– Monotonicity – Lipschitz properties

Query y comp mplexity xity of f 𝜷, 𝜻 - erasu sure-res esilient ilient test ster r equ qual l to 𝜻-teste ster r for 𝛽 ∈ (0,1), 𝜁 ∈ (0,1) Query y co comp mplexity xity of f (𝜷, 𝜻)- erasu sure-res esilient ilient test ster r equ qual l to 𝜻-teste ster r for 𝜁 ∈ (0,1), 𝛽 = 𝑃(𝜁/𝑒)

17

slide-18
SLIDE 18

Separation of erasure-resilient and standard testing

Theorem em: There exists a property 𝑄 on inputs of size 𝑜 such that:

  • testing with co

const stant number of queries

  • every erasure-resilient tester needs ෩

Ω(𝑜) queries

18

slide-19
SLIDE 19

Relationship between models

Standard Testing Erasure-resilient testing Tolerant testing Some me containme nment nts s are st strict:

  • [Fischer Fortnow 05]: standard vs. tolerant
  • [Dixit Raskhodnikova Thakurta Varma 18]: standard vs.

erasure-resilient

19

slide-20
SLIDE 20

Outline

■ Erasures in property testing ■ Erasu sures es in loca cal de deco codi ding ■ Average sensitivity of graph algorithms

– Definition – Main results

■ Average sensitivity of approximate maximum matching ■ Current and future directions

20

slide-21
SLIDE 21

■ Error correcting code 𝐷: Σ𝑜 → Σ𝑂, for 𝑂 > 𝑜 ■ Deco codi ding ng: Recover 𝑦 from 𝑥 if not too many errors or erasures ■ Loca cal de deco code der: Sublinear-time algorithm for decoding Message 𝑦

Local decoding

Encoder Channel 𝐷(𝑦) Received word 𝑥

21

Local decoding is extensively studied and has many applications

[GL89,BFLS91,BLR93,GLRSW91,GS92,PS94,BIKR93,KT00,STV01,Y08,E12,DGY11,BET10…]

slide-22
SLIDE 22

Local decoding and property testing

[Raskhodnikova, Ron-Zewi, Varma rma 19]

22

Ou Our Re Resu sult lts ■ Initiate study of erasures in the context of local decoding ■ Erasures are easier than errors in local decoding ■ Separation between erasure-resilient and (error) tolerant testing

slide-23
SLIDE 23

Separation of erasure-resilient and tolerant testing

[Raskhodnikova, Ron-Zewi, Varma rma 19]

23

Theorem em: There exists a property 𝑄 on inputs of size 𝑜 such that:

  • erasure-resilient testing with co

const stant nt number of queries

  • every (error) tolerant tester needs 𝑜Ω(1) queries
slide-24
SLIDE 24

Relationship between models

Testing Erasure-resilient testing Tolerant testing

All l contain ainments ents are e strict rict:

  • [Fischer Fortnow 05]: standard vs. tolerant
  • [Dixit Raskhodnikova Thakurta Varm

rma 18]: standard vs. erasure-resilient

  • [Raskhodnikova Ron-Zewi Varm

rma 19]: erasure-resilient vs. tolerant

24

slide-25
SLIDE 25

Outline

■ Erasures in property testing ■ Erasures in local decoding ■ Avera erage ge se sensitiv sitivit ity y of f gr graph ph algo gorit ithms hms

– Def efinitio nition – Main n res esults lts

■ Average sensitivity of approximate maximum matching ■ Current and future directions

25

slide-26
SLIDE 26

Motivation

■ Want to solve optimization problems on large graphs

– Maximum matching, min. vertex cover, min cut, …

■ Cannot assume that we get access to the true graph

– A fraction of the edges , say 1%, might be missing

■ Need algorithms that are robust to missing edges

26

slide-27
SLIDE 27

T

  • wards average sensitivity

27

■ Want to solve problem on 𝐻; only have access to 𝐻′. ■ Similar to robustness notions in differential privacy [Dwork,

Kenthapadi, McSherry, Mironov & Naor 06, Dwork, McSherry, Nissim & Smith 06],

learning theory [Bosquet & Elisseef 02],…. Algorithm 𝐵

𝐵(𝐻′) 𝐵(𝐻) 𝐻′ = 𝑊, 𝐹′ ; 𝐹′ ⊆ 𝐹 𝐻 = (𝑊, 𝐹)

Algorithm 𝐵

slide-28
SLIDE 28

Average sensitivity: Deterministic algorithm [Varm

rma & Yoshida]

■ 𝐵 : deterministic graph algorithm outputting a set of edges

  • r vertices

– e.g., 𝐵 outputs a maximum matching

■ 𝑡𝐵: 𝒣 → ℝ, where 𝒣 is the universe of input graphs

28

Average sensitivity of deterministic algorithm 𝐵 𝑡𝐵 𝐻 = avg𝑓∈𝐹 [Ham 𝐵 𝐻 , 𝐵 𝐻 − 𝑓 ]

slide-29
SLIDE 29

■ 𝑡𝐵: 𝒣 → ℝ, where 𝒣 is the universe of input graphs ■ Algorithm with low average sensitivity: st stabl ble-on

  • n-aver

average ge

Average sensitivity: Randomized algorithm [Varm

rma & Yoshida]

29

Average sensitivity of randomized algorithm 𝐵 𝑡𝐵 𝐻 = avg𝑓∈𝐹 [Dist 𝐵 𝐻 , 𝐵 𝐻 − 𝑓 ] Output distributions

slide-30
SLIDE 30

Average sensitivity: Randomized algorithms

30

Average sensitivity of randomized algorithm A, 𝑡𝐵 𝐻 , is defined as: avg𝑓∈𝐹 [Dist 𝐵 𝐻 , 𝐵 𝐻 − 𝑓 ]

𝑓 ∈ 𝐹 Optimal cost of moving the probability mass from one distribution to the other Distribution 𝐵(𝐻) Distribution 𝐵(𝐻 − 𝑓) cost(𝑞, 𝑦 → 𝑧) = 𝑞 ⋅ Ham(𝑦, 𝑧)

𝑦 𝑧

slide-31
SLIDE 31

Average sensitivity: Randomized algorithms

[Varm rma & Yoshida]

31

Average sensitivity of randomized algorithm A, 𝑡𝐵 𝐻 , is defined as: avg𝑓∈𝐹 [d𝐹𝑁 𝐵 𝐻 , 𝐵 𝐻 − 𝑓 ]

𝑓 ∈ 𝐹 Optimal cost of moving the probability mass from one distribution to the other Distribution 𝐵(𝐻) Distribution 𝐵(𝐻 − 𝑓)

Earth mover's distance Can extend end de defi finit nition ion to mu multip iple le mi miss ssing ing edg dges

slide-32
SLIDE 32

Locality implies low average sensitivity

32

𝑟 𝐻 ≜ 𝔽𝑓∈𝐹[#queries by 𝑀] Algorithm 𝐵 𝐻 𝐵(𝐻) Local simulator 𝑀

1 if 𝑓 ∈ 𝐵(𝐻) 0, otherwise 𝑓 ∈ 𝐹

Graph 𝐻

Local computation algorithm [Rubinfeld, T amir, Vardi, Xie '11]

Ou Our Theor

  • rem:

em: 𝑡𝐵 𝐻 ≤ 𝑟(𝐻)

slide-33
SLIDE 33

Locality implies low average sensitivity

33

𝑟 𝐻 ≜ 𝔽𝜌,𝑓∈𝐹[#queries by 𝑀] Algorithm 𝐵 𝐻 𝐵𝜌(𝐻) Local simulator 𝑀

1 if 𝑓 ∈ 𝐵𝜌(𝐻) 0, otherwise 𝑓 ∈ 𝐹

Graph 𝐻

Local computation algorithm [Rubinfeld, T amir, Vardi, Xie '11]

Ou Our Theor

  • rem:

em: 𝑡𝐵 𝐻 ≤ 𝑟(𝐻) 𝜌 𝜌 𝜌 is the random string

slide-34
SLIDE 34

Main results

■ Approximation algorithms with low average sensitivity for

– Minimum spanning tree – Global min cut – Maximum matching – Minimum vertex cover

■ Lower bounds on average sensitivity for

– Global min cut algorithms – 2-coloring algorithms

34

slide-35
SLIDE 35

Outline

■ Erasures in property testing ■ Erasures in local decoding ■ Average sensitivity of graph algorithms

– Properties of the definition – Main results

■ Avera erage ge se sensitiv sitivit ity y of f app pproximat imate e ma maximum mum ma match ching ng ■ Current and open directions

35

slide-36
SLIDE 36

Average sensitivity of approximating the maximum matching: Our results

36

Upp pper Bound nd: There exists a polynomial time matching algorithm with App pproxima ximation ion ratio:

1 2 − 𝑝(1)

Avera erage ge se sensi sitivi tivity : ෨ 𝑃(OPT0.75) Lower r Bound: Every exact maximum matching algorithm has average sensitivity Ω(OPT).

OPT: size of max matching

slide-37
SLIDE 37

Average sensitivity of exact maximum matching

■ Even cycle 𝐷𝑜

– Exactly two max. matchings – For every edge 𝑓, the graph 𝐷𝑜 − 𝑓 has exactly

  • ne max. matching

■ Deterministic max. matching algorithm 𝐵

– For

𝑜 2 edges 𝑓, outputs 𝐵(𝐷𝑜) and 𝐵(𝐷𝑜 − 𝑓)

differ in Ω(OPT) edges – Average sensitivity of 𝐵 is Ω(OPT)

37

Average sensitivity of exact max. matching is Ω(OPT).

slide-38
SLIDE 38

Upper bound: Starting point

38

On input 𝐻:

  • As long as possible, add a fresh uniformly random

edge of 𝐻 into the matching 𝑁

  • Output 𝑁

Local algorithm for 𝐵 with query complexity ≤ Δ 𝐻

[Yoshid ida, a, Yamam amoto

  • to & Ito '12]

[Parnas & Ron '07; Nguyen & Onak '08; Onak, Ron, Rosen & Rubinfeld '12]

App pproxima ximatio tion n ratio : 1/2 Avera erage ge se sensiti sitivit vity ≤ Δ(𝐻) Locality implies low sensitivity Randomized greedy matching algorithm 𝐵

slide-39
SLIDE 39

Improving average sensitivity of 𝐵

Average sensitivity can be high when max. degree is large Let 𝜁 ∈ (0,1/2) ≤ 𝜁 ⋅ OPT vertices removed ⇒ Approximation ratio is 1/2 − 𝜁

39

Average sensitivity of 𝐵 ≤ Δ(𝐻)

Idea: Remove all vertices of degree ≥

𝑛 𝜁⋅OPT + Lap(𝜇), and then run 𝐵

Average sensitivity of vertex-removal step can be large 𝑛: Number of edges

slide-40
SLIDE 40

Improving average sensitivity of 𝐵

Average sensitivity can be high when max. degree is large Let 𝜁 ∈ (0,1/2) and 𝜇 = Θ(

𝑛 𝜁⋅OPT ⋅ 1 ln 𝑜)

W.h.p. ≤ 𝜁 ⋅ OPT vertices removed ⇒ W.h.p. Approximation ratio is 1/2 − 𝜁

40

Average sensitivity of 𝐵 ≤ Δ(𝐻)

Idea: Remove all vertices of degree ≥

𝑛 𝜁⋅OPT + Lap(𝜇), and then run 𝐵

slide-41
SLIDE 41

Degree-reduction matching algorithm

41

On input 𝐻:

  • Sample 𝑀 ∼

𝑛 𝜁⋅OPT + Lap( 𝑛 𝜁⋅OPT ⋅ 1 ln 𝑜)

  • Run 𝐵 on the graph after removing vertices of

degree at least 𝑀 Algorithm 𝐵′ App pproxima ximatio tion n ratio : 1/2 − 𝜁 Avera erage ge se sensiti sitivit vity : : O

𝑛 𝜁⋅OPT 3

Sequ quent entia ial l Comp mposi sition ion

[Varma rma & Yoshida]

slide-42
SLIDE 42

Lexicographically smallest matching

■ Fix an ordering on vertex pairs ■ Algorithm 𝐵′′ outputs the lexicographically smallest maximum matching

42

Ou Our Theor

  • rem

em: Average sensitivity of 𝐵′′ ≤ OPT2/𝑛

slide-43
SLIDE 43

Final Algorithm 𝐶

43

Degree-reduction algorithm 𝐵′ s𝐵′ 𝐻 = O 𝑛 𝜁 ⋅ OPT

3

  • Lex. smallest matching algorithm 𝐵′′

𝑡𝐵′′ 𝐻 = OPT2 𝑛

App pproxima ximatio tion n ratio : 1/2 − 𝜁 Avera erage ge se sensitiv sitivit ity : : O OPT

𝜁 0.75

Paralle llel Comp mposi sition ion

[Varm rma & Yoshida]

On input G

  • Run 𝐵′ with probability

𝑡𝐵′′ 𝐻 𝑡𝐵′′ 𝐻 +𝑡𝐵′ 𝐻 and run 𝐵′′ with remaining probability

slide-44
SLIDE 44

What we saw

44

Theorem em: Matching algorithm with App pproxima ximation ion ratio: 1/2 − 𝑝(1) Avera erage ge se sensi sitivi tivity : ෨ 𝑃(OPT0.75)

slide-45
SLIDE 45

Outline

■ Erasures in property testing ■ Erasures in local decoding ■ Average sensitivity of graph algorithms

– Properties of the definition – Main results

■ Average sensitivity of approximate maximum matching ■ Current ent and fu d future e di direct ction ions

45

slide-46
SLIDE 46

Current and future directions

■ Erasure-resilience in other models of sublinear algorithms ■ Erasure-resilient testing under different erasure models

– Ongoing work with Sofya Raskhodnikova and Iden Kalemaj

■ Average sensitivity bounds for other optimization problems

46

slide-47
SLIDE 47

Thanks to my Wonderful Collaborators

Kashyap hyap Dixit Iden Kalemaj emaj Amit Levi Rames esh h Pallav lavoor

  • r

Sofya ya Raskho khodnikova dnikova Noga Ron-Zew ewi Abhrad adee eep Thakur kurta Yuichi hi Yoshida ida

47

slide-48
SLIDE 48

Current and future directions

■ Erasure-resilience in other models of sublinear algorithms ■ Erasure-resilient testing under different erasure models

– Ongoing work with Sofya Raskhodnikova and Iden Kalemaj

■ Average sensitivity bounds for other optimization problems

Thank You!

48