Tie Vegan Data Diet Tie Vegan Data Diet How Wikipedia cuts down - - PowerPoint PPT Presentation

tie vegan data diet tie vegan data diet
SMART_READER_LITE
LIVE PREVIEW

Tie Vegan Data Diet Tie Vegan Data Diet How Wikipedia cuts down - - PowerPoint PPT Presentation

Tie Vegan Data Diet Tie Vegan Data Diet How Wikipedia cuts down privacy issues while keeping data fit Marcel Ruiz Forns Software Developer Analytics Team Marcel Ruiz Forns Software Developer Analytics Team Marcel Ruiz Forns Software


slide-1
SLIDE 1

Tie Vegan Data Diet Tie Vegan Data Diet

How Wikipedia cuts down privacy issues while keeping data fit

slide-2
SLIDE 2

Marcel Ruiz Forns

Software Developer Analytics Team

slide-3
SLIDE 3

Marcel Ruiz Forns

Software Developer Analytics Team

slide-4
SLIDE 4

Anyone can edit! Marcel Ruiz Forns

Software Developer Analytics Team

slide-5
SLIDE 5

Anyone can edit!

Privacy

  • Why privacy
  • What we do
  • Implementation
  • Pros and cons
  • Questions

Marcel Ruiz Forns

Software Developer Analytics Team

slide-6
SLIDE 6

https://blog.wikimedia.org/2018/04/18/greece-legal-case-ended

slide-7
SLIDE 7

By: Hugh D'Andrade, Senior Designer @ EFF https://commons.wikimedia.org/wiki/File:Laptop-spying.jpg

slide-8
SLIDE 8

https://transparency.wikimedia.org

slide-9
SLIDE 9

Privacy Privacy

slide-10
SLIDE 10

https://foundation.wikimedia.org/wiki/Privacy_policy

slide-11
SLIDE 11
  • Read or edit without

account.

https://foundation.wikimedia.org/wiki/Privacy_policy

slide-12
SLIDE 12
  • Read or edit without

account.

  • Register account

without name, email or any other info.

https://foundation.wikimedia.org/wiki/Privacy_policy

slide-13
SLIDE 13
  • Read or edit without

account.

  • Register account

without name, email or any other info.

  • Never selling/sharing

your info with third parties.

https://foundation.wikimedia.org/wiki/Privacy_policy

slide-14
SLIDE 14
  • Read or edit without

account.

  • Register account

without name, email or any other info.

  • Never selling/sharing

your info with third parties.

  • Retaining your info for

shortest time possible.

https://foundation.wikimedia.org/wiki/Privacy_policy

slide-15
SLIDE 15

Usage Data

slide-16
SLIDE 16

Usage Data 500M

web requests

PER HOUR

slide-17
SLIDE 17

Usage Data 500M

web requests

PER HOUR

2000

events

PER SECOND

slide-18
SLIDE 18

https://stats.wikimedia.org/v2/#/all-projects/reading/legacy-page-views

slide-19
SLIDE 19

https://analytics.wikimedia.org/dashboards/browsers/#all-sites-by-os

slide-20
SLIDE 20

username mforns ip_adress 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... session_id 8c878625792be023 edit_count 4257 ui_skin minerva

slide-21
SLIDE 21

username mforns ip_adress 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... session_id 8c878625792be023 edit_count 4257 ui_skin minerva

D a t a R e t e n t i

  • n

G u i d e l i n e s

h t t p s : / / m e t a . w i k i m e d i a .

  • r

g / w i k i / D a t a _ r e t e n t i

  • n

_ g u i d e l i n e s

“ A f t e r a t m

  • s

t 9 d a y s , i t w i l l b e d e l e t e d , a g g r e g a t e d ,

  • r

d e

  • i

d e n t i fi e d . ”

slide-22
SLIDE 22

D e l e t i n g D a t a Deleting Data

slide-23
SLIDE 23

D e l e t i n g D a t a Deleting Data

Are you sure?

Cancel Delete

slide-24
SLIDE 24

D e l e t i n g Data

slide-25
SLIDE 25
  • -dry-run

undef -> execute

  • -tables-to-delete

undef -> all

  • -execute

undef -> dry-run

  • -tables-to-delete

undef -> none

* -> all

slide-26
SLIDE 26

Executing tests… Tests passed. Starting DRY-RUN. Checking partitions to delete… Partitions that would be deleted by execution:

  • year=2019, month=1, day=1, hour=0, wiki=en.wikipedia
  • year=2019, month=1, day=1, hour=0, wiki=es.wiktionary
  • year=2019, month=1, day=1, hour=0, wiki=de.wikibooks
  • year=2019, month=1, day=1, hour=1, wiki=en.wikipedia
  • year=2019, month=1, day=1, hour=1, wiki=es.wiktionary
  • year=2019, month=1, day=1, hour=1, wiki=de.wikibooks
  • year=2019, month=1, day=1, hour=2, wiki=en.wikipedia
  • year=2019, month=1, day=1, hour=2, wiki=es.wiktionary
  • year=2019, month=1, day=1, hour=2, wiki=de.wikibooks

DRY-RUN finished.

slide-27
SLIDE 27
  • -database=event
  • -tables=menuClicks
  • -wikis=en.wikipedia
  • -older-than=90
  • -skip-trash=true
  • -execute=<checksum>
slide-28
SLIDE 28
  • -database=event
  • -tables=menuClicks
  • -wikis=en.wikipedia
  • -older-than=90
  • -skip-trash=true

Executing tests… Tests passed. Starting DRY-RUN. Checking partitions to delete… Partitions that would be deleted by execution:

  • year=2019, month=1, day=1, hour=0, wiki=en.wikipedia
  • year=2019, month=1, day=1, hour=0, wiki=es.wiktionary
  • year=2019, month=1, day=1, hour=0, wiki=de.wikibooks
  • year=2019, month=1, day=1, hour=1, wiki=en.wikipedia
  • year=2019, month=1, day=1, hour=1, wiki=es.wiktionary
  • year=2019, month=1, day=1, hour=1, wiki=de.wikibooks
  • year=2019, month=1, day=1, hour=2, wiki=en.wikipedia
  • year=2019, month=1, day=1, hour=2, wiki=es.wiktionary
  • year=2019, month=1, day=1, hour=2, wiki=de.wikibooks

DRY-RUN finished. Parameter checksum: 57ca7987d987e9e98a6c79

  • -execute=<checksum>
slide-29
SLIDE 29
  • -database=event
  • -tables=menuClicks
  • -wikis=en.wikipedia
  • -older-than=90
  • -skip-trash=true

Executing tests… Tests passed. Starting DRY-RUN. Checking partitions to delete… Partitions that would be deleted by execution:

  • year=2019, month=1, day=1, hour=0, wiki=en.wikipedia
  • year=2019, month=1, day=1, hour=0, wiki=es.wiktionary
  • year=2019, month=1, day=1, hour=0, wiki=de.wikibooks
  • year=2019, month=1, day=1, hour=1, wiki=en.wikipedia
  • year=2019, month=1, day=1, hour=1, wiki=es.wiktionary
  • year=2019, month=1, day=1, hour=1, wiki=de.wikibooks
  • year=2019, month=1, day=1, hour=2, wiki=en.wikipedia
  • year=2019, month=1, day=1, hour=2, wiki=es.wiktionary
  • year=2019, month=1, day=1, hour=2, wiki=de.wikibooks

DRY-RUN finished. Parameter checksum: 57ca7987d987e9e98a6c79

  • -execute=<checksum>

#1 Dry-run #2 Eyecute

slide-30
SLIDE 30

Saniting D a t a

slide-31
SLIDE 31

Saniting D a t a

Advanced

slide-32
SLIDE 32

90 days

Unsanitized data

slide-33
SLIDE 33

90 days

Unsanitized data Sanitized data

Kept indefinitely

S

slide-34
SLIDE 34

90 days

Unsanitized data Sanitized data

Kept indefinitely

S

slide-35
SLIDE 35

90 days

Unsanitized data Sanitized data

Kept indefinitely

S S

slide-36
SLIDE 36

date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki en.wikipedia action click target menu

Unsanitized

slide-37
SLIDE 37

date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki en.wikipedia action click target menu

Unsanitized Black-list

slide-38
SLIDE 38

date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki en.wikipedia action click target menu

Unsanitized

date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu

Sanitized Black-list

slide-39
SLIDE 39

date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki en.wikipedia action click target menu cookie_id 724310

Unsanitized

date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu cookie_id 724310

Sanitized Black-list

slide-40
SLIDE 40

date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki en.wikipedia action click target menu

Unsanitized White-list

slide-41
SLIDE 41

date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki en.wikipedia action click target menu

Unsanitized

date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu

Sanitized White-list

slide-42
SLIDE 42

Unsanitized

date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu cookie_id NULL

Sanitized White-list

slide-43
SLIDE 43

Unsanitized

date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip Spain user_agent NULL wiki en.wikipedia action click target menu cookie_id NULL

Sanitized White-list

slide-44
SLIDE 44

Unsanitized

date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip Spain user_agent Linux wiki en.wikipedia action click target menu cookie_id NULL

Sanitized White-list

slide-45
SLIDE 45

Unsanitized

date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip Spain user_agent Linux wiki en.wikipedia action click target menu cookie_id 8d56ab209e10

Sanitized White-list

#

slide-46
SLIDE 46

P r i v a c y C u l t u r e

slide-47
SLIDE 47

Unique visitor

slide-48
SLIDE 48

UUID

Unique visitor

slide-49
SLIDE 49

UUID, REQ UUID UUID, REQ

Unique visitor

slide-50
SLIDE 50

SELECT COUNT(DISTINCT uuid) FROM database.table WHERE date = ’2019-01-01’;

UUID UUID, REQ

Unique visitor

UUID, REQ

slide-51
SLIDE 51

UUID

Unique visitor

UUID, REQ

slide-52
SLIDE 52

LAST ACCESS

Unique visitor

slide-53
SLIDE 53

LAST ACCESS

Unique visitor

LA, REQ LA, REQ

slide-54
SLIDE 54

SELECT COUNT(*) FROM database.table WHERE (la IS NULL OR la < date) AND date = ’2019-01-01’;

LAST ACCESS

Unique visitor

LA, REQ LA, REQ

slide-55
SLIDE 55

By: Victor Grigas https://commons.wikimedia.org/wiki/File:Papaul_Tshibamba-4.jpg

slide-56
SLIDE 56

Tie Vegan Data Diet

slide-57
SLIDE 57

Tie Vegan Data Diet

  • Guarantee of privacy
  • Less work related to

data requests

  • Easier to publicize
slide-58
SLIDE 58

Tie Vegan Data Diet

  • Guarantee of privacy
  • Less work related to

data requests

  • Easier to publicize
  • Extra work
  • Privacy culture

needs time

  • Analysts have to

compromise

slide-59
SLIDE 59

QUESTIONS? QUESTIONS?

By: Randall Munroe @ XKCD https://xkcd.com/285/

slide-60
SLIDE 60