Collecting User's Data in a Socially-Responsible Manner. Photograph: - - PowerPoint PPT Presentation

collecting user s data in a socially responsible manner
SMART_READER_LITE
LIVE PREVIEW

Collecting User's Data in a Socially-Responsible Manner. Photograph: - - PowerPoint PPT Presentation

Collecting User's Data in a Socially-Responsible Manner. Photograph: Daniel Beltra/Greenpeace Josep M. Pujol Konark Modi @konarkmodi @solso About Cliqz 80+ - Team size 500,000 - DAU 3 Million+ - Downloads (Germany only) 1


slide-1
SLIDE 1

“Collecting User's Data in a Socially-Responsible Manner.”

Photograph: Daniel Beltra/Greenpeace

Konark Modi

@konarkmodi

Josep M. Pujol

@solso

slide-2
SLIDE 2

About Cliqz

  • 80+ - Team size
  • 500,000 - DAU
  • 3 Million+ - Downloads (Germany only)
  • 1 billion+ - Indexed pages (We do not believe in

indexing the web.)

  • 5 TB - In-Memory indexed (Based on open source

and in-house build NoSQL stores.)

  • 10x more coverage for anti-phishing protection
  • As compared to other players like safebrowsing by

Google.

  • Upcoming products like Anti-tracking etc.
slide-3
SLIDE 3

About Cliqz

slide-4
SLIDE 4

We Love Data …

slide-5
SLIDE 5

Let's step back a bit in time, to get the context.

slide-6
SLIDE 6

Source : http://thehumanfaceofbigdata.com

“ Data is the new oil ”

  • Clive HumBy (2006)
slide-7
SLIDE 7

Data is still being collected without enough controls & measures.

Is privacy the new Green ?

slide-8
SLIDE 8

The biggest by-product of which being SESSIONS.

Is privacy the new Green ?

slide-9
SLIDE 9

How ?

Alice Alice Bob

MAP/REDUCE :D

Server-Side Alice Alice Bob Client-Side

Uncharted water

slide-10
SLIDE 10

Instead …

Uncharted water

Server-Side Alice Alice Bob Client-Side Alice Alice Bob

MAP/REDUCE :D MAP/REDUCE :D MAP/REDUCE :D

slide-11
SLIDE 11

Who is responsible ?

Is there a conspiracy theory or an evil plan ?

slide-12
SLIDE 12

Well, we have a simpler explanation:

It’s the consequences of common development practices, which results in trading user’s data knowingly / unknowingly !

slide-13
SLIDE 13

Demo

slide-14
SLIDE 14

This looks like a toy example ?

slide-15
SLIDE 15

Which are the queries that are so bad that forces people to redo the same query elsewhere ?

Let’s take a more complex case

slide-16
SLIDE 16

Alice apache big data conf search engine 2 search engine 1 Alice apache big data conf

Client-Side

slide-17
SLIDE 17

Alice apache big data conf search engine 2 search engine 1 Alice apache big data conf

Uncharted water

Alice apache big data conf search engine 2 search engine 1 Alice apache big data conf Map-Reduce

Client-Side

Server - Side

slide-18
SLIDE 18

Alice apache big data conf search engine 2 search engine 1 Alice apache big data conf

Uncharted water

Alice apache big data conf search engine 2 search engine 1 Alice apache big data conf Map-Reduce

Client-Side

Server - Side

slide-19
SLIDE 19

Alice apache big data conf search engine 2 search engine 1 Alice apache big data conf

Uncharted water

Alice apache big data conf search engine 2 search engine 1 Alice apache big data conf Map-Reduce Alice apache big data conf search engine 2 search engine 1 Alice apache big data conf Map-Reduce

Client-Side

Server - Side

slide-20
SLIDE 20

Alice apache big data conf search engine 2 search engine 1 Alice apache big data conf

Uncharted water

Alice apache big data conf search engine 2 search engine 1 Alice apache big data conf Map-Reduce Alice apache big data conf search engine 2 search engine 1 Alice apache big data conf Map-Reduce

Client-Side

Server - Side

slide-21
SLIDE 21

We mentioned before, we believe in data and are not against the collection .

  • Stopping data collection altogether would be foolish

and dangerous.This also means stopping the wheels

  • f innovation.
  • Who would benefit the most by

supporting the ban on advertisements of tobacco products??

slide-22
SLIDE 22
slide-23
SLIDE 23

“Socially responsible manner” is an analogy to ensure events being collected are not suffering from pollutants like Explicit IDs, Implicit IDs and reaches home Secure.

slide-24
SLIDE 24

Why does CLIQZ Care ?

slide-25
SLIDE 25

German Data Privacy Laws Security breaches

When government knocks

  • n your door
slide-26
SLIDE 26

So what do we bring on the table ??

slide-27
SLIDE 27

HUMAN WEB

  • We have developed HumanWeb to balance the Right-to-Privacy with the

needs to build products that improve the web and allow for more

  • penness.
  • Ensuring data that can infer sessions, linkages to navigation patterns is

not collected.

  • Does not create so much data that could allow identification of individuals
  • We do not want to know who "YOU" are, what "YOU" searched and when

"YOU" searched.

  • Designed keeping in mind so that a "malicious/untrustworthy" actor or as

a matter of fact even anyone at Cliqz, getting access to the raw data flow cannot infer or identify individuals.

slide-28
SLIDE 28

Sample events:

{ "action": action of the message, "ver": version name, "type": "humanweb", "payload": { }, //the actual data "ts": UTC time capped to the day, e.g. 20150909 }

  • Sample event for Page
  • Sample event for Query
slide-29
SLIDE 29

HumanWeb

[ {event1}, {event2}, {event3} ]

Event Queue | Schedule to ensure not sent in batch Final checks Filtering Sanitisation / Masking Secure Channel

Client-side

Local storage | Structural data about webpages

Map-Reduce

Aggregations, Heuristics, Filtering,Hashing

slide-30
SLIDE 30

Privacy breaches on the way home

To achieve total privacy, we must rely on a network of proxies that remove any network-related data like cookies, IP , headers so that finger-printing is impossible.

slide-31
SLIDE 31

SecureChannel : Protection from network fingerprinting

slide-32
SLIDE 32

SecureChannel : What do we encrypt ?

  • The queries from the user (initiated by them upon activity on the

Cliqz’s instrumented Firefox address bar).

  • All telemetry signals (initiated by Cliqz’s instrumented Firefox)
  • All messages regarding the HumanWeb data collection effort.

Also, before reaching our infrastructure the encrypted messages are routed through a mesh of proxies.

slide-33
SLIDE 33

SecureChannel : How do we encrypt ?

Life-Cycle of hashes / keys :

  • AES : Hash-keys used with AES are used only one time. Even if the user types the

same query .

  • Public / Private KeyPair ( Client ) :
  • The Keys on client side are all short lived, we continuously generate keys on

the client-side.

  • The public/private key pair of the client (the Extension) is meant to be used
  • nly once and then thrown away. The key pairs are regenerated to fill a pool

while the browser is idle.

  • Public / Private KeyPair ( Server ) :
  • Only public part of this key is shared with the extension.
  • The client uses it while encrypting the request. This is long lived key, currently
  • nly to change in the case it is compromised

Client side : 128-bit symmetric AES encryption, OpenSSL RSA 1024-bit encryption. EventLogger: 128-bit symmetric AES encryption, OpenSSL RSA 4096-bit encryption.

slide-34
SLIDE 34

SecureChannel : How do we encrypt ? (Extension)

encryptedRequest(iv:encryptedMsg:encryptedKey) iv :Initializaton Vector msg = (originalRequest + ExtensionPublicKey) key = md5(msg) encryptedMsg = AES.encrypt(msg, key, {mode: CBC, padding: PKCS7, iv: iv}) encryptedKey = sign(EventLoggerPublicKey, key) Each request to be encrypted has the following components :

  • Message / Request to encrypt : Query or Data
  • ExtensionPublicKey : Chosen from a pool of public keys for that user on

the machine, key is used only once and then discarded).

  • Initialisation Vector : Derived from wordarray of 16-bits.
  • EventLoggerPublicKey : Our public key, shared with the extension.
slide-35
SLIDE 35

SecureChannel : Routing ? (Extension)

  • Extension maintains a list of proxies which are healthy / good at that point in

time.

  • When sending the request / message extension picks up the end-point in a

round-robin fashion (Round-robin for now).

  • To avoid the risk of proxies being malicious with the message, we implement

scrambling and splitting of messages into a random ‘n’ parts just before sending the message from extension.

  • The value of n is determined by the extension, we expect ‘n’ to be 1,2,4 or 8

for the time being. Also, the value of ’n’ is not known to proxies hence they are unaware if it has all the parts.

  • The only way to tamper a message is to have all the parts to decrypt it, but

since messages are scrambled, split and send through different proxies this makes the messages safe from proxies.

  • Event Logger waits for all the message by combination at our Event

Logger(Secure) can decrypt the message.

slide-36
SLIDE 36

SecureChannel : How do we decrypt ? (Server)

EncryptedRequest = iv:encryptedMsg:encryptedKey key = unlock(EventLoggerPrivateKey, encryptedKey) msg = AES.decrypt(encryptedMsg, key, {mode: CBC, padding: PKCS7, iv: iv) request = msg.data ExtensionPublicKey = msg.pk (We need it to sign the response)

Important:

  • Because the server receives messages in parts, to get the key and message we rely on

combinations.

  • The message itself is scrambled, so even if it is decrypted we need to stitch it together by trying

different combinations.

slide-37
SLIDE 37

All talk and no play, makes Jack a dull boy ! Demo

slide-38
SLIDE 38

Thank You

http://www.cliqz.com/en

We believe it’s possible, we are actually doing it

photo: projectsecretidentity.org