Twitter De-Identification Jonathon Storrick Jon.Storrick@gmail.com - - PDF document

▶

Sep 26, 2023 399 likes •464 views

CASOS Twitter De-Identification Jonathon Storrick Jon.Storrick@gmail.com Center for Computational Analysis of Social and Organizational Systems http://www.casos.cs.cmu.edu/ Why Its Necessary June 2020 2 1 CASOS Why Its Necessary

SLIDE 1

CASOS 1

Center for Computational Analysis of Social and Organizational Systems http://www.casos.cs.cmu.edu/

Twitter De-Identification

Jonathon Storrick

Jon.Storrick@gmail.com

June 2020

Why It’s Necessary

SLIDE 2

CASOS 2

June 2020

Why It’s Necessary

After the Cambridge Analytica scandal, there is MASSIVE

concern for how data is stored.

EU passes General Data Protection Regulation.
Personally Identifiable Information.
The more information we gather about a given individual, the

more likely it is we’ll be able to reverse engineer their real identity.

That can cause issue with grants, data transfer, and may limit

the amount of data you can collect for a given subject.

Because Twitter said it is

June 2020

The Solution

We developed the Twitter De-

Identifier, a standalone tool for processing Twitter data.

Reduces PII, handles large

datasets, and removes only superfluous information

For information on how to

access the De-Identifier, please email Dr. Carley

SLIDE 3

CASOS 3

June 2020

De-Identifier: the Challenges

While a typical tweet is limited to 280 characters

(mostly), an individual tweet has 10-20x as much info associated with it. Each tweet would need to be carefully handled such that no user could take a De-ID tweet and find its source.

A record of the anonimization needs to be kept, in case

project heads absolutely need it, and to keep consistent anonimizations across multiple datasets.

Speed. A twitter dataset can contain millions of tweets.
Not removing data that is of analytic use

June 2020

The Approach

Direct Identifiers: Tweet ID’s, Tweet Usernames,

Mentions

Indirect Identifiers: User Profiles, Locations, Dates
Masking: Should something need to be anonymized, its

relevant portion is replaced by pseudo-random text

Recognizing data that doesn’t need to be anonymized.

– News reports, verified individuals, etc.

SLIDE 4

CASOS 4

June 2020

Tweet Processor Tweet ID Retweet ID ID Anonymizer

Anonymized Tweet

Tweet Key DB Tweet Tweeter Key DB

Mentioned Users Retweet Sources Tweeter

User Anonymizer Whitelist Check Verified Check News Check US Govt Check

Allowed Needs Anonymized

IP Addresses Pronouns Websites

June 2020

Operation Speed

The primary bottleneck – read/write speed. Twitter data

is far too large to fit entirely in Memory.

Even then, it can process 20k per minute with a typical

non-SSD hard drive.

SLIDE 5

CASOS 5

June 2020

Demo

June 2020

Summary of Features

It must be capable of importing a tweet in Json format, and

exporting a de-identified tweet in the exact same format.

It must remove as much personally-identifiable information as

possible, without removing information important to analysis.

Users must have options in what gets anonymized. If they want to

leave certain users or agencies un-anonymized, they should be able to.

De-Identified ID's should be carried throughout the process. If a

tweeter is "00001" in one place, he should be "00001" in every

ther place.
A lookup table for tweets and users should be output to allow for

CASOS 1

Twitter De-Identification

Jonathon Storrick

Jon.Storrick@gmail.com

Why It’s Necessary

CASOS 2

Why It’s Necessary

concern for how data is stored.

more likely it is we’ll be able to reverse engineer their real identity.

the amount of data you can collect for a given subject.

The Solution

Identifier, a standalone tool for processing Twitter data.

datasets, and removes only superfluous information

access the De-Identifier, please email Dr. Carley

CASOS 3

De-Identifier: the Challenges

(mostly), an individual tweet has 10-20x as much info associated with it. Each tweet would need to be carefully handled such that no user could take a De-ID tweet and find its source.

project heads absolutely need it, and to keep consistent anonimizations across multiple datasets.

The Approach

Mentions

relevant portion is replaced by pseudo-random text

– News reports, verified individuals, etc.

CASOS 4

Tweet Processor Tweet ID Retweet ID ID Anonymizer

Anonymized Tweet

Tweet Key DB Tweet Tweeter Key DB

User Anonymizer Whitelist Check Verified Check News Check US Govt Check

IP Addresses Pronouns Websites

Operation Speed

is far too large to fit entirely in Memory.

non-SSD hard drive.

CASOS 5

Demo

Summary of Features

exporting a de-identified tweet in the exact same format.

possible, without removing information important to analysis.

leave certain users or agencies un-anonymized, they should be able to.

tweeter is "00001" in one place, he should be "00001" in every

looking into specific agents or to keep De-Identified ID's the same across multiple runs.