[PPT] - MULTILINGUAL AUTOMATED TEXT ANONYMIZATION Francisco Dias PowerPoint Presentation

SLIDE 1

MULTILINGUAL AUTOMATED TEXT ANONYMIZATION

Francisco Dias

francisco.m.c.dias@tecnico.ulisboa.pt

03/06/2016 Instituto Superior Técnico Slide 1 of 52

SLIDE 2

INTRODUCTION

03/06/2016 Instituto Superior Técnico Slide 2 of 52

 INTRODUCTION  RELATED WORK  ARCHITECTURE  ANONYMIZATION METHODS  EVALUATION  INTEGRATING OUR SYSTEM  CONCLUSION

SLIDE 3

03/06/2016 Instituto Superior Técnico Slide 3 of 52

INTRODUCTION

ANOMYMIZATION

From the Old Greek anónumos (transl: “without name”);
It suppresses names and sensitive information;

TEXT

It processes data in the form of text;
A text contains unstructured data;

AUTOMATED

It runs without human intervention;

MULTILINGUAL

It processes texts written in different languages.

SLIDE 4

MOTIVATION

Information sharing in text-form is important in some areas;

(clinical and scientific research, decision making, among others)

Texts may contain private information, protected by law;
In order to share information in text-form, all sensitive

information should be removed.

Manual redaction is a hard and time-consuming task.

An automated anonymization system could help in this task.

03/06/2016 Instituto Superior Técnico Slide 4 of 52

SLIDE 5

CHALLENGE

To implement a multilingual anonymization system:

→ STRING NLP Chain; → Unbabel Translation Pipeline;

Support 4 languages: English, German, Portuguese, Spanish;
Evaluate the anonymization system:

→ does it remove all sensitive information? → does it replace same entities by the same label? → does the results look natural to a human reader?

03/06/2016 Instituto Superior Técnico Slide 5 of 52

SLIDE 6

03/06/2016 Instituto Superior Técnico Slide 6 of 52

RELATED WORK

SLIDE 7

RELATED WORK

Most of the previous works are based on NER techniques;
The evaluation of the previous works was based on the

detection of entities in the text;

I2B2 launched two de-identification challenges in the past:

2006 and 2014.

03/06/2016 Instituto Superior Técnico Slide 7 of 52

SLIDE 8

RELATED WORK

MITRE, Wellner et al., 2006
Model-based and Pattern-matching techniques;
Best performance on i2b2 2006 challenge;

03/06/2016 Instituto Superior Técnico Slide 8 of 52

SLIDE 9

RELATED WORK

Szarvas et al. System, 2006
Model-based classifiers in parallel and a voting module;
Post-processing iteration in order to detect more candidates;

03/06/2016 Instituto Superior Técnico Slide 9 of 52

SLIDE 10

RELATED WORK

Arakami et al. System, 2006
A CRF * classifier detects candidates to sensitive information;
Label-consistency post-processing;

* CRF: Conditional Random Fields

03/06/2016 Instituto Superior Técnico Slide 10 of 52

SLIDE 11

RELATED WORK

HIDE, Gardner et al., 2008
A CRF classifier detects candidates to sensitive information;
Uses coreferences in order to detect more candidates;

03/06/2016 Instituto Superior Técnico Slide 11 of 52

SLIDE 12

RELATED WORK

“Nottingham System”, Yang & Garibaldi, 2014
Model-based (CRF) and Pattern-matching techniques;
It uses coreferences in order to detect more candidates;
Best performance on i2b2 2014 challenge;

03/06/2016 Instituto Superior Técnico Slide 12 of 52

SLIDE 13

03/06/2016 Instituto Superior Técnico Slide 13 of 52

ARCHITECTURE

SLIDE 14

ARCHITECTURE

Pipeline with 5 modules;
Based on NER techniques;
Post-processing and coreference modules;

Pre-processing NER Second-pass Detection Anonymization Coreference Resolution 03/06/2016 Instituto Superior Técnico Slide 14 of 52

SLIDE 15

ARCHITECTURE

The NER module detects sensitive information contained in the text;
It is composed of several parallel components;

Pattern- matching Main NER Classifier Parallel NER Classifier 1 Parallel NER Classifier 2 Voting 03/06/2016 Instituto Superior Técnico Slide 15 of 52

SLIDE 16

ARCHITECTURE

SECOND-PASS DETECTION

Post-processing step; corrections over NER results;
It applies Short-forms and Label-Consistence;

COREFERENCE RESOLUTION

Groups named entities into mentions;
Each mention refers to the same extra-linguistic object;

ANONYMIZATION MODULE

Implements anonymization methods;
Returns an anonymized text and a table of solutions.

03/06/2016 Instituto Superior Técnico Slide 16 of 52

SLIDE 17

03/06/2016 Instituto Superior Técnico Slide 17 of 52

ANONYMIZATION METHODS

SLIDE 18

ANONYMIZATION METHODS

The methods obfuscate original entities in text

using replacement tags or entities;

We implemented 4 anonymization methods:
Suppression

Lisbon ***** →

Tagging

Lisbon [LOCATION] →

Random Substitution

Lisbon Cairo →

Generalization

Lisbon City →

03/06/2016 Instituto Superior Técnico Slide 18 of 52

SLIDE 19

RANDOM SUBSTITUTION

Random substitution replaces an entity by another random entity

from the same class and morphosyntactic features;

Morphosyntactic features are determined by the headword;
A

e r

p
r

t

(

m a s c , s i n g ) → R e c i n t

(

m a s c , s i n g )

F

r a n k f u r t s ( m a s c , s i n g , g e n i t i v ) → We g s ( m a s c , s i n g , g e n i t i v )

Random entities are looked up from a default list of entities;

03/06/2016 Instituto Superior Técnico Slide 19 of 52 Language Class Number Gender Case Term PT location singular masculine recinto ES location singular feminine arena EN location singular venue DE location singular neuter nominative Wahrzeichen

SLIDE 20

GENERALIZATION

Generalization is any method of replacing an entity by another

that mentions an item of the same type but in a more general way;

This method accesses a Knowledge Base in order to retrieve the

superclasses of a given entity.

03/06/2016 Instituto Superior Técnico Slide 20 of 52 London Berlin Lisbon Madrid City

SLIDE 21

03/06/2016 Instituto Superior Técnico Slide 21 of 52

EVALUATION

SLIDE 22

EVALUATION

1) Detection of sensitive information (named entities) → does it remove all sensitive information? 2) Performance of the coreference between entities → does it replace same entities by the same label? 3) Adequacy of the replacements → does the results look natural to a human reader? Evaluation of previous studies on anonymization aimed at:

detection of entities in a text (we also evaluate points 2 and 3);
clinical report text style (we aim various text styles);

03/06/2016 Instituto Superior Técnico Slide 22 of 52

SLIDE 23

DATASETS

We aim at different domains of text and languages.
We use corpora divided into documents from 2 different sources,

with different text domains for each language: English: CoNLL 2003 + DCEP German: CoNLL 2003 + DCEP Portuguese: Segundo HAREM + DCEP Spanish: CoNLL 2002 + DCEP

DCEP reports were manually annotated for named entities;
All datasets were manually annotated for coreference

between entities;

03/06/2016 Instituto Superior Técnico Slide 23 of 52

SLIDE 24

DETECTION OF SENSITIVE INFORMATION

Intrinsic evaluation of the performance of NER: f1-score (also recall);
3 classes of entities: Location, Organization and Person;
5 configurations: - Baseline;
Baseline + Pattern-matching;
Baseline + Second-pass Detection;
Baseline + Parallel NER classifier;
Baseline + All previous configurations;
Statistically different results from the baseline.

03/06/2016 Instituto Superior Técnico Slide 24 of 52

SLIDE 25

DETECTION OF SENSITIVE INFORMATION

Baseline

performance depends on the text domain;

Gazetteers

improve significantly* recall;

Second-pass

improves significantly* f1-score (some datasets); also adds false positives;

Parallel NER

improves the performance (same training text domain) not significantly* when compared with Second-pass in CoNLL;

All modules

improves f1-score only on DCEP; * p < 0.01, compared with baseline

. 03/06/2016 Instituto Superior Técnico Slide 25 of 52

SLIDE 26

DETECTION OF SENSITIVE INFORMATION

03/06/2016 Instituto Superior Técnico Slide 26 of 52

SLIDE 27

DETECTION OF SENSITIVE INFORMATION

03/06/2016 Instituto Superior Técnico Slide 27 of 52

SLIDE 28

COMPARING WITH I2B2 RESULTS

2006 2014

03/06/2016 Instituto Superior Técnico Slide 28 of 52

5º 5º

against 6 other systems against 10 other systems

SLIDE 29

COREFERENCE RESOLUTION

Baseline: no coreference;
Metrics: B-Cubed Score;
Results depend on the language and text domain;
Performance of coreference resolution is satisfactory:
Precision close to 1.0;
Recall much higher than the baseline.

03/06/2016 Instituto Superior Técnico Slide 29 of 52

SLIDE 30

COREFERENCE RESOLUTION

03/06/2016 Instituto Superior Técnico Slide 30 of 52

SLIDE 31

ANONYMIZATION

Metrics: Availability and relevance of a substitution;
Effects of anonymization in the coreference of entities;
The relevance of a substitution within a context was measured

using human raters, as the ratio:

03/06/2016 Instituto Superior Técnico Slide 31 of 52

SLIDE 32

ANONYMIZATION

Among other points , Parliament states its readiness to evaluate proposals for a general correction mechanism based on the principle of Community solidarity . Among other points , ****** states its readiness to evaluate proposals for a general correction mechanism based on the principle of ****** solidarity .

Among other points , ORGANIZATION1 states its readiness to evaluate proposals for a general correction mechanism based on the principle of ORGANIZATION2 solidarity . Among other points , Entreprise 1 states its readiness to evaluate proposals for a general correction mechanism based on the principle of Society 1 solidarity . Among other points , legislature states its readiness to evaluate proposals for a general correction mechanism based on the principle of social group solidarity .

03/06/2016 Instituto Superior Técnico Slide 32 of 52

SUPPRESSION TAGGING RANDOM SUBSTITUTION GENERALIZATION

SLIDE 33

ANONYMIZATION

AVAILABILITY SUPPRESSION Always available TAGGING Always available RANDOM SUBSTITUTION Always available GENERALIZATION Depends on the recall of the KB

03/06/2016 Instituto Superior Técnico Slide 33 of 52

SLIDE 34

ANONYMIZATION

RELEVANCE SUPPRESSION Not relevant TAGGING Depends on NER performance RANDOM SUBSTITUTION Low relevance; possible semantic drifts

(EN-CONLL: 0.36 EN-DCEP: 0.16)

GENERALIZATION Depends on: NE Linking and KB Higher than Random Substitution

(EN-CONLL: 0.75 EN-DCEP: 0.77)

03/06/2016 Instituto Superior Técnico Slide 34 of 52

SLIDE 35

ANONYMIZATION

COREFERENCE SUPPRESSION Low performance by humans Lower than the CRR module TAGGING Performance of CRR module RANDOM SUBSTITUTION Performance of CRR module GENERALIZATION Performance of CRR module

03/06/2016 Instituto Superior Técnico Slide 35 of 52

SLIDE 36

ANONYMIZATION

REPLACEMENTS SUPPRESSION Tags

x x x x x x x

TAGGING Tags

[ * * L O C A L 1 * * ]

RANDOM SUBSTITUTION Real random entities Aims at MS* agreement GENERALIZATION Real superclass entities No MS agreement * MS = Morphosyntactic

03/06/2016 Instituto Superior Técnico Slide 36 of 52

SLIDE 37

03/06/2016 Instituto Superior Técnico Slide 37 of 52

INTEGRATING OUR SYSTEM

SLIDE 38

INTEGRATION IN THE L2F STRING CHAIN

We added a new module to the STRING chain

that performs anonymization of Portuguese texts;

The STRING chain provides:
Named Entities;
Morphosyntactic features of the entities;
We convert the output of STRING to the internal data format
f our system, process coreferences and run an anonymization

method over this data.

03/06/2016 Instituto Superior Técnico Slide 38 of 52

SLIDE 39

INTEGRATION IN THE L2F STRING CHAIN

STRING supports a wide range of NE classes;
Anonymization methods were created for some of them;

03/06/2016 Instituto Superior Técnico Slide 39 of 52 Original Random / Generalization Phone Number 210000001 827843582 Email teste@teste.com

wusiksw@example.com

NIB PT50 0000 00(...) PT5068116292115585228351 Date (D M Y) 3 de Junho de 2016 2 de Abril de 2045 “certo dia” Date (YYYY) em 2016 em 2009 “em certo ano” Time Intervals de 3 a 6 de Junho de 4 a 21 de Abril “entre duas datas” Time References 2 dias depois 32 dias depois "depois de algum tempo"

SLIDE 40

INTEGRATION IN THE UNBABEL PIPELINE

Text is anonymized and sent to human translators;
Original entities are translated by a specialized MT system;
Automated anonymization is not perfect;
The anonymized text can be changed by the MT and humans.

03/06/2016 Instituto Superior Técnico Slide 40 of 52

SLIDE 41

03/06/2016 Instituto Superior Técnico Slide 41 of 52

CONCLUSION

SLIDE 42

CONCLUSION

We have presented an implementation of a

multilingual, automated anonymization system for text documents;

Based on NER classifiers;
Maintain coreference between entities;
The second-pass detection and parallel NER classifiers

showed to improve significantly the performance of the detection.

03/06/2016 Instituto Superior Técnico Slide 42 of 52

SLIDE 43

CONCLUSION

Suppression is a simple method but removes relevant

semantic information from the text;

Tagging is able to keep some of the information about the type
f entity and the coreferential integrity of the mentions;
Random substitution results in a natural text but it results in

some semantic drifts;

Generalization results in a natural text but is limited by the

recall of the KB and NE linking.

03/06/2016 Instituto Superior Técnico Slide 43 of 52

SLIDE 44

FUTURE WORK

Improve : Gazetteers and KBs; Relevance of substitutions using Named Entity Linking; Intelligent generalization method; Test : Larger datasets for evaluation; Information content approach (instead of NER); Dependency parser in order to identify and anonymize also indirect identifiers.

03/06/2016 Instituto Superior Técnico Slide 44 of 52

SLIDE 45

CONTRIBUTIONS

Multilingual text anonymization system;
Web-based annotation platform;
Golden-standard for NER corpora composed of DCEP reports;
The integration of this system in the STRING chain.
The deployment of this system on the Unbabel servers.
A paper in the WCCI 2016 International Conference;
All data is publicly available at https://www.l2f.inesc-id.pt/~fdias/mscthesis/

03/06/2016 Instituto Superior Técnico Slide 45 of 52

SLIDE 46

MULTILINGUAL AUTOMATED TEXT ANONYMIZATION

03/06/2016 Instituto Superior Técnico Slide 46 of 52

SLIDE 47

O João Silva, de 25 anos, foi ao Porto em 3 de Junho de 201 num carro Mercedes de matrícula 12-13-KL com dispositivo de localização da Toshiba de endereço 128.0.0.1. Ele pretendia chegar à Escola Municipal às 14

horas. A viatura foi roubada. O João, ligou para o número do Francisco,

962952857, e localizaram a viatura depois de 6 horas. Ele comprou o localizador no OLX, transferindo para o NIB PT50123443211234567890172 uma quantia de 230 euros.

SLIDE 48

O -----, -----, foi ao ----- ----- num carro Mercedes de matrícula ----- com dispositivo de localização da ----- de endereço -----. Ele pretendia chegar à ----- -----. A viatura foi roubada. O -----, ligou para o número do -----,

----, e localizaram a viatura -----. Ele comprou o localizador no OLX,
----transferindo para o NIB ----- uma quantia -----.

SLIDE 49

O [PESSOA2], [DATA4], foi ao [LOCAL2] [DATA3] num carro Mercedes de matrícula [MATRICULA1] com dispositivo de localização da [ORGANIZACAO1] de endereço [ENDERECO IP1]. Ele pretendia chegar à [LOCAL1] [DATA2]. A viatura foi roubada. O [PESSOA2], ligou para o número do [PESSOA1], [PHONE1], e localizaram a viatura [DATA1]. Ele comprou o localizador no OLX, transferindo para o NIB [NIB1] uma quantia [VALOR1].

SLIDE 50

O [Miguel1], [data1], foi ao [Viseu] [em 14/9/2018] num carro Mercedes de matrícula [TW-23-KQ] com dispositivo de localização da [Empresa] de endereço [403.198.31.155]. Ele pretendia chegar à [Portalegre] [data]. A viatura foi roubada. O [Miguel1], ligou para o número do [Miguel], [899030283], e localizaram a viatura [depoiJunho de 2043 hora]. Ele comprou o localizador no OLX, transferindo para o NIB [PT5061590168374705433994] uma quantia [3441 euros].

SLIDE 51

O [Afonso], [data 2], foi ao [cidade] [data 2] num carro Mercedes de matrícula [matrícula] com dispositivo de localização da [negócio] de endereço [endereço de IP]. Ele pretendia chegar à [instituição educacional] [data 1]. A viatura foi roubada. O [Afonso], ligou para o número do [Rodrigo], [número de telefone], e localizaram a viatura [data]. Ele comprou o localizador no OLX, transferindo para o NIB [IBAN] uma quantia [quantia em dinheiro].

SLIDE 52

O [Afonso], [data 2], foi ao [cidade] [data 2] num carro Mercedes de matrícula [matrícula] com dispositivo de localização da [negócio] de endereço [endereço de IP]. Ele pretendia chegar à [instituição educacional] [data 1]. A viatura foi roubada. O [Afonso], ligou para o número do [Rodrigo], [número de telefone], e localizaram a viatura [data]. Ele comprou o localizador no OLX, transferindo para o NIB [IBAN] uma quantia [quantia em dinheiro].