MULTILINGUAL AUTOMATED TEXT ANONYMIZATION Francisco Dias - - PowerPoint PPT Presentation

multilingual automated text anonymization
SMART_READER_LITE
LIVE PREVIEW

MULTILINGUAL AUTOMATED TEXT ANONYMIZATION Francisco Dias - - PowerPoint PPT Presentation

MULTILINGUAL AUTOMATED TEXT ANONYMIZATION Francisco Dias francisco.m.c.dias@tecnico.ulisboa.pt 03/06/2016 Instituto Superior Tcnico Slide 1 of 52 INTRODUCTION RELATED WORK ARCHITECTURE INTRODUCTION ANONYMIZATION METHODS


slide-1
SLIDE 1

MULTILINGUAL AUTOMATED TEXT ANONYMIZATION

Francisco Dias

francisco.m.c.dias@tecnico.ulisboa.pt

03/06/2016 Instituto Superior Técnico Slide 1 of 52

slide-2
SLIDE 2

INTRODUCTION

03/06/2016 Instituto Superior Técnico Slide 2 of 52

 INTRODUCTION  RELATED WORK  ARCHITECTURE  ANONYMIZATION METHODS  EVALUATION  INTEGRATING OUR SYSTEM  CONCLUSION

slide-3
SLIDE 3

03/06/2016 Instituto Superior Técnico Slide 3 of 52

INTRODUCTION

ANOMYMIZATION

  • From the Old Greek anónumos (transl: “without name”);
  • It suppresses names and sensitive information;

TEXT

  • It processes data in the form of text;
  • A text contains unstructured data;

AUTOMATED

  • It runs without human intervention;

MULTILINGUAL

  • It processes texts written in different languages.
slide-4
SLIDE 4

MOTIVATION

  • Information sharing in text-form is important in some areas;

(clinical and scientific research, decision making, among others)

  • Texts may contain private information, protected by law;
  • In order to share information in text-form, all sensitive

information should be removed.

  • Manual redaction is a hard and time-consuming task.

An automated anonymization system could help in this task.

03/06/2016 Instituto Superior Técnico Slide 4 of 52

slide-5
SLIDE 5

CHALLENGE

  • To implement a multilingual anonymization system:

→ STRING NLP Chain; → Unbabel Translation Pipeline;

  • Support 4 languages: English, German, Portuguese, Spanish;
  • Evaluate the anonymization system:

→ does it remove all sensitive information? → does it replace same entities by the same label? → does the results look natural to a human reader?

03/06/2016 Instituto Superior Técnico Slide 5 of 52

slide-6
SLIDE 6

03/06/2016 Instituto Superior Técnico Slide 6 of 52

RELATED WORK

slide-7
SLIDE 7

RELATED WORK

  • Most of the previous works are based on NER techniques;
  • The evaluation of the previous works was based on the

detection of entities in the text;

  • I2B2 launched two de-identification challenges in the past:

2006 and 2014.

03/06/2016 Instituto Superior Técnico Slide 7 of 52

slide-8
SLIDE 8

RELATED WORK

  • MITRE, Wellner et al., 2006
  • Model-based and Pattern-matching techniques;
  • Best performance on i2b2 2006 challenge;

03/06/2016 Instituto Superior Técnico Slide 8 of 52

slide-9
SLIDE 9

RELATED WORK

  • Szarvas et al. System, 2006
  • Model-based classifiers in parallel and a voting module;
  • Post-processing iteration in order to detect more candidates;

03/06/2016 Instituto Superior Técnico Slide 9 of 52

slide-10
SLIDE 10

RELATED WORK

  • Arakami et al. System, 2006
  • A CRF * classifier detects candidates to sensitive information;
  • Label-consistency post-processing;

* CRF: Conditional Random Fields

03/06/2016 Instituto Superior Técnico Slide 10 of 52

slide-11
SLIDE 11

RELATED WORK

  • HIDE, Gardner et al., 2008
  • A CRF classifier detects candidates to sensitive information;
  • Uses coreferences in order to detect more candidates;

03/06/2016 Instituto Superior Técnico Slide 11 of 52

slide-12
SLIDE 12

RELATED WORK

  • “Nottingham System”, Yang & Garibaldi, 2014
  • Model-based (CRF) and Pattern-matching techniques;
  • It uses coreferences in order to detect more candidates;
  • Best performance on i2b2 2014 challenge;

03/06/2016 Instituto Superior Técnico Slide 12 of 52

slide-13
SLIDE 13

03/06/2016 Instituto Superior Técnico Slide 13 of 52

ARCHITECTURE

slide-14
SLIDE 14

ARCHITECTURE

  • Pipeline with 5 modules;
  • Based on NER techniques;
  • Post-processing and coreference modules;

Pre-processing NER Second-pass Detection Anonymization Coreference Resolution 03/06/2016 Instituto Superior Técnico Slide 14 of 52

slide-15
SLIDE 15

ARCHITECTURE

  • The NER module detects sensitive information contained in the text;
  • It is composed of several parallel components;

Pattern- matching Main NER Classifier Parallel NER Classifier 1 Parallel NER Classifier 2 Voting 03/06/2016 Instituto Superior Técnico Slide 15 of 52

slide-16
SLIDE 16

ARCHITECTURE

SECOND-PASS DETECTION

  • Post-processing step; corrections over NER results;
  • It applies Short-forms and Label-Consistence;

COREFERENCE RESOLUTION

  • Groups named entities into mentions;
  • Each mention refers to the same extra-linguistic object;

ANONYMIZATION MODULE

  • Implements anonymization methods;
  • Returns an anonymized text and a table of solutions.

03/06/2016 Instituto Superior Técnico Slide 16 of 52

slide-17
SLIDE 17

03/06/2016 Instituto Superior Técnico Slide 17 of 52

ANONYMIZATION METHODS

slide-18
SLIDE 18

ANONYMIZATION METHODS

  • The methods obfuscate original entities in text

using replacement tags or entities;

  • We implemented 4 anonymization methods:
  • Suppression

Lisbon ***** →

  • Tagging

Lisbon [LOCATION] →

  • Random Substitution

Lisbon Cairo →

  • Generalization

Lisbon City →

03/06/2016 Instituto Superior Técnico Slide 18 of 52

slide-19
SLIDE 19

RANDOM SUBSTITUTION

  • Random substitution replaces an entity by another random entity

from the same class and morphosyntactic features;

  • Morphosyntactic features are determined by the headword;
  • A

e r

  • p
  • r

t

  • (

m a s c , s i n g ) → R e c i n t

  • (

m a s c , s i n g )

  • F

r a n k f u r t s ( m a s c , s i n g , g e n i t i v ) → We g s ( m a s c , s i n g , g e n i t i v )

  • Random entities are looked up from a default list of entities;

03/06/2016 Instituto Superior Técnico Slide 19 of 52 Language Class Number Gender Case Term PT location singular masculine recinto ES location singular feminine arena EN location singular venue DE location singular neuter nominative Wahrzeichen

slide-20
SLIDE 20

GENERALIZATION

  • Generalization is any method of replacing an entity by another

that mentions an item of the same type but in a more general way;

  • This method accesses a Knowledge Base in order to retrieve the

superclasses of a given entity.

03/06/2016 Instituto Superior Técnico Slide 20 of 52 London Berlin Lisbon Madrid City

slide-21
SLIDE 21

03/06/2016 Instituto Superior Técnico Slide 21 of 52

EVALUATION

slide-22
SLIDE 22

EVALUATION

1) Detection of sensitive information (named entities) → does it remove all sensitive information? 2) Performance of the coreference between entities → does it replace same entities by the same label? 3) Adequacy of the replacements → does the results look natural to a human reader? Evaluation of previous studies on anonymization aimed at:

  • detection of entities in a text (we also evaluate points 2 and 3);
  • clinical report text style (we aim various text styles);

03/06/2016 Instituto Superior Técnico Slide 22 of 52

slide-23
SLIDE 23

DATASETS

  • We aim at different domains of text and languages.
  • We use corpora divided into documents from 2 different sources,

with different text domains for each language: English: CoNLL 2003 + DCEP German: CoNLL 2003 + DCEP Portuguese: Segundo HAREM + DCEP Spanish: CoNLL 2002 + DCEP

  • DCEP reports were manually annotated for named entities;
  • All datasets were manually annotated for coreference

between entities;

03/06/2016 Instituto Superior Técnico Slide 23 of 52

slide-24
SLIDE 24

DETECTION OF SENSITIVE INFORMATION

  • Intrinsic evaluation of the performance of NER: f1-score (also recall);
  • 3 classes of entities: Location, Organization and Person;
  • 5 configurations: - Baseline;
  • Baseline + Pattern-matching;
  • Baseline + Second-pass Detection;
  • Baseline + Parallel NER classifier;
  • Baseline + All previous configurations;
  • Statistically different results from the baseline.

03/06/2016 Instituto Superior Técnico Slide 24 of 52

slide-25
SLIDE 25

DETECTION OF SENSITIVE INFORMATION

  • Baseline

performance depends on the text domain;

  • Gazetteers

improve significantly* recall;

  • Second-pass

improves significantly* f1-score (some datasets); also adds false positives;

  • Parallel NER

improves the performance (same training text domain) not significantly* when compared with Second-pass in CoNLL;

  • All modules

improves f1-score only on DCEP; * p < 0.01, compared with baseline

. 03/06/2016 Instituto Superior Técnico Slide 25 of 52

slide-26
SLIDE 26

DETECTION OF SENSITIVE INFORMATION

03/06/2016 Instituto Superior Técnico Slide 26 of 52

slide-27
SLIDE 27

DETECTION OF SENSITIVE INFORMATION

03/06/2016 Instituto Superior Técnico Slide 27 of 52

slide-28
SLIDE 28

COMPARING WITH I2B2 RESULTS

2006 2014

03/06/2016 Instituto Superior Técnico Slide 28 of 52

5º 5º

against 6 other systems against 10 other systems

slide-29
SLIDE 29

COREFERENCE RESOLUTION

  • Baseline: no coreference;
  • Metrics: B-Cubed Score;
  • Results depend on the language and text domain;
  • Performance of coreference resolution is satisfactory:
  • Precision close to 1.0;
  • Recall much higher than the baseline.

03/06/2016 Instituto Superior Técnico Slide 29 of 52

slide-30
SLIDE 30

COREFERENCE RESOLUTION

03/06/2016 Instituto Superior Técnico Slide 30 of 52

slide-31
SLIDE 31

ANONYMIZATION

  • Metrics: Availability and relevance of a substitution;
  • Effects of anonymization in the coreference of entities;
  • The relevance of a substitution within a context was measured

using human raters, as the ratio:

03/06/2016 Instituto Superior Técnico Slide 31 of 52

slide-32
SLIDE 32

ANONYMIZATION

Among other points , Parliament states its readiness to evaluate proposals for a general correction mechanism based on the principle of Community solidarity . Among other points , ****** states its readiness to evaluate proposals for a general correction mechanism based on the principle of ****** solidarity .

Among other points , ORGANIZATION1 states its readiness to evaluate proposals for a general correction mechanism based on the principle of ORGANIZATION2 solidarity . Among other points , Entreprise 1 states its readiness to evaluate proposals for a general correction mechanism based on the principle of Society 1 solidarity . Among other points , legislature states its readiness to evaluate proposals for a general correction mechanism based on the principle of social group solidarity .

03/06/2016 Instituto Superior Técnico Slide 32 of 52

SUPPRESSION TAGGING RANDOM SUBSTITUTION GENERALIZATION

slide-33
SLIDE 33

ANONYMIZATION

AVAILABILITY SUPPRESSION Always available TAGGING Always available RANDOM SUBSTITUTION Always available GENERALIZATION Depends on the recall of the KB

03/06/2016 Instituto Superior Técnico Slide 33 of 52

slide-34
SLIDE 34

ANONYMIZATION

RELEVANCE SUPPRESSION Not relevant TAGGING Depends on NER performance RANDOM SUBSTITUTION Low relevance; possible semantic drifts

(EN-CONLL: 0.36 EN-DCEP: 0.16)

GENERALIZATION Depends on: NE Linking and KB Higher than Random Substitution

(EN-CONLL: 0.75 EN-DCEP: 0.77)

03/06/2016 Instituto Superior Técnico Slide 34 of 52

slide-35
SLIDE 35

ANONYMIZATION

COREFERENCE SUPPRESSION Low performance by humans Lower than the CRR module TAGGING Performance of CRR module RANDOM SUBSTITUTION Performance of CRR module GENERALIZATION Performance of CRR module

03/06/2016 Instituto Superior Técnico Slide 35 of 52

slide-36
SLIDE 36

ANONYMIZATION

REPLACEMENTS SUPPRESSION Tags

x x x x x x x

TAGGING Tags

[ * * L O C A L 1 * * ]

RANDOM SUBSTITUTION Real random entities Aims at MS* agreement GENERALIZATION Real superclass entities No MS agreement * MS = Morphosyntactic

03/06/2016 Instituto Superior Técnico Slide 36 of 52

slide-37
SLIDE 37

03/06/2016 Instituto Superior Técnico Slide 37 of 52

INTEGRATING OUR SYSTEM

slide-38
SLIDE 38

INTEGRATION IN THE L2F STRING CHAIN

  • We added a new module to the STRING chain

that performs anonymization of Portuguese texts;

  • The STRING chain provides:
  • Named Entities;
  • Morphosyntactic features of the entities;
  • We convert the output of STRING to the internal data format
  • f our system, process coreferences and run an anonymization

method over this data.

03/06/2016 Instituto Superior Técnico Slide 38 of 52

slide-39
SLIDE 39

INTEGRATION IN THE L2F STRING CHAIN

  • STRING supports a wide range of NE classes;
  • Anonymization methods were created for some of them;

03/06/2016 Instituto Superior Técnico Slide 39 of 52 Original Random / Generalization Phone Number 210000001 827843582 Email teste@teste.com

  • wusiksw@example.com

NIB PT50 0000 00(...) PT5068116292115585228351 Date (D M Y) 3 de Junho de 2016 2 de Abril de 2045 “certo dia” Date (YYYY) em 2016 em 2009 “em certo ano” Time Intervals de 3 a 6 de Junho de 4 a 21 de Abril “entre duas datas” Time References 2 dias depois 32 dias depois "depois de algum tempo"

slide-40
SLIDE 40

INTEGRATION IN THE UNBABEL PIPELINE

  • Text is anonymized and sent to human translators;
  • Original entities are translated by a specialized MT system;
  • Automated anonymization is not perfect;
  • The anonymized text can be changed by the MT and humans.

03/06/2016 Instituto Superior Técnico Slide 40 of 52

slide-41
SLIDE 41

03/06/2016 Instituto Superior Técnico Slide 41 of 52

CONCLUSION

slide-42
SLIDE 42

CONCLUSION

  • We have presented an implementation of a

multilingual, automated anonymization system for text documents;

  • Based on NER classifiers;
  • Maintain coreference between entities;
  • The second-pass detection and parallel NER classifiers

showed to improve significantly the performance of the detection.

03/06/2016 Instituto Superior Técnico Slide 42 of 52

slide-43
SLIDE 43

CONCLUSION

  • Suppression is a simple method but removes relevant

semantic information from the text;

  • Tagging is able to keep some of the information about the type
  • f entity and the coreferential integrity of the mentions;
  • Random substitution results in a natural text but it results in

some semantic drifts;

  • Generalization results in a natural text but is limited by the

recall of the KB and NE linking.

03/06/2016 Instituto Superior Técnico Slide 43 of 52

slide-44
SLIDE 44

FUTURE WORK

Improve : Gazetteers and KBs; Relevance of substitutions using Named Entity Linking; Intelligent generalization method; Test : Larger datasets for evaluation; Information content approach (instead of NER); Dependency parser in order to identify and anonymize also indirect identifiers.

03/06/2016 Instituto Superior Técnico Slide 44 of 52

slide-45
SLIDE 45

CONTRIBUTIONS

  • Multilingual text anonymization system;
  • Web-based annotation platform;
  • Golden-standard for NER corpora composed of DCEP reports;
  • The integration of this system in the STRING chain.
  • The deployment of this system on the Unbabel servers.
  • A paper in the WCCI 2016 International Conference;
  • All data is publicly available at https://www.l2f.inesc-id.pt/~fdias/mscthesis/

03/06/2016 Instituto Superior Técnico Slide 45 of 52

slide-46
SLIDE 46

MULTILINGUAL AUTOMATED TEXT ANONYMIZATION

03/06/2016 Instituto Superior Técnico Slide 46 of 52

slide-47
SLIDE 47

O João Silva, de 25 anos, foi ao Porto em 3 de Junho de 201 num carro Mercedes de matrícula 12-13-KL com dispositivo de localização da Toshiba de endereço 128.0.0.1. Ele pretendia chegar à Escola Municipal às 14

  • horas. A viatura foi roubada. O João, ligou para o número do Francisco,

962952857, e localizaram a viatura depois de 6 horas. Ele comprou o localizador no OLX, transferindo para o NIB PT50123443211234567890172 uma quantia de 230 euros.

slide-48
SLIDE 48

O -----, -----, foi ao ----- ----- num carro Mercedes de matrícula ----- com dispositivo de localização da ----- de endereço -----. Ele pretendia chegar à ----- -----. A viatura foi roubada. O -----, ligou para o número do -----,

  • ----, e localizaram a viatura -----. Ele comprou o localizador no OLX,
  • ----transferindo para o NIB ----- uma quantia -----.
slide-49
SLIDE 49

O [**PESSOA2**], [**DATA4**], foi ao [**LOCAL2**] [**DATA3**] num carro Mercedes de matrícula [**MATRICULA1**] com dispositivo de localização da [**ORGANIZACAO1**] de endereço [**ENDERECO IP1**]. Ele pretendia chegar à [**LOCAL1**] [**DATA2**]. A viatura foi roubada. O [**PESSOA2**], ligou para o número do [**PESSOA1**], [**PHONE1**], e localizaram a viatura [**DATA1**]. Ele comprou o localizador no OLX, transferindo para o NIB [**NIB1**] uma quantia [**VALOR1**].

slide-50
SLIDE 50

O [**Miguel1**], [**data1**], foi ao [**Viseu**] [**em 14/9/2018**] num carro Mercedes de matrícula [**TW-23-KQ**] com dispositivo de localização da [**Empresa**] de endereço [**403.198.31.155**]. Ele pretendia chegar à [**Portalegre**] [**data**]. A viatura foi roubada. O [**Miguel1**], ligou para o número do [**Miguel**], [**899030283**], e localizaram a viatura [**depoiJunho de 2043 hora**]. Ele comprou o localizador no OLX, transferindo para o NIB [**PT5061590168374705433994**] uma quantia [**3441 euros**].

slide-51
SLIDE 51

O [**Afonso**], [**data 2**], foi ao [**cidade**] [**data 2**] num carro Mercedes de matrícula [**matrícula**] com dispositivo de localização da [**negócio**] de endereço [**endereço de IP**]. Ele pretendia chegar à [**instituição educacional**] [**data 1**]. A viatura foi roubada. O [**Afonso**], ligou para o número do [**Rodrigo**], [**número de telefone**], e localizaram a viatura [**data**]. Ele comprou o localizador no OLX, transferindo para o NIB [**IBAN**] uma quantia [**quantia em dinheiro**].

slide-52
SLIDE 52

O [**Afonso**], [**data 2**], foi ao [**cidade**] [**data 2**] num carro Mercedes de matrícula [**matrícula**] com dispositivo de localização da [**negócio**] de endereço [**endereço de IP**]. Ele pretendia chegar à [**instituição educacional**] [**data 1**]. A viatura foi roubada. O [**Afonso**], ligou para o número do [**Rodrigo**], [**número de telefone**], e localizaram a viatura [**data**]. Ele comprou o localizador no OLX, transferindo para o NIB [**IBAN**] uma quantia [**quantia em dinheiro**].