SLIDE 1 MULTILINGUAL AUTOMATED TEXT ANONYMIZATION
Francisco Dias
francisco.m.c.dias@tecnico.ulisboa.pt
03/06/2016 Instituto Superior Técnico Slide 1 of 52
SLIDE 2 INTRODUCTION
03/06/2016 Instituto Superior Técnico Slide 2 of 52
INTRODUCTION RELATED WORK ARCHITECTURE ANONYMIZATION METHODS EVALUATION INTEGRATING OUR SYSTEM CONCLUSION
SLIDE 3 03/06/2016 Instituto Superior Técnico Slide 3 of 52
INTRODUCTION
ANOMYMIZATION
- From the Old Greek anónumos (transl: “without name”);
- It suppresses names and sensitive information;
TEXT
- It processes data in the form of text;
- A text contains unstructured data;
AUTOMATED
- It runs without human intervention;
MULTILINGUAL
- It processes texts written in different languages.
SLIDE 4 MOTIVATION
- Information sharing in text-form is important in some areas;
(clinical and scientific research, decision making, among others)
- Texts may contain private information, protected by law;
- In order to share information in text-form, all sensitive
information should be removed.
- Manual redaction is a hard and time-consuming task.
An automated anonymization system could help in this task.
03/06/2016 Instituto Superior Técnico Slide 4 of 52
SLIDE 5 CHALLENGE
- To implement a multilingual anonymization system:
→ STRING NLP Chain; → Unbabel Translation Pipeline;
- Support 4 languages: English, German, Portuguese, Spanish;
- Evaluate the anonymization system:
→ does it remove all sensitive information? → does it replace same entities by the same label? → does the results look natural to a human reader?
03/06/2016 Instituto Superior Técnico Slide 5 of 52
SLIDE 6 03/06/2016 Instituto Superior Técnico Slide 6 of 52
RELATED WORK
SLIDE 7 RELATED WORK
- Most of the previous works are based on NER techniques;
- The evaluation of the previous works was based on the
detection of entities in the text;
- I2B2 launched two de-identification challenges in the past:
2006 and 2014.
03/06/2016 Instituto Superior Técnico Slide 7 of 52
SLIDE 8 RELATED WORK
- MITRE, Wellner et al., 2006
- Model-based and Pattern-matching techniques;
- Best performance on i2b2 2006 challenge;
03/06/2016 Instituto Superior Técnico Slide 8 of 52
SLIDE 9 RELATED WORK
- Szarvas et al. System, 2006
- Model-based classifiers in parallel and a voting module;
- Post-processing iteration in order to detect more candidates;
03/06/2016 Instituto Superior Técnico Slide 9 of 52
SLIDE 10 RELATED WORK
- Arakami et al. System, 2006
- A CRF * classifier detects candidates to sensitive information;
- Label-consistency post-processing;
* CRF: Conditional Random Fields
03/06/2016 Instituto Superior Técnico Slide 10 of 52
SLIDE 11 RELATED WORK
- HIDE, Gardner et al., 2008
- A CRF classifier detects candidates to sensitive information;
- Uses coreferences in order to detect more candidates;
03/06/2016 Instituto Superior Técnico Slide 11 of 52
SLIDE 12 RELATED WORK
- “Nottingham System”, Yang & Garibaldi, 2014
- Model-based (CRF) and Pattern-matching techniques;
- It uses coreferences in order to detect more candidates;
- Best performance on i2b2 2014 challenge;
03/06/2016 Instituto Superior Técnico Slide 12 of 52
SLIDE 13 03/06/2016 Instituto Superior Técnico Slide 13 of 52
ARCHITECTURE
SLIDE 14 ARCHITECTURE
- Pipeline with 5 modules;
- Based on NER techniques;
- Post-processing and coreference modules;
Pre-processing NER Second-pass Detection Anonymization Coreference Resolution 03/06/2016 Instituto Superior Técnico Slide 14 of 52
SLIDE 15 ARCHITECTURE
- The NER module detects sensitive information contained in the text;
- It is composed of several parallel components;
Pattern- matching Main NER Classifier Parallel NER Classifier 1 Parallel NER Classifier 2 Voting 03/06/2016 Instituto Superior Técnico Slide 15 of 52
SLIDE 16 ARCHITECTURE
SECOND-PASS DETECTION
- Post-processing step; corrections over NER results;
- It applies Short-forms and Label-Consistence;
COREFERENCE RESOLUTION
- Groups named entities into mentions;
- Each mention refers to the same extra-linguistic object;
ANONYMIZATION MODULE
- Implements anonymization methods;
- Returns an anonymized text and a table of solutions.
03/06/2016 Instituto Superior Técnico Slide 16 of 52
SLIDE 17 03/06/2016 Instituto Superior Técnico Slide 17 of 52
ANONYMIZATION METHODS
SLIDE 18 ANONYMIZATION METHODS
- The methods obfuscate original entities in text
using replacement tags or entities;
- We implemented 4 anonymization methods:
- Suppression
Lisbon ***** →
Lisbon [LOCATION] →
Lisbon Cairo →
Lisbon City →
03/06/2016 Instituto Superior Técnico Slide 18 of 52
SLIDE 19 RANDOM SUBSTITUTION
- Random substitution replaces an entity by another random entity
from the same class and morphosyntactic features;
- Morphosyntactic features are determined by the headword;
- A
e r
t
m a s c , s i n g ) → R e c i n t
m a s c , s i n g )
r a n k f u r t s ( m a s c , s i n g , g e n i t i v ) → We g s ( m a s c , s i n g , g e n i t i v )
- Random entities are looked up from a default list of entities;
03/06/2016 Instituto Superior Técnico Slide 19 of 52 Language Class Number Gender Case Term PT location singular masculine recinto ES location singular feminine arena EN location singular venue DE location singular neuter nominative Wahrzeichen
SLIDE 20 GENERALIZATION
- Generalization is any method of replacing an entity by another
that mentions an item of the same type but in a more general way;
- This method accesses a Knowledge Base in order to retrieve the
superclasses of a given entity.
03/06/2016 Instituto Superior Técnico Slide 20 of 52 London Berlin Lisbon Madrid City
SLIDE 21 03/06/2016 Instituto Superior Técnico Slide 21 of 52
EVALUATION
SLIDE 22 EVALUATION
1) Detection of sensitive information (named entities) → does it remove all sensitive information? 2) Performance of the coreference between entities → does it replace same entities by the same label? 3) Adequacy of the replacements → does the results look natural to a human reader? Evaluation of previous studies on anonymization aimed at:
- detection of entities in a text (we also evaluate points 2 and 3);
- clinical report text style (we aim various text styles);
03/06/2016 Instituto Superior Técnico Slide 22 of 52
SLIDE 23 DATASETS
- We aim at different domains of text and languages.
- We use corpora divided into documents from 2 different sources,
with different text domains for each language: English: CoNLL 2003 + DCEP German: CoNLL 2003 + DCEP Portuguese: Segundo HAREM + DCEP Spanish: CoNLL 2002 + DCEP
- DCEP reports were manually annotated for named entities;
- All datasets were manually annotated for coreference
between entities;
03/06/2016 Instituto Superior Técnico Slide 23 of 52
SLIDE 24 DETECTION OF SENSITIVE INFORMATION
- Intrinsic evaluation of the performance of NER: f1-score (also recall);
- 3 classes of entities: Location, Organization and Person;
- 5 configurations: - Baseline;
- Baseline + Pattern-matching;
- Baseline + Second-pass Detection;
- Baseline + Parallel NER classifier;
- Baseline + All previous configurations;
- Statistically different results from the baseline.
03/06/2016 Instituto Superior Técnico Slide 24 of 52
SLIDE 25 DETECTION OF SENSITIVE INFORMATION
performance depends on the text domain;
improve significantly* recall;
improves significantly* f1-score (some datasets); also adds false positives;
improves the performance (same training text domain) not significantly* when compared with Second-pass in CoNLL;
improves f1-score only on DCEP; * p < 0.01, compared with baseline
. 03/06/2016 Instituto Superior Técnico Slide 25 of 52
SLIDE 26 DETECTION OF SENSITIVE INFORMATION
03/06/2016 Instituto Superior Técnico Slide 26 of 52
SLIDE 27 DETECTION OF SENSITIVE INFORMATION
03/06/2016 Instituto Superior Técnico Slide 27 of 52
SLIDE 28 COMPARING WITH I2B2 RESULTS
2006 2014
03/06/2016 Instituto Superior Técnico Slide 28 of 52
5º 5º
against 6 other systems against 10 other systems
SLIDE 29 COREFERENCE RESOLUTION
- Baseline: no coreference;
- Metrics: B-Cubed Score;
- Results depend on the language and text domain;
- Performance of coreference resolution is satisfactory:
- Precision close to 1.0;
- Recall much higher than the baseline.
03/06/2016 Instituto Superior Técnico Slide 29 of 52
SLIDE 30 COREFERENCE RESOLUTION
03/06/2016 Instituto Superior Técnico Slide 30 of 52
SLIDE 31 ANONYMIZATION
- Metrics: Availability and relevance of a substitution;
- Effects of anonymization in the coreference of entities;
- The relevance of a substitution within a context was measured
using human raters, as the ratio:
03/06/2016 Instituto Superior Técnico Slide 31 of 52
SLIDE 32 ANONYMIZATION
Among other points , Parliament states its readiness to evaluate proposals for a general correction mechanism based on the principle of Community solidarity . Among other points , ****** states its readiness to evaluate proposals for a general correction mechanism based on the principle of ****** solidarity .
Among other points , ORGANIZATION1 states its readiness to evaluate proposals for a general correction mechanism based on the principle of ORGANIZATION2 solidarity . Among other points , Entreprise 1 states its readiness to evaluate proposals for a general correction mechanism based on the principle of Society 1 solidarity . Among other points , legislature states its readiness to evaluate proposals for a general correction mechanism based on the principle of social group solidarity .
03/06/2016 Instituto Superior Técnico Slide 32 of 52
SUPPRESSION TAGGING RANDOM SUBSTITUTION GENERALIZATION
SLIDE 33 ANONYMIZATION
AVAILABILITY SUPPRESSION Always available TAGGING Always available RANDOM SUBSTITUTION Always available GENERALIZATION Depends on the recall of the KB
03/06/2016 Instituto Superior Técnico Slide 33 of 52
SLIDE 34 ANONYMIZATION
RELEVANCE SUPPRESSION Not relevant TAGGING Depends on NER performance RANDOM SUBSTITUTION Low relevance; possible semantic drifts
(EN-CONLL: 0.36 EN-DCEP: 0.16)
GENERALIZATION Depends on: NE Linking and KB Higher than Random Substitution
(EN-CONLL: 0.75 EN-DCEP: 0.77)
03/06/2016 Instituto Superior Técnico Slide 34 of 52
SLIDE 35 ANONYMIZATION
COREFERENCE SUPPRESSION Low performance by humans Lower than the CRR module TAGGING Performance of CRR module RANDOM SUBSTITUTION Performance of CRR module GENERALIZATION Performance of CRR module
03/06/2016 Instituto Superior Técnico Slide 35 of 52
SLIDE 36 ANONYMIZATION
REPLACEMENTS SUPPRESSION Tags
x x x x x x x
TAGGING Tags
[ * * L O C A L 1 * * ]
RANDOM SUBSTITUTION Real random entities Aims at MS* agreement GENERALIZATION Real superclass entities No MS agreement * MS = Morphosyntactic
03/06/2016 Instituto Superior Técnico Slide 36 of 52
SLIDE 37 03/06/2016 Instituto Superior Técnico Slide 37 of 52
INTEGRATING OUR SYSTEM
SLIDE 38 INTEGRATION IN THE L2F STRING CHAIN
- We added a new module to the STRING chain
that performs anonymization of Portuguese texts;
- The STRING chain provides:
- Named Entities;
- Morphosyntactic features of the entities;
- We convert the output of STRING to the internal data format
- f our system, process coreferences and run an anonymization
method over this data.
03/06/2016 Instituto Superior Técnico Slide 38 of 52
SLIDE 39 INTEGRATION IN THE L2F STRING CHAIN
- STRING supports a wide range of NE classes;
- Anonymization methods were created for some of them;
03/06/2016 Instituto Superior Técnico Slide 39 of 52 Original Random / Generalization Phone Number 210000001 827843582 Email teste@teste.com
NIB PT50 0000 00(...) PT5068116292115585228351 Date (D M Y) 3 de Junho de 2016 2 de Abril de 2045 “certo dia” Date (YYYY) em 2016 em 2009 “em certo ano” Time Intervals de 3 a 6 de Junho de 4 a 21 de Abril “entre duas datas” Time References 2 dias depois 32 dias depois "depois de algum tempo"
SLIDE 40 INTEGRATION IN THE UNBABEL PIPELINE
- Text is anonymized and sent to human translators;
- Original entities are translated by a specialized MT system;
- Automated anonymization is not perfect;
- The anonymized text can be changed by the MT and humans.
03/06/2016 Instituto Superior Técnico Slide 40 of 52
SLIDE 41 03/06/2016 Instituto Superior Técnico Slide 41 of 52
CONCLUSION
SLIDE 42 CONCLUSION
- We have presented an implementation of a
multilingual, automated anonymization system for text documents;
- Based on NER classifiers;
- Maintain coreference between entities;
- The second-pass detection and parallel NER classifiers
showed to improve significantly the performance of the detection.
03/06/2016 Instituto Superior Técnico Slide 42 of 52
SLIDE 43 CONCLUSION
- Suppression is a simple method but removes relevant
semantic information from the text;
- Tagging is able to keep some of the information about the type
- f entity and the coreferential integrity of the mentions;
- Random substitution results in a natural text but it results in
some semantic drifts;
- Generalization results in a natural text but is limited by the
recall of the KB and NE linking.
03/06/2016 Instituto Superior Técnico Slide 43 of 52
SLIDE 44 FUTURE WORK
Improve : Gazetteers and KBs; Relevance of substitutions using Named Entity Linking; Intelligent generalization method; Test : Larger datasets for evaluation; Information content approach (instead of NER); Dependency parser in order to identify and anonymize also indirect identifiers.
03/06/2016 Instituto Superior Técnico Slide 44 of 52
SLIDE 45 CONTRIBUTIONS
- Multilingual text anonymization system;
- Web-based annotation platform;
- Golden-standard for NER corpora composed of DCEP reports;
- The integration of this system in the STRING chain.
- The deployment of this system on the Unbabel servers.
- A paper in the WCCI 2016 International Conference;
- All data is publicly available at https://www.l2f.inesc-id.pt/~fdias/mscthesis/
03/06/2016 Instituto Superior Técnico Slide 45 of 52
SLIDE 46 MULTILINGUAL AUTOMATED TEXT ANONYMIZATION
03/06/2016 Instituto Superior Técnico Slide 46 of 52
SLIDE 47 O João Silva, de 25 anos, foi ao Porto em 3 de Junho de 201 num carro Mercedes de matrícula 12-13-KL com dispositivo de localização da Toshiba de endereço 128.0.0.1. Ele pretendia chegar à Escola Municipal às 14
- horas. A viatura foi roubada. O João, ligou para o número do Francisco,
962952857, e localizaram a viatura depois de 6 horas. Ele comprou o localizador no OLX, transferindo para o NIB PT50123443211234567890172 uma quantia de 230 euros.
SLIDE 48 O -----, -----, foi ao ----- ----- num carro Mercedes de matrícula ----- com dispositivo de localização da ----- de endereço -----. Ele pretendia chegar à ----- -----. A viatura foi roubada. O -----, ligou para o número do -----,
- ----, e localizaram a viatura -----. Ele comprou o localizador no OLX,
- ----transferindo para o NIB ----- uma quantia -----.
SLIDE 49
O [**PESSOA2**], [**DATA4**], foi ao [**LOCAL2**] [**DATA3**] num carro Mercedes de matrícula [**MATRICULA1**] com dispositivo de localização da [**ORGANIZACAO1**] de endereço [**ENDERECO IP1**]. Ele pretendia chegar à [**LOCAL1**] [**DATA2**]. A viatura foi roubada. O [**PESSOA2**], ligou para o número do [**PESSOA1**], [**PHONE1**], e localizaram a viatura [**DATA1**]. Ele comprou o localizador no OLX, transferindo para o NIB [**NIB1**] uma quantia [**VALOR1**].
SLIDE 50
O [**Miguel1**], [**data1**], foi ao [**Viseu**] [**em 14/9/2018**] num carro Mercedes de matrícula [**TW-23-KQ**] com dispositivo de localização da [**Empresa**] de endereço [**403.198.31.155**]. Ele pretendia chegar à [**Portalegre**] [**data**]. A viatura foi roubada. O [**Miguel1**], ligou para o número do [**Miguel**], [**899030283**], e localizaram a viatura [**depoiJunho de 2043 hora**]. Ele comprou o localizador no OLX, transferindo para o NIB [**PT5061590168374705433994**] uma quantia [**3441 euros**].
SLIDE 51
O [**Afonso**], [**data 2**], foi ao [**cidade**] [**data 2**] num carro Mercedes de matrícula [**matrícula**] com dispositivo de localização da [**negócio**] de endereço [**endereço de IP**]. Ele pretendia chegar à [**instituição educacional**] [**data 1**]. A viatura foi roubada. O [**Afonso**], ligou para o número do [**Rodrigo**], [**número de telefone**], e localizaram a viatura [**data**]. Ele comprou o localizador no OLX, transferindo para o NIB [**IBAN**] uma quantia [**quantia em dinheiro**].
SLIDE 52
O [**Afonso**], [**data 2**], foi ao [**cidade**] [**data 2**] num carro Mercedes de matrícula [**matrícula**] com dispositivo de localização da [**negócio**] de endereço [**endereço de IP**]. Ele pretendia chegar à [**instituição educacional**] [**data 1**]. A viatura foi roubada. O [**Afonso**], ligou para o número do [**Rodrigo**], [**número de telefone**], e localizaram a viatura [**data**]. Ele comprou o localizador no OLX, transferindo para o NIB [**IBAN**] uma quantia [**quantia em dinheiro**].