Chemical Names: Terminological Resources and Corpora Annotation - - PowerPoint PPT Presentation

chemical names terminological resources and corpora
SMART_READER_LITE
LIVE PREVIEW

Chemical Names: Terminological Resources and Corpora Annotation - - PowerPoint PPT Presentation

Chemical Names: Terminological Resources and Corpora Annotation Corinna Kol a rik, Roman Klinger, C. M. Friedrich, M. Hofmann-Apitius, J. Fluck Workshop BERBTM 08 at LREC 08 Marrakech, Morocco 26 May 2007 Outline Introduction


slide-1
SLIDE 1

Chemical Names: Terminological Resources and Corpora Annotation

Corinna Kol´ aˇ rik, Roman Klinger,

  • C. M. Friedrich, M. Hofmann-Apitius, J. Fluck

Workshop BERBTM ’08 at LREC ’08 Marrakech, Morocco

26 May 2007

slide-2
SLIDE 2

Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary

Outline

1

Introduction

2

Terminological Resources

3

Test Corpus

4

Machine Learning based Recognition

5

Conclusion & Summary

Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 2/25

slide-3
SLIDE 3

Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary

Introduction

Most efforts in Named Entity Recognition were spent on Genes and Proteins (well established methods available, BioCreative I & II) Corpora, Comparable Systems, Standard Dictionary Sources Chemical Named Entities important for: Medical Applications, Drug Development, Pharmaceutical Research, Analysis of bio-chemical pathways,. . . ⇒ Need for terminological resources and annotated corpora

Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 3/25

slide-4
SLIDE 4

Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary

Introduction – Examples

Novel nonnarcotic analgesics with an improved therapeutic ratio. Structure-activity relationships of 8-(methylthio)- and 8-(acylthio)-1,2,3,4,5,6-hexahydro-2,6-methano-3- benzazocines. Conversion of the 8-phenolic 1,2,3,4,5,6-hexahydro-2,6-methano-3-benzazocines to the corresponding 8-thiophenolic analogues was achieved by three different routes. Diazo- tization of 8-amino-2,6-methano-3-benzazocine (2) followed by the reaction with CH3SNa afforded 8-(methylthio)-1,2,3,4,5,6-hexahydro-2,6-methano-3-benzazocine (3). Another route using Grewe cyclization was also examined for the synthesis of 3. As the most ef- fective route, Newman-Kwart rearrangement of benzazocines was selected and closely

  • investigated. 8-(N,N-Dimethylthiocarbamoyl)oxy derivatives (6a-e) rearranged to 8-(N,N-

dimethylcarbamoyl)thio derivatives (7a-e) in good yields. Reductive cleavage of 7a-e and subsequent methylation or acylations gave the title compounds (3, 8-24). Although anal- gesic activities of sulfur-containing benzazocines decreased compared to the correspond- ing hydroxy compounds , the N-methyl derivative (S-metazocine, 8) showed potent anal- gesic activity.

PMID 2999399: Hori M, et.al J Med Chem.. 1985 Nov; 28(11):1656-61.

Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 4/25

slide-5
SLIDE 5

Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary

Introduction – Examples

Different nomenclatures: Aspirin: Trade name: Aspirin Formula: C9H8O4 IUPAC: 2-acetyloxybenzoic acid Smiles: CC(=O)OC1=CC=CC=C1C(=O)O Other synonyms: Acetylsalicylate, Enterosarein, Acenterine, Acylpyrin, Acetosal, Colfarit, Acetylsalicylic Acid, Acetosalic acid, Enterosarine

Normalization

Mapping to unique structure or identifier

InChI=1/C9H8O4/c1-6(10)13-8-5-3-2-4- 7(8)9(11)12/h2-5H,1H3,(H,11,12)/f/h11H

Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 5/25

slide-6
SLIDE 6

Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary

Introduction – Examples

Normalization

Mapping to unique structure or identifier

Additional available information

Molecular Weight: 180.15742

g mol

Heavy Atom Count: 13 Structure Search Classes: Benzoic acid family, Cyclooxygenase inhibitors Therapeutic Indications: Fever, Inflammation, Pain . . .

Where to find such information with structure and synonyms?

Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 6/25

slide-7
SLIDE 7

Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary

Terminological Resources – Commercial Databases

CrossFire Beilstein Database

10 million organic compounds Information of bio-activity and physical properties Literature References

CAS RegistrySM

35 million organic and inorganic substances Unique IDs (CAS Registry Numbers) assigned ⇒ established IDs

The World Drug Index

80,000 marketed and development drugs Drug names, synonyms, trade names, trivial names Drug activity, treatment, manufacturer, medical information

Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 7/25

slide-8
SLIDE 8

Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary

Terminological Resources – Freely Available Res.

PubChem (http://pubchem.ncbi.nlm.nih.gov/)

Compound: 18.4 million compounds, Structure Information, Smiles, InChI, IUPAC, No Synonyms Substance: 36.8 million entries: substances and proteins, mixtures, extracts, complexes, No Smiles, few InChI

Chemical Entities of Biological Interest (ChEBI) (http://www.ebi.ac.uk/chebi/)

15.562 entries Ontological classification Synonyms and Structure information

Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 8/25

slide-9
SLIDE 9

Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary

Terminological Resources – Freely Available Res.

MeSH Medical Subject Headings (referred to as MeSH T)

Thesaurus used for indexing MEDLINE, hierarchical organized Chemical category contains 8,612 entries No Structure information

Supplementary Concept Records (Formally Supplementary Chemical Records) provided by National Library of Medicine

(referred to as MeSH C)

175,136 entries Synonyms, no structure but CAS identifier

Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 9/25

slide-10
SLIDE 10

Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary

Terminological Resources – Freely Available Res.

Kyoto Encyclopedia of Genes and Genomes (KEGG) (http://www.genome.jp/kegg/)

KEGG Compound (15,033 entries) KEGG Drug (6,834 entries) Synonyms and Structures

DrugBank (http://www.drugbank.ca/)

4,764 entries Synonyms and structure

Human Metabolome Database (HMDB) (http://www.hmdb.ca/)

3000 entries many synonyms and structural information

Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 10/25

slide-11
SLIDE 11

Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary

Terminological Resources – Analysis

Number of entries in extracted dictionaries

1000 10000 100000 1e+06 1e+07 Pubchem MeSH_T ChEBI DrugBank HMDB MeSH_C KEGG Number

Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 11/25

slide-12
SLIDE 12

Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary

Terminological Resources – Analysis

Are all synonyms included in PubChem?

20 40 60 80 100 Pubchem MeSH_T ChEBI DrugBank HMDB MeSH_C KEGG Percentage of overlap with PubChem

Combining all analyzed dictionaries, 69 % of the synonyms are not from PubChem but from the other resources.

Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 12/25

slide-13
SLIDE 13

Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary

Annotation of a Test Corpus

How to analyze the usability of the resources for text mining? ⇒ Test Corpus ⇒ Assumption: Some classes are more easy findable in text then

  • thers.

⇒ Different classes, easily annotatable

Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 13/25

slide-14
SLIDE 14

Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary

Annotation of a Test Corpus – Classes

TRIVIAL

Single word terms (also if they were in fact IUPAC) aspirin, estragon, testosterone, Acetylsalicylate

IUPAC(-like)

Multi-word systematic names N-substituted-pyridino[2,3-f]indole-4,9-dione, 1-hexoxy-4-methyl-hexane, elaidic acid, 1,4-dihydronaphthoquinones

PART

Partial chemical names (e. g. in enumerations) 8-(methylthio)- and. . . , 17beta-

Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 14/25

slide-15
SLIDE 15

Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary

Annotation of a Test Corpus – Classes

ABBREVIATION

Abbreviations of names, as part of IUPAC names not tagged separatly TPA, AMPA

SUM

Sum formulas CH3SNa, KOH

FAMILY

Chemical Families, not pharmacological/functional families (as anti-inflammatory drug, chelator) disaccaride, pyrimidine, hydrazides

Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 15/25

slide-16
SLIDE 16

Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary

Annotation of a Test Corpus

Problems occured during annotation Labels (H3-[chemical]) Mixtures of abbreviations in long names

(2R,10S)-N(1)-cyclopropylmethyl-2,10-dihydroxy-N(11)- ethylnorspermine abbreviated as (2R,10S)-(HO)(2)CPMENSPM

Differentation of family and trivial-names

Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 16/25

slide-17
SLIDE 17

Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary

Annotation of a Test Corpus

MEDLINE abstracts Two annotators worked independently on the corpus Inter-annotator F1 measure not considering the classes but

  • nly the boundaries:

First run: 80 % After discussion and reannotation: 89%

  • P. Corbett, C. Batchelor, S. Teufel 2007 published 93 %

IA-F1 measure for annotation of a training corpus for Oscar

Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 17/25

slide-18
SLIDE 18

Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary

Annotation of a Test Corpus

Number of Classes in test Corpus of 100 abstracts

50 100 150 200 250 300 350 400 450 IUPAC PART TRIVIAL ABB. SUM FAMILY Number of Entities in Test Corpus

⇒ Alltogether: 1206 entities

Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 18/25

slide-19
SLIDE 19

Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary

Results on Test Corpus

Conditions of testing: No curation of raw dictionaries was done → no names removed, added or changed. Simple case insensitive string search, dashes were ignored No control of the correct association of the found names to the corresponding entry was performed.

Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 19/25

slide-20
SLIDE 20

Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary

Results on Test Corpus

Exact match of synonyms in dictionary to test corpus

0.2 0.4 0.6 0.8 1 IUPAC PART SUM TRIV ABB FAM All Recall PubChem ChEBI MeSH_C MeSH_T HMDB KEGG_C KEGG_D DrugBank Combined

Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 20/25

slide-21
SLIDE 21

Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary

Results on Test Corpus

Normalized with size of dictionaries:

2e-05 4e-05 6e-05 8e-05 0.0001 0.00012 0.00014 IUPAC PART SUM TRIV ABB FAM All Normalized Recall PubChem ChEBI MeSH_C MeSH_T HMDB KEGG_C KEGG_D DrugBank Combined

Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 21/25

slide-22
SLIDE 22

Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary

ML-based Detection of IUPAC-like Names

IUPAC entities not with dictionary Machine-Learning Approach

Training Corpus Annotation (463 abstracts, IA-F1 of 80 %) Conditional Random Field-based

Results

On Test Corpus only annotated with IUPAC (sampled from Medline): 86.4 % On the presented test corpus on IUPAC and PART: F1: 77.06 % (81.38 % Precision, 73.18 % Recall) On the presented test corpus on all entities: 91.41 % Precision, 29.04 % Recall

= ⇒ more on ISMB ’08 in Toronto

Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 22/25

slide-23
SLIDE 23

Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary

Conclusion & Summary

Resources for many mentions in text available High recall possible for some classes

→ Combination of Resources necessary → NER System has to be developed on this approach

Not all classes can be found with string search

→ Combination of Approaches necessary → Combination of different classes for different approaches

Abbreviations? Mapping to long form in text. References with Numbers in text? Corpus availability: http://www.scai.fraunhofer.de/chem-corpora.html

Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 23/25

slide-24
SLIDE 24

Outline Introduction Terminological Resources Test Corpus ML-based Recognition Conclusion & Summary

Acknowledgements

Roman Klinger – Chemical Names: Terminological Resources and Corpora Annotation 24/25

slide-25
SLIDE 25

roman.klinger@scai.fhg.de http://www.scai.fhg.de/klinger.html

Thank YOU for your attention! Questions? Remarks?