> > Overview Goals of TM How does TM work Components - - PowerPoint PPT Presentation

overview goals of tm how does tm work components history
SMART_READER_LITE
LIVE PREVIEW

> > Overview Goals of TM How does TM work Components - - PowerPoint PPT Presentation

> > Overview Goals of TM How does TM work Components History Problems particular to biomedical texts Examples 22/01/08 Tamara Polajnar > BRC > U Glasgow 2 > The Goal of Text Mining The small G-Protein Ras


slide-1
SLIDE 1

>

slide-2
SLIDE 2

22/01/08 Tamara Polajnar > BRC > U Glasgow 2

> Overview

  • Goals of TM
  • How does TM work
  • Components
  • History
  • Problems particular to biomedical texts
  • Examples
slide-3
SLIDE 3

22/01/08 Tamara Polajnar > BRC > U Glasgow 3

> The Goal of Text Mining

The small G-Protein Ras is activated by many growth factor receptors and binds to the Raf-1 kinase with high affinity. Action Inter1 Inter2 activate Ras gfr bind ras raf-1 kinase

slide-4
SLIDE 4

22/01/08 Tamara Polajnar > BRC > U Glasgow 4

> How does TM work?

Text collection Text Selection (IR, classification) External Knowledge Information Extraction Knowledge Management Visualisation

Text Mining Text Mining

slide-5
SLIDE 5

22/01/08 Tamara Polajnar > BRC > U Glasgow 5

> How does TM work?

TM integrates:

  • Information Retrieval in order to retrieve a high proportion of

relevant documents (high recall).

  • Text categorisation for a higher precision document selection.
  • Named entity recognition to identify relevant proteins, genes,

cellular components, processes, etc.

  • Information extraction to explain the relationships between the

entities

  • Knowledge management and visualisation to store and access

the results

slide-6
SLIDE 6

22/01/08 Tamara Polajnar > BRC > U Glasgow 6

> A Note on Evaluation

precision = tp/(tp+fp) recall = tp/(tp+fn) F-measure=2*precision*recall/(precision+recall)

A – Set of retrieved documents B – Set of relevant documents

slide-7
SLIDE 7

22/01/08 Tamara Polajnar > BRC > U Glasgow 7

> Information Retrieval

  • IR is used to manage and access vast numbers of documents.
  • Many IR systems index documents and create dictionaries

which associate words with documents.

  • Others also cluster documents based on topic.
  • Search engines retrieve many documents and rank them

according to the query. Precision drops off as you look further down the ranked list.

  • IR systems in general have high recall but low precision.
slide-8
SLIDE 8

22/01/08 Tamara Polajnar > BRC > U Glasgow 8

> Medline

  • A database of citations and abstracts of articles

published in major peer reviewed journals since 1950s

  • Each entry contains bibliographic information
  • Some entries contain abstracts, MESH terms,

citations...

  • Is available for download for TM in XML format
slide-9
SLIDE 9

22/01/08 Tamara Polajnar > BRC > U Glasgow 9

> IR for Biomedical Texts

  • IR is key for biomedical research
  • Search engines are the main way of looking for

journal articles

  • PubMed/Medline citations:

> 2008: 16,880,015 > 2007: 16,120,074 > 2006: 15,433,668

slide-10
SLIDE 10

22/01/08 Tamara Polajnar > BRC > U Glasgow 10

> Classification

  • Classification is used to put documents into two or more

classes.

  • The classes are learned from examples, so usually manually

compiled training data is needed.

  • Clustering can be done without training data, but the meaning of

the clusters has to be determined after.

  • Classification usually has much higher precision than IR while

also maintaining high recall (depending on the data).

slide-11
SLIDE 11

22/01/08 Tamara Polajnar > BRC > U Glasgow 11

> Information Extraction

  • Information extraction is a process by which knowledge

contained in unstructured text is translated into predicated forms which can be used for reasoning.

  • IE usually involves several layers of processing including:

named entity recognition, parsing, and pattern recognition.

  • In general, IE systems are manually engineered for particular

problems.

  • IE is usually high precision, but may miss a lot of information

which is presented in a format which was not observed before.

slide-12
SLIDE 12

22/01/08 Tamara Polajnar > BRC > U Glasgow 12

>

The small G-Protein Ras is activated by many growth factor receptors and binds to the Raf-1 kinase with high affinity activate(g-protein ras, gfr), bind(ras, raf-1 kinase)

sentence NP VP DET AP ADJ AP ADJ N VP Conj-VP Conj VP VP PP V V Prep AP ADJ AP ADJ N V PP Prep NP Det NP N PP AP ADJ N Prep subject verb

  • bject
  • bject

verb

Legend AP – adjective phrase NP – noun phrase PP – prepositional phrase VP – verb phrase Adj - adjective Conj - conjunction Det – determiner N – noun Prep – preposition V – verb

The small [G-Protein] Ras is activated by many [growth factor] receptors and binds to the [Raf-1 kinase] with high affinity

slide-13
SLIDE 13

22/01/08 Tamara Polajnar > BRC > U Glasgow 13

> A Little History

  • The goal of natural language understanding has driven

research into computational linguistics since the first computers.

  • In the 1940s the behaviorist theory prevailed. Early on it was

shown that language can be closely approximated by counting frequency/probability of words and by outputting the words with like frequency, which supported this view.

  • In the 1950s this view was challenged by rationalists.

Rationalists believe that human language is innate and only the syntactic rules of language are learned, whereas empiricists believe that human language is learned entirely through exposure.

slide-14
SLIDE 14

22/01/08 Tamara Polajnar > BRC > U Glasgow 14

> Shannon

  • In his 1948 paper A Mathematical Theory of

Communication Claude Shannon showed that English can be approximated by statistical processes:

  • THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE

CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED. From the Random Surrealism Shakespearian generator:

To debug a kilt, or not to debug a kilt, that is the question! Is this a roadworks sign which I see before me, the doorbell toward my lecturer? Come, let me charge headlong at thee.

slide-15
SLIDE 15

22/01/08 Tamara Polajnar > BRC > U Glasgow 15

> Chomsky

  • In 1956 Noam Chomsky revolutionised the study of linguistics

by formalising grammars and by organising languages in a hierarchical structure according to their capabilities.Chomsky's ideas influenced much of theoretical Computer Science including automata theory and programming languages.

  • Chomsky postulated that human beings are born with innate

ability to process language and that different languages only differ by the syntactic surface structure while the semantics of language, the deep structure, is common to all people.

  • The following sentences share the same deep structure:
  • I am reading this slide. This slide is being read by

me.

slide-16
SLIDE 16

22/01/08 Tamara Polajnar > BRC > U Glasgow 16

> And Then...

  • Rationalists attempted to write down all the rules governing

language while empiricists concentrated on developing representative statistical models.

  • These approaches developed separately for several decades

each achieving individual success, but neither providing a comprehensive solution.

  • In the 90s the two approaches started coming together resulting

in statistical learning of rules and linguistic improvements to statistical methods. This made natural language processing much more practical.

slide-17
SLIDE 17

22/01/08 Tamara Polajnar > BRC > U Glasgow 17

> NLP vs. Text Mining

  • Natural Language Processing is a research field in computing

science sometimes interchangeably called Computational Linguistics or Language Engineering.

  • It comprises of all research involving computers and human

spoken languages whether in speech or text form.

  • Text Mining is a combination of tools developed within NLP

research which are aimed for extraction of specific information from large textual corpora.

  • The term text mining is almost exclusively used for biological

applications and is meant to allude to data mining.

slide-18
SLIDE 18

22/01/08 Tamara Polajnar > BRC > U Glasgow 18

> Some Applications of NLP

  • Machine translation
  • Question answering
  • Information extraction
  • Speech generation
  • Speech processing
  • Dialogue systems
  • Author identification
  • Natural language querying
slide-19
SLIDE 19

22/01/08 Tamara Polajnar > BRC > U Glasgow 19

> NLP for Biology

  • There are some specific challenges in biological

texts which have interested the NLP community.

  • There is also a real need an use for the tools

developed for biology which is also driving the research in text mining.

slide-20
SLIDE 20

22/01/08 Tamara Polajnar > BRC > U Glasgow 20

> NLP Tools in Text Mining

  • Tokenisation
  • Named entity recognition
  • Classification
  • Shallow Parsing
  • Full Parsing
slide-21
SLIDE 21

22/01/08 Tamara Polajnar > BRC > U Glasgow 21

> Current Approaches to NLP

  • In general most approaches are a combination
  • f rule-based and stochastic methods
  • Tokenisation, named entity recognition, and

classification are generally done using statistical methods, but may employ dictionaries.

  • Parsing is usually a combination of rules guided

by probabilities that the rule will occur.

slide-22
SLIDE 22

22/01/08 Tamara Polajnar > BRC > U Glasgow 22

> Tokenisation

What is a word?

  • A sequence of letters, all lowercase, with first capital letter, or

all capitals? mRNA, hnRNP

  • A sequence of letters? P26, RGL3, Nore1
  • A sequence of letters and numbers? Raf-1, JRE-IL6
  • Any sequence of symbols which ends with a space or a

punctuation mark? H7-sensitive, PI3K-induced, 5'- triphosphatase

  • Usually each application will have its own definition for what a

word is.

slide-23
SLIDE 23

22/01/08 Tamara Polajnar > BRC > U Glasgow 23

> An Aside about Zipf

  • There are very few words which are used very often, e. g. the,

a, as, by, that, those, what, between, on, and, etc.

  • These words carry very little information but make up most of

the text, while the longer words which contain more information

  • ccur less often.
  • George Kingsley Zipf proposed this is due to the fact that

speaker prefers to use a lot of simple words because that makes speaking easier.

  • On the other hand, the listener prefers precise words which

convey the meaning succinctly as that makes intake of information easier.

slide-24
SLIDE 24

22/01/08 Tamara Polajnar > BRC > U Glasgow 24

> Example: Zipf

Word frequencies from Shakespeare's Hamlet:

The, 1101; And, 898; To, 726; Of, 657; I, 561; You, 544; My, 508; A, 498; In, 414; It, 414; That, 389; Is, 334; Not, 315; This, 296; His, 292; But, 265; With, 257; For, 247; Your, 242; Me, 235; As, 228; Be, 226; Lord, 218; He, 216; What, 203; So, 197; Him, 189; Have, 179; Will, 169; Do, 150; No, 143; We, 140; Are, 131; On, 125; O, 121; Our, 119; By, 116; Shall, 114; If, 113; Or, 112; All, 110; Good, 109; Come, 104; Thou, 103; Now, 97; From, 95; More, 95; They, 95; Let, 94; How, 88; Thy, 87; Her, 86; At, 84; Was, 83; Most, 82; Like, 80; Would, 80; Hamlet, 78; Well, 78; There, 76; Know, 74; Sir, 74; Them, 74; May, 71.... .. . Assay'd, 1; Associates, 1; Assure, 1; Astonish, 1; Asunder, 1; As'es, 1; Attend, 1; Attended, 1; Attends, 1; Attent, 1; Attractive, 1; Attribute, 1; Audit, 1; Augury, 1; Aunt-Mother, 1; Auspicious, 1; Authorities, 1; Avouch, 1; Awry, 1; Aye, 1; Babe, 1; Backed, 1; Backward, 1; Bade, 1; Bait, 1; Baker's, 1; Ban, 1; Bands, 1; Baptista, 1; Bar, 1; Barber's, 1; Bare, 1; Barefaced, 1; Barefoot, 1; Bark, 1; Bark'd, 1; Barren, 1; Barr'd, 1; Baseness, 1; Bastard, 1; Bat, 1; Bated, 1; Battalions, 1; Batten, 1; Battery, 1; Battlements, 1; Bawd, 1; Bawdry, 1; Bawds, 1; Bawdy, 1; Be-Netted, 1; Beam, 1; Beards, 1; Bear't, 1; Beaten, 1; Beats, 1; Beauteous, 1; Beautied, 1; Beauties, 1; Beaver, 1; Became ...

slide-25
SLIDE 25

22/01/08 Tamara Polajnar > BRC > U Glasgow 25

> Why Zipf?

  • In order to save space many search engines

and NLP programs disregard function words.

  • This causes a particular problem when tailoring

applications for biological purposes.

  • Many proteins/genes are given popular word

names.

slide-26
SLIDE 26

22/01/08 Tamara Polajnar > BRC > U Glasgow 26

> Named Entity Recognition

  • Named entity recognition is used to find all

proper nouns in text.

  • In news it is usually used to identify places,

people, corporations, etc.

  • In biology it is used to recognise protein and

gene names, drugs, diseases, snippets of DNA, etc.

slide-27
SLIDE 27

22/01/08 Tamara Polajnar > BRC > U Glasgow 27

> NER for Biology

  • While in news NER is mostly a solved problem, biological texts

have proved more difficult.

  • Biological names:

> Share names with common English words. > New names are constantly being invented. > Names and acronyms are reused. > Very few naming conventions which aren't always followed. > Names are often typed differently, the orthography changes.

slide-28
SLIDE 28

22/01/08 Tamara Polajnar > BRC > U Glasgow 28

>

slide-29
SLIDE 29

22/01/08 Tamara Polajnar > BRC > U Glasgow 29

> Difficult Named Entities

  • AND – short for ANDANTE, a gene

from Arabidopsis thaliana

  • And – Androcam, a gene from

Drosophila melanogaster

  • jim – a gene in Drosophila

melanogaster

  • FOR1 - floral organ regulator, a

protein from Oryza sativa

  • for – short for foraging, a gene

from Drosophila melanogaster

  • MAD - mothers against dpp, a

gene from Drosophila melanogaster

  • can – cannonball, a gene from

Drosophila melanogaster

  • MAPK, MAPKK, MAPKKK –

mitogen activated protein kinase, kinase, kinase....

  • ASP – abnormal spindle protein,

antisense promoter, ankylosing spondilitis

slide-30
SLIDE 30

22/01/08 Tamara Polajnar > BRC > U Glasgow 30

> Why NER?

  • After entities are identified then they are evaluated with respect

to each other and the surrounding text.

  • For example if one was interested in corporate acquisitions then

after the companies were identified in text the IE system should determine which company is acquiring the other, what are the key players, when is this happening?

  • In biology the relationships are:
  • Relationships between proteins: binding, upregulation, activation,

phosphoralysation

  • Relationships between genes or genes and proteins
  • Drugs and diseases or drugs and target parts of cells
slide-31
SLIDE 31

22/01/08 Tamara Polajnar > BRC > U Glasgow 31

> What Next?

  • Natural language parsing is used to generate a tree structure

describing a sentence according to the grammatical rules of the language the sentence is written in.

  • Shallow parsing tries to loosely find constituent noun and verb

phrases without worrying about deeper structure.

  • Full parsing considers the sentence as a whole extracting the

main components and their modifiers.

  • Full parsing is more time consuming than shallow parsing both

during the design and execution stages.

slide-32
SLIDE 32

22/01/08 Tamara Polajnar > BRC > U Glasgow 32

> Shallow vs. Full Parsing

For the sentence: The small G-Protein Ras is activated by many growth factor receptors and binds to the Raf-1 kinase with high affinity. The shallow parse may yield (depending on the algorithm): Subject noun phrase: The small G-Protein Ras Verb phrase: is activated, binds Object noun phrase: many growth factor receptors (gfr), the Raf-1 kinase i.e. activate(g-protein ras, gfr), bind(ras, raf-1 kinase) Whereas full parsing would yield: activate(mod(small, mod(g-protein, noun(ras)), mod(many, gfr)) & mod(high affinity, bind(mod(small, mod(g-protein, noun(ras)), typeof(kinase, raf-1))))

slide-33
SLIDE 33

22/01/08 Tamara Polajnar > BRC > U Glasgow 33

> The Big Problems in TM

  • Access!
  • Communication!
  • Tables and figures
  • Extraction over several sentences
  • Measurements and results
  • Quality
slide-34
SLIDE 34

22/01/08 Tamara Polajnar > BRC > U Glasgow 34

> Research vs. Product

  • Bio NLP contains interesting challenges
  • Many do not extend beyond trying to engineer a

new solution for a particular situation

  • Research usually leads to development of tools

which can be used to create a product

  • Very few TM commercial products and they still

rely manual curation

slide-35
SLIDE 35

22/01/08 Tamara Polajnar > BRC > U Glasgow 35

> GENIA – Tsujii Lab

  • MEDIE - Retrieve abstracts or sentences in

MEDLINE using deep-parsing results.

  • TerMine – biomedical term highlighter
  • Genia Corpus
  • http://www-tsujii.is.s.u-tokyo.ac.jp/
slide-36
SLIDE 36

22/01/08 Tamara Polajnar > BRC > U Glasgow 36

>

slide-37
SLIDE 37

22/01/08 Tamara Polajnar > BRC > U Glasgow 37

> EBI

  • EBI Med – searches for co-occurrence between

terms from key categories

  • Whatizit - allows you to highlight terms from

different databases in text

  • http://www.ebi.ac.uk/Rebholz-

srv/ebimed/index.jsp

slide-38
SLIDE 38

22/01/08 Tamara Polajnar > BRC > U Glasgow 38

>

slide-39
SLIDE 39

22/01/08 Tamara Polajnar > BRC > U Glasgow 39

> GeneWays

  • GeneWays was developed at Columbia University and it process full-

text papers in order to extract knowledge about protein-protein interactions and store it in a searchable knowledge base.

phosphorylated CbL coprecipitated with Crkl, which was constitutively associated with the C3G [action, attach, [protein, Cbl, [state, phosphorylated]], [protein, CrkL, [ation, attach, [protein, CrkL], [protein, C3G]]]]

  • Precision is 95%, recall 65%, the system can be accessed through an

internet gateway where a user can browse the knowledge base or upload an article to be analysed.

slide-40
SLIDE 40

22/01/08 Tamara Polajnar > BRC > U Glasgow 40

>

slide-41
SLIDE 41

22/01/08 Tamara Polajnar > BRC > U Glasgow 41

> MedScan

  • Does full parsing, it also relies heavily on hand assembled data

such as dictionaries.

  • It has 34% recall and 90% precision, however they estimate this

could be improved with a larger lexicon.

  • The system allows the user to query PubMed and then parses

the retrieved abstracts. Novichkova, S.; Egorov, S. & Daraselia, N.MedScan, a natural language processing engine for MEDLINE abstracts. Bioinformatics, 2003, 19, 1699-1706

slide-42
SLIDE 42

22/01/08 Tamara Polajnar > BRC > U Glasgow 42

>

slide-43
SLIDE 43

22/01/08 Tamara Polajnar > BRC > U Glasgow 43

> Conclusions

  • Biological natural language processing is a vibrant field.
  • Many of the algorithms which were successfully designed for

news text don't preform quite as well in the biological domain, thus new and more general algorithms are needed.

  • The main problems which are encountered are lack of training

resources (i.e. labeled text, free full text articles, dictionaries, etc.), unstandardised naming schemes, and lack of crossover knowledge between biologists and computer scientists.

  • This was just a brief introduction with many omissions.
slide-44
SLIDE 44

22/01/08 Tamara Polajnar > BRC > U Glasgow 44

> FTHK 2008

Finding the Hidden Knowledge: Text mining for biology and medicine 21-22 February 2008, Glasgow http://www.bioinformatics-scotland.org/txt_mining/

slide-45
SLIDE 45

22/01/08 Tamara Polajnar > BRC > U Glasgow 45

> FTHK 2008 Speakers

  • Sophia Ananiadou, National Center for

Text Mining, University of Manchester

  • Doug Armstrong, Edinburgh Centre for

Bioinformatics, University of Edinburgh

  • Wendy Bickmore, MRC Human

Genetics Unit, Western General Hospital, Edinburgh

  • Ted Briscoe, Computer Laboratory,

University of Cambridge

  • Elizabeth Fairley, ITI Life Sciences
  • William Hayes, Biogen Idec
  • Lawrence Hunter, University of

Colorado Denver School of Medicine

  • Peter Jackson, Thomson Corporation
  • Mark Liberman, Linguistic Data

Consortium / University of Pennsylvania

  • Tim Miller, Thomson Scientific
  • John Pestian, Cincinnati Children's

Hospital