[PPT] - > > Overview Goals of TM How does TM work Components PowerPoint Presentation

SLIDE 1

>

SLIDE 2

22/01/08 Tamara Polajnar > BRC > U Glasgow 2

> Overview

Goals of TM
How does TM work
Components
History
Problems particular to biomedical texts
Examples

SLIDE 3

22/01/08 Tamara Polajnar > BRC > U Glasgow 3

> The Goal of Text Mining

The small G-Protein Ras is activated by many growth factor receptors and binds to the Raf-1 kinase with high affinity. Action Inter1 Inter2 activate Ras gfr bind ras raf-1 kinase

SLIDE 4

22/01/08 Tamara Polajnar > BRC > U Glasgow 4

> How does TM work?

Text collection Text Selection (IR, classification) External Knowledge Information Extraction Knowledge Management Visualisation

Text Mining Text Mining

SLIDE 5

22/01/08 Tamara Polajnar > BRC > U Glasgow 5

> How does TM work?

TM integrates:

Information Retrieval in order to retrieve a high proportion of

relevant documents (high recall).

Text categorisation for a higher precision document selection.
Named entity recognition to identify relevant proteins, genes,

cellular components, processes, etc.

Information extraction to explain the relationships between the

entities

Knowledge management and visualisation to store and access

the results

SLIDE 6

22/01/08 Tamara Polajnar > BRC > U Glasgow 6

> A Note on Evaluation

precision = tp/(tp+fp) recall = tp/(tp+fn) F-measure=2*precision*recall/(precision+recall)

A – Set of retrieved documents B – Set of relevant documents

SLIDE 7

22/01/08 Tamara Polajnar > BRC > U Glasgow 7

> Information Retrieval

IR is used to manage and access vast numbers of documents.
Many IR systems index documents and create dictionaries

which associate words with documents.

Others also cluster documents based on topic.
Search engines retrieve many documents and rank them

according to the query. Precision drops off as you look further down the ranked list.

IR systems in general have high recall but low precision.

SLIDE 8

22/01/08 Tamara Polajnar > BRC > U Glasgow 8

> Medline

A database of citations and abstracts of articles

published in major peer reviewed journals since 1950s

Each entry contains bibliographic information
Some entries contain abstracts, MESH terms,

citations...

Is available for download for TM in XML format

SLIDE 9

22/01/08 Tamara Polajnar > BRC > U Glasgow 9

> IR for Biomedical Texts

IR is key for biomedical research
Search engines are the main way of looking for

journal articles

PubMed/Medline citations:

> 2008: 16,880,015 > 2007: 16,120,074 > 2006: 15,433,668

SLIDE 10

22/01/08 Tamara Polajnar > BRC > U Glasgow 10

> Classification

Classification is used to put documents into two or more

classes.

The classes are learned from examples, so usually manually

compiled training data is needed.

Clustering can be done without training data, but the meaning of

the clusters has to be determined after.

Classification usually has much higher precision than IR while

also maintaining high recall (depending on the data).

SLIDE 11

22/01/08 Tamara Polajnar > BRC > U Glasgow 11

> Information Extraction

Information extraction is a process by which knowledge

contained in unstructured text is translated into predicated forms which can be used for reasoning.

IE usually involves several layers of processing including:

named entity recognition, parsing, and pattern recognition.

In general, IE systems are manually engineered for particular

problems.

IE is usually high precision, but may miss a lot of information

which is presented in a format which was not observed before.

SLIDE 12

22/01/08 Tamara Polajnar > BRC > U Glasgow 12

>

The small G-Protein Ras is activated by many growth factor receptors and binds to the Raf-1 kinase with high affinity activate(g-protein ras, gfr), bind(ras, raf-1 kinase)

sentence NP VP DET AP ADJ AP ADJ N VP Conj-VP Conj VP VP PP V V Prep AP ADJ AP ADJ N V PP Prep NP Det NP N PP AP ADJ N Prep subject verb

bject
bject

verb

Legend AP – adjective phrase NP – noun phrase PP – prepositional phrase VP – verb phrase Adj - adjective Conj - conjunction Det – determiner N – noun Prep – preposition V – verb

The small [G-Protein] Ras is activated by many [growth factor] receptors and binds to the [Raf-1 kinase] with high affinity

SLIDE 13

22/01/08 Tamara Polajnar > BRC > U Glasgow 13

> A Little History

The goal of natural language understanding has driven

research into computational linguistics since the first computers.

In the 1940s the behaviorist theory prevailed. Early on it was

shown that language can be closely approximated by counting frequency/probability of words and by outputting the words with like frequency, which supported this view.

In the 1950s this view was challenged by rationalists.

Rationalists believe that human language is innate and only the syntactic rules of language are learned, whereas empiricists believe that human language is learned entirely through exposure.

SLIDE 14

22/01/08 Tamara Polajnar > BRC > U Glasgow 14

> Shannon

In his 1948 paper A Mathematical Theory of

Communication Claude Shannon showed that English can be approximated by statistical processes:

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE

CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED. From the Random Surrealism Shakespearian generator:

To debug a kilt, or not to debug a kilt, that is the question! Is this a roadworks sign which I see before me, the doorbell toward my lecturer? Come, let me charge headlong at thee.

SLIDE 15

22/01/08 Tamara Polajnar > BRC > U Glasgow 15

> Chomsky

In 1956 Noam Chomsky revolutionised the study of linguistics

by formalising grammars and by organising languages in a hierarchical structure according to their capabilities.Chomsky's ideas influenced much of theoretical Computer Science including automata theory and programming languages.

Chomsky postulated that human beings are born with innate

ability to process language and that different languages only differ by the syntactic surface structure while the semantics of language, the deep structure, is common to all people.

The following sentences share the same deep structure:
I am reading this slide. This slide is being read by

me.

SLIDE 16

22/01/08 Tamara Polajnar > BRC > U Glasgow 16

> And Then...

Rationalists attempted to write down all the rules governing

language while empiricists concentrated on developing representative statistical models.

These approaches developed separately for several decades

each achieving individual success, but neither providing a comprehensive solution.

In the 90s the two approaches started coming together resulting

in statistical learning of rules and linguistic improvements to statistical methods. This made natural language processing much more practical.

SLIDE 17

22/01/08 Tamara Polajnar > BRC > U Glasgow 17

> NLP vs. Text Mining

Natural Language Processing is a research field in computing

science sometimes interchangeably called Computational Linguistics or Language Engineering.

It comprises of all research involving computers and human

spoken languages whether in speech or text form.

Text Mining is a combination of tools developed within NLP

research which are aimed for extraction of specific information from large textual corpora.

The term text mining is almost exclusively used for biological

applications and is meant to allude to data mining.

SLIDE 18

22/01/08 Tamara Polajnar > BRC > U Glasgow 18

> Some Applications of NLP

Machine translation
Question answering
Information extraction
Speech generation
Speech processing
Dialogue systems
Author identification
Natural language querying

SLIDE 19

22/01/08 Tamara Polajnar > BRC > U Glasgow 19

> NLP for Biology

There are some specific challenges in biological

texts which have interested the NLP community.

There is also a real need an use for the tools

developed for biology which is also driving the research in text mining.

SLIDE 20

22/01/08 Tamara Polajnar > BRC > U Glasgow 20

> NLP Tools in Text Mining

Tokenisation
Named entity recognition
Classification
Shallow Parsing
Full Parsing

SLIDE 21

22/01/08 Tamara Polajnar > BRC > U Glasgow 21

> Current Approaches to NLP

In general most approaches are a combination
f rule-based and stochastic methods
Tokenisation, named entity recognition, and

classification are generally done using statistical methods, but may employ dictionaries.

Parsing is usually a combination of rules guided

by probabilities that the rule will occur.

SLIDE 22

22/01/08 Tamara Polajnar > BRC > U Glasgow 22

> Tokenisation

What is a word?

A sequence of letters, all lowercase, with first capital letter, or

all capitals? mRNA, hnRNP

A sequence of letters? P26, RGL3, Nore1
A sequence of letters and numbers? Raf-1, JRE-IL6
Any sequence of symbols which ends with a space or a

punctuation mark? H7-sensitive, PI3K-induced, 5'- triphosphatase

Usually each application will have its own definition for what a

word is.

SLIDE 23

22/01/08 Tamara Polajnar > BRC > U Glasgow 23

> An Aside about Zipf

There are very few words which are used very often, e. g. the,

a, as, by, that, those, what, between, on, and, etc.

These words carry very little information but make up most of

the text, while the longer words which contain more information

ccur less often.
George Kingsley Zipf proposed this is due to the fact that

speaker prefers to use a lot of simple words because that makes speaking easier.

On the other hand, the listener prefers precise words which

convey the meaning succinctly as that makes intake of information easier.

SLIDE 24

22/01/08 Tamara Polajnar > BRC > U Glasgow 24

> Example: Zipf

Word frequencies from Shakespeare's Hamlet:

The, 1101; And, 898; To, 726; Of, 657; I, 561; You, 544; My, 508; A, 498; In, 414; It, 414; That, 389; Is, 334; Not, 315; This, 296; His, 292; But, 265; With, 257; For, 247; Your, 242; Me, 235; As, 228; Be, 226; Lord, 218; He, 216; What, 203; So, 197; Him, 189; Have, 179; Will, 169; Do, 150; No, 143; We, 140; Are, 131; On, 125; O, 121; Our, 119; By, 116; Shall, 114; If, 113; Or, 112; All, 110; Good, 109; Come, 104; Thou, 103; Now, 97; From, 95; More, 95; They, 95; Let, 94; How, 88; Thy, 87; Her, 86; At, 84; Was, 83; Most, 82; Like, 80; Would, 80; Hamlet, 78; Well, 78; There, 76; Know, 74; Sir, 74; Them, 74; May, 71.... .. . Assay'd, 1; Associates, 1; Assure, 1; Astonish, 1; Asunder, 1; As'es, 1; Attend, 1; Attended, 1; Attends, 1; Attent, 1; Attractive, 1; Attribute, 1; Audit, 1; Augury, 1; Aunt-Mother, 1; Auspicious, 1; Authorities, 1; Avouch, 1; Awry, 1; Aye, 1; Babe, 1; Backed, 1; Backward, 1; Bade, 1; Bait, 1; Baker's, 1; Ban, 1; Bands, 1; Baptista, 1; Bar, 1; Barber's, 1; Bare, 1; Barefaced, 1; Barefoot, 1; Bark, 1; Bark'd, 1; Barren, 1; Barr'd, 1; Baseness, 1; Bastard, 1; Bat, 1; Bated, 1; Battalions, 1; Batten, 1; Battery, 1; Battlements, 1; Bawd, 1; Bawdry, 1; Bawds, 1; Bawdy, 1; Be-Netted, 1; Beam, 1; Beards, 1; Bear't, 1; Beaten, 1; Beats, 1; Beauteous, 1; Beautied, 1; Beauties, 1; Beaver, 1; Became ...

SLIDE 25

22/01/08 Tamara Polajnar > BRC > U Glasgow 25

> Why Zipf?

In order to save space many search engines

and NLP programs disregard function words.

This causes a particular problem when tailoring

applications for biological purposes.

Many proteins/genes are given popular word

names.

SLIDE 26

22/01/08 Tamara Polajnar > BRC > U Glasgow 26

> Named Entity Recognition

Named entity recognition is used to find all

proper nouns in text.

In news it is usually used to identify places,

people, corporations, etc.

In biology it is used to recognise protein and

gene names, drugs, diseases, snippets of DNA, etc.

SLIDE 27

22/01/08 Tamara Polajnar > BRC > U Glasgow 27

> NER for Biology

While in news NER is mostly a solved problem, biological texts

have proved more difficult.

Biological names:

> Share names with common English words. > New names are constantly being invented. > Names and acronyms are reused. > Very few naming conventions which aren't always followed. > Names are often typed differently, the orthography changes.

SLIDE 28

22/01/08 Tamara Polajnar > BRC > U Glasgow 28

>

SLIDE 29

22/01/08 Tamara Polajnar > BRC > U Glasgow 29

> Difficult Named Entities

AND – short for ANDANTE, a gene

from Arabidopsis thaliana

And – Androcam, a gene from

Drosophila melanogaster

jim – a gene in Drosophila

melanogaster

FOR1 - floral organ regulator, a

protein from Oryza sativa

for – short for foraging, a gene

from Drosophila melanogaster

MAD - mothers against dpp, a

gene from Drosophila melanogaster

can – cannonball, a gene from

Drosophila melanogaster

MAPK, MAPKK, MAPKKK –

mitogen activated protein kinase, kinase, kinase....

ASP – abnormal spindle protein,

antisense promoter, ankylosing spondilitis

SLIDE 30

22/01/08 Tamara Polajnar > BRC > U Glasgow 30

> Why NER?

After entities are identified then they are evaluated with respect

to each other and the surrounding text.

For example if one was interested in corporate acquisitions then

after the companies were identified in text the IE system should determine which company is acquiring the other, what are the key players, when is this happening?

In biology the relationships are:
Relationships between proteins: binding, upregulation, activation,

phosphoralysation

Relationships between genes or genes and proteins
Drugs and diseases or drugs and target parts of cells

SLIDE 31

22/01/08 Tamara Polajnar > BRC > U Glasgow 31

> What Next?

Natural language parsing is used to generate a tree structure

describing a sentence according to the grammatical rules of the language the sentence is written in.

Shallow parsing tries to loosely find constituent noun and verb

phrases without worrying about deeper structure.

Full parsing considers the sentence as a whole extracting the

main components and their modifiers.

Full parsing is more time consuming than shallow parsing both

during the design and execution stages.

SLIDE 32

22/01/08 Tamara Polajnar > BRC > U Glasgow 32

> Shallow vs. Full Parsing

For the sentence: The small G-Protein Ras is activated by many growth factor receptors and binds to the Raf-1 kinase with high affinity. The shallow parse may yield (depending on the algorithm): Subject noun phrase: The small G-Protein Ras Verb phrase: is activated, binds Object noun phrase: many growth factor receptors (gfr), the Raf-1 kinase i.e. activate(g-protein ras, gfr), bind(ras, raf-1 kinase) Whereas full parsing would yield: activate(mod(small, mod(g-protein, noun(ras)), mod(many, gfr)) & mod(high affinity, bind(mod(small, mod(g-protein, noun(ras)), typeof(kinase, raf-1))))

SLIDE 33

22/01/08 Tamara Polajnar > BRC > U Glasgow 33

> The Big Problems in TM

Access!
Communication!
Tables and figures
Extraction over several sentences
Measurements and results
Quality

SLIDE 34

22/01/08 Tamara Polajnar > BRC > U Glasgow 34

> Research vs. Product

Bio NLP contains interesting challenges
Many do not extend beyond trying to engineer a

new solution for a particular situation

Research usually leads to development of tools

which can be used to create a product

Very few TM commercial products and they still

rely manual curation

SLIDE 35

22/01/08 Tamara Polajnar > BRC > U Glasgow 35

> GENIA – Tsujii Lab

MEDIE - Retrieve abstracts or sentences in

MEDLINE using deep-parsing results.

TerMine – biomedical term highlighter
Genia Corpus
http://www-tsujii.is.s.u-tokyo.ac.jp/

SLIDE 36

22/01/08 Tamara Polajnar > BRC > U Glasgow 36

>

SLIDE 37

22/01/08 Tamara Polajnar > BRC > U Glasgow 37

> EBI

EBI Med – searches for co-occurrence between

terms from key categories

Whatizit - allows you to highlight terms from

different databases in text

http://www.ebi.ac.uk/Rebholz-

srv/ebimed/index.jsp

SLIDE 38

22/01/08 Tamara Polajnar > BRC > U Glasgow 38

>

SLIDE 39

22/01/08 Tamara Polajnar > BRC > U Glasgow 39

> GeneWays

GeneWays was developed at Columbia University and it process full-

text papers in order to extract knowledge about protein-protein interactions and store it in a searchable knowledge base.

phosphorylated CbL coprecipitated with Crkl, which was constitutively associated with the C3G [action, attach, [protein, Cbl, [state, phosphorylated]], [protein, CrkL, [ation, attach, [protein, CrkL], [protein, C3G]]]]

Precision is 95%, recall 65%, the system can be accessed through an

internet gateway where a user can browse the knowledge base or upload an article to be analysed.

SLIDE 40

22/01/08 Tamara Polajnar > BRC > U Glasgow 40

>

SLIDE 41

22/01/08 Tamara Polajnar > BRC > U Glasgow 41

> MedScan

Does full parsing, it also relies heavily on hand assembled data

such as dictionaries.

It has 34% recall and 90% precision, however they estimate this

could be improved with a larger lexicon.

The system allows the user to query PubMed and then parses

the retrieved abstracts. Novichkova, S.; Egorov, S. & Daraselia, N.MedScan, a natural language processing engine for MEDLINE abstracts. Bioinformatics, 2003, 19, 1699-1706

SLIDE 42

22/01/08 Tamara Polajnar > BRC > U Glasgow 42

>

SLIDE 43

22/01/08 Tamara Polajnar > BRC > U Glasgow 43

> Conclusions

Biological natural language processing is a vibrant field.
Many of the algorithms which were successfully designed for

news text don't preform quite as well in the biological domain, thus new and more general algorithms are needed.

The main problems which are encountered are lack of training

resources (i.e. labeled text, free full text articles, dictionaries, etc.), unstandardised naming schemes, and lack of crossover knowledge between biologists and computer scientists.

This was just a brief introduction with many omissions.

SLIDE 44

22/01/08 Tamara Polajnar > BRC > U Glasgow 44

> FTHK 2008

Finding the Hidden Knowledge: Text mining for biology and medicine 21-22 February 2008, Glasgow http://www.bioinformatics-scotland.org/txt_mining/

SLIDE 45

22/01/08 Tamara Polajnar > BRC > U Glasgow 45

> FTHK 2008 Speakers

Sophia Ananiadou, National Center for

Text Mining, University of Manchester

Doug Armstrong, Edinburgh Centre for

Bioinformatics, University of Edinburgh

Wendy Bickmore, MRC Human

Genetics Unit, Western General Hospital, Edinburgh

Ted Briscoe, Computer Laboratory,

University of Cambridge

Elizabeth Fairley, ITI Life Sciences
William Hayes, Biogen Idec
Lawrence Hunter, University of

Colorado Denver School of Medicine

Peter Jackson, Thomson Corporation
Mark Liberman, Linguistic Data

Consortium / University of Pennsylvania

Tim Miller, Thomson Scientific
John Pestian, Cincinnati Children's

Hospital