> > Overview Goals of TM How does TM work Components - - PowerPoint PPT Presentation
> > Overview Goals of TM How does TM work Components - - PowerPoint PPT Presentation
> > Overview Goals of TM How does TM work Components History Problems particular to biomedical texts Examples 22/01/08 Tamara Polajnar > BRC > U Glasgow 2 > The Goal of Text Mining The small G-Protein Ras
22/01/08 Tamara Polajnar > BRC > U Glasgow 2
> Overview
- Goals of TM
- How does TM work
- Components
- History
- Problems particular to biomedical texts
- Examples
22/01/08 Tamara Polajnar > BRC > U Glasgow 3
> The Goal of Text Mining
The small G-Protein Ras is activated by many growth factor receptors and binds to the Raf-1 kinase with high affinity. Action Inter1 Inter2 activate Ras gfr bind ras raf-1 kinase
22/01/08 Tamara Polajnar > BRC > U Glasgow 4
> How does TM work?
Text collection Text Selection (IR, classification) External Knowledge Information Extraction Knowledge Management Visualisation
Text Mining Text Mining
22/01/08 Tamara Polajnar > BRC > U Glasgow 5
> How does TM work?
TM integrates:
- Information Retrieval in order to retrieve a high proportion of
relevant documents (high recall).
- Text categorisation for a higher precision document selection.
- Named entity recognition to identify relevant proteins, genes,
cellular components, processes, etc.
- Information extraction to explain the relationships between the
entities
- Knowledge management and visualisation to store and access
the results
22/01/08 Tamara Polajnar > BRC > U Glasgow 6
> A Note on Evaluation
precision = tp/(tp+fp) recall = tp/(tp+fn) F-measure=2*precision*recall/(precision+recall)
A – Set of retrieved documents B – Set of relevant documents
22/01/08 Tamara Polajnar > BRC > U Glasgow 7
> Information Retrieval
- IR is used to manage and access vast numbers of documents.
- Many IR systems index documents and create dictionaries
which associate words with documents.
- Others also cluster documents based on topic.
- Search engines retrieve many documents and rank them
according to the query. Precision drops off as you look further down the ranked list.
- IR systems in general have high recall but low precision.
22/01/08 Tamara Polajnar > BRC > U Glasgow 8
> Medline
- A database of citations and abstracts of articles
published in major peer reviewed journals since 1950s
- Each entry contains bibliographic information
- Some entries contain abstracts, MESH terms,
citations...
- Is available for download for TM in XML format
22/01/08 Tamara Polajnar > BRC > U Glasgow 9
> IR for Biomedical Texts
- IR is key for biomedical research
- Search engines are the main way of looking for
journal articles
- PubMed/Medline citations:
> 2008: 16,880,015 > 2007: 16,120,074 > 2006: 15,433,668
22/01/08 Tamara Polajnar > BRC > U Glasgow 10
> Classification
- Classification is used to put documents into two or more
classes.
- The classes are learned from examples, so usually manually
compiled training data is needed.
- Clustering can be done without training data, but the meaning of
the clusters has to be determined after.
- Classification usually has much higher precision than IR while
also maintaining high recall (depending on the data).
22/01/08 Tamara Polajnar > BRC > U Glasgow 11
> Information Extraction
- Information extraction is a process by which knowledge
contained in unstructured text is translated into predicated forms which can be used for reasoning.
- IE usually involves several layers of processing including:
named entity recognition, parsing, and pattern recognition.
- In general, IE systems are manually engineered for particular
problems.
- IE is usually high precision, but may miss a lot of information
which is presented in a format which was not observed before.
22/01/08 Tamara Polajnar > BRC > U Glasgow 12
>
The small G-Protein Ras is activated by many growth factor receptors and binds to the Raf-1 kinase with high affinity activate(g-protein ras, gfr), bind(ras, raf-1 kinase)
sentence NP VP DET AP ADJ AP ADJ N VP Conj-VP Conj VP VP PP V V Prep AP ADJ AP ADJ N V PP Prep NP Det NP N PP AP ADJ N Prep subject verb
- bject
- bject
verb
Legend AP – adjective phrase NP – noun phrase PP – prepositional phrase VP – verb phrase Adj - adjective Conj - conjunction Det – determiner N – noun Prep – preposition V – verb
The small [G-Protein] Ras is activated by many [growth factor] receptors and binds to the [Raf-1 kinase] with high affinity
22/01/08 Tamara Polajnar > BRC > U Glasgow 13
> A Little History
- The goal of natural language understanding has driven
research into computational linguistics since the first computers.
- In the 1940s the behaviorist theory prevailed. Early on it was
shown that language can be closely approximated by counting frequency/probability of words and by outputting the words with like frequency, which supported this view.
- In the 1950s this view was challenged by rationalists.
Rationalists believe that human language is innate and only the syntactic rules of language are learned, whereas empiricists believe that human language is learned entirely through exposure.
22/01/08 Tamara Polajnar > BRC > U Glasgow 14
> Shannon
- In his 1948 paper A Mathematical Theory of
Communication Claude Shannon showed that English can be approximated by statistical processes:
- THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE
CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED. From the Random Surrealism Shakespearian generator:
To debug a kilt, or not to debug a kilt, that is the question! Is this a roadworks sign which I see before me, the doorbell toward my lecturer? Come, let me charge headlong at thee.
22/01/08 Tamara Polajnar > BRC > U Glasgow 15
> Chomsky
- In 1956 Noam Chomsky revolutionised the study of linguistics
by formalising grammars and by organising languages in a hierarchical structure according to their capabilities.Chomsky's ideas influenced much of theoretical Computer Science including automata theory and programming languages.
- Chomsky postulated that human beings are born with innate
ability to process language and that different languages only differ by the syntactic surface structure while the semantics of language, the deep structure, is common to all people.
- The following sentences share the same deep structure:
- I am reading this slide. This slide is being read by
me.
22/01/08 Tamara Polajnar > BRC > U Glasgow 16
> And Then...
- Rationalists attempted to write down all the rules governing
language while empiricists concentrated on developing representative statistical models.
- These approaches developed separately for several decades
each achieving individual success, but neither providing a comprehensive solution.
- In the 90s the two approaches started coming together resulting
in statistical learning of rules and linguistic improvements to statistical methods. This made natural language processing much more practical.
22/01/08 Tamara Polajnar > BRC > U Glasgow 17
> NLP vs. Text Mining
- Natural Language Processing is a research field in computing
science sometimes interchangeably called Computational Linguistics or Language Engineering.
- It comprises of all research involving computers and human
spoken languages whether in speech or text form.
- Text Mining is a combination of tools developed within NLP
research which are aimed for extraction of specific information from large textual corpora.
- The term text mining is almost exclusively used for biological
applications and is meant to allude to data mining.
22/01/08 Tamara Polajnar > BRC > U Glasgow 18
> Some Applications of NLP
- Machine translation
- Question answering
- Information extraction
- Speech generation
- Speech processing
- Dialogue systems
- Author identification
- Natural language querying
22/01/08 Tamara Polajnar > BRC > U Glasgow 19
> NLP for Biology
- There are some specific challenges in biological
texts which have interested the NLP community.
- There is also a real need an use for the tools
developed for biology which is also driving the research in text mining.
22/01/08 Tamara Polajnar > BRC > U Glasgow 20
> NLP Tools in Text Mining
- Tokenisation
- Named entity recognition
- Classification
- Shallow Parsing
- Full Parsing
22/01/08 Tamara Polajnar > BRC > U Glasgow 21
> Current Approaches to NLP
- In general most approaches are a combination
- f rule-based and stochastic methods
- Tokenisation, named entity recognition, and
classification are generally done using statistical methods, but may employ dictionaries.
- Parsing is usually a combination of rules guided
by probabilities that the rule will occur.
22/01/08 Tamara Polajnar > BRC > U Glasgow 22
> Tokenisation
What is a word?
- A sequence of letters, all lowercase, with first capital letter, or
all capitals? mRNA, hnRNP
- A sequence of letters? P26, RGL3, Nore1
- A sequence of letters and numbers? Raf-1, JRE-IL6
- Any sequence of symbols which ends with a space or a
punctuation mark? H7-sensitive, PI3K-induced, 5'- triphosphatase
- Usually each application will have its own definition for what a
word is.
22/01/08 Tamara Polajnar > BRC > U Glasgow 23
> An Aside about Zipf
- There are very few words which are used very often, e. g. the,
a, as, by, that, those, what, between, on, and, etc.
- These words carry very little information but make up most of
the text, while the longer words which contain more information
- ccur less often.
- George Kingsley Zipf proposed this is due to the fact that
speaker prefers to use a lot of simple words because that makes speaking easier.
- On the other hand, the listener prefers precise words which
convey the meaning succinctly as that makes intake of information easier.
22/01/08 Tamara Polajnar > BRC > U Glasgow 24
> Example: Zipf
Word frequencies from Shakespeare's Hamlet:
The, 1101; And, 898; To, 726; Of, 657; I, 561; You, 544; My, 508; A, 498; In, 414; It, 414; That, 389; Is, 334; Not, 315; This, 296; His, 292; But, 265; With, 257; For, 247; Your, 242; Me, 235; As, 228; Be, 226; Lord, 218; He, 216; What, 203; So, 197; Him, 189; Have, 179; Will, 169; Do, 150; No, 143; We, 140; Are, 131; On, 125; O, 121; Our, 119; By, 116; Shall, 114; If, 113; Or, 112; All, 110; Good, 109; Come, 104; Thou, 103; Now, 97; From, 95; More, 95; They, 95; Let, 94; How, 88; Thy, 87; Her, 86; At, 84; Was, 83; Most, 82; Like, 80; Would, 80; Hamlet, 78; Well, 78; There, 76; Know, 74; Sir, 74; Them, 74; May, 71.... .. . Assay'd, 1; Associates, 1; Assure, 1; Astonish, 1; Asunder, 1; As'es, 1; Attend, 1; Attended, 1; Attends, 1; Attent, 1; Attractive, 1; Attribute, 1; Audit, 1; Augury, 1; Aunt-Mother, 1; Auspicious, 1; Authorities, 1; Avouch, 1; Awry, 1; Aye, 1; Babe, 1; Backed, 1; Backward, 1; Bade, 1; Bait, 1; Baker's, 1; Ban, 1; Bands, 1; Baptista, 1; Bar, 1; Barber's, 1; Bare, 1; Barefaced, 1; Barefoot, 1; Bark, 1; Bark'd, 1; Barren, 1; Barr'd, 1; Baseness, 1; Bastard, 1; Bat, 1; Bated, 1; Battalions, 1; Batten, 1; Battery, 1; Battlements, 1; Bawd, 1; Bawdry, 1; Bawds, 1; Bawdy, 1; Be-Netted, 1; Beam, 1; Beards, 1; Bear't, 1; Beaten, 1; Beats, 1; Beauteous, 1; Beautied, 1; Beauties, 1; Beaver, 1; Became ...
22/01/08 Tamara Polajnar > BRC > U Glasgow 25
> Why Zipf?
- In order to save space many search engines
and NLP programs disregard function words.
- This causes a particular problem when tailoring
applications for biological purposes.
- Many proteins/genes are given popular word
names.
22/01/08 Tamara Polajnar > BRC > U Glasgow 26
> Named Entity Recognition
- Named entity recognition is used to find all
proper nouns in text.
- In news it is usually used to identify places,
people, corporations, etc.
- In biology it is used to recognise protein and
gene names, drugs, diseases, snippets of DNA, etc.
22/01/08 Tamara Polajnar > BRC > U Glasgow 27
> NER for Biology
- While in news NER is mostly a solved problem, biological texts
have proved more difficult.
- Biological names:
> Share names with common English words. > New names are constantly being invented. > Names and acronyms are reused. > Very few naming conventions which aren't always followed. > Names are often typed differently, the orthography changes.
22/01/08 Tamara Polajnar > BRC > U Glasgow 28
>
22/01/08 Tamara Polajnar > BRC > U Glasgow 29
> Difficult Named Entities
- AND – short for ANDANTE, a gene
from Arabidopsis thaliana
- And – Androcam, a gene from
Drosophila melanogaster
- jim – a gene in Drosophila
melanogaster
- FOR1 - floral organ regulator, a
protein from Oryza sativa
- for – short for foraging, a gene
from Drosophila melanogaster
- MAD - mothers against dpp, a
gene from Drosophila melanogaster
- can – cannonball, a gene from
Drosophila melanogaster
- MAPK, MAPKK, MAPKKK –
mitogen activated protein kinase, kinase, kinase....
- ASP – abnormal spindle protein,
antisense promoter, ankylosing spondilitis
22/01/08 Tamara Polajnar > BRC > U Glasgow 30
> Why NER?
- After entities are identified then they are evaluated with respect
to each other and the surrounding text.
- For example if one was interested in corporate acquisitions then
after the companies were identified in text the IE system should determine which company is acquiring the other, what are the key players, when is this happening?
- In biology the relationships are:
- Relationships between proteins: binding, upregulation, activation,
phosphoralysation
- Relationships between genes or genes and proteins
- Drugs and diseases or drugs and target parts of cells
22/01/08 Tamara Polajnar > BRC > U Glasgow 31
> What Next?
- Natural language parsing is used to generate a tree structure
describing a sentence according to the grammatical rules of the language the sentence is written in.
- Shallow parsing tries to loosely find constituent noun and verb
phrases without worrying about deeper structure.
- Full parsing considers the sentence as a whole extracting the
main components and their modifiers.
- Full parsing is more time consuming than shallow parsing both
during the design and execution stages.
22/01/08 Tamara Polajnar > BRC > U Glasgow 32
> Shallow vs. Full Parsing
For the sentence: The small G-Protein Ras is activated by many growth factor receptors and binds to the Raf-1 kinase with high affinity. The shallow parse may yield (depending on the algorithm): Subject noun phrase: The small G-Protein Ras Verb phrase: is activated, binds Object noun phrase: many growth factor receptors (gfr), the Raf-1 kinase i.e. activate(g-protein ras, gfr), bind(ras, raf-1 kinase) Whereas full parsing would yield: activate(mod(small, mod(g-protein, noun(ras)), mod(many, gfr)) & mod(high affinity, bind(mod(small, mod(g-protein, noun(ras)), typeof(kinase, raf-1))))
22/01/08 Tamara Polajnar > BRC > U Glasgow 33
> The Big Problems in TM
- Access!
- Communication!
- Tables and figures
- Extraction over several sentences
- Measurements and results
- Quality
22/01/08 Tamara Polajnar > BRC > U Glasgow 34
> Research vs. Product
- Bio NLP contains interesting challenges
- Many do not extend beyond trying to engineer a
new solution for a particular situation
- Research usually leads to development of tools
which can be used to create a product
- Very few TM commercial products and they still
rely manual curation
22/01/08 Tamara Polajnar > BRC > U Glasgow 35
> GENIA – Tsujii Lab
- MEDIE - Retrieve abstracts or sentences in
MEDLINE using deep-parsing results.
- TerMine – biomedical term highlighter
- Genia Corpus
- http://www-tsujii.is.s.u-tokyo.ac.jp/
22/01/08 Tamara Polajnar > BRC > U Glasgow 36
>
22/01/08 Tamara Polajnar > BRC > U Glasgow 37
> EBI
- EBI Med – searches for co-occurrence between
terms from key categories
- Whatizit - allows you to highlight terms from
different databases in text
- http://www.ebi.ac.uk/Rebholz-
srv/ebimed/index.jsp
22/01/08 Tamara Polajnar > BRC > U Glasgow 38
>
22/01/08 Tamara Polajnar > BRC > U Glasgow 39
> GeneWays
- GeneWays was developed at Columbia University and it process full-
text papers in order to extract knowledge about protein-protein interactions and store it in a searchable knowledge base.
phosphorylated CbL coprecipitated with Crkl, which was constitutively associated with the C3G [action, attach, [protein, Cbl, [state, phosphorylated]], [protein, CrkL, [ation, attach, [protein, CrkL], [protein, C3G]]]]
- Precision is 95%, recall 65%, the system can be accessed through an
internet gateway where a user can browse the knowledge base or upload an article to be analysed.
22/01/08 Tamara Polajnar > BRC > U Glasgow 40
>
22/01/08 Tamara Polajnar > BRC > U Glasgow 41
> MedScan
- Does full parsing, it also relies heavily on hand assembled data
such as dictionaries.
- It has 34% recall and 90% precision, however they estimate this
could be improved with a larger lexicon.
- The system allows the user to query PubMed and then parses
the retrieved abstracts. Novichkova, S.; Egorov, S. & Daraselia, N.MedScan, a natural language processing engine for MEDLINE abstracts. Bioinformatics, 2003, 19, 1699-1706
22/01/08 Tamara Polajnar > BRC > U Glasgow 42
>
22/01/08 Tamara Polajnar > BRC > U Glasgow 43
> Conclusions
- Biological natural language processing is a vibrant field.
- Many of the algorithms which were successfully designed for
news text don't preform quite as well in the biological domain, thus new and more general algorithms are needed.
- The main problems which are encountered are lack of training
resources (i.e. labeled text, free full text articles, dictionaries, etc.), unstandardised naming schemes, and lack of crossover knowledge between biologists and computer scientists.
- This was just a brief introduction with many omissions.
22/01/08 Tamara Polajnar > BRC > U Glasgow 44
> FTHK 2008
Finding the Hidden Knowledge: Text mining for biology and medicine 21-22 February 2008, Glasgow http://www.bioinformatics-scotland.org/txt_mining/
22/01/08 Tamara Polajnar > BRC > U Glasgow 45
> FTHK 2008 Speakers
- Sophia Ananiadou, National Center for
Text Mining, University of Manchester
- Doug Armstrong, Edinburgh Centre for
Bioinformatics, University of Edinburgh
- Wendy Bickmore, MRC Human
Genetics Unit, Western General Hospital, Edinburgh
- Ted Briscoe, Computer Laboratory,
University of Cambridge
- Elizabeth Fairley, ITI Life Sciences
- William Hayes, Biogen Idec
- Lawrence Hunter, University of
Colorado Denver School of Medicine
- Peter Jackson, Thomson Corporation
- Mark Liberman, Linguistic Data
Consortium / University of Pennsylvania
- Tim Miller, Thomson Scientific
- John Pestian, Cincinnati Children's
Hospital