BioMinT: Biological Text Mining EU FP5 Quality of Life Project Dr. - PowerPoint PPT Presentation

BioMinT: Biological Text Mining EU FP5 Quality of Life Project Dr. Dipl.-Ing. Alexander K. Seewald Österreichisches Forschungsinstitut für Artificial Intelligence

Motivation “Economic and business pressures are forcing drug companies to deploy computing, but there are still gaps between what users want and what can be achieved .” (Peter Rees - Scientific computing world - Jul/Aug 2003) “To be honest I don’t really understand why you can’t buy more [off-the shelf bioinformatics software].” (Jim Fickett, global director bioinformatics, AstraZeneca - Scientific Computing World, Jul/Aug 2003) “What might help is if the [bioinformatics] manufacturers have the scientists’ needs in mind .” (Michael Man, Pfizer - Genome Technology, Jan 2003) Alexander K. Seewald 2 alex@seewald.at / alex.seewald.at

Background Current frontier is biological text mining = finding research papers, extracting topics, ranking by relevance, extracting metabolic pathways... • Still in its infancy • Biology is hard domain for general text mining • Chronic lack of large training corpora • "Access is a bigger problem than algorithms" So, we concentrate on a small user group with clear requirements and address these issues. Alexander K. Seewald 3 alex@seewald.at / alex.seewald.at

BioMinT: Biological Text Mining Research project funded by the EU (2003 – 2005) • develop a generic text mining tool for content-based and knowledge-intensive information retrieval and extraction • to be applied to the annotation of the Swiss-Prot and PRINTS proteomics databases with information mined from scientific papers; and to generate human-readable reports • adapted to the needs of biological researchers in general and specifically for SwissProt / PRINTS annotation. = In-silico research / curator assistant www.biomint.org Alexander K. Seewald 4 alex@seewald.at / alex.seewald.at

BioMinT Partners • University of Manchester(U.K), School of biological sciences – Prints and Precis providers • Swiss Institute of Bioinformatics – SwissProt providers and users • University of Antwerp (Belgium) – Language technology providers • Österreichisches Forschungsinstitut für AI (ÖFAI, Austria) – Information extraction/retrieval providers • University of Geneva (Swiss) – Information extraction/retrieval providers • PharmaDM (Belgium) – Relational data mining technology, architecture Alexander K. Seewald 5 alex@seewald.at / alex.seewald.at

Information Retrieval / Query Expansion A semantic meta-query engine built around legacy search engines of servers such as PubMed that operates in two steps 1) An expansion of the initial query with synonyms or related terms derived either from domain ontologies or from existing database entries. 2) A filtering and ranking of documents retrieved from these servers using task-specific heuristics. Alexander K. Seewald 6 alex@seewald.at / alex.seewald.at

Query Expansion: Synonym DB Download all 14 databases according to SIB (+ SwissProt) Extract all relevant fields from each DB separately Create all pairs of synonyms (noting Source DB, field, ID) 7,652,510 pairs of synonyms; 737,040 unique names 3250000 3000000 2750000 2500000 2250000 2000000 1750000 1500000 1250000 1000000 750000 500000 250000 0 Lo- Swiss Fly- GDB HUGO MGD OMIM RGD Ra t SGD TAIR Worm SubtiL- Ec- No.Entries cus Prot Base ma p Base ist oGen Unique Link Alexander K. Seewald 8 alex@seewald.at / alex.seewald.at

Named Entity Recognition… Positive-only comparison allows to recognize… • Competitive perf. of KeX & Yapex w/ sloppy comparison • Overlong matches of KeX All DEs Yapex KeX GAPSCORE Strict 0.202±0.401 0.097±0.296 0.192±0.394 PNP 0.606±0.423 0.529±0.374 0.629±0.414 Sloppy 0.732±0.443 0.775±0.420 0.761±0.427 Recent work • Competitive perf. of GAPSCORE vs. Yapex • Ensemble of all approaches improves on best single system Alexander K. Seewald 9 alex@seewald.at / alex.seewald.at

Learning Large Training Corpora… Learning approaches on top 20 species • 75.5% Human domain expert • 79.6% Mapping MeSH Terms to species • 88.9% JRip Rule Learner, 172 rules • 89.3% support vector machine (SMO) Conclusion • Domain experts are good at creating precise rules, but bad at managing trade-off • JRip is good at managing trade-off, but yields worse precision offset by better recall. Alexander K. Seewald 10 alex@seewald.at / alex.seewald.at

Related Research TextPresso: Question answering • Small domain with simple nomenclature (C. elegans) • Corpus of 2,700 full-text papers and 16,000 abstracts • Open-Source, freely available search: www.textpresso.org QUOSA: Query, Organize, Share, Analyze • Commercial product, launched late 2002 • Establishes local paper collection by downloading • Prioritizes full-text papers during search • Available to hundreds of researchers in two US hospitals Alexander K. Seewald 11 alex@seewald.at / alex.seewald.at

Future Work • Generating better PubMed queries • Filtering and Ranking documents • User-interface improvements • Bootstrap human-generated corpora • Beat (or join) competition Alexander K. Seewald 12 alex@seewald.at / alex.seewald.at

Acknowledgments • Terry Attwood, Alex Mitchell, Paul Bradley, Peter Bracken (University of Manchester) • Luc Dehaspe, Andre Vandecandelaere, Kristof van Belleghem (PharmaDM) • Johann Petrak (ÖFAI) • Anne-Lise Veuthey, Violaine Pillet, Marc Zehnder, Pavel Dobrokhotov (SIB) • Walter Daelemans, Frederik Durant, Fien De Meulder (CNTS, University of Antwerp) • Melanie Hilario, Jee-Hyub Kim (University of Geneva) Alexander K. Seewald 13 alex@seewald.at / alex.seewald.at

BioMinT: Biological Text Mining EU FP5 Quality of Life Project Dr. - PowerPoint PPT Presentation

BioMinT: Biological Text Mining EU FP5 Quality of Life Project Dr. Dipl.-Ing. Alexander K. Seewald sterreichisches Forschungsinstitut fr Artificial Intelligence Motivation Economic and business pressures are forcing drug companies to

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Text Mining Text Mining Web pages Emails Technical documents Corporate documents

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

2. Text Mining D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 118

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

6/3/2013 The Prom ise of Early experiences during Early Childhood critical periods of

A deterministic global optimisation algorithm for mixed-integer nonlinear bilevel programs Claire

Moving Boundary Problems for the Harry Dym Equation & Reciprocal Associates Colin Rogers

Bilevel Optimization, Pricing Problems and Stackelberg Games Martine Labb Computer Science

ts rtss

Polynomials on F 2 m with good resistance to cryptanalysis Y. Aubry 1 G. McGuire 2 . Rodier 1 F 1

Keep Calm and Carry On? Policy, Psychology and the Effects of Economic War ! ! British

Dr. Barry McAuley, CitA and TU Dublin BIM in Ireland Update 30 th January 20199 BIM IM in in Ir

Sambuz

Useful Links

Newsletter

Mail Us

BioMinT: Biological Text Mining EU FP5 Quality of Life Project Dr. - PowerPoint PPT Presentation

BioMinT: Biological Text Mining EU FP5 Quality of Life Project Dr. Dipl.-Ing. Alexander K. Seewald sterreichisches Forschungsinstitut fr Artificial Intelligence Motivation Economic and business pressures are forcing drug companies to

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Text Mining Text Mining Web pages Emails Technical documents Corporate documents

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

2. Text Mining D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 118

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

6/3/2013 The Prom ise of Early experiences during Early Childhood critical periods of

A deterministic global optimisation algorithm for mixed-integer nonlinear bilevel programs Claire

Moving Boundary Problems for the Harry Dym Equation &amp; Reciprocal Associates Colin Rogers

Bilevel Optimization, Pricing Problems and Stackelberg Games Martine Labb Computer Science

ts rtss

Polynomials on F 2 m with good resistance to cryptanalysis Y. Aubry 1 G. McGuire 2 . Rodier 1 F 1

Keep Calm and Carry On? Policy, Psychology and the Effects of Economic War ! ! British

Dr. Barry McAuley, CitA and TU Dublin BIM in Ireland Update 30 th January 20199 BIM IM in in Ir

Sambuz

Useful Links

Newsletter

Mail Us

Moving Boundary Problems for the Harry Dym Equation & Reciprocal Associates Colin Rogers