Programming in Python Lecture 8: Python Online Michael Schroeder - - PowerPoint PPT Presentation

programming in python
SMART_READER_LITE
LIVE PREVIEW

Programming in Python Lecture 8: Python Online Michael Schroeder - - PowerPoint PPT Presentation

Programming in Python Lecture 8: Python Online Michael Schroeder Melissa Adasme 1 Motivation: Access to Web Resources Wildcards possible? Can I filter somewhere? Can I combine two different searches? In most cases NO, since Web GUIs are


slide-1
SLIDE 1

1

Programming in Python

Michael Schroeder Melissa Adasme

Lecture 8: Python Online

slide-2
SLIDE 2

Motivation: Access to Web Resources

In most cases NO, since Web GUIs are simplified access points to the data!

Wildcards possible? Can I filter somewhere? Can I combine two different searches?

slide-3
SLIDE 3

Solution: Programmatic Access

(Use programmatic access via power user gateways)

Wildcards / Search for substrings Filtering by selected properties Combination of different criteria

https://www.ebi.ac.uk/chembl/api/data/molecule?molecule_properties__mw_f reebase__lte=300&pref_name__iendswith=nib

Example Query (URL) 1 1 2 2 3 3

Schema of ChEMBL data https://www.ebi.ac.uk/chembl/api/data/molecule/schema

slide-4
SLIDE 4

HTTP/REST

  • HTTP (Hypertext Transfer Protocol) is a protocol/architecture for the internet
  • specifies how data can be transferred between machines in a network
  • defines several methods, e.g. GET and POST, DELETE
  • REST (Representational State Transfer) describes how the architecture of HTTP

can/should be used as a uniform interface

  • REST or REST-like structures available in many web services APIs
  • Usually defined by URL (web address) and HTTP method (action on that address)

http://biowebsitexyz.com/pug/proteins GET POST List all proteins Create new protein entry (with data sent to server) http://biowebsitexyz.com/pug/proteins/p21 GET DELETE Get the data for protein 21 Delete entry for protein 21 on the server

Data is sent separately here, server creates new URL

slide-5
SLIDE 5

Where can I use it?

  • Uniprot (Sequences)
  • ENRICHR (Ontology Enrichment)
  • PubMed (Literature)
  • PubChem, ChEMBL (chemical structures)
  • PDB (Structures)
  • etc.

Non-biologial databases and services Biological databases and services etc.

slide-6
SLIDE 6

Constructing Queries

http://biowebsitexyz.com/pug/proteins GET List all proteins

Just the base URL for service

http://biowebsitexyz.com/pug/proteins? num_aa_gte=100 GET List all proteins with more than 100 amino acids http://biowebsitexyz.com/pug/proteins? num_aa_gte=100&organism=homo_sapiens GET List all human proteins with more than 100 amino acids

Simple filter Multiple criteria

We will focus on GET queries since you mostly will need to just read data from servers

slide-7
SLIDE 7

Revision: XML files

■ We can store any data in XML, the eXtensible Mark-up Language, e.g. Medline

■ Logical data organisation: yes, XML schema, which is enforced ■ Physical data organisation: None, we cannot optimise retrieval for common queries ■ Hierarchical organization ■ Commonly used as an exchange format for data

<Article> <Journal> <ISSN>0270-7306</ISSN> <JournalIssue> <Volume>19</Volume> <Issue>11</Issue> <PubDate> <Year>1999</Year> <Month>Nov</Month> </PubDate> </JournalIssue> </Journal> <ArticleTitle>Differential regulation of the cell wall integrity mitogen-activated protein kinase pathway in budding yeast by the protein tyrosine phosphatases Ptp2 and Ptp3. </ArticleTitle> <Pagination> <MedlinePgn>7651-60</MedlinePgn> </Pagination> <Abstract> <AbstractText>Mitogen-activated protein kinases (MAPKs) are inactivated by dual-specificity and protein tyrosine phosphatases (PTPs) in yeasts. In Saccharomyces cerevisiae, two PTPs, Ptp2 and Ptp3, inactivate the MAPKs, Hog1 and Fus3, with different specificities... </AbstractText> </Abstract> <Affiliation>Department of Chemistry, University of Colorado, Boulder, Colorado 80309-0215, USA. </Affiliation>…

See also lecture 2

slide-8
SLIDE 8

Application I:

What‘s the most recent article from the Schroeder group?

https://www.ncbi.nlm.nih.gov/pubmed https://www.ncbi.nlm.nih.gov/home/develop/api/

slide-9
SLIDE 9

Application I:

What‘s the most recent article from the Schroeder group?

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi? db=pubmed&term=Michael+Schroeder%5Bauthor%5D

1 First we run the main query to obtain all articles from the group (with the author name Michael Schroeder) Documentation at https://www.ncbi.nlm.nih.gov/pmc/tools/developers/

slide-10
SLIDE 10

Application I:

What‘s the most recent article from the Schroeder group?

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi? db=pubmed&term=Michael+Schroeder%5Bauthor%5D

1 First we run the main query to obtain all articles from the group (with the author name Michael Schroeder)

Documentation at https://www.ncbi.nlm.nih.gov/pmc/tools/developers/

ID of the last article published!

slide-11
SLIDE 11

Application I:

What‘s the most recent article from the Schroeder group? 2 Then, using the article ID we get the details for it

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi? db=pubmed&id=31811259&format=xml

slide-12
SLIDE 12

Application I:

What‘s the most recent article from the Schroeder group? 2 Title Then, using the article ID we get the details for it

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi? db=pubmed&id=31811259&format=xml

slide-13
SLIDE 13

Application II: ChEMBL

Find compounds with desired properties 1 2

https://www.ebi.ac.uk/chembl https://chembl.gitbook.io/chembl-interface-documentation/web-services/chembl-data-web-services

Not the same for all web services!!

slide-14
SLIDE 14

Application II: ChEMBL

Find compounds with desired properties 1 Let‘s find compounds ending with rin with a MW between 150 and 200

slide-15
SLIDE 15

Application II: ChEMBL

Find compounds with desired properties 1

Aspirin!!

https://www.ebi.ac.uk/chembl/api/data/molecule? molecule_properties__mw_freebase__gte=150& molecule_properties__mw_freebase__lte=200& pref_name__iendswith=rin

Let‘s find compounds ending with rin with a MW between 150 and 200

slide-16
SLIDE 16

Application II: ChEMBL

Find compounds with desired properties

Canonical SMILES CC(=O)Oc1ccccc1C(=O)O

1

https://www.ebi.ac.uk/chembl/api/data/molecule? molecule_properties__mw_freebase__gte=150& molecule_properties__mw_freebase__lte=200& pref_name__iendswith=rin

Let‘s find compounds ending with rin with a MW between 150 and 200 :

slide-17
SLIDE 17

Application II: ChEMBL

Documentation at https://www.ebi.ac.uk/chembl/ws

https://www.ebi.ac.uk/chembl/api/data/substructure/CC(=O)Oc1ccccc1C(=O)O

Find compounds with desired properties 2

(XML result data not shown)

Aspirin

CC(=O)Oc1ccccc1C(=O)O Let‘s find another molecule with aspirin as a substructure:

slide-18
SLIDE 18

Application II: ChEMBL

Documentation at https://www.ebi.ac.uk/chembl/ws

https://www.ebi.ac.uk/chembl/api/data/substructure/CC(=O)Oc1ccccc1C(=O)O

Find compounds with desired properties 2

(XML result data not shown)

Let‘s find another molecule with aspirin as a substructure:

Second hit (CHEMBL7666) Aspirin

CC(=O)Oc1ccccc1C(=O)O

slide-19
SLIDE 19

Important Information

  • Read the document of each service you are using
  • Sometimes you will need keys to have access
  • Don‘t send too many requests to the server (you could crash it or be blocked)
  • some services don‘t allow parallel requests

https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST.html

With great power comes great responsibility!

USAGE POLICY: Please note that PUG REST is not designed for very large volumes (millions) of requests. We ask that any script or application not make more than 5 requests per second, in order to avoid overloading the PubChem servers. If you have a large data set that you need to compute with, please contact us for help on optimizing your task, as there are likely more efficient ways to approach such bulk queries.

slide-20
SLIDE 20

Web Resources in Python

Simple example: Extract all authors for a paper

  • urllib library for fetching web resources
  • lxml for parsing XML result files

From urllib.request import urlopen #module to open the url from lxml import etree #module to read xml files baseurl = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?" query = "db=pubmed&id=27626687&format=xml“ url = baseurl+query f = urlopen(url) #opens the url with urlopen module resultxml = f.read() #reads the url content xml = etree.XML(resultxml) #parses the content into xml format resultelements = xml.xpath("//LastName") #search for all tags with given xpath for element in resultelements print ([element.text])

Part I: Choosing your tools

slide-21
SLIDE 21

Web Resources in Python

Simple example: Extract all authors for a paper

  • urllib library for fetching web resources
  • lxml for parsing XML result files

From urllib.request import urlopen from lxml import etree baseurl = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?" query = "db=pubmed&id=27626687&format=xml“ url = baseurl+query f = urlopen(url) #opens the url with urlopen module resultxml = f.read() #reads the url content xml = etree.XML(resultxml) #parses the content into xml format resultelements = xml.xpath("//LastName") #search for all tags with given xpath for element in resultelements print ([element.text])

Import the libraries Part I: Choosing your tools

slide-22
SLIDE 22

Web Resources in Python

Simple example: Extract all authors for a paper

  • urllib library for fetching web resources
  • lxml for parsing XML result files

From urllib.request import urlopen from lxml import etree baseurl = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?" query = "db=pubmed&id=27626687&format=xml“ url = baseurl+query f = urlopen(url) #opens the url with urlopen module resultxml = f.read() #reads the url content xml = etree.XML(resultxml) #parses the content into xml format resultelements = xml.xpath("//LastName") #search for all tags with given xpath for element in resultelements print ([element.text])

Define your url query Part I: Choosing your tools

slide-23
SLIDE 23

Web Resources in Python

Simple example: Extract all authors for a paper

  • urllib library for fetching web resources
  • lxml for parsing XML result files

From urllib.request import urlopen from lxml import etree baseurl = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?" query = "db=pubmed&id=27626687&format=xml“ url = baseurl+query f = urlopen(url) #opens the url with urlopen module resultxml = f.read() #reads the url content xml = etree.XML(resultxml) #parses the content into xml format resultelements = xml.xpath("//LastName") #search for all tags with given xpath for element in resultelements print ([element.text])

Open the url and read the xml Part I: Choosing your tools

slide-24
SLIDE 24

Web Resources in Python

Simple example: Extract all authors for a paper

  • urllib library for fetching web resources
  • lxml for parsing XML result files

from urllib.request import urlopen from lxml import etree baseurl = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?" query = "db=pubmed&id=27626687&format=xml“ url = baseurl+query f = urlopen(url) #opens the url with urlopen module resultxml = f.read() #reads the url content xml = etree.XML(resultxml) #parses the content into xml format resultelements = xml.xpath("//LastName") #search for all tags with given xpath for element in resultelements: print ([element.text])

Part I: Choosing your tools Extract important data from the xml with a tag

slide-25
SLIDE 25

Web Resources in Python

Simple example: Extract all authors for a paper

  • urllib library for fetching web resources
  • lxml for parsing XML result files

From urllib.request import urlopen from lxml import etree baseurl = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?" query = "db=pubmed&id=27626687&format=xml“ url = baseurl+query f = urlopen(url) #opens the url with urlopen module resultxml = f.read() #reads the url content xml = etree.XML(resultxml) #parses the content into xml format resultelements = xml.xpath("//LastName") #search for all tags with given xpath for element in resultelements print ([element.text])

Extract important data from the xml with a tag

slide-26
SLIDE 26

From urllib.request import urlopen from lxml import etree baseurl = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?" query = "db=pubmed&id=27626687&format=xml“ url = baseurl+query f = urlopen(url) #opens the url with urlopen module resultxml = f.read() #reads the url content xml = etree.XML(resultxml) #parses the content into xml format resultelements = xml.xpath("//LastName") #search for all tags with given xpath for element in resultelements print ([element.text])

Web Resources in Python

Simple example: Extract all authors for a paper Part I: Choosing your tools

  • urllib library for fetching web resources
  • lxml for parsing XML result files

[‘Heinrich‘, ‘Donakonda‘, ‘Haupt‘, ‘Lennig‘, ‘Zhang‘, ‘Schroeder‘]

slide-27
SLIDE 27

Web Resources in Python

Part II: Constructing queries

from urllib.request import urlopen from urllib.parse import urlencode from lxml import etree author_name = "Michael Schroeder" base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?" url = base_url + urlencode({"db": "pubmed", "term": author_name+"[author]"}) f = urlopen(url) resultxml = f.read() xml = etree.XML(resultxml) resultelements = xml.xpath("//Id") for element in resultelements[:3]: print ([element.text])

Example: Get last 3 PubMed Ids for an author Import the libraries

slide-28
SLIDE 28

Web Resources in Python

Part II: Constructing queries

from urllib.request import urlopen From urllib.parse import urlencode from lxml import etree author_name = "Michael Schroeder" base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?" url = base_url + urlencode({"db": "pubmed", "term": author_name+"[author]"}) f = urlopen(url) resultxml = f.read() xml = etree.XML(resultxml) resultelements = xml.xpath("//Id") for element in resultelements[:3] print ([element.text])

Example: Get last 3 PubMed Ids for an author Author name to search

slide-29
SLIDE 29

Web Resources in Python

Part II: Constructing queries

from urllib.request import urlopen From urllib.parse import urlencode from lxml import etree author_name = "Michael Schroeder" base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?" url = base_url + urlencode({"db": "pubmed", "term": author_name+"[author]"}) f = urlopen(url) resultxml = f.read() xml = etree.XML(resultxml) resultelements = xml.xpath("//Id") for element in resultelements[:3] print ([element.text])

Example: Get last 3 PubMed Ids for an author Encode the query values

slide-30
SLIDE 30

Web Resources in Python

Part II: Constructing queries

from urllib.request import urlopen From urllib.parse import urlencode from lxml import etree author_name = "Michael Schroeder" base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?" url = base_url + urlencode({"db": "pubmed", "term": author_name+"[author]"}) f = urlopen(url) resultxml = f.read() xml = etree.XML(resultxml) resultelements = xml.xpath("//Id") for element in resultelements[:3] print ([element.text])

Example: Get last 3 PubMed Ids for an author Open url and extract important data with tag

slide-31
SLIDE 31

Web Resources in Python

Part II: Constructing queries

from urllib.request import urlopen from urllib.parse import urlencode from lxml import etree author_name = "Michael Schroeder" base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?" url = base_url + urlencode({"db": "pubmed", "term": author_name+"[author]"}) f = urlopen(url) resultxml = f.read() xml = etree.XML(resultxml) resultelements = xml.xpath("//Id") for element in resultelements[:3]: print ([element.text])

Example: Get last 3 PubMed Ids for an author

[‘30376559‘, ‘30239928‘, ‘29895899‘]

slide-32
SLIDE 32

Web Resources in Python

Part III: Parsing results iteratively

from urllib.request import urlopen from urllib.parse import urlencode from lxml import etree ending = "zumab" base_url = "https://www.ebi.ac.uk/chembl/api/data/molecule?" url = base_url + urlencode({"pref_name__iendswith": ending}) #Drugs ending with zumab f = urlopen(url) resultxml = f.read() xml = etree.XML(resultxml) approvaldata = {} #Creates a new empty dictionary for element in xml.iter(): #Iterates over the xml elements in first level if element.tag == 'molecules': #Search for a specific tag name for subele in element: #Iterates elements inside the second level of xml if subele,tag == 'molecule': for subsubele in subele: #Iterates elements inside the third level if subsubele.tag == 'pref_name': #Gets the prefered name of the drug molname = subsubele.text if subsubele.tag == 'first_approval': #Gets date of approval approval = subsubele.text approvaldata[molname] = approval for mol in approvaldata: if approvaldata[mol] is not None: print ("%s (%s)" % (mol, approvaldata[mol]))

['EFALIZUMAB (2003)', 'BEVACIZUMAB (2004)', 'PALIVIZUMAB (1998)', 'OMALIZUMAB (2003)', 'NATALIZUMAB (2004)', 'TRASTUZUMAB (1998)', 'ALEMTUZUMAB (2001)', 'DACLIZUMAB (1997)', 'RANIBIZUMAB (2006)', 'ECULIZUMAB (2007)', 'TOCILIZUMAB (2005)', 'BENRALIZUMAB (2017)', 'ELOTUZUMAB (2015)']

Example: Get molecules ending with…

slide-33
SLIDE 33

Summary

  • GUIs are simplified access point to the data so we need the

programmatic access via power user gateways

  • REST-like structures are available in many web services
  • URL-HTTP queries usually consist of a base URL plus the

properties filtering

  • XML files are commonly used for exchanging data
  • The REST queries syntax is different for each service
  • Always pay attention to the documentation (instructions/rules)
  • f each service
  • Python has the perfect modules to work with REST queries:

urllib and lxml