1
Programming in Python
Michael Schroeder Melissa Adasme
Lecture 8: Python Online
Programming in Python Lecture 8: Python Online Michael Schroeder - - PowerPoint PPT Presentation
Programming in Python Lecture 8: Python Online Michael Schroeder Melissa Adasme 1 Motivation: Access to Web Resources Wildcards possible? Can I filter somewhere? Can I combine two different searches? In most cases NO, since Web GUIs are
1
Lecture 8: Python Online
Wildcards possible? Can I filter somewhere? Can I combine two different searches?
Wildcards / Search for substrings Filtering by selected properties Combination of different criteria
https://www.ebi.ac.uk/chembl/api/data/molecule?molecule_properties__mw_f reebase__lte=300&pref_name__iendswith=nib
Example Query (URL) 1 1 2 2 3 3
Schema of ChEMBL data https://www.ebi.ac.uk/chembl/api/data/molecule/schema
can/should be used as a uniform interface
http://biowebsitexyz.com/pug/proteins GET POST List all proteins Create new protein entry (with data sent to server) http://biowebsitexyz.com/pug/proteins/p21 GET DELETE Get the data for protein 21 Delete entry for protein 21 on the server
Data is sent separately here, server creates new URL
http://biowebsitexyz.com/pug/proteins GET List all proteins
Just the base URL for service
http://biowebsitexyz.com/pug/proteins? num_aa_gte=100 GET List all proteins with more than 100 amino acids http://biowebsitexyz.com/pug/proteins? num_aa_gte=100&organism=homo_sapiens GET List all human proteins with more than 100 amino acids
Simple filter Multiple criteria
We will focus on GET queries since you mostly will need to just read data from servers
■ We can store any data in XML, the eXtensible Mark-up Language, e.g. Medline
■ Logical data organisation: yes, XML schema, which is enforced ■ Physical data organisation: None, we cannot optimise retrieval for common queries ■ Hierarchical organization ■ Commonly used as an exchange format for data
<Article> <Journal> <ISSN>0270-7306</ISSN> <JournalIssue> <Volume>19</Volume> <Issue>11</Issue> <PubDate> <Year>1999</Year> <Month>Nov</Month> </PubDate> </JournalIssue> </Journal> <ArticleTitle>Differential regulation of the cell wall integrity mitogen-activated protein kinase pathway in budding yeast by the protein tyrosine phosphatases Ptp2 and Ptp3. </ArticleTitle> <Pagination> <MedlinePgn>7651-60</MedlinePgn> </Pagination> <Abstract> <AbstractText>Mitogen-activated protein kinases (MAPKs) are inactivated by dual-specificity and protein tyrosine phosphatases (PTPs) in yeasts. In Saccharomyces cerevisiae, two PTPs, Ptp2 and Ptp3, inactivate the MAPKs, Hog1 and Fus3, with different specificities... </AbstractText> </Abstract> <Affiliation>Department of Chemistry, University of Colorado, Boulder, Colorado 80309-0215, USA. </Affiliation>…
See also lecture 2
What‘s the most recent article from the Schroeder group?
https://www.ncbi.nlm.nih.gov/pubmed https://www.ncbi.nlm.nih.gov/home/develop/api/
What‘s the most recent article from the Schroeder group?
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi? db=pubmed&term=Michael+Schroeder%5Bauthor%5D
1 First we run the main query to obtain all articles from the group (with the author name Michael Schroeder) Documentation at https://www.ncbi.nlm.nih.gov/pmc/tools/developers/
What‘s the most recent article from the Schroeder group?
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi? db=pubmed&term=Michael+Schroeder%5Bauthor%5D
1 First we run the main query to obtain all articles from the group (with the author name Michael Schroeder)
Documentation at https://www.ncbi.nlm.nih.gov/pmc/tools/developers/
ID of the last article published!
What‘s the most recent article from the Schroeder group? 2 Then, using the article ID we get the details for it
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi? db=pubmed&id=31811259&format=xml
What‘s the most recent article from the Schroeder group? 2 Title Then, using the article ID we get the details for it
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi? db=pubmed&id=31811259&format=xml
Find compounds with desired properties 1 2
https://www.ebi.ac.uk/chembl https://chembl.gitbook.io/chembl-interface-documentation/web-services/chembl-data-web-services
Not the same for all web services!!
Find compounds with desired properties 1 Let‘s find compounds ending with rin with a MW between 150 and 200
Find compounds with desired properties 1
Aspirin!!
https://www.ebi.ac.uk/chembl/api/data/molecule? molecule_properties__mw_freebase__gte=150& molecule_properties__mw_freebase__lte=200& pref_name__iendswith=rin
Let‘s find compounds ending with rin with a MW between 150 and 200
Find compounds with desired properties
Canonical SMILES CC(=O)Oc1ccccc1C(=O)O
1
https://www.ebi.ac.uk/chembl/api/data/molecule? molecule_properties__mw_freebase__gte=150& molecule_properties__mw_freebase__lte=200& pref_name__iendswith=rin
Let‘s find compounds ending with rin with a MW between 150 and 200 :
Documentation at https://www.ebi.ac.uk/chembl/ws
https://www.ebi.ac.uk/chembl/api/data/substructure/CC(=O)Oc1ccccc1C(=O)O
Find compounds with desired properties 2
(XML result data not shown)
Aspirin
CC(=O)Oc1ccccc1C(=O)O Let‘s find another molecule with aspirin as a substructure:
Documentation at https://www.ebi.ac.uk/chembl/ws
https://www.ebi.ac.uk/chembl/api/data/substructure/CC(=O)Oc1ccccc1C(=O)O
Find compounds with desired properties 2
(XML result data not shown)
Let‘s find another molecule with aspirin as a substructure:
Second hit (CHEMBL7666) Aspirin
CC(=O)Oc1ccccc1C(=O)O
https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST.html
USAGE POLICY: Please note that PUG REST is not designed for very large volumes (millions) of requests. We ask that any script or application not make more than 5 requests per second, in order to avoid overloading the PubChem servers. If you have a large data set that you need to compute with, please contact us for help on optimizing your task, as there are likely more efficient ways to approach such bulk queries.
Simple example: Extract all authors for a paper
From urllib.request import urlopen #module to open the url from lxml import etree #module to read xml files baseurl = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?" query = "db=pubmed&id=27626687&format=xml“ url = baseurl+query f = urlopen(url) #opens the url with urlopen module resultxml = f.read() #reads the url content xml = etree.XML(resultxml) #parses the content into xml format resultelements = xml.xpath("//LastName") #search for all tags with given xpath for element in resultelements print ([element.text])
Part I: Choosing your tools
Simple example: Extract all authors for a paper
From urllib.request import urlopen from lxml import etree baseurl = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?" query = "db=pubmed&id=27626687&format=xml“ url = baseurl+query f = urlopen(url) #opens the url with urlopen module resultxml = f.read() #reads the url content xml = etree.XML(resultxml) #parses the content into xml format resultelements = xml.xpath("//LastName") #search for all tags with given xpath for element in resultelements print ([element.text])
Import the libraries Part I: Choosing your tools
Simple example: Extract all authors for a paper
From urllib.request import urlopen from lxml import etree baseurl = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?" query = "db=pubmed&id=27626687&format=xml“ url = baseurl+query f = urlopen(url) #opens the url with urlopen module resultxml = f.read() #reads the url content xml = etree.XML(resultxml) #parses the content into xml format resultelements = xml.xpath("//LastName") #search for all tags with given xpath for element in resultelements print ([element.text])
Define your url query Part I: Choosing your tools
Simple example: Extract all authors for a paper
From urllib.request import urlopen from lxml import etree baseurl = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?" query = "db=pubmed&id=27626687&format=xml“ url = baseurl+query f = urlopen(url) #opens the url with urlopen module resultxml = f.read() #reads the url content xml = etree.XML(resultxml) #parses the content into xml format resultelements = xml.xpath("//LastName") #search for all tags with given xpath for element in resultelements print ([element.text])
Open the url and read the xml Part I: Choosing your tools
Simple example: Extract all authors for a paper
from urllib.request import urlopen from lxml import etree baseurl = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?" query = "db=pubmed&id=27626687&format=xml“ url = baseurl+query f = urlopen(url) #opens the url with urlopen module resultxml = f.read() #reads the url content xml = etree.XML(resultxml) #parses the content into xml format resultelements = xml.xpath("//LastName") #search for all tags with given xpath for element in resultelements: print ([element.text])
Part I: Choosing your tools Extract important data from the xml with a tag
Simple example: Extract all authors for a paper
From urllib.request import urlopen from lxml import etree baseurl = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?" query = "db=pubmed&id=27626687&format=xml“ url = baseurl+query f = urlopen(url) #opens the url with urlopen module resultxml = f.read() #reads the url content xml = etree.XML(resultxml) #parses the content into xml format resultelements = xml.xpath("//LastName") #search for all tags with given xpath for element in resultelements print ([element.text])
Extract important data from the xml with a tag
From urllib.request import urlopen from lxml import etree baseurl = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?" query = "db=pubmed&id=27626687&format=xml“ url = baseurl+query f = urlopen(url) #opens the url with urlopen module resultxml = f.read() #reads the url content xml = etree.XML(resultxml) #parses the content into xml format resultelements = xml.xpath("//LastName") #search for all tags with given xpath for element in resultelements print ([element.text])
Simple example: Extract all authors for a paper Part I: Choosing your tools
[‘Heinrich‘, ‘Donakonda‘, ‘Haupt‘, ‘Lennig‘, ‘Zhang‘, ‘Schroeder‘]
Part II: Constructing queries
from urllib.request import urlopen from urllib.parse import urlencode from lxml import etree author_name = "Michael Schroeder" base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?" url = base_url + urlencode({"db": "pubmed", "term": author_name+"[author]"}) f = urlopen(url) resultxml = f.read() xml = etree.XML(resultxml) resultelements = xml.xpath("//Id") for element in resultelements[:3]: print ([element.text])
Example: Get last 3 PubMed Ids for an author Import the libraries
Part II: Constructing queries
from urllib.request import urlopen From urllib.parse import urlencode from lxml import etree author_name = "Michael Schroeder" base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?" url = base_url + urlencode({"db": "pubmed", "term": author_name+"[author]"}) f = urlopen(url) resultxml = f.read() xml = etree.XML(resultxml) resultelements = xml.xpath("//Id") for element in resultelements[:3] print ([element.text])
Example: Get last 3 PubMed Ids for an author Author name to search
Part II: Constructing queries
from urllib.request import urlopen From urllib.parse import urlencode from lxml import etree author_name = "Michael Schroeder" base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?" url = base_url + urlencode({"db": "pubmed", "term": author_name+"[author]"}) f = urlopen(url) resultxml = f.read() xml = etree.XML(resultxml) resultelements = xml.xpath("//Id") for element in resultelements[:3] print ([element.text])
Example: Get last 3 PubMed Ids for an author Encode the query values
Part II: Constructing queries
from urllib.request import urlopen From urllib.parse import urlencode from lxml import etree author_name = "Michael Schroeder" base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?" url = base_url + urlencode({"db": "pubmed", "term": author_name+"[author]"}) f = urlopen(url) resultxml = f.read() xml = etree.XML(resultxml) resultelements = xml.xpath("//Id") for element in resultelements[:3] print ([element.text])
Example: Get last 3 PubMed Ids for an author Open url and extract important data with tag
Part II: Constructing queries
from urllib.request import urlopen from urllib.parse import urlencode from lxml import etree author_name = "Michael Schroeder" base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?" url = base_url + urlencode({"db": "pubmed", "term": author_name+"[author]"}) f = urlopen(url) resultxml = f.read() xml = etree.XML(resultxml) resultelements = xml.xpath("//Id") for element in resultelements[:3]: print ([element.text])
Example: Get last 3 PubMed Ids for an author
[‘30376559‘, ‘30239928‘, ‘29895899‘]
Part III: Parsing results iteratively
from urllib.request import urlopen from urllib.parse import urlencode from lxml import etree ending = "zumab" base_url = "https://www.ebi.ac.uk/chembl/api/data/molecule?" url = base_url + urlencode({"pref_name__iendswith": ending}) #Drugs ending with zumab f = urlopen(url) resultxml = f.read() xml = etree.XML(resultxml) approvaldata = {} #Creates a new empty dictionary for element in xml.iter(): #Iterates over the xml elements in first level if element.tag == 'molecules': #Search for a specific tag name for subele in element: #Iterates elements inside the second level of xml if subele,tag == 'molecule': for subsubele in subele: #Iterates elements inside the third level if subsubele.tag == 'pref_name': #Gets the prefered name of the drug molname = subsubele.text if subsubele.tag == 'first_approval': #Gets date of approval approval = subsubele.text approvaldata[molname] = approval for mol in approvaldata: if approvaldata[mol] is not None: print ("%s (%s)" % (mol, approvaldata[mol]))
['EFALIZUMAB (2003)', 'BEVACIZUMAB (2004)', 'PALIVIZUMAB (1998)', 'OMALIZUMAB (2003)', 'NATALIZUMAB (2004)', 'TRASTUZUMAB (1998)', 'ALEMTUZUMAB (2001)', 'DACLIZUMAB (1997)', 'RANIBIZUMAB (2006)', 'ECULIZUMAB (2007)', 'TOCILIZUMAB (2005)', 'BENRALIZUMAB (2017)', 'ELOTUZUMAB (2015)']
Example: Get molecules ending with…
programmatic access via power user gateways
properties filtering
urllib and lxml