comp60411 modelling data on the web tree data models week
play

COMP60411: Modelling Data on the Web Tree Data Models Week 2 Tim - PowerPoint PPT Presentation

COMP60411: Modelling Data on the Web Tree Data Models Week 2 Tim Morris & Uli Sattler University of Manchester 1 Reminder: Plagiarism & Academic Malpractice We assume that you have all by now successfully completed the


  1. COMP60411: Modelling Data on the Web 
 Tree Data Models Week 2 Tim Morris & Uli Sattler University of Manchester � 1

  2. Reminder: Plagiarism & Academic Malpractice • We assume that you have all by now successfully completed the 
 Plagiarism and Malpractice Test • ...if you haven’t: 
 do so before you submit any coursework (assignment or assessment) • ...because we work under the assumption that – you know what you do – you take pride in your own thoughts & your own writing – you don’t steal thoughts or words from others • ...and if you don’t, and submit coursework where you have 
 copied other people’s work without correct attribution 
 it costs you at least marks or more, e.g., your MSc � 2

  3. Reminder We maintain 3 sources of information: • syllabus … /pgt/COMP60411/syllabus/ • materials … /pgt/COMP60411/ – growing continuously – with slides, reading material, etc – with TA lab times • Blackboard via myManchester – growing continuously Subscribe – Forums Read • General Contribue • Week 1, Week 2, … – Coursework � 3

  4. Coursework - Week 1 • Q1: looks good, will look better next week, BUT … • SE1: looks mostly good • use a good spell & grammar checker! • answer the question! • We know what ER diagrams are, no need to explain them • No need to explain logical/physical model in detail! • avoid non sequiturs • M1: • … • CW1: • … • For all: • check our feedback in the rubrics • if you can’t find them, ask us in labs • start in time � 4

  5. Today We will encounter many things: Tree data models: 1. Data Structure formalisms: JSON 2. Schema Language: JSON Schema 3. Data Manipulation: Python, JSON package General concepts: • Semi-structured data • Self-Describing • Trees • Regular Expressions • Internal & External Representation, Parsing, Serialisation • Validation, valid, … • Format � 5

  6. Extending Last Week’s Running Example � 6

  7. Extended Running Example • Remember last week’s example: – per person 1-3 data records, with address, phone, email, … • now combine this with management information : – who supervises/line manages whom? Management Employees Manager ID Managee ID Employee ID Postcode City … 1234123 M16 0P2 Manchester … 1234124 1234123 1234567 1234124 1234124 M2 3OZ Manchester … 1234124 1234567 1234567 SW1 A London … ... ... ... ... ... ... • … what could go wrong? • … what did go wrong? � 7

  8. Running Example (2) • Take a few minutes and sketch this SQL query: Q1: all postcodes of 4th-level managers 
 Employees Management Manager ID ManageeID Employee ID Postcode City … 1234123 M16 0P2 Manchester … 1234124 1234123 1234124 M2 3OZ Manchester … 1234567 1234124 1234567 SW1 A London … 1234123 1234567 ... ... ... ... ... ... � 8

  9. Management Q1: Tricky.. ManagerID ManageeID 1234124 1234123 1234567 1234124 … … Q1’: Postcodes of all managers: SELECT Postcode FROM Employees E, Management M WHERE E.EmployeeID = 
 M.ManagerID Q1’’: Postcode of 2nd level managers: SELECT Postcode FROM Employees E INNER JOIN (SELECT ManagerID 
 FROM Management M1, Management M2 WHERE M1.ManageeID = M2.ManagerID) M ON E.EmployeeID = M.ManagerID … more and more joins! � 9

  10. Running Example (2) • Take a few minutes and sketch this SQL query: Q2: “error” if we have a cyclic management structure Employees Management Manager ID ManageeID Employee ID Postcode City … 1234123 M16 0P2 Manchester … 1234124 1234123 1234124 M2 3OZ Manchester … 1234567 1234124 1234567 SW1 A London … 1234123 1234567 ... ... ... ... ... ... � 10

  11. Management Q2: Tricky … ManagerID ManageeID 1234124 1234123 1234567 1234124 … … – Detecting management cycles of length 1: SELECT EmployeeID FROM Management M WHERE M.ManageeID = 
 M.ManagerID – Detecting management cycles of length 2: SELECT EmployeeID FROM Employees E1 INNER JOIN (SELECT EmployeeID 
 FROM Management M1, Management M2 WHERE M1.ManageeID = M2.ManagerID) M ON E1.EmployeeID = M.ManagerID – … where do we stop? � 11

  12. A new example: UniProt, a Protein Database • A research community based & curated knowledge base of – 550K protein sequences, – comprising 192M amino acids – abstracted from 220K references. • Proteins largely determine how (parts of) living things 
 work and interact – how/where diseases work • Used for a variety of research into – (causes of) diseases – genetics – (personalized) drugs – … � 12

  13. � 13

  14. Protein data from UniProt UniProt • provides a web query interface to Uniprot DB, – manual – programmatic • e.g., query http://www.uniprot.org/uniprot/ for ‘BRCA’ • … biologists need to integrate, share, query, analyse, and search this data • ...so what format is/should it be in? • ...or what format should it be made available in to be integrated with other data? � 14

  15. Protein data from UniProt in as text � 15

  16. Protein data from UniProt in a table (1) ... Protein Short Alterna Altern Altern Gene Gene Gene ... Organi Taxon 1 Taxon 2 Full Nam tive ative ative 1 2 3 sm Name e Name Name Name 1 2 3 ... Fancon BRCA BRCA BRIP1 BACH FANC Haloru Viruses dsDNA FACJ ATP- i 1- 1- 1 J brum viruses, depend anemia interac interac phage no RNA ent group J ting ting HF2 stage RNA protei protei helicase n C- n 1 BRIP1 termin ... N/A N/A N/A N/A helica N/A N/A Gallus Eukary Metazoa ATP- se gallus / ota depend Chicke ent n helicas e ... ... ... ... ... ... ... ... ... ... ... ... ... � 16

  17. Protein data from UniProt in many tables (2) Proteins Protein Full Short Organism ... Protein-names ID Name Name Protein Alternative Name ID 1234123 Fanco FACJ Halorubru ... ni m phage anemi HF2 1234123 ATP-dependent a RNA helicase 1234567 ATP- N/A Gallus ... BRIP1 depen gallus / dent Chicken 1234123 BRCA1-interacting helicas protein C-terminal ... ... ... ... helicase 1 Protein-genes 1234123 BRCA1-interacting Protein Genes protein 1 ID 1234123 BRIP1 ... 1234123 BACH1 ... 1234567 helicas e ... too many joins! � 17

  18. Protein data from UniProt in JSON [{"accession": "Q9NXB0-3", "id": "MKS1-3_HUMAN", "proteinExistence": "Evidence at protein level", "info": { "type": "Swiss-Prot", "created": "2006-03-07", "modified": "2019-09-18", "version": 135 }, "organism": { "taxonomy": 9606, "names": [{ "type": "scientific", "value": "Homo sapiens" }, { "type": "common", "value": "Human" }], "lineage": ["Eukaryota", "Metazoa", "Chordata", "Craniata", "Vertebrata", "Euteleostomi", "Mammalia", "Eutheria", "Euarchontoglires", "Primates", "Haplorrhini", "Catarrhini", "Hominidae", “Homo”] … � 18

  19. Protein data from UniProt in an XML doc (1) <?xml version="1.0" encoding="UTF-8"?> 
 <uniprot xmlns="http://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
 xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd"> 
 <entry dataset="Swiss-Prot" created="2005-01-04" modified="2010-08-10" version="80"> 
 <accession>Q9BX63</accession> 
 <accession>Q3MJE2</accession> 
 <accession>Q8NCI5</accession> 
 <name>FANCJ_HUMAN</name> 
 <protein> 
 <recommendedName ref="1"> 
 <fullName>Fanconi anemia group J protein</fullName> 
 <shortName>Protein FACJ</shortName> 
 </recommendedName> 
 <alternativeName> 
 <fullName>ATP-dependent RNA helicase BRIP1</fullName> 
 </alternativeName> 
 <alternativeName> 
 <fullName>BRCA1-interacting protein C-terminal helicase 1</fullName> 
 <shortName>BRCA1-interacting protein 1</shortName> 
 </alternativeName> 
 <alternativeName> 
 <fullName>BRCA1-associated C-terminal helicase 1</fullName> 
 </alternativeName> 
 </protein> 
 <gene> 
 <name type="primary">BRIP1</name> 
 <name type="synonym">BACH1</name> 
 <name type="synonym">FANCJ</name> 
 </gene> 
 � 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend