Introduction to Databases Michael Schroeder BIOTEC TU Dresden - - PowerPoint PPT Presentation

introduction to databases
SMART_READER_LITE
LIVE PREVIEW

Introduction to Databases Michael Schroeder BIOTEC TU Dresden - - PowerPoint PPT Presentation

Introduction to Databases Michael Schroeder BIOTEC TU Dresden ms@biotec.tu-dresden.de Structure Motivation Types of Databases Introduction to Relational Databases Important Concepts of MySQL Usage and Example Queries


slide-1
SLIDE 1

Michael Schroeder

BIOTEC TU Dresden ms@biotec.tu-dresden.de

Introduction to Databases

slide-2
SLIDE 2

Structure

■ Motivation ■ Types of Databases ■ Introduction to Relational Databases ■ Important Concepts of MySQL ■ Usage and Example Queries

slide-3
SLIDE 3

Motivation

slide-4
SLIDE 4

Motivation

■ Interaction with frontends / GUIs ■ manual submission of queries, ■ accessing information in the web browser or via tools ■ includes frequent copy&paste of intermediate results, IDs, … ■ switiching between websites, tools ■ Advantages ■ easy to retrieve information if website/tool is well-designed ■ no programming necessary ■Disadvantages ■ no batch processing possible in many cases ■ queries limited by interface provided by web page / tool ■ difficult/impossible to integrate information from different sites

How can we directly access databases?

slide-5
SLIDE 5

What actually happens when you retrieve data online?

… Compose result web page and send it ß Send result ß Display result 4 Start program that evaluates query by accessing database… à Send query à * Display home page * enter query * press submit 3 ß Send home page ß 2 Get it and send it à Get home page à 1 Web Server Message Client

slide-6
SLIDE 6

The flood of biomedical data…

■ Since 1980, the number and size of biomedical databases has been growing exponentially. ■ How can you find sources of information you are seeking?

■ Nucleic Acids Research Database Issue in January of every year (oxfordjournals.org/nar) ■ Wikipedia Article: List of Biological Databases

Susan B. Davidson, Biol537/CIS636, Fall 2003

slide-7
SLIDE 7

Early Databank Format of PDB

http://ibiblio.org/comphist/files/images/CardReaderPunch.jpg

ATOM 1 N THR 1 17.047 14.099 3.625 1.00 13.79 1CRN 70 Exemplary PDB ATOM record

Early Databank Format of PDB

http://ibiblio.org/comphist/files/images/CardReaderPunch.jpg

ATOM 1 N THR 1 17.047 14.099 3.625 1.00 13.79 1CRN 70 Exemplary PDB ATOM record

slide-8
SLIDE 8

Types of Databases

slide-9
SLIDE 9

What databases are about

■ Logical organization of data (focused on application needs)

■ data models, schema design, dictionaries

■ Physical organization of data (focused on database needs)

■ Fast retrieval, indexing, compact storage of data

■ Other requirements: ■ Logging (important to know who did what to the data) ■ Security and access control (important to know who can do what) ■ Transactions and concurrency control (important when more than one person is working on database) ■ Integrity (important to ensure that only valid entries in the database) ■ Recovery (important as hardware and software can sometimes fail

slide-10
SLIDE 10

Different types of databases

■ Flat files ■ XML ■ Relational database ■ Object databases ■ Object relational databases ■ Graph databases

slide-11
SLIDE 11

Flat files

■ We can store any data in a flat file, e.g. EMBL ■ But is this a database?

■ Logical data organisation: None, unless we define one (as done for EMBL) and adhere to it, which is not enforced ■ Physical data organisation: None, we cannot optimise retrieval for common queries ■ Logging: No ■ Access control: Implicit through Unix ■ Transaction and concurrency control: None ■ Integrity: None ■ Recovery: If files are backed- up they can be recovered. However, not on the fly

ID BTBPTIG standard; genomic DNA; MAM; 3998 BP. XX AC X03365; K00966; XX SV X03365.1 XX DT 18-NOV-1986 (Rel. 10, Created) DT 20-MAY-1992 (Rel. 31, Last updated, Version 3) XX DE Bovine pancreatic trypsin inhibitor (BPTI) gene XX KW Alu-like repetitive sequence; protease inhibitor; trypsin inhibitor. XX OS Bos taurus (cow) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; OC Eutheria; Cetartiodactyla; Ruminantia; Pecora; Bovoidea; Bovidae; Bovinae; OC Bos. XX RN [1] RP 1-3998 RX MEDLINE; 86158754. RX PUBMED; 2420326. RA Kingston I.B., Anderson S.; RT "Sequences encoding two trypsin inhibitors occur in strikingly similar RT genomic environments"; RL Biochem. J. 233(2):443-450(1986). XX RN [2] RX MEDLINE; 84070725. RX PUBMED; 6580617. RA Anderson S., Kingston I.B.; RT "Isolation of a genomic clone for bovine pancreatic trypsin inhibitor by RT using a unique-sequence synthetic DNA probe.";

slide-12
SLIDE 12

XML files

■ We can store any data in XML, the eXtensible Mark-up Language, e.g. Medline ■ But is this a database?

■ Logical data organisation: yes, XML schema, which is enforced ■ Physical data organisation: None, we cannot optimise retrieval for common queries ■ Logging: No ■ Access control: Implicit through Unix ■ Transaction and concurrency control: None ■ Integrity: None ■ Recovery: If files are backed- up they can be recovered. However, not on the fly

<Article> <Journal> <ISSN>0270-7306</ISSN> <JournalIssue> <Volume>19</Volume> <Issue>11</Issue> <PubDate> <Year>1999</Year> <Month>Nov</Month> </PubDate> </JournalIssue> </Journal> <ArticleTitle>Differential regulation of the cell wall integrity mitogen-activated protein kinase pathway in budding yeast by the protein tyrosine phosphatases Ptp2 and Ptp3. </ArticleTitle> <Pagination> <MedlinePgn>7651-60</MedlinePgn> </Pagination> <Abstract> <AbstractText>Mitogen-activated protein kinases (MAPKs) are inactivated by dual-specificity and protein tyrosine phosphatases (PTPs) in yeasts. In Saccharomyces cerevisiae, two PTPs, Ptp2 and Ptp3, inactivate the MAPKs, Hog1 and Fus3, with different specificities... </AbstractText> </Abstract> <Affiliation>Department of Chemistry, University of Colorado, Boulder, Colorado 80309-0215, USA. </Affiliation>…

slide-13
SLIDE 13

Introduction to Relational Databases

+-------+------+---------+ | id | name | subject | +-------+------+---------+ | 46458 | rick | bio | | 46459 | gerd | bio | | 46460 | mary | bio | | 46461 | ella | math | | 14982 | anne | math | | 46462 | paul | math | +-------+------+---------+ +-------+------+---------+ | id | prof | subject | +-------+------+---------+ | 51221 | bert | bio | | 55435 | anne | math | +-------+------+---------+ +---------+------+-----+------+ | subject | room | day | time | +---------+------+-----+------+ | bio | A | mo | 3pm | | math | B | tue | 1pm | +---------+------+-----+------+

slide-14
SLIDE 14

Relational Database

■ Central Idea: Data as relations in a table ■ E.g. SCOP, Structural Classification of Proteins

+-------+------+---------+---------+--------------------------------------+ | id | type | sccs | sid | description | +-------+------+---------+---------+--------------------------------------+ | 46457 | cf | a.1 | - | Globin-like | | 46458 | sf | a.1.1 | - | Globin-like | | 46459 | fa | a.1.1.1 | - | Truncated hemoglobin | | 46460 | dm | a.1.1.1 | - | Truncated hemoglobin | | 46461 | sp | a.1.1.1 | - | Ciliate (Paramecium caudatum) | | 14982 | px | a.1.1.1 | d1dlwa_ | 1dlw A: | | 46462 | sp | a.1.1.1 | - | Green alga (Chlamydomonas eugametos) | | 14983 | px | a.1.1.1 | d1dlya_ | 1dly A: | | 63437 | sp | a.1.1.1 | - | Mycobacterium tuberculosis | | 62301 | px | a.1.1.1 | d1idra_ | 1idr A: | +-------+------+---------+---------+--------------------------------------+

slide-15
SLIDE 15

Relational Database

■ Central Idea: Data as relations in a table ■ E.g. Employee

+-------+------+---------+---------+ | id | name | salary | role | +-------+------+---------+---------+ | 46457 | pete | 50.000 | director| | 46458 | jane | 60.000 | nurse | | 46459 | asif | 70.000 | driver | +-------+------+---------+---------+

slide-16
SLIDE 16

Relational Database

■ Central Idea: Data as relations in a table ■ E.g. pets

+----------+--------+---------+------+------------+------------+ | name | owner | species | sex | birth | death | +----------+--------+---------+------+------------+------------+ | Whistler | Gwen | bird | | 0000-00-00 | NULL | | Chirpy | Gwen | bird | f | 1998-09-11 | 0000-00-00 | | Bowser | Diane | dog | m | 1979-08-31 | 1995-07-29 | | Fang | Benny | dog | m | 1990-08-27 | 0000-00-00 | | Buffy | Harold | dog | f | 1989-05-13 | 0000-00-00 | | Claws | Gwen | cat | m | 1994-03-17 | 0000-00-00 | | Fluffy | Harold | cat | f | 1993-02-04 | 0000-00-00 | | Slim | Benny | snake | m | 1996-04-29 | 0000-00-00 | +----------+--------+---------+------+------------+------------+

slide-17
SLIDE 17

Relational Database

■ Central Idea: Data as relations in a table ■ E.g. school

+-------+------+---------+ | id | name | subject | +-------+------+---------+ | 46458 | rick | bio | | 46459 | gerd | bio | | 46460 | mary | bio | | 46461 | ella | math | | 14982 | anne | math | | 46462 | paul | math | +-------+------+---------+ +-------+------+---------+ | id | prof | subject | +-------+------+---------+ | 51221 | bert | bio | | 55435 | anne | math | +-------+------+---------+ +---------+------+-----+------+ | subject | room | day | time | +---------+------+-----+------+ | bio | A | mo | 3pm | | math | B | tue | 1pm | +---------+------+-----+------+

slide-18
SLIDE 18

Bioinformatics: 10 years of resistance to flattening!

■ Why the resistance?

■ Bioinformatics data is naturally nested ■ Extensive Use of sets and lists ■E.g. Swissprot: Features, keywords, References

■ Such data can be flattened, but the resulting relational schema is hard to understand hence it is hard to formulate queries.

■ For example, storing the SWISSPROT entry in a relational database would split it over 15-20 tables.

slide-19
SLIDE 19

Relational Databases

■ RDB introduced in 1970 by Codd ■ Took off in the 80s ■ In the business world, relational databases are the rule (Oracle, PostgreSQL, MySQL, SQLite, MS Access). ■ Large biomedical databases typically use a relational technology; but there are also a lot of homegrown systems (ACeDB, SRS indexed files). ■ Data is almost always viewed and exported in a variety

  • f flat file formats (EMBL, GenBank among others)
slide-20
SLIDE 20

§ SQL: “Structured Query Language”—the most common standardized language used to access databases. § SQL has several parts: DDL – Data Definition Language {Defining, Deleting, Modifying relation schemas} DML – Data Manipulation Language {Inserting, Deleting, Modifying tuples in database} § Embedded SQL – defines how SQL statements can be standardized and used with programming languages

SQL: A Query Language

CREATE, DROP, RENAME, ... SELECT, INSERT, DELETE, MERGE, ...

slide-21
SLIDE 21

■ MySQL, the most popular Open Source SQL database, is developed, distributed and supported by MySQL AB. ■ MySQL is a relational database management system. ■ MySQL software is Open Source.

  • Written in C and C++. Tested with a broad range of different

compilers. ■Works on many different platforms. ■APIs for C, C++, Eiffel, Java, Perl, PHP, Python, Ruby, and Tcl. ■You can find MySQl manual and documentation at:

  • www.mysql.com/documentation/
  • You can download and install MySQL on your own computer

(under Windows, Mac OS, Linux)

MySQL Database System

slide-22
SLIDE 22

Important Concepts

  • f MySQL
slide-23
SLIDE 23

Relational Schema

The schema of a database is a set of relation names, their attribute names and types. Example:

Sequence(ID: int, Length: int, Seq: string, Mod: date) Feature(ID: int, Type: string, From: int,To: int)

§ Sequence and Feature are relation names § ID, Seq, Mod, etc are attribute names § Int, string and date are types

Susan B. Davidson, Biol537/CIS636, Fall 2003

slide-24
SLIDE 24

Relation Instance

An instance of a relation is a set of tuples of the relation. A tuple of Sequence could be:

( ID: 82814, Length: 597, Seq:“ccagctaaccg”, Mod: 1-7-95)

A tuple of Feature could be:

(ID: 82814, Type:“source”, From:1, To:8959)

Susan B. Davidson, Biol537/CIS636, Fall 2003

slide-25
SLIDE 25

Tabular representation

Typically, relations are displayed as tables Sequence: Feature:

ID Type From To 82814 “Source” 1 184 82814 “Gene” 23 65

}tuples

attributes

ID Length Seq Mod 82814 597 “ccagctaa...” 1-07-95 98608 18976 “accgcct...” 2-14-98

Susan B. Davidson, Biol537/CIS636, Fall 2003

slide-26
SLIDE 26

Entities and Relationships

§ One-to-many relationship from Sequence to Feature § each sequence can have many features § but a feature can be on at most one sequence. § The existence of a feature depends on the existence

  • f the owning sequence Þ referential integrity

§ Other types of relationships possible § One-to-one § Many-to-many § Many-to-one

Susan B. Davidson, Biol537/CIS636, Fall 2003

slide-27
SLIDE 27

Integrity Constraints

§ ID is the primary key of Sequence: It is not possible

for two tuples of Sequence to have the same ID.

§ In Feature, there is a referential integrity constraint

  • n ID: Every ID in Feature must appear in some tuple

in Entry. § It is specified in the data definition language (DDL). § It is enforced by the system as updates are made to the instance.

Susan B. Davidson, Biol537/CIS636, Fall 2003

slide-28
SLIDE 28

DDL for this relational schema

CREATE TABLE Sequence (Id INTEGER, Length INTEGER, Sequence LONGCHAR, Mod DATE, PRIMARY KEY (Id) ) CREATE TABLE Feature (Id INTEGER, Type CHAR(15), From INTEGER, To INTEGER, PRIMARY KEY (Id, Type, From, To) FOREIGN KEY (Id) REFERENCES Sequence ON DELETE CASCADE ON UPDATE CASCADE)

Susan B. Davidson, Biol537/CIS636, Fall 2003

slide-29
SLIDE 29

Locking and Concurrency

Zawodny: High Performance MySQL

§ Two users read from a table at the same time? § One user reads and the other one writes to a table at the same time? § Both write to one table at the same time?

What if ...

slide-30
SLIDE 30

Locking and Concurrency

Zawodny: High Performance MySQL

Solution: Read and write locks

§ What is locked? -> Lock Granularity § Table locks (most basic lock) § Fine, mostly read operations on MySQL § Inexpensive lock, but low concurrency § Page locks (portion of tables) § Concurrency depends on size of pages § End of table may be locked frequently § Row locks § High overhead (expensive), high concurrency § Important if multiple users need to update the same row at the same time

slide-31
SLIDE 31

Transactions

Zawodny: High Performance MySQL

Scenario: Banking Application

§ Banks with one table for checkings and savings § Jane wants to transfer 200 € to her savings § Multiple steps necessary for transaction? § What happens if the server crashes at a certain point? Where is the money?

slide-32
SLIDE 32

Transactions

Zawodny: High Performance MySQL

Example Transaction for Banking

BEGIN; [step 1] SELECT balance FROM checking WHERE customer_id = 10233276; [step 2] UPDATE checking SET balance = balance – 200.0 WHERE customer_id = 10233276; [step 3] UPDATE savings SET balance = balance + 200.0 WHERE customer_id = 10233276; COMMIT;

slide-33
SLIDE 33

Transactions

Zawodny: High Performance MySQL

Solution: ACID Transactions

§ Atomicity: Transactions forms an indivisible unit § Consistency: Databases have to move from one consistent state to the other § Isolation: Results of transaction are invisible to other transactions until the transaction is complete § Durability: Once commited, the results of a transaction are permanent

slide-34
SLIDE 34

Usage and example queries

SELECT * FROM Sequence WHERE Length < 10000;

slide-35
SLIDE 35

Querying relational databases

§ SQL as a standard for querying relational databases § Curious mixture of the relational algebra and relational calculus (formal languages) § allows new relations of information to be computed from a set of relations. § SQL is relationally complete § Enables you to link, filter or rename relations § SQL is powerful, offers all basic relational algebra

  • perations

(UNION, SELECTION, DIFFERENCE, etc.) § Unlike the relational algebra, it allows other useful stuff: COUNT, SUM, MIN, MAX, etc.

Susan B. Davidson, Biol537/CIS636, Fall 2003

slide-36
SLIDE 36

Simple queries

Typically, relations are displayed as tables Sequence: Feature:

ID Type From To 82814 “Source” 1 184 82814 “Gene” 23 65

}tuples

attributes

ID Length Seq Mod 82814 597 “ccagctaa...” 1-07-95 98608 18976 “accgcct...” 2-14-98

Susan B. Davidson, Biol537/CIS636, Fall 2003

slide-37
SLIDE 37

Simple queries

SELECT * FROM Sequence WHERE Length < 10000; SELECT Type FROM Feature; Print all sequences with length less than 10000. Print the type of all features.

ID Length Seq Mod 82814 597 “ccagctaa...” 1-07-95 76582 9976 “actgga…” 2-29-00 Type “Source” “Gene” “Gene” “Gene” “Source”

Susan B. Davidson, Biol537/CIS636, Fall 2003

slide-38
SLIDE 38

Summary

■ MySQL is a relational database ■ The language SQL lets you query such databases ■ SQL is “relationally complete”: allows you to perform

  • perators in an algebra of relations (the relational

algebra). ■ Additional features: string comparisons, set membership, arithmetic and grouping. ■ In contrast, flat files / XML files offer less features ■ Transactions as important feature for safety /consistency ■ Locks as important feature for concurrency / multiple users on one database

Susan B. Davidson, Biol537/CIS636, Fall 2003