Some notes on type systems and type theory Data types and their - - PowerPoint PPT Presentation

some notes on type systems and type theory
SMART_READER_LITE
LIVE PREVIEW

Some notes on type systems and type theory Data types and their - - PowerPoint PPT Presentation

Some notes on type systems and type theory Data types and their application in computer program languages have been the subject for extensive studies but mad little progress until the calculus was introduced as an instrument for these studies.


slide-1
SLIDE 1

Some notes on type systems and type theory

Data types and their application in computer program languages have been the subject for extensive studies but mad little progress until the λ calculus was introduced as an instrument for these studies. Typed λ calculus became one of the most important tools for these studies and for the study of type systems. Syntax and rules are simple and the most astonishing is that such a simple syntax with so few rules allow for such profound reasoning. Even more astonishing is the fact that you can use the theories to build efficient type systems, such as the Java type system. An “in-depth” understanding of the λ-calculus and type theory is far beyond the scope of this course.

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 1 / 35

slide-2
SLIDE 2

Some notes on type . . . , history

The ideas behind the λ calculus are not new. All started with Leibniz, who stated around 1700 that he wanted to create a universal mathematical (or symbolic) language good enough to formulate all possible problems and a method (algorithm) that could be used to formulate solutions for all the problems one could formulate with the universal language. If you stick to mathematical problems, creating the language is simple enough (and G¨

  • del numbering – much later – told us that most problems can be coded as

numbers and numeric computations). Set theory as formulated in first order predicate logics is quite sufficient. The algorithm was a bigger problem and formulated in λ calculus by Alonzo Church (it’s founder) and independently by Alan Turing, formulated in his “Turing machine” (the name came far later) one could prove that such an algorithm can not be designed. Consequently you had a definition of “computability” (or rather “provability”) and also a proof that some problems have no solutions (some computations can not be performed) as well as languages that are good for solving most (but not all) problems).

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 2 / 35

slide-3
SLIDE 3

Some notes on type . . . , history . . .

Turing also showed that his notation and the λ calculus were equivalent. The λ calculus may be viewed as a language on its own and does serve as the base for a number of languages, e.g. Lisp, Scheme, Clean, ML, Miranda and Haskell. The Turing machine, on the other hand is the model for the Von Neumann machines (= computers) which all conceptually are Turing machines with random-access memory. Assembler languages are directly models of Turing machines while imperative languages all are higher order Turing machines. If you want to study extendible type systems you may start with untyped λ calculus to get the basics and then continue with typed λ calculus to understand type theory. This is just a pointer to the importance of λ calculus in data type and type systems research.

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 3 / 35

slide-4
SLIDE 4

For those who want to dig deeper into type systems

you can take a course but not at KTH. There are courses at Chalmers in Gothenburg, at Upsala University and in Lund.

  • r take to programming in ML, Haskell or Scheme and get into advanced

problems.

  • r read:

Barendregt: ”The Lambda Calculus – Its Syntax and Semantics”, 1984

  • r (more easily accessed):

Hindley, Seldin: ”Introduction to Combinators and λ-Calculus”, 1986 Cambridge University Press (I have a reprint from 2008)

  • r read the not so easily accessed article by Barendregt (link under ”λ calculus

(type systems)” on the course link page) It is important to emphasize its importance in the development of efficient type

  • systems. E.g. the Java type system is built on modern λ calculus based type

theoretical models.

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 4 / 35

slide-5
SLIDE 5

Some notes on data types in modern database applications

  • r

data types in the world of mass media,

text

Formatted or not, free-text or XML

graphics

Vector graphics (CGM, FIG, PICT, Postscript)

images

Bit maps, photos (JPEG, MPEG, GIF)

animations

sequences of graphics /

video

sequences of photos

audio

sound, music – mostly digital or digitalized

combination of all these

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 5 / 35

slide-6
SLIDE 6

Some notes on data types . . .

Common factor: all have a rich internal structure and are big. (Simple sound sequence 8 Kb, a 100 page text 500 Kb. Colour image 6-80 MB, 5 min high quality video ≈ 50 GB

  • It’s normal to play (play back) several different data streams simultaneously.
  • Tough requirements on storage media and application programs.
  • Download/retrieval time may be a problem
  • Differences in download/retrieval time is another
  • Synchronization may be required, time delay calculations too
  • Intricate semantics with complex objects.
  • Querying is difficult. Meta-data is important while meta-data capture is complicated.

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 6 / 35

slide-7
SLIDE 7

Meta-data

Meta-data is data which is needed to interpret other data, thus explaining the meaning of data. It is important in information management and very important when managing complex data. In RDBMS meta-data is used to describe classes of objects (table content). With new data types you may have to use meta-data to describe each object. Thus every row in every table may need to be associated with its own meta-data. Meta-data may be used in index as well as for linguistic annotation.

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 7 / 35

slide-8
SLIDE 8

Meta-data . . .

32 k$ 20 tons Total Weight # Seats 54 Price Type Schoolbus

Analogous data must be digitalized. Digitalized data accuracy depends on sampling frequency. m rows and n columns in an image. Each component in a 2D-array is called a pixel and contains information about colour and intensity.

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 8 / 35

slide-9
SLIDE 9

Meta-data . . .

Data may be external to or internal in the database. Internally, data is stored as a BLOB (binary large object) or a CLOB (caracter large object). Lossless compression 1/3 of original Lossy compression 1/80 of original with not too irritating (??) quality loss

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 9 / 35

slide-10
SLIDE 10

Meta-data . . .

JPEG (Joint Photo Expert Group) – 4 modes

  • sequential, left-to-right, top-to-bottom
  • progressive, starting with a few pixels, gradually adding pixels until whole

image is shown

  • lossless, exact correspondence with original image
  • hierarchical, a variety of versions with different degree of quality loss

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 10 / 35

slide-11
SLIDE 11

Meta-data . . .

Large amount of file formats and compression methods What kind of meta-data do we need?

  • structure, colour, . . . (image)
  • frequencies (sound)
  • type-face, font-size, . . . (text)
  • direction of motion, lighting (video)
  • Id for a speaker, place and time for speech
  • . . .

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 11 / 35

slide-12
SLIDE 12

Meta-data . . .

You need to extract or generate meta-data Images: colour and structure Histogram for colour (RGB, CMYK, Y’UV . . . ) If not Y’UV one may have to add luminous intensity as well Meta-data may be generated:

  • manually, time-consuming.
  • Semi-automatic, where automatically generated meta-data is supplemented with

manually generated meta-data.

  • Automatically

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 12 / 35

slide-13
SLIDE 13

Fetching data

Operation Text Sound Image Manipulation Character Waveform Geometric String Sound editor Pixel Editing Filtering Presentation Formatting Synchronization Composition Decoding Decompression Decompression Conversion Conversion Analysis Indexing Indexing Indexing Searching Searching Searching

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 13 / 35

slide-14
SLIDE 14

Fetching data . . .

select wine_name, price, dbms_lob.substr(note,10,20)as name, price, comment from wine_list where dbms_lob.instr (note, ’poise, elegance and balance’) <> 0

Non-standard LOB manipulation packages exist for most DBMS. Traditional data is managed as usual:

select category, year, avg(price) as average_price, max(price) as highest from wine_list where region = ’Bordeaux’ group by category, year having year between 1995 and 1998

  • rder by category, year

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 14 / 35

slide-15
SLIDE 15

Presentation of results

  • Combining result sets
  • Results that resemble a given example
  • Results that may be ordered according to some criteria

On a lexical level On a syntactic level On a semantic level

Presentation model

Dialogue model

Context

Interaction model

Application model

Application wrapper

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 15 / 35

slide-16
SLIDE 16

Presentation of results . . .

Example:

Elmander meets the ball in the far end of the box after a corner and nets as if he had a Lacross racket for a right foot. Not a chance for the goal-keeper. Niicee! DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 16 / 35

slide-17
SLIDE 17

Fetching data . . .

Every type of data has a set of features of interest (which may vary with the application) Images may be compared with respect to colour, form and structure. If the user asks for similar objects all object that are close enough(??) in colour, form and structure are retrieved. Two issues: What information may be retrieved? How can we retrieve it?

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 17 / 35

slide-18
SLIDE 18

Fetching data . . .

Three levels of complexity when querying:

  • Level 1: Fetch taking primitive features into consideration:

colour, form, structure, spatial location, motion. E.g.: Find a sequence where a car is moving from left to right.

  • Level 2: Logical features related to an object

E.g.: find a sequence with a starting car.

  • Level 3: Abstract attributes associated with an object

E.g.: Find an image with a car that stopped due to a carburettor problem.

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 18 / 35

slide-19
SLIDE 19

Fetching data . . .

Levels 2 and 3 are called semantic media retrieval Successful retrieval restricted to level 1 The difference between simple questions as in level 1 and questions in levels 2 and 3 are referred to as the “semantic gap” How can we retrieve information?

  • Attribute based systems: Same as in RDBMS
  • Text based systems: Use of supplied additional information (descriptions, . . . )

combined with traditional methods.

  • Content based systems: Details that are automatically extracted from data.

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 19 / 35

slide-20
SLIDE 20

Text

Traditional DBMS have inadequate support for text. The only supported text data type was ‘string’ (char, varchar), a character string of max 255 characters. You could query these on exact match or with ”wild cards”. Later (after 1996) support was added for binary large objects (BLOB) which may be used to store text – but without the possibility to make use of the text structure

  • r content which limits meaningful querying.

Mainly you look for a text with either a specific content or a specific structure. Thus, specialized DBMS for text management were developed. These had a rich set of operators for searching for text content and “similarity” in content “Similar” could mean tolerant to incorrect spelling or similarity in pronunciation.

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 20 / 35

slide-21
SLIDE 21

Text . . .

In free text DBMS special indexes were generated (some still do). Indexes were built containing “word stems” and inflexion patterns (b¨

  • jningsm¨
  • nster).

Each record in index was a triple with a stem, an inflexion pattern and list with pairs containing a document identifier and an offset to the location

  • f the word

In spite of dropping “stop” words (‘and’, ‘or’, ‘the’, . . . ) indexes became many times larger than the indexed documents

−m− −ar −arna −en

  • rd

− −en −et

  • rdstam

tall

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 21 / 35

slide-22
SLIDE 22

Text . . .

Therefore methods were invented to rapidly find keywords and for quick matching against string patterns To find a text containing e.g. ‘media’ and ‘database’ stored in a BLOB you must either fetch the BLOB and assign the type text to it and look for the key words (in a possibly multi-GB

  • bject) or . . .

store enough info about the object to avoid unnecessarily fetching it (thus an extensive index). Even if we use a CLOB we have the same situation though the type conversion won’t happen. There is a large number of methods to navigate in big chunks of text.

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 22 / 35

slide-23
SLIDE 23

Text matching, naive method

Note: All intuitive or naive methods are lousy. Suppose we have a text T[n] and a pattern M[m], where n and m are the lengths

  • f the text and the pattern.

for (i=0; T[i] != ’\0’; i++) { for (j=0; T[i+j] != ’\0’ && M[j] != ’\0’ && T[i+j]==M[j]; j++) ; if (M[j] == ’\0’) we-have-a-match }

Two nested loops where the inner one is evaluated O(m) times and the outer one

O(n). Thus the whole algorithm has O(m × n) order of complexity. That’s not very

  • rapid. (mostly it is faster than the worst case but still . . . )

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 23 / 35

slide-24
SLIDE 24

Text matching, naive method . . .

Suppose we want to look for the pattern ’nano’ in the text ’banananobano’ Let’s do it manually to really dig into the method Every row represents one iteration in the outer loop where i runs along the y-axis and j along the x-axis. X means ‘no match’, else the matching pattern is written

  • ut.

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 24 / 35

slide-25
SLIDE 25

Text matching, naive method . . .

1 2 3 4 5 6 7 8 9 10 11 T: b a n a n a n

  • b

a n

  • 0: X

1: X 2: n a n X 3: X 4: n a n

  • 5:

X 6: n X 7: X 8: X 9: X 10: n X 11: X

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 25 / 35

slide-26
SLIDE 26

Text matching, naive method . . .

It is obvious that some of the work in unnecessary. At iteration 3 we have already seen that the next character is a and thus we may skip this step and for the same reason we may skip line 5, 6 and 7. As the pattern passes the end of the text in lines 9, 10 and 11, we may skip these steps too We need different algorithms In 1977 two emerged, one designed by Knuth, Morris and Pratt and one by Boyer and Moore. Both in the complexity class O(n).

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 26 / 35

slide-27
SLIDE 27

Text matching (Knuth, Morris and Pratt)

Knuth, Morris and Pratt introduced the term ‘overlap’ where you compare the partial match with the pattern and with their technique they jump exactly the lines I suggested The algorithm:

i=0; while (i<n) { for (j=0; T[i+j] != ’\0’ && P[j] != ’\0’ && T[i+j]==M[j]; j++) ; if (M[j] == ’\0’) we-have-a-match; i = i + max(1, j-overlap(M[0..j-1],M[0..m])); }

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 27 / 35

slide-28
SLIDE 28

Text matching (Knuth, Morris and Pratt) . . .

An overlap between two strings s1 and s2 is the longest suffix in s1 that is a prefix in s2 but not the entire s1 or the entire s2. It is possible to trim the algorithm further and preconfigure the overlap function. Also, normally you build a structure that works as a finite state machine to handle the matching step. The final algorithm is fast. If you are interested, there are many algorithms on

http://www-igm.univ-mlv.fr/~lecroq/string/

There are also literature references (as well as on the course web pages).

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 28 / 35

slide-29
SLIDE 29

Text matching, other methods

You can use an inverted index to the price of an extra 10-300% overhead (compared to the length of the text). You can organize a hash index on top of that for further speed (at yet an additional cost). Beyond comparison the fastest way but at a high price. A nice method of the ‘quick-and-dirty’ type is to use signature files. Suppose that we want to find ‘data’ and ‘retrieval’ in a text file that contains (for simplicity) the words ‘data’ and ‘base’. It is an old idea (Mooers 1949!!) in which you assign a unique random bit pattern to every meaningful unit in the text. Then use an OR operation to merge all the bit patterns and use that as a signature for the text With an adequate length on the signature (we used 1024 bits) and a sufficient number of bits set to one (we used 509, the biggest prime less than half the length

  • f the signature) it works astonishingly well. However it has drawbacks . . . .

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 29 / 35

slide-30
SLIDE 30

Text matching, other methods . . .

In this example I use 4-bit patterns in a 12-bit signature to make it more lucid. In a real case it is far to short a signature and far too few bits that are set. Then let M mean ‘match’ and X mean ‘does no match.

Word Signature data 001 000 110 010 base 000 010 101 001 retrieval 001 001 001 100 doc signature 001 010 111 011 for ‘data’ we get: M MM M for ‘retrieval’: M X M X

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 30 / 35

slide-31
SLIDE 31

Text matching, other methods . . .

If you choose the length of the signature so that it is about half full of ones you get a small number of false alarms but it is a brilliant idea for filtering away a large number of documents that do not match the search criteria. After all ‘no’ actually means ‘no’ while a match rather means ‘maybe’ Slower than inverted indexes but but a nice ‘quick-and-dirty’ filter.

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 31 / 35

slide-32
SLIDE 32

Text, searching sets of documents

So far we looked at methods to query (or not have to query) single documents but the last filtering methods points at the possibility to organize or classify large amounts of documents with respect to how relevant they are to a database query. Thus it must work ‘ad hoc’. We can build a matrix with terms in the texts and calculate the number of hits per document at the same time as we build a temporary index. Both for the matrix and the index we skip “stop words”

doc/word data base retrieval deletion doc1 2 3 2 doc2 1 2 2 doc3 1 3 doc4 2 3 doc5 2 4

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 32 / 35

slide-33
SLIDE 33

Text, searching sets of documents . . .

Depending on the size of the register for a document and which if the words that the document contains, we place a marker for each document in a multi-dimensional lattice, one dimension per search criterion. Then we manipulate the marker positions according to some “closeness” criterion and we get a number of “clusters” Also, we may use statistics and signatures cleanse the clustering and get a small enough number of documents to actually consider. We may also search with some “fuzzyness” in the search criteria for text (and actually images too).

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 33 / 35

slide-34
SLIDE 34

Text, searching sets of documents . . .

A number of methods exist already for querying the internet:

  • Cyclic incremental search. You find a document that is “close enough” to the

document you are looking for and let the search engine or MMDBMS use it as a template to find “similar” documents.

  • You let a macro traverse all links in a document to find something like a convex hull.
  • Annotation of the result set to widen or restrict a search You let the user add or delete

search criteria during the run.

  • Visualization of spatial relations. Searching complex data is essentially spatial

Use it to find new proximity measures or concepts to widen or restrict search

  • Visualization of temporal aspects. Same thing but with focus on time.

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 34 / 35

slide-35
SLIDE 35

Interesting system – Carrot2

Interesting research project the gave us considerably better hit rates on the internet showing us that modern math works and can be used in implementations You perform clustering with the proximity measure adjusted according to a contexture all search parameters and we get a noticeably better search result than without clustering. Most search engines on the internet use it or something similar to it today.

http://www.carrot2.org/ http://project.carrot2.org/

DD2471 (Lecture 02) Modern database systems & their applications Spring 2012 35 / 35