1 Data and information services: Scientific data directories/ - - PDF document

1
SMART_READER_LITE
LIVE PREVIEW

1 Data and information services: Scientific data directories/ - - PDF document

General background Driving forces rapid evolution of : Digital Libraries Computing power hardware Memory Networking (internet) From Information retrieval to search engines DB systems e-books, e-libraries &


slide-1
SLIDE 1

1

DL - 2004 Introduction – Beeri/Feitelson 1

Digital Libraries

  • From Information retrieval to search engines
  • e-books, e-libraries & related topics

DL - 2004 Introduction – Beeri/Feitelson 2

General background

Driving forces – rapid evolution of:

  • Computing power
  • Memory
  • Networking (internet)
  • DB systems
  • IR systems
  • Hypertext (WWW)
  • GUI/ presentation tools (html)

hardware software

DL - 2004 Introduction – Beeri/Feitelson 3

Consequences:

Transformation of existing applications Generation of new applications related to data collection, organization, classification, access

Some examples:

DL - 2004 Introduction – Beeri/Feitelson 4

(Classical) Libraries:

  • Automation of catalogs (old stuff)
  • On-line e-journals
  • Collections of born-digital materials

New:

  • Digitized collections (images, maps,..)
  • on-line archives
  • On-line, virtual museums

DL - 2004 Introduction – Beeri/Feitelson 5

Digital libraries & bibliographic services:

  • ACM Digital Library acm-diglib

collection of all (full) papers from ACM journals

  • SIGMOD digital anthology anthology
  • DBLP dblp

collection of bibliographic information

  • Citeseer

citeseer citation and impact factor data

DL - 2004 Introduction – Beeri/Feitelson 6

Portals, directories, search engines

  • Yahoo – a directory (manual labor)
  • Google – a search engine (fully automatic)

IR technology & hypertext structure

  • Amazon (& similar on-line sales companies)

(book/ dvd/ .. Portal/ directory)

slide-2
SLIDE 2

2

DL - 2004 Introduction – Beeri/Feitelson 7

Data and information services:

  • Medline & US national library of medicine:

nlmed

  • On-line encyclopedias:

Wikipedia, art encyclopedia

  • Lexis-nexis (and many like it)

DL - 2004 Introduction – Beeri/Feitelson 8

Scientific data directories/ repositories/ portals:

  • Bioinformatics: -- over 500 db’s/ data sources

bio-source

  • Astronomy – the world-wide telescope project

wwt

  • Classical humanities – the Perseus digital

library perseus

DL - 2004 Introduction – Beeri/Feitelson 9

Issues:

  • Heterogeneity data

transformation/ integration

  • Dependence on experiments scientific

experiment meta-data

  • Huge volumes fast, approximate, on-line

stream processing

DL - 2004 Introduction – Beeri/Feitelson 10

New kinds of services:

  • Query subscription on XML/ data streams

niagara, niagaracq – news – Stocks – satellite data Issues: Fast streams Millions and more queries & subscribers Needs ultra-fast

  • stream processing
  • Query evaluation/

data routing

DL - 2004 Introduction – Beeri/Feitelson 11

A few more relevant buzzwords :

  • Customer profiling

targeted advertisment

  • Knowledge management

Collecting, managing, exploiting the know-how in large organizations

  • E-learning
  • E-publication

a End of examples

DL - 2004 Introduction – Beeri/Feitelson 12

Library/IR basic concepts

Classification: ןוימ Axiom: unique position for a book Unique call number סמ 'ןוימ

(+ secondary subjects)

hierarchical & universal classification system Dewey (1876) LC (1900) + subjects books x Single main catalog x Claim to universality

slide-3
SLIDE 3

3

DL - 2004 Introduction – Beeri/Feitelson 13

Metadata לע ינותנ data describing a collection/ item Example: library bibliographic record for a book See: Dublin Core & its use in stanford

DL - 2004 Introduction – Beeri/Feitelson 14

Operations in collection creation/ maintenance : Cataloging: גולטק create metadata record for an item

(many style/ spelling … conventions used here)

Indexing: חותפימ identify key terms (in all text/ some fields)

  • Controlled רקובמ-- uses a fixed vocabulary
  • Uncontrolled terms chosen by indexer

Abstraction: רוצקת create short description (of key ideas)

DL - 2004 Introduction – Beeri/Feitelson 15

Products:

  • Bibliographic record db
  • Author/ title catalog
  • Subject (header) catalog

(provides entry points in on-line catalogs)

Currently, operations performed manually

  • Expensive, slow, time-consuming
  • Require experts
  • Results are non-uniform (even with experts)

Automatic indexing & abstracting --- research areas

DL - 2004 Introduction – Beeri/Feitelson 16

Thesaurus: םינותנ ןולימ ,סורואזית a vocabulary of standard terms/ concepts

  • Hierarchical organization – every term has

– BT – broader term, NT – narrower terms

  • Additional relationships

– Synonyms, RT – related terms Used for:

  • Indexing

standardization of index terms

  • Querying standardization of query terms

(supposedly solves problem of non-uniform indexing)

Creation: complex, lengthy, community process See also: ontology, KWIC (google them!)

DL - 2004 Introduction – Beeri/Feitelson 17

Technology/theory background

Structured data – dbms

  • Schema (structure) describes data precisely
  • Queries & query language (based on

structure)

  • Indices, query optimization

covered in DB course Big challenge: integration of multiple heterogeneous autonomous sources

DL - 2004 Introduction – Beeri/Feitelson 18

Unstructured data – (free) text:

information retrieval רוזחיא עדימ Data = collection of texts Query = set of terms (words) Indices = inverted lists (words to locations) Results = ranked answers Issues/ challenges:

  • Answers are imprecise, approximate
  • Difficult to evaluate answer goodness
slide-4
SLIDE 4

4

DL - 2004 Introduction – Beeri/Feitelson 19

Hypertext/ www:

html, http, soap, … . A browsing model covered in bsdi course Issues/ disadvantages:

  • No notion of query, just browsing
  • No structure on data
  • no data quality guarantees

DL - 2004 Introduction – Beeri/Feitelson 20

Semi-structured data --- XML:

A standard for data exchange, also stored covered in bsdi course

  • Self-describing data
  • Optional meta-data --- DTD/ schemas
  • Validation tools
  • query language
  • Stream processing tools (under development)

Current move: extend to semantic web

DL - 2004 Introduction – Beeri/Feitelson 21

Machine learning:

Used for automatic

  • Classification
  • Indexing
  • Abstracting
  • Clustering

Of free text/ semi-structured data covered in machine learning courses

End of technology survey

DL - 2004 Introduction – Beeri/Feitelson 22

The course

  • IR -- classical to Google

– System architectures – Kinds of queries – Auxiliary data structures (indices) & efficient query processing – Compression, a bit of theory, uses in IR – Extensions to hypertext

(using link structure)

  • “Conceptual” topics

– E-books – E-publishing

DL - 2004 Introduction – Beeri/Feitelson 23

End of Introduction