1
play

1 Data and information services: Scientific data directories/ - PDF document

General background Driving forces rapid evolution of : Digital Libraries Computing power hardware Memory Networking (internet) From Information retrieval to search engines DB systems e-books, e-libraries &


  1. General background Driving forces – rapid evolution of : Digital Libraries • Computing power hardware • Memory • Networking (internet) • From Information retrieval to search engines • DB systems • e-books, e-libraries & related topics software • IR systems • Hypertext (WWW) • GUI/ presentation tools (html) DL - 2004 Introduction – Beeri/Feitelson 2 DL - 2004 Introduction – Beeri/Feitelson 1 Consequences: (Classical) Libraries: • Automation of catalogs (old stuff) • On-line e-journals Transformation of existing applications • Collections of born-digital materials Generation of new applications related to New: data collection, organization, classification, • Digitized collections (images, maps,..) access • on-line archives • On-line, virtual museums Some examples: DL - 2004 Introduction – Beeri/Feitelson 4 DL - 2004 Introduction – Beeri/Feitelson 3 Digital libraries & bibliographic services: Portals, directories, search engines • ACM Digital Library acm-diglib • Yahoo – a directory (manual labor) collection of all (full) papers from ACM journals • Google – a search engine (fully automatic) • SIGMOD digital anthology anthology IR technology & hypertext structure • DBLP dblp collection of bibliographic information • Citeseer citeseer • Amazon (& similar on-line sales companies) citation and impact factor data (book/ dvd/ .. Portal/ directory) DL - 2004 Introduction – Beeri/Feitelson 6 DL - 2004 Introduction – Beeri/Feitelson 5 1

  2. Data and information services: Scientific data directories/ repositories/ portals : • Medline & US national library of medicine: nlmed • Bioinformatics: -- over 500 db’s/ data sources • On-line encyclopedias: bio-source Wikipedia, art encyclopedia • Astronomy – the world-wide telescope project • Lexis-nexis (and many like it) wwt • Classical humanities – the Perseus digital library perseus DL - 2004 Introduction – Beeri/Feitelson 8 DL - 2004 Introduction – Beeri/Feitelson 7 New kinds of services: Issues: • Heterogeneity � data transformation/ integration • Query subscription on XML/ data streams • Dependence on experiments � scientific niagara, niagaracq experiment – news meta-data – Stocks • Huge volumes � fast, approximate, on-line – satellite data stream processing Needs ultra-fast Issues: Fast streams • stream processing Millions and more queries • Query evaluation/ & subscribers data routing DL - 2004 Introduction – Beeri/Feitelson 10 DL - 2004 Introduction – Beeri/Feitelson 9 Library/IR basic concepts A few more relevant buzzwords : • Customer profiling Classification: ןוימ targeted advertisment Axiom: unique position for a book • Knowledge management � Unique call number סמ ' ןוימ Collecting, managing, exploiting the know-how in (+ secondary subjects) large organizations � hierarchical & universal classification system • E-learning Dewey (1876) LC (1900) • E-publication + subjects � books x Single main catalog x Claim to universality a End of examples DL - 2004 Introduction – Beeri/Feitelson 12 DL - 2004 Introduction – Beeri/Feitelson 11 2

  3. Metadata לע ינותנ Operations in collection creation/ maintenance : data describing a collection/ item Cataloging: גולטק create metadata record for an item Example: library bibliographic record for a book (many style/ spelling … conventions used here) See: Indexing: חותפימ Dublin Core & its use in stanford identify key terms (in all text/ some fields) • Controlled רקובמ -- uses a fixed vocabulary • Uncontrolled terms chosen by indexer Abstraction: רוצקת create short description (of key ideas) DL - 2004 Introduction – Beeri/Feitelson 14 DL - 2004 Introduction – Beeri/Feitelson 13 Products: Thesaurus: םינותנ ןולימ , סורואזית • Bibliographic record db a vocabulary of standard terms/ concepts • Author/ title catalog • Hierarchical organization – every term has • Subject (header) catalog – BT – broader term, NT – narrower terms (provides entry points in on-line catalogs) • Additional relationships – Synonyms, RT – related terms Currently, operations performed manually Used for: • Expensive, slow, time-consuming • Indexing � standardization of index terms • Require experts • Querying � standardization of query terms • Results are non-uniform (even with experts) (supposedly solves problem of non-uniform indexing) Creation: complex, lengthy, community process Automatic indexing & abstracting --- See also: ontology, KWIC (google them!) research areas DL - 2004 Introduction – Beeri/Feitelson 16 DL - 2004 Introduction – Beeri/Feitelson 15 Technology/theory background Unstructured data – (free) text : information retrieval רוזחיא עדימ Data = collection of texts Structured data – dbms Query = set of terms (words) • Schema (structure) describes data precisely Indices = inverted lists (words to locations) • Queries & query language (based on Results = ranked answers structure) • Indices, query optimization Issues/ challenges: covered in DB course • Answers are imprecise, approximate • Difficult to evaluate answer goodness Big challenge: integration of multiple heterogeneous autonomous sources DL - 2004 Introduction – Beeri/Feitelson 18 DL - 2004 Introduction – Beeri/Feitelson 17 3

  4. Hypertext/ www : Semi-structured data --- XML : html, http, soap, … . A standard for data exchange, also stored A browsing model covered in bsdi course covered in bsdi course • Self-describing data Issues/ disadvantages: • Optional meta-data --- DTD/ schemas • No notion of query, just browsing • Validation tools • No structure on data • query language • no data quality guarantees • Stream processing tools (under development) Current move: extend to semantic web DL - 2004 Introduction – Beeri/Feitelson 20 DL - 2004 Introduction – Beeri/Feitelson 19 The course Machine learning: Used for automatic • IR -- classical to Google • Classification – System architectures • Indexing – Kinds of queries • Abstracting – Auxiliary data structures (indices) & • Clustering efficient query processing – Compression, a bit of theory, uses in IR Of free text/ semi-structured data – Extensions to hypertext covered in machine learning courses (using link structure) • “Conceptual” topics – E-books – E-publishing End of technology survey DL - 2004 Introduction – Beeri/Feitelson 22 DL - 2004 Introduction – Beeri/Feitelson 21 End of Introduction DL - 2004 Introduction – Beeri/Feitelson 23 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend