new cdsware system tools
play

New CDSware System Tools Eduardo Margallo Balb as August 22, 2002 - PowerPoint PPT Presentation

New CDSware System Tools Eduardo Margallo Balb as August 22, 2002 0-0 About the Talk About the talk BibSched BibSched Overview Architecture Future Lines BibReformat BibWords BibUpload BibFormat Search


  1. New CDSware System Tools Eduardo Margallo Balb´ as August 22, 2002 0-0

  2. � � � � � About the Talk About the talk BibSched BibSched Overview Architecture Future Lines BibReformat BibWords BibUpload BibFormat Search Overview Implementation Future Lines BibWords Overview BibReformat Fulltext Index Reverse Tables Lexical Statistics Figure 1: New tools in the system. Future Lines Conclusions CDSware ETT-DH August 22, 2002 Page 1

  3. � � � � � � � BibSched – Overview About the talk BibSched Overview Why BibSched Architecture Computing intensive tasks may not be run at the same time. Future Lines Data consistency requires mutual exclusion. BibReformat Need for daemon processes and scheduling into the future. Overview Implementation Goals Future Lines Handle concurrent access administrator friendly. BibWords Support for periodical tasks. Overview Job control. Fulltext Index Reverse Tables Priority mechanism. Lexical Statistics Future Lines Conclusions CDSware ETT-DH August 22, 2002 Page 2

  4. � � � � � BibSched – Architecture About the talk BibSched Overview BibFormat BibWords BibUpload Architecture Interface Interface Interface Future Lines BibReformat schTASKS BibSched BibUpload Overview Admin Implementation BibFormat Future Lines BibSched BibWords BibWords Bib??? Overview Fulltext Index Reverse Tables Figure 2: BibSched Architecture. Lexical Statistics Future Lines Conclusions CDSware ETT-DH August 22, 2002 Page 3

  5. � � � � � BibSched – Architecture About the talk WAITING BibSched Overview Architecture DONE RUNNING Future Lines SIGUSR1 SIGUSR2 SLEEP SENT BibReformat STOPPING Overview SLEEPING Implementation SIGCONT Future Lines STOPPED WAKING UP BibWords Overview Fulltext Index ERROR Reverse Tables Lexical Statistics Future Lines Figure 3: BibSched Tasks State Diagram. Conclusions CDSware ETT-DH August 22, 2002 Page 4

  6. � � � � � � � � � BibSched – Future Lines About the talk BibSched User Interface Overview Task managing in a user friendly way. Architecture Future Lines General/task specific manager. BibReformat Code Library Overview Common properties to tasks and interfaces: Class design. Implementation Future Lines Several Queues BibWords FIFO is simple, but better performance is possible. Overview Compatible tasks may run in parallel. Fulltext Index System Monitor Reverse Tables Lexical Statistics Resource based task scheduling. Future Lines Conclusions CDSware ETT-DH August 22, 2002 Page 5

  7. � � � � � � � � BibReformat – Overview About the talk BibSched Overview Problem Architecture Output formats stored in the database for performance. Future Lines Changes in bibligraphic data: Update formats database. BibReformat Overview Previous methodology Implementation Download-Reformat-Upload Cycle. Future Lines Slow, error prone. BibWords Goal Overview Fulltext Index Simple way to update prestored formats. Reverse Tables Easy to use and reliable. Lexical Statistics Future Lines Conclusions CDSware ETT-DH August 22, 2002 Page 6

  8. � � � � � BibReformat – Implementation About the talk Arbitrary Collection Select Select BibSched Overview Architecture Future Lines flxREFORMAT BibReformat queue Overview Implementation Future Lines Search BibReformat BibUpload BibWords Overview Fulltext Index Reverse Tables BibFormat Lexical Statistics Future Lines Figure 4: BibReformat Architecture. Conclusions CDSware ETT-DH August 22, 2002 Page 7

  9. � � � � � � � BibReformat – Future Lines About the talk BibSched Overview Architecture Future Lines BibSched Merging BibReformat BibReformat was a precursor for BibSched. Overview Similar task queue mechanism. Implementation Now BibReformat is obsolete: rewrite into a BibSched task Future Lines BibWords User Interface Enhancement Overview Better collection select: collapse/expand(?). Fulltext Index Reverse Tables Lexical Statistics Future Lines Conclusions CDSware ETT-DH August 22, 2002 Page 8

  10. � � � � � � � � BibWords – Overview Fulltext Index About the talk 220.000 fulltexts in the system. BibSched Capability to search inside them very valuable. Overview Architecture CDSware one of the first bibliographic engines with combined meta- Future Lines data/reference/fulltext search!!! BibReformat Word Tables Overview Forward tables: Word- Document number list. Implementation Reverse tables: Document number- Word list Future Lines Many advantages for little price. BibWords Overview Lexical Analysis Fulltext Index What is our data like? Reverse Tables Many common words or rather many rare ones? Lexical Statistics Future Lines Is the corpus size growing? How is the growth curve? Conclusions Will our current system implementation cope with it? CDSware ETT-DH August 22, 2002 Page 9

  11. � � � � � � � BibWords – Fulltext Index About the talk BibSched Goals Overview Fulltext indexes for maximal number of records. Architecture Speed is a concern due to the high volume of data. Future Lines Good search speed. BibReformat Overview Problems Implementation Different file formats. Future Lines Heterogenous file access. BibWords Scanned bitmaps. Overview Slow file access for external documents. (Transfer rate, timeouts,...) Fulltext Index Reverse Tables Slow text extraction. (mainly ps and pdf) Lexical Statistics Huge amount of words. Future Lines Conclusions CDSware ETT-DH August 22, 2002 Page 10

  12. � � � � � BibWords – Fulltext Index About the talk BibSched Overview bibwordsX bibXXX Architecture bibitem Future Lines BibReformat WWW Overview BibWords Implementation Future Lines BibWords Document Helper Programs Overview Server (pdf2text, antiword,...) Fulltext Index Reverse Tables Figure 5: BibWords Architecture. Lexical Statistics Future Lines Conclusions CDSware ETT-DH August 22, 2002 Page 11

  13. � � � � � � � � BibWords – Fulltext Index About the talk BibSched Fulltext word extraction from url Overview Architecture Indirect Lookup Future Lines URL might be direct: “http://foo.bar/spam.pdf” BibReformat URL migh be indirect: “http://foo.bar/setlink?...” Overview Indirect URLs contain pointers to actual locations of files. Implementation Future Lines Document retrieval BibWords urllib does the job for us. Overview Text extraction Fulltext Index Helper programs permit flexible configurations Reverse Tables Lexical Statistics (PPT, DOC, PS, PS.GZ,...). Future Lines Conclusions CDSware ETT-DH August 22, 2002 Page 12

  14. � � � � � � � BibWords – Reverse Tables About the talk BibSched Overview Architecture Future Lines Advantages BibReformat Database consistency all throughout indexation process. Faster up- Overview dates: No need to look through all the words. Implementation Future Lines Price BibWords Lower indexing speed: word list compress and insert. Storage Overview space. Fulltext Index Reverse Tables Lexical Statistics Future Lines Conclusions CDSware ETT-DH August 22, 2002 Page 13

  15. � � � � � � � BibWords – Lexical Statistics About the talk BibSched Analysis conditions Overview Architecture Analysis performed on indexable data (unique words in document). Future Lines Ordered by system number. BibReformat Limitations Overview Every word counted only once per document. Implementation Future Lines Standard corpus statistics count all the word appearances. BibWords Not really comparable to standard corpus statistics. Overview Nontheless useful for system evolution prediction. Fulltext Index System number highly correlated to type of documents. Reverse Tables Anomalies appear in vocabulary growth curves. Lexical Statistics Future Lines Conclusions CDSware ETT-DH August 22, 2002 Page 14

  16. � � � � � BibWords – Lexical Statistics About the talk Frequency distribution BibSched 100000 Title Overview Architecture 10000 Future Lines BibReformat 1000 Overview Implementation Future Lines 100 BibWords 10 Overview Fulltext Index Reverse Tables 1 1 10 100 1000 10000 100000 1e+06 Lexical Statistics Future Lines Figure 6: Frequency Distribution for Title Index. Conclusions CDSware ETT-DH August 22, 2002 Page 15

  17. � � � � � BibWords – Lexical Statistics About the talk BibSched Overview Architecture 10000 Future Lines BibReformat Vm Overview 100 Implementation Future Lines BibWords 1 Overview 1e+00 1e+02 1e+04 1e+06 Fulltext Index m Reverse Tables Lexical Statistics Figure 7: Frequency Distribution for Stuttgarter Zeitung. a Future Lines Conclusions a Data by Anke L¨ udeling, Stefan Evert. Universit¨ at Stuttgart CDSware ETT-DH August 22, 2002 Page 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend