New CDSware System Tools Eduardo Margallo Balb as August 22, 2002 - - PowerPoint PPT Presentation
New CDSware System Tools Eduardo Margallo Balb as August 22, 2002 - - PowerPoint PPT Presentation
New CDSware System Tools Eduardo Margallo Balb as August 22, 2002 0-0 About the Talk About the talk BibSched BibSched Overview Architecture Future Lines BibReformat BibWords BibUpload BibFormat Search
About the Talk
- About the talk
- BibSched
Overview Architecture Future Lines
- BibReformat
Overview Implementation Future Lines
- BibWords
Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines
- Conclusions
BibFormat BibUpload Search BibWords BibSched BibReformat
Figure 1: New tools in the system.
August 22, 2002
CDSware ETT-DH
Page 1
BibSched – Overview
- About the talk
- BibSched
Overview Architecture Future Lines
- BibReformat
Overview Implementation Future Lines
- BibWords
Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines
- Conclusions
- Why BibSched
Computing intensive tasks may not be run at the same time. Data consistency requires mutual exclusion. Need for daemon processes and scheduling into the future.
- Goals
Handle concurrent access administrator friendly. Support for periodical tasks. Job control. Priority mechanism.
August 22, 2002
CDSware ETT-DH
Page 2
BibSched – Architecture
- About the talk
- BibSched
Overview Architecture Future Lines
- BibReformat
Overview Implementation Future Lines
- BibWords
Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines
- Conclusions
BibSched BibUpload BibFormat BibWords Bib??? BibUpload Interface BibSched Admin BibFormat Interface BibWords Interface schTASKS
Figure 2: BibSched Architecture.
August 22, 2002
CDSware ETT-DH
Page 3
BibSched – Architecture
- About the talk
- BibSched
Overview Architecture Future Lines
- BibReformat
Overview Implementation Future Lines
- BibWords
Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines
- Conclusions
WAITING STOPPED SLEEPING STOPPING DONE ERROR RUNNING WAKING UP SLEEP SENT
SIGUSR2 SIGCONT SIGUSR1
Figure 3: BibSched Tasks State Diagram.
August 22, 2002
CDSware ETT-DH
Page 4
BibSched – Future Lines
- About the talk
- BibSched
Overview Architecture Future Lines
- BibReformat
Overview Implementation Future Lines
- BibWords
Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines
- Conclusions
- User Interface
Task managing in a user friendly way. General/task specific manager.
- Code Library
Common properties to tasks and interfaces: Class design.
- Several Queues
FIFO is simple, but better performance is possible. Compatible tasks may run in parallel.
- System Monitor
Resource based task scheduling.
August 22, 2002
CDSware ETT-DH
Page 5
BibReformat – Overview
- About the talk
- BibSched
Overview Architecture Future Lines
- BibReformat
Overview Implementation Future Lines
- BibWords
Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines
- Conclusions
- Problem
Output formats stored in the database for performance. Changes in bibligraphic data: Update formats database.
- Previous methodology
Download-Reformat-Upload Cycle. Slow, error prone.
- Goal
Simple way to update prestored formats. Easy to use and reliable.
August 22, 2002
CDSware ETT-DH
Page 6
BibReformat – Implementation
- About the talk
- BibSched
Overview Architecture Future Lines
- BibReformat
Overview Implementation Future Lines
- BibWords
Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines
- Conclusions
Collection Select Arbitrary Select flxREFORMAT queue BibReformat Search BibFormat BibUpload
Figure 4: BibReformat Architecture.
August 22, 2002
CDSware ETT-DH
Page 7
BibReformat – Future Lines
- About the talk
- BibSched
Overview Architecture Future Lines
- BibReformat
Overview Implementation Future Lines
- BibWords
Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines
- Conclusions
- BibSched Merging
BibReformat was a precursor for BibSched. Similar task queue mechanism. Now BibReformat is obsolete: rewrite into a BibSched task
- User Interface Enhancement
Better collection select: collapse/expand(?).
August 22, 2002
CDSware ETT-DH
Page 8
BibWords – Overview
- About the talk
- BibSched
Overview Architecture Future Lines
- BibReformat
Overview Implementation Future Lines
- BibWords
Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines
- Conclusions
- Fulltext Index
220.000 fulltexts in the system. Capability to search inside them very valuable. CDSware one of the first bibliographic engines with combined meta- data/reference/fulltext search!!!
- Word Tables
Forward tables: Word- Document number list. Reverse tables: Document number- Word list Many advantages for little price.
- Lexical Analysis
What is our data like? Many common words or rather many rare ones? Is the corpus size growing? How is the growth curve? Will our current system implementation cope with it?
August 22, 2002
CDSware ETT-DH
Page 9
BibWords – Fulltext Index
- About the talk
- BibSched
Overview Architecture Future Lines
- BibReformat
Overview Implementation Future Lines
- BibWords
Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines
- Conclusions
- Goals
Fulltext indexes for maximal number of records. Speed is a concern due to the high volume of data. Good search speed.
- Problems
Different file formats. Heterogenous file access. Scanned bitmaps. Slow file access for external documents. (Transfer rate, timeouts,...) Slow text extraction. (mainly ps and pdf) Huge amount of words.
August 22, 2002
CDSware ETT-DH
Page 10
BibWords – Fulltext Index
- About the talk
- BibSched
Overview Architecture Future Lines
- BibReformat
Overview Implementation Future Lines
- BibWords
Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines
- Conclusions
BibWords bibwordsX bibXXX bibitem Document Server Helper Programs (pdf2text, antiword,...) WWW
Figure 5: BibWords Architecture.
August 22, 2002
CDSware ETT-DH
Page 11
BibWords – Fulltext Index
- About the talk
- BibSched
Overview Architecture Future Lines
- BibReformat
Overview Implementation Future Lines
- BibWords
Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines
- Conclusions
Fulltext word extraction from url
- Indirect Lookup
URL might be direct: “http://foo.bar/spam.pdf” URL migh be indirect: “http://foo.bar/setlink?...” Indirect URLs contain pointers to actual locations of files.
- Document retrieval
urllib does the job for us.
- Text extraction
Helper programs permit flexible configurations (PPT, DOC, PS, PS.GZ,...).
August 22, 2002
CDSware ETT-DH
Page 12
BibWords – Reverse Tables
- About the talk
- BibSched
Overview Architecture Future Lines
- BibReformat
Overview Implementation Future Lines
- BibWords
Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines
- Conclusions
- Advantages
Database consistency all throughout indexation process. Faster up- dates: No need to look through all the words.
- Price
Lower indexing speed: word list compress and insert. Storage space.
August 22, 2002
CDSware ETT-DH
Page 13
BibWords – Lexical Statistics
- About the talk
- BibSched
Overview Architecture Future Lines
- BibReformat
Overview Implementation Future Lines
- BibWords
Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines
- Conclusions
- Analysis conditions
Analysis performed on indexable data (unique words in document). Ordered by system number.
- Limitations
Every word counted only once per document. Standard corpus statistics count all the word appearances. Not really comparable to standard corpus statistics. Nontheless useful for system evolution prediction. System number highly correlated to type of documents. Anomalies appear in vocabulary growth curves.
August 22, 2002
CDSware ETT-DH
Page 14
BibWords – Lexical Statistics
- About the talk
- BibSched
Overview Architecture Future Lines
- BibReformat
Overview Implementation Future Lines
- BibWords
Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines
- Conclusions
1 10 100 1000 10000 100000 1 10 100 1000 10000 100000 1e+06 Frequency distribution Title
Figure 6: Frequency Distribution for Title Index.
August 22, 2002
CDSware ETT-DH
Page 15
BibWords – Lexical Statistics
- About the talk
- BibSched
Overview Architecture Future Lines
- BibReformat
Overview Implementation Future Lines
- BibWords
Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines
- Conclusions
1e+00 1e+02 1e+04 1e+06 1 100 10000 m Vm
Figure 7: Frequency Distribution for Stuttgarter Zeitung. a
aData by Anke L¨
udeling, Stefan Evert. Universit¨ at Stuttgart
August 22, 2002
CDSware ETT-DH
Page 16
BibWords – Lexical Statistics
- About the talk
- BibSched
Overview Architecture Future Lines
- BibReformat
Overview Implementation Future Lines
- BibWords
Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines
- Conclusions
Global Titles Stuttgarter Z. a Tokens N 50,794,361 4,929,586 36,197,372 Types(lemma) V 1,584,356 83,733 714,972 Hapax legomena
- ✁
1,189,207 39,234 404,579 Dis legomena
- ✂
116,786 11,045 96,981 Proportion of Hapax 75.05% 46.86% 56.59 % Proportion of Dis 7.37% 13.19% 13.56 % Mean Type Frequency 50.03 59.28 50.628 Standard Deviation 2300.12 1360.38 5,616 Figure 8: Descriptive statistics on CDSware words.
aWords were processed individually instead of grouped by document.
August 22, 2002
CDSware ETT-DH
Page 17
BibWords – Lexical Statistics
- About the talk
- BibSched
Overview Architecture Future Lines
- BibReformat
Overview Implementation Future Lines
- BibWords
Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines
- Conclusions
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 200000 300000 400000 500000 600000 Vocabulary Growth Curves dis hapax all
Figure 9: Vocabulary Growth Curve for Titles Index
August 22, 2002
CDSware ETT-DH
Page 18
BibWords – Lexical Statistics
- About the talk
- BibSched
Overview Architecture Future Lines
- BibReformat
Overview Implementation Future Lines
- BibWords
Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines
- Conclusions
- LNRE Statistics.
Large number of hapax and dis legomena. Most of the words are only used once. Indexing cost of rare words important!
- Productive Process
Linguistic rules provide an ever increasing vocabulary. Growth for titles index:
- ✁
words/doc. Growth for global index:
- ✆
words/doc.
August 22, 2002
CDSware ETT-DH
Page 19
Bibwords – Future Lines
- About the talk
- BibSched
Overview Architecture Future Lines
- BibReformat
Overview Implementation Future Lines
- BibWords
Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines
- Conclusions
- System Number Buckets.
Updates slowed down by compress/decompress delay. Higher number of documents, longer times. Solution: split up system number range in chunks.
- Multithreaded Text Extraction.˙
Very slow document transfer and text extraction CPU usage always under 10% Solution: several documents at the same time.
August 22, 2002
CDSware ETT-DH
Page 20
Conclusions
- About the talk
- BibSched
Overview Architecture Future Lines
- BibReformat
Overview Implementation Future Lines
- BibWords
Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines
- Conclusions
- BibReformat
Small tool to automate bibliographic formats: Prelude to BibSched.
- BibSched
First working version of scheduler daemon. BibWords already integrated.
- BibWords
One of the main system tools strongly enhanced. Fulltext possibilities, better update speed, consistent indexing.
August 22, 2002
CDSware ETT-DH
Page 21
Thank You! This has been a great summer!
August 22, 2002
CDSware ETT-DH
Page 22