New CDSware System Tools Eduardo Margallo Balb as August 22, 2002 - - PowerPoint PPT Presentation

new cdsware system tools
SMART_READER_LITE
LIVE PREVIEW

New CDSware System Tools Eduardo Margallo Balb as August 22, 2002 - - PowerPoint PPT Presentation

New CDSware System Tools Eduardo Margallo Balb as August 22, 2002 0-0 About the Talk About the talk BibSched BibSched Overview Architecture Future Lines BibReformat BibWords BibUpload BibFormat Search


slide-1
SLIDE 1

New CDSware System Tools

Eduardo Margallo Balb´ as August 22, 2002

0-0

slide-2
SLIDE 2

About the Talk

  • About the talk
  • BibSched

Overview Architecture Future Lines

  • BibReformat

Overview Implementation Future Lines

  • BibWords

Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines

  • Conclusions

BibFormat BibUpload Search BibWords BibSched BibReformat

Figure 1: New tools in the system.

August 22, 2002

CDSware ETT-DH

Page 1

slide-3
SLIDE 3

BibSched – Overview

  • About the talk
  • BibSched

Overview Architecture Future Lines

  • BibReformat

Overview Implementation Future Lines

  • BibWords

Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines

  • Conclusions
  • Why BibSched

Computing intensive tasks may not be run at the same time. Data consistency requires mutual exclusion. Need for daemon processes and scheduling into the future.

  • Goals

Handle concurrent access administrator friendly. Support for periodical tasks. Job control. Priority mechanism.

August 22, 2002

CDSware ETT-DH

Page 2

slide-4
SLIDE 4

BibSched – Architecture

  • About the talk
  • BibSched

Overview Architecture Future Lines

  • BibReformat

Overview Implementation Future Lines

  • BibWords

Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines

  • Conclusions

BibSched BibUpload BibFormat BibWords Bib??? BibUpload Interface BibSched Admin BibFormat Interface BibWords Interface schTASKS

Figure 2: BibSched Architecture.

August 22, 2002

CDSware ETT-DH

Page 3

slide-5
SLIDE 5

BibSched – Architecture

  • About the talk
  • BibSched

Overview Architecture Future Lines

  • BibReformat

Overview Implementation Future Lines

  • BibWords

Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines

  • Conclusions

WAITING STOPPED SLEEPING STOPPING DONE ERROR RUNNING WAKING UP SLEEP SENT

SIGUSR2 SIGCONT SIGUSR1

Figure 3: BibSched Tasks State Diagram.

August 22, 2002

CDSware ETT-DH

Page 4

slide-6
SLIDE 6

BibSched – Future Lines

  • About the talk
  • BibSched

Overview Architecture Future Lines

  • BibReformat

Overview Implementation Future Lines

  • BibWords

Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines

  • Conclusions
  • User Interface

Task managing in a user friendly way. General/task specific manager.

  • Code Library

Common properties to tasks and interfaces: Class design.

  • Several Queues

FIFO is simple, but better performance is possible. Compatible tasks may run in parallel.

  • System Monitor

Resource based task scheduling.

August 22, 2002

CDSware ETT-DH

Page 5

slide-7
SLIDE 7

BibReformat – Overview

  • About the talk
  • BibSched

Overview Architecture Future Lines

  • BibReformat

Overview Implementation Future Lines

  • BibWords

Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines

  • Conclusions
  • Problem

Output formats stored in the database for performance. Changes in bibligraphic data: Update formats database.

  • Previous methodology

Download-Reformat-Upload Cycle. Slow, error prone.

  • Goal

Simple way to update prestored formats. Easy to use and reliable.

August 22, 2002

CDSware ETT-DH

Page 6

slide-8
SLIDE 8

BibReformat – Implementation

  • About the talk
  • BibSched

Overview Architecture Future Lines

  • BibReformat

Overview Implementation Future Lines

  • BibWords

Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines

  • Conclusions

Collection Select Arbitrary Select flxREFORMAT queue BibReformat Search BibFormat BibUpload

Figure 4: BibReformat Architecture.

August 22, 2002

CDSware ETT-DH

Page 7

slide-9
SLIDE 9

BibReformat – Future Lines

  • About the talk
  • BibSched

Overview Architecture Future Lines

  • BibReformat

Overview Implementation Future Lines

  • BibWords

Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines

  • Conclusions
  • BibSched Merging

BibReformat was a precursor for BibSched. Similar task queue mechanism. Now BibReformat is obsolete: rewrite into a BibSched task

  • User Interface Enhancement

Better collection select: collapse/expand(?).

August 22, 2002

CDSware ETT-DH

Page 8

slide-10
SLIDE 10

BibWords – Overview

  • About the talk
  • BibSched

Overview Architecture Future Lines

  • BibReformat

Overview Implementation Future Lines

  • BibWords

Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines

  • Conclusions
  • Fulltext Index

220.000 fulltexts in the system. Capability to search inside them very valuable. CDSware one of the first bibliographic engines with combined meta- data/reference/fulltext search!!!

  • Word Tables

Forward tables: Word- Document number list. Reverse tables: Document number- Word list Many advantages for little price.

  • Lexical Analysis

What is our data like? Many common words or rather many rare ones? Is the corpus size growing? How is the growth curve? Will our current system implementation cope with it?

August 22, 2002

CDSware ETT-DH

Page 9

slide-11
SLIDE 11

BibWords – Fulltext Index

  • About the talk
  • BibSched

Overview Architecture Future Lines

  • BibReformat

Overview Implementation Future Lines

  • BibWords

Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines

  • Conclusions
  • Goals

Fulltext indexes for maximal number of records. Speed is a concern due to the high volume of data. Good search speed.

  • Problems

Different file formats. Heterogenous file access. Scanned bitmaps. Slow file access for external documents. (Transfer rate, timeouts,...) Slow text extraction. (mainly ps and pdf) Huge amount of words.

August 22, 2002

CDSware ETT-DH

Page 10

slide-12
SLIDE 12

BibWords – Fulltext Index

  • About the talk
  • BibSched

Overview Architecture Future Lines

  • BibReformat

Overview Implementation Future Lines

  • BibWords

Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines

  • Conclusions

BibWords bibwordsX bibXXX bibitem Document Server Helper Programs (pdf2text, antiword,...) WWW

Figure 5: BibWords Architecture.

August 22, 2002

CDSware ETT-DH

Page 11

slide-13
SLIDE 13

BibWords – Fulltext Index

  • About the talk
  • BibSched

Overview Architecture Future Lines

  • BibReformat

Overview Implementation Future Lines

  • BibWords

Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines

  • Conclusions

Fulltext word extraction from url

  • Indirect Lookup

URL might be direct: “http://foo.bar/spam.pdf” URL migh be indirect: “http://foo.bar/setlink?...” Indirect URLs contain pointers to actual locations of files.

  • Document retrieval

urllib does the job for us.

  • Text extraction

Helper programs permit flexible configurations (PPT, DOC, PS, PS.GZ,...).

August 22, 2002

CDSware ETT-DH

Page 12

slide-14
SLIDE 14

BibWords – Reverse Tables

  • About the talk
  • BibSched

Overview Architecture Future Lines

  • BibReformat

Overview Implementation Future Lines

  • BibWords

Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines

  • Conclusions
  • Advantages

Database consistency all throughout indexation process. Faster up- dates: No need to look through all the words.

  • Price

Lower indexing speed: word list compress and insert. Storage space.

August 22, 2002

CDSware ETT-DH

Page 13

slide-15
SLIDE 15

BibWords – Lexical Statistics

  • About the talk
  • BibSched

Overview Architecture Future Lines

  • BibReformat

Overview Implementation Future Lines

  • BibWords

Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines

  • Conclusions
  • Analysis conditions

Analysis performed on indexable data (unique words in document). Ordered by system number.

  • Limitations

Every word counted only once per document. Standard corpus statistics count all the word appearances. Not really comparable to standard corpus statistics. Nontheless useful for system evolution prediction. System number highly correlated to type of documents. Anomalies appear in vocabulary growth curves.

August 22, 2002

CDSware ETT-DH

Page 14

slide-16
SLIDE 16

BibWords – Lexical Statistics

  • About the talk
  • BibSched

Overview Architecture Future Lines

  • BibReformat

Overview Implementation Future Lines

  • BibWords

Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines

  • Conclusions

1 10 100 1000 10000 100000 1 10 100 1000 10000 100000 1e+06 Frequency distribution Title

Figure 6: Frequency Distribution for Title Index.

August 22, 2002

CDSware ETT-DH

Page 15

slide-17
SLIDE 17

BibWords – Lexical Statistics

  • About the talk
  • BibSched

Overview Architecture Future Lines

  • BibReformat

Overview Implementation Future Lines

  • BibWords

Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines

  • Conclusions

1e+00 1e+02 1e+04 1e+06 1 100 10000 m Vm

Figure 7: Frequency Distribution for Stuttgarter Zeitung. a

aData by Anke L¨

udeling, Stefan Evert. Universit¨ at Stuttgart

August 22, 2002

CDSware ETT-DH

Page 16

slide-18
SLIDE 18

BibWords – Lexical Statistics

  • About the talk
  • BibSched

Overview Architecture Future Lines

  • BibReformat

Overview Implementation Future Lines

  • BibWords

Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines

  • Conclusions

Global Titles Stuttgarter Z. a Tokens N 50,794,361 4,929,586 36,197,372 Types(lemma) V 1,584,356 83,733 714,972 Hapax legomena

1,189,207 39,234 404,579 Dis legomena

116,786 11,045 96,981 Proportion of Hapax 75.05% 46.86% 56.59 % Proportion of Dis 7.37% 13.19% 13.56 % Mean Type Frequency 50.03 59.28 50.628 Standard Deviation 2300.12 1360.38 5,616 Figure 8: Descriptive statistics on CDSware words.

aWords were processed individually instead of grouped by document.

August 22, 2002

CDSware ETT-DH

Page 17

slide-19
SLIDE 19

BibWords – Lexical Statistics

  • About the talk
  • BibSched

Overview Architecture Future Lines

  • BibReformat

Overview Implementation Future Lines

  • BibWords

Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines

  • Conclusions

10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 200000 300000 400000 500000 600000 Vocabulary Growth Curves dis hapax all

Figure 9: Vocabulary Growth Curve for Titles Index

August 22, 2002

CDSware ETT-DH

Page 18

slide-20
SLIDE 20

BibWords – Lexical Statistics

  • About the talk
  • BibSched

Overview Architecture Future Lines

  • BibReformat

Overview Implementation Future Lines

  • BibWords

Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines

  • Conclusions
  • LNRE Statistics.

Large number of hapax and dis legomena. Most of the words are only used once. Indexing cost of rare words important!

  • Productive Process

Linguistic rules provide an ever increasing vocabulary. Growth for titles index:

✂ ✄☎

words/doc. Growth for global index:

✂ ✝

words/doc.

August 22, 2002

CDSware ETT-DH

Page 19

slide-21
SLIDE 21

Bibwords – Future Lines

  • About the talk
  • BibSched

Overview Architecture Future Lines

  • BibReformat

Overview Implementation Future Lines

  • BibWords

Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines

  • Conclusions
  • System Number Buckets.

Updates slowed down by compress/decompress delay. Higher number of documents, longer times. Solution: split up system number range in chunks.

  • Multithreaded Text Extraction.˙

Very slow document transfer and text extraction CPU usage always under 10% Solution: several documents at the same time.

August 22, 2002

CDSware ETT-DH

Page 20

slide-22
SLIDE 22

Conclusions

  • About the talk
  • BibSched

Overview Architecture Future Lines

  • BibReformat

Overview Implementation Future Lines

  • BibWords

Overview Fulltext Index Reverse Tables Lexical Statistics Future Lines

  • Conclusions
  • BibReformat

Small tool to automate bibliographic formats: Prelude to BibSched.

  • BibSched

First working version of scheduler daemon. BibWords already integrated.

  • BibWords

One of the main system tools strongly enhanced. Fulltext possibilities, better update speed, consistent indexing.

August 22, 2002

CDSware ETT-DH

Page 21

slide-23
SLIDE 23

Thank You! This has been a great summer!

August 22, 2002

CDSware ETT-DH

Page 22