Invenio Technology Introduction Selected Practical Software - - PowerPoint PPT Presentation

invenio technology
SMART_READER_LITE
LIVE PREVIEW

Invenio Technology Introduction Selected Practical Software - - PowerPoint PPT Presentation

Invenio Technology Tibor imko Invenio Technology Introduction Selected Practical Software Development Lessons Digital Library Invenio From A Large Digital Library System Case Studies Episode 1: Python Episode 2: Git Episode 3: Test


slide-1
SLIDE 1

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Technology

Selected Practical Software Development Lessons From A Large Digital Library System Tibor Šimko ❁t✐❜♦r✳s✐♠❦♦❅❝❡r♥✳❝❤❃

Department of Information Technology CERN

August 2011 / openlab talk

slide-2
SLIDE 2

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Outline

1

Introduction Digital Library Invenio

2

Case Studies Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

3

Conclusions

slide-3
SLIDE 3

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Outline

1

Introduction Digital Library Invenio

2

Case Studies Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

3

Conclusions

slide-4
SLIDE 4

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

What is Digital Library?

“library in which collections are stored in digital formats (as opposed to print, microform, or other media) and accessible by computers” (1) institutional document repositories (2) world-wide subject-based information systems Example #1: CERN Document Server managing CERN and selected non-CERN high-energy physics and related documents since ∼1993 more than 1,000,000 records articles, books, theses, photos, videos, and more powered by Invenio, free digital library software ❤tt♣✿✴✴❝❞s✇❡❜✳❝❡r♥✳❝❤✴

slide-5
SLIDE 5

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

What is Digital Library?

“library in which collections are stored in digital formats (as opposed to print, microform, or other media) and accessible by computers” (1) institutional document repositories (2) world-wide subject-based information systems Example #1: CERN Document Server managing CERN and selected non-CERN high-energy physics and related documents since ∼1993 more than 1,000,000 records articles, books, theses, photos, videos, and more powered by Invenio, free digital library software ❤tt♣✿✴✴❝❞s✇❡❜✳❝❡r♥✳❝❤✴

slide-6
SLIDE 6

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

CDS: Collection Tree

slide-7
SLIDE 7

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

CDS: Search for Books

slide-8
SLIDE 8

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

CDS: Search for Photos

slide-9
SLIDE 9

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

CDS Features: Commenting

slide-10
SLIDE 10

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Features: Reviewing

slide-11
SLIDE 11

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

CDS: Create Personal Alert

slide-12
SLIDE 12

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

CDS: Add to Personal Basket

slide-13
SLIDE 13

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

CDS: Display Personal Basket

slide-14
SLIDE 14

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

CDS: Organize and Share Your Baskets

slide-15
SLIDE 15

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

CDS: Journals and Bulletins

slide-16
SLIDE 16

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

What is digital library?

Example #2: INSPIRE world-wide high-energy physics information system run by CERN, DESY, FNAL, SLAC metadata curation since 1960s, Invenio technology since 2007 citation analysis, author/affiliation analysis close partnership with arXiv and ADS ❤tt♣✿✴✴✐♥s♣✐r❡❜❡t❛✳♥❡t✴

slide-17
SLIDE 17

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

INSPIRE: full-text search

slide-18
SLIDE 18

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

INSPIRE: cite summary

slide-19
SLIDE 19

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

INSPIRE: citation history

slide-20
SLIDE 20

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

INSPIRE: author pages

slide-21
SLIDE 21

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Outline

1

Introduction Digital Library Invenio

2

Case Studies Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

3

Conclusions

slide-22
SLIDE 22

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Key Features

navigable collection tree (regular, virtual) powerful search engine

Google-like speed for up to 5M records combined metadata, reference and fulltext search

flexible metadata (MARC, OA)

handling any kind of document (multimedia) customizable input, formatting and linking

personalization and collaborative features:

alerts, baskets, groups, reviews, comments internationalisation (28 languages)

  • pen source, GNU General Public License

co-developed by CERN (2002–), EPFL (2004–), DESY/FNAL/SLAC (2008–), CfA (2009–) installed at ∼30 institutions world-wide

slide-23
SLIDE 23

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Architecture: Overview

Author

slide-24
SLIDE 24

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Architecture: Overview

Author Ingestion

slide-25
SLIDE 25

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Architecture: Overview

Author Ingestion Database

slide-26
SLIDE 26

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Architecture: Overview

Author Ingestion Database Sources

slide-27
SLIDE 27

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Architecture: Overview

Author Ingestion Database Sources Processing

slide-28
SLIDE 28

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Architecture: Overview

Author Ingestion Database Sources Processing Dissemination User

slide-29
SLIDE 29

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Architecture: Overview

Author Ingestion Database Sources Processing Dissemination User Curation Librarian

slide-30
SLIDE 30

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Architecture: Overview

Author Ingestion Database Sources Processing Dissemination User Curation Librarian Overview

slide-31
SLIDE 31

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Ingestion

Author

slide-32
SLIDE 32

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Ingestion

Author WebSubmit WebSession, WebAccess

slide-33
SLIDE 33

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Ingestion

Author WebSubmit WebSession, WebAccess Metadata Full-text full-text document metadata

slide-34
SLIDE 34

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Ingestion

Author WebSubmit WebSession, WebAccess Metadata Full-text full-text document BibUpload BibSched BibConvert metadata MARCXML

slide-35
SLIDE 35

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Ingestion

Author WebSubmit WebSession, WebAccess Metadata Full-text full-text document BibUpload BibSched BibConvert metadata MARCXML BibHarvest OAI Data Source

slide-36
SLIDE 36

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Ingestion

Author WebSubmit WebSession, WebAccess Metadata Full-text full-text document BibUpload BibSched BibConvert metadata MARCXML BibHarvest OAI Data Source ElmSubmit Non-OAI Data Source

slide-37
SLIDE 37

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Ingestion

Author WebSubmit WebSession, WebAccess Metadata Full-text full-text document BibUpload BibSched BibConvert metadata MARCXML BibHarvest OAI Data Source ElmSubmit Non-OAI Data Source

Ingestion

slide-38
SLIDE 38

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Processing

Metadata Full-text

slide-39
SLIDE 39

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Processing

Metadata Full-text RefExtract BibClassify

slide-40
SLIDE 40

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Processing

Metadata Full-text RefExtract BibClassify Clusters BibIndex

slide-41
SLIDE 41

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Processing

Metadata Full-text RefExtract BibClassify Clusters BibIndex WebColl

slide-42
SLIDE 42

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Processing

Metadata Full-text RefExtract BibClassify Clusters BibIndex WebColl BibRank

slide-43
SLIDE 43

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Processing

Metadata Full-text RefExtract BibClassify Clusters BibIndex WebColl BibRank BibFormat

slide-44
SLIDE 44

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Processing

Metadata Full-text RefExtract BibClassify Clusters BibIndex WebColl BibRank BibFormat

Processing

slide-45
SLIDE 45

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Dissemination

Metadata Full-text Clusters

slide-46
SLIDE 46

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Dissemination

Metadata Full-text Clusters WebSearch User

slide-47
SLIDE 47

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Dissemination

Metadata Full-text Clusters WebSearch User WebBasket

slide-48
SLIDE 48

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Dissemination

Metadata Full-text Clusters WebSearch User WebBasket WebTag

slide-49
SLIDE 49

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Dissemination

Metadata Full-text Clusters WebSearch User WebBasket WebTag WebAlert

slide-50
SLIDE 50

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Dissemination

Metadata Full-text Clusters WebSearch User WebBasket WebTag WebAlert BibHarvest OAI Harvester

slide-51
SLIDE 51

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Dissemination

Metadata Full-text Clusters WebSearch User WebBasket WebTag WebAlert BibHarvest OAI Harvester WebComment

slide-52
SLIDE 52

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Dissemination

Metadata Full-text Clusters WebSearch User WebBasket WebTag WebAlert BibHarvest OAI Harvester WebComment WebMessage

slide-53
SLIDE 53

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Dissemination

Metadata Full-text Clusters WebSearch User WebBasket WebTag WebAlert BibHarvest OAI Harvester WebComment WebMessage WebJournal

slide-54
SLIDE 54

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Dissemination

Metadata Full-text Clusters WebSearch User WebBasket WebTag WebAlert BibHarvest OAI Harvester WebComment WebMessage WebJournal BibCirculation

slide-55
SLIDE 55

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Dissemination

Metadata Full-text Clusters WebSearch User WebBasket WebTag WebAlert BibHarvest OAI Harvester WebComment WebMessage WebJournal BibCirculation WebStat

slide-56
SLIDE 56

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Dissemination

Metadata Full-text Clusters WebSearch User WebBasket WebTag WebAlert BibHarvest OAI Harvester WebComment WebMessage WebJournal BibCirculation WebStat WebHelp

slide-57
SLIDE 57

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Dissemination

Metadata Full-text Clusters WebSearch User WebBasket WebTag WebAlert BibHarvest OAI Harvester WebComment WebMessage WebJournal BibCirculation WebStat WebHelp

Dissemination

slide-58
SLIDE 58

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Curation

Metadata Librarian Full-text

slide-59
SLIDE 59

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Curation

Metadata Librarian Full-text BibEdit

slide-60
SLIDE 60

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Curation

Metadata Librarian Full-text BibEdit MultiEdit

slide-61
SLIDE 61

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Curation

Metadata Librarian Full-text BibEdit MultiEdit BatchUploader

slide-62
SLIDE 62

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Curation

Metadata Librarian Full-text BibEdit MultiEdit BatchUploader BibCheck

slide-63
SLIDE 63

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Curation

Metadata Librarian Full-text BibEdit MultiEdit BatchUploader BibCheck BibCirculation

slide-64
SLIDE 64

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Curation

Metadata Librarian Full-text BibEdit MultiEdit BatchUploader BibCheck BibCirculation BibDocFile

slide-65
SLIDE 65

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Curation

Metadata Librarian Full-text BibEdit MultiEdit BatchUploader BibCheck BibCirculation BibDocFile BibClassify

slide-66
SLIDE 66

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Curation

Metadata Librarian Full-text BibEdit MultiEdit BatchUploader BibCheck BibCirculation BibDocFile BibClassify RefExtract

slide-67
SLIDE 67

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Curation

Metadata Librarian Full-text BibEdit MultiEdit BatchUploader BibCheck BibCirculation BibDocFile BibClassify RefExtract Tasks BibCatalog

slide-68
SLIDE 68

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Curation

Metadata Librarian Full-text BibEdit MultiEdit BatchUploader BibCheck BibCirculation BibDocFile BibClassify RefExtract Tasks BibCatalog Knowledge Bases BibKnowledge

slide-69
SLIDE 69

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Curation

Metadata Librarian Full-text BibEdit MultiEdit BatchUploader BibCheck BibCirculation BibDocFile BibClassify RefExtract Tasks BibCatalog Knowledge Bases BibKnowledge BibExport

slide-70
SLIDE 70

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Curation

Metadata Librarian Full-text BibEdit MultiEdit BatchUploader BibCheck BibCirculation BibDocFile BibClassify RefExtract Tasks BibCatalog Knowledge Bases BibKnowledge BibExport BibMatch

slide-71
SLIDE 71

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Curation

Metadata Librarian Full-text BibEdit MultiEdit BatchUploader BibCheck BibCirculation BibDocFile BibClassify RefExtract Tasks BibCatalog Knowledge Bases BibKnowledge BibExport BibMatch BibMerge

slide-72
SLIDE 72

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Curation

Metadata Librarian Full-text BibEdit MultiEdit BatchUploader BibCheck BibCirculation BibDocFile BibClassify RefExtract Tasks BibCatalog Knowledge Bases BibKnowledge BibExport BibMatch BibMerge

Curation

slide-73
SLIDE 73

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Invenio Modules: Summary

∼33 modules codebase

∼290,000 lines of Python code ∼12,000 lines of JavaScript code ∼6,000 lines of XSL code ∼5,000 lines of autotools code

∼75 authors since inception

∼25 authors and contributors in 2010 many short-term students importance of informal coding standards

∼10 years of development

started at CERN, first release in 2002 now co-developed world-wide (EU, US)

lego programming... but no silver bullet

slide-74
SLIDE 74

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Outline

1

Introduction Digital Library Invenio

2

Case Studies Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

3

Conclusions

slide-75
SLIDE 75

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Why Python?

easy to read and understand (good for many temporary developers) suitable for rapid prototyping (good for organic-growth software development model) write code to throw it away

slide-76
SLIDE 76

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Art of Ikebana

Japanese art of flower arrangement “way of flowers” natural shapes, graceful lines minimalism “disciplined art form in which nature and humanity are brought together”

slide-77
SLIDE 77

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Art of Ikebana Programming

Java? ♥❡✇ ❈❛❧❧❛❜❧❡✭✮ ④ ♣✉❜❧✐❝ ❖❜❥❡❝t ❝❛❧❧✭❖❜❥❡❝t ①✮ ④ r❡t✉r♥ ①✳t✐♠❡s✭❦✮ ⑥ ⑥ Python! ❧❛♠❜❞❛ ①✿ ❦ ✯ ①

slide-78
SLIDE 78

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Art of Ikebana Programming

Java? ♥❡✇ ❈❛❧❧❛❜❧❡✭✮ ④ ♣✉❜❧✐❝ ❖❜❥❡❝t ❝❛❧❧✭❖❜❥❡❝t ①✮ ④ r❡t✉r♥ ①✳t✐♠❡s✭❦✮ ⑥ ⑥ Python! ❧❛♠❜❞❛ ①✿ ❦ ✯ ①

slide-79
SLIDE 79

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Speeding Up Python

bytecode interpreted language: what about speed? Cython permits to write C extensions easily combining efficiency of C with high-levelness of Python

Example: intbitset.pyx ❝t②♣❡❞❡❢ ✉♥s✐❣♥❡❞ ❧♦♥❣ ❧♦♥❣ ✐♥t ✇♦r❞❴t ❝t②♣❡❞❡❢ str✉❝t ■♥t❇✐t❙❡t✿ ✐♥t s✐③❡ ✐♥t ❛❧❧♦❝❛t❡❞ ✇♦r❞❴t tr❛✐❧✐♥❣❴❜✐ts ✐♥t t♦t ✇♦r❞❴t ✯❜✐ts❡t

slide-80
SLIDE 80

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Outline

1

Introduction Digital Library Invenio

2

Case Studies Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

3

Conclusions

slide-81
SLIDE 81

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Why Git?

good for distributed teams

  • ffline development possible

“pull on demand” collaboration model (as opposed to “shared push” collaboration model)

inherent,natural code review process

commit early, commit often (to private repositories) rebase and clean (before pushing for public consumption) interplay with SVN

slide-82
SLIDE 82

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Branches

C1 master

slide-83
SLIDE 83

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Branches

C1 master C2

slide-84
SLIDE 84

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Branches

C1 master C2 C3 v1.0.0

slide-85
SLIDE 85

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Branches

C1 master C2 C3 v1.0.0 C4

slide-86
SLIDE 86

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Branches

C1 master C2 C3 v1.0.0 C4 M1 maintenance

slide-87
SLIDE 87

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Branches

C1 master C2 C3 v1.0.0 C4 M1 maintenance M2

slide-88
SLIDE 88

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Branches

C1 master C2 C3 v1.0.0 C4 M1 maintenance M2 C5

slide-89
SLIDE 89

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Branches

C1 master C2 C3 v1.0.0 C4 M1 maintenance M2 C5 M3 v1.0.1

slide-90
SLIDE 90

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Branches

C1 master C2 C3 v1.0.0 C4 M1 maintenance M2 C5 M3 v1.0.1 M4

slide-91
SLIDE 91

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Branches

C1 master C2 C3 v1.0.0 C4 M1 maintenance M2 C5 M3 v1.0.1 M4 N1 next

slide-92
SLIDE 92

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Branches

C1 master C2 C3 v1.0.0 C4 M1 maintenance M2 C5 M3 v1.0.1 M4 N1 next C6

slide-93
SLIDE 93

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Branches

C1 master C2 C3 v1.0.0 C4 M1 maintenance M2 C5 M3 v1.0.1 M4 N1 next C6 N2

slide-94
SLIDE 94

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Branches

C1 master C2 C3 v1.0.0 C4 M1 maintenance M2 C5 M3 v1.0.1 M4 N1 next C6 N2 C7 v1.1.0

slide-95
SLIDE 95

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Branches

C1 master C2 C3 v1.0.0 C4 M1 maintenance M2 C5 M3 v1.0.1 M4 N1 next C6 N2 C7 v1.1.0 M5 v1.0.2

slide-96
SLIDE 96

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Branches

C1 master C2 C3 v1.0.0 C4 M1 maintenance M2 C5 M3 v1.0.1 M4 N1 next C6 N2 C7 v1.1.0 M5 v1.0.2 maint — release maintenance branch master — new feature branch next — things not yet release-ready

slide-97
SLIDE 97

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Development

C1 M1 N1 master maint next

slide-98
SLIDE 98

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Development

C1 M1 N1 master maint next B1 some-bugfix

slide-99
SLIDE 99

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Development

C1 M1 N1 master maint next B1 some-bugfix C2 M2 N2

slide-100
SLIDE 100

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Development

C1 M1 N1 master maint next B1 some-bugfix C2 M2 N2 B2

slide-101
SLIDE 101

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Development

C1 M1 N1 master maint next B1 some-bugfix C2 M2 N2 B2 M3

merge

slide-102
SLIDE 102

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Development

C1 M1 N1 master maint next B1 some-bugfix C2 M2 N2 B2 M3

merge

C3

merge

slide-103
SLIDE 103

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Development

C1 M1 N1 master maint next B1 some-bugfix C2 M2 N2 B2 M3

merge

C3

merge

F1 some-new-feature

slide-104
SLIDE 104

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Development

C1 M1 N1 master maint next B1 some-bugfix C2 M2 N2 B2 M3

merge

C3

merge

F1 some-new-feature F2

slide-105
SLIDE 105

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Development

C1 M1 N1 master maint next B1 some-bugfix C2 M2 N2 B2 M3

merge

C3

merge

F1 some-new-feature F2 C4

merge

slide-106
SLIDE 106

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Development

C1 M1 N1 master maint next B1 some-bugfix C2 M2 N2 B2 M3

merge

C3

merge

F1 some-new-feature F2 C4

merge

N3

merge

slide-107
SLIDE 107

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Development

C1 M1 N1 master maint next B1 some-bugfix C2 M2 N2 B2 M3

merge

C3

merge

F1 some-new-feature F2 C4

merge

N3

merge

E1 some-experimental-feature

slide-108
SLIDE 108

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Development

C1 M1 N1 master maint next B1 some-bugfix C2 M2 N2 B2 M3

merge

C3

merge

F1 some-new-feature F2 C4

merge

N3

merge

E1 some-experimental-feature E2

slide-109
SLIDE 109

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git Development

C1 M1 N1 master maint next B1 some-bugfix C2 M2 N2 B2 M3

merge

C3

merge

F1 some-new-feature F2 C4

merge

N3

merge

E1 some-experimental-feature E2 N4

merge

slide-110
SLIDE 110

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Git collaboration model

slide-111
SLIDE 111

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Outline

1

Introduction Digital Library Invenio

2

Case Studies Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

3

Conclusions

slide-112
SLIDE 112

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Unit testing

test-driven development when appropriate e.g. before/while developing str✐♣❴❛❝❝❡♥ts✭✮, write: Example: search_engine_tests.py

❝❧❛ss ❚❡st❙tr✐♣❆❝❝❡♥ts✭✉♥✐tt❡st✳❚❡st❈❛s❡✮✿ ✧✧✧❚❡st ❢♦r ❤❛♥❞❧✐♥❣ ♦❢ ❯❚❋✲✽ ❛❝❝❡♥ts✳✧✧✧ ❞❡❢ t❡st❴str✐♣❴❛❝❝❡♥ts✭s❡❧❢✮✿ ✧✧✧s❡❛r❝❤ ❡♥❣✐♥❡ ✲ str✐♣♣✐♥❣ ♦❢ ❛❝❝❡♥t❡❞ ❧❡tt❡rs✧✧✧ s❡❧❢✳❛ss❡rt❊q✉❛❧✭✧♠❡♠❡♠❡♠❡✧✱ s❡❛r❝❤❴❡♥❣✐♥❡✳str✐♣❴❛❝❝❡♥ts✭✬♠é♠ê♠ë♠è✬✮✮ s❡❧❢✳❛ss❡rt❊q✉❛❧✭✧▼❊▼❊▼❊▼❊✧✱ s❡❛r❝❤❴❡♥❣✐♥❡✳str✐♣❴❛❝❝❡♥ts✭✬▼➱▼✃▼❐▼➮✬✮✮

slide-113
SLIDE 113

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Functional testing

functional/acceptance/regression testing testbed site (Atlantis of Institute Fictive Science) e.g. Python mechanize module to emulate browser Example: websearch_regression_tests.py

❝❧❛ss ❲❡❜❙❡❛r❝❤❙❡❛r❝❤❊♥❣✐♥❡P②t❤♦♥❆P■❚❡st✭✉♥✐tt❡st✳❚❡st❈❛s❡✮✿ ✧❈❤❡❝❦ t②♣✐❝❛❧ s❡❛r❝❤ ❡♥❣✐♥❡ P②t❤♦♥ ❆P■ ❝❛❧❧s ♦♥ t❤❡ ❞❡♠♦ ❞❛t❛✳✧ ❞❡❢ t❡st❴s❡❛r❝❤❴❡♥❣✐♥❡❴♣②t❤♦♥❴❛♣✐❴❢♦r❴❢❛✐❧❡❞❴q✉❡r②✭s❡❧❢✮✿ ✧✇❡❜s❡❛r❝❤ ✲ s❡❛r❝❤ ❡♥❣✐♥❡ P②t❤♦♥ ❆P■ ❢♦r ❢❛✐❧❡❞ q✉❡r②✧ s❡❧❢✳❛ss❡rt❊q✉❛❧✭❬❪✱ ♣❡r❢♦r♠❴r❡q✉❡st❴s❡❛r❝❤✭♣❂✬❛♦❡✉✐❞❤t♥s✬✮✮ ❞❡❢ t❡st❴s❡❛r❝❤❴❡♥❣✐♥❡❴♣②t❤♦♥❴❛♣✐❴❢♦r❴s✉❝❝❡ss❢✉❧❴q✉❡r②✭s❡❧❢✮✿ ✧✇❡❜s❡❛r❝❤ ✲ s❡❛r❝❤ ❡♥❣✐♥❡ P②t❤♦♥ ❆P■ ❢♦r s✉❝❝❡ss❢✉❧ q✉❡r②✧ s❡❧❢✳❛ss❡rt❊q✉❛❧✭❬✽✱ ✾✱ ✶✵✱ ✶✶✱ ✶✷✱ ✶✸✱ ✶✹✱ ✶✺✱ ✶✻✱ ✶✼✱ ✶✽✱ ✹✼❪✱ ♣❡r❢♦r♠❴r❡q✉❡st❴s❡❛r❝❤✭♣❂✬❡❧❧✐s✬✮✮

slide-114
SLIDE 114

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Web testing

sometimes we need to run tests in real browser

e.g. pages with heavy JavaScript

using Selenium IDE extension for Firefox

record and replay browser actions test for text existence or non-existence on pages test for link labels and targets

Example: test_search_ellis.html

❁tr❃❁t❞❃♦♣❡♥❁✴t❞❃ ❁t❞❃❤tt♣✿✴✴❧♦❝❛❧❤♦st❁✴t❞❃ ❁t❞❃❁✴t❞❃ ❁✴tr❃ ❁tr❃❁t❞❃t②♣❡❁✴t❞❃ ❁t❞❃♣❁✴t❞❃ ❁t❞❃❡❧❧✐s❁✴t❞❃ ❁✴tr❃ ❁tr❃❁t❞❃❝❧✐❝❦❆♥❞❲❛✐t❁✴t❞❃ ❁t❞❃❛❝t✐♦♥❴s❡❛r❝❤❁✴t❞❃ ❁t❞❃❁✴t❞❃ ❁✴tr❃ ❁tr❃❁t❞❃✈❡r✐❢②❚❡①tPr❡s❡♥t❁✴t❞❃ ❁t❞❃✶✳ ❚❤❡r♠❛❧ ❝♦♥❞✉❝t✐✈✐t② ♦❢ ❞❡♥s❡ q✉❛r❦ ♠❛tt❡r ❛♥❞ ❝♦♦❧✐♥❣ ♦❢ st❛rs❁✴t❞❃ ❁t❞❃❁✴t❞❃ ❁✴tr❃

slide-115
SLIDE 115

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Outline

1

Introduction Digital Library Invenio

2

Case Studies Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

3

Conclusions

slide-116
SLIDE 116

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Designing A Search Engine

performance-driven design assumptions:

high number of selects, low number of updates fast searching, slow indexation cache everything cacheable

search functionality:

search for words, phrases, regular expressions search in any field, authors, titles, etc

index design:

forward indexes: word1 − → [rec1, rec2, . . . ] word2 − → [rec2, rec7, . . . ] reverse indexes: rec1 − → [word1, word8, . . . ] rec2 − → [word1, word2, . . . ]

Zipf’s law on word frequency:

few words occur very often (e.g. the) most words are infrequent (even e.g. boson)

slide-117
SLIDE 117

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Search Engine Under Cover

slide-118
SLIDE 118

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Measuring the Performance

three important speed factors to consider:

speed of finding sets (DB Server) speed of demarshaling sets (DB ↔ Web App Server) speed of intersecting sets (Web App Server) Example: speed of various parts (2002, before optimization) ❛❝t✐♦♥ ✴ q✉❡r②✿ ✧❈❊❘◆ ✷✵✵✷✧ ✧♦❢ t❤❡ t❤✐s✧ ✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲ ❢❡t❝❤✐♥❣ ✵✳✷✽ s❡❝ ✵✳✸✹ s❡❝ ❞❡♠❛rs❤❛❧✐♥❣ ✵✳✼✽ s❡❝ ✶✳✶✵ s❡❝ ❛❞❞✐♥❣ ❝♦❧❧s ✵✳✸✼ s❡❝ ✵✳✻✸ s❡❝ ✐♥t❡rs❡❝t✐♥❣ ✵✳✻✹ s❡❝ ✶✳✶✾ s❡❝ ✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲✲ t♦t❛❧ s❡❛r❝❤ t✐♠❡ ✷✳✵✼ s❡❝ ✸✳✷✷ s❡❝

slide-119
SLIDE 119

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Optimizing Data Structures

data structures tested:

‘sorted’ (lists, Patricia trees) ‘unsorted’ (hashed sets, binary vectors)

fast prototyping: (Python, Lisp in 2002)

throw-away coding to test ideas Example: lists vs dicts, 350K sets in 800K universe ♠❛rs❤❛❧✐♥❣ ❧✐sts ✳✳✳✳✳ ✺✸✷✻✶✻✰✺✸✷✺✼✶ ❜②t❡s ✐♥ ✶✳✸✸ s❡❝ ❞❡♠❛rs❤❛❧✐♥❣ ❧✐sts ✳✳✳ ✸✺✵✵✵✵✰✸✺✵✵✵✵ ✐t❡♠s ✐♥ ✵✳✶✵ s❡❝ ♠❡r❣✐♥❣ ❧✐sts ✳✳✳✳✳✳✳✳ ✺✹✻✾✻✺ ✐t❡♠s ✐♥ ✵✳✸✹ s❡❝ ✐♥t❡rs❡❝t✐♥❣ ❧✐sts ✳✳✳ ✶✺✸✵✸✺ ✐t❡♠s ✐♥ ✵✳✸✺ s❡❝ ♠❛rs❤❛❧✐♥❣ ❞✐❝ts ✳✳✳✳✳ ✺✼✻✹✾✶✰✺✼✻✹✺✵ ❜②t❡s ✐♥ ✵✳✽✼ s❡❝ ❞❡♠❛rs❤❛❧✐♥❣ ❞✐❝ts ✳✳✳ ✸✺✵✵✵✵✰✸✺✵✵✵✵ ✐t❡♠s ✐♥ ✵✳✸✻ s❡❝ ♠❡r❣✐♥❣ ❞✐❝ts ✳✳✳✳✳✳✳✳ ✺✹✻✾✻✺ ✐t❡♠s ✐♥ ✵✳✵✾ s❡❝ ✐♥t❡rs❡❝t✐♥❣ ❞✐❝ts ✳✳✳ ✶✺✸✵✸✺ ✐t❡♠s ✐♥ ✵✳✶✺ s❡❝

slide-120
SLIDE 120

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

. . . and the winner is:

binary vectors found the best compromise!

using ◆✉♠❡r✐❝ Python module (in 2002) typical search time gain: 4.0 sec → 0.2 sec (in 2002) typical indexing time loss: 7 hours → 4 days (in 2002) mostly spare data modelled via mostly dense data structure? free your mind, think critically

further optimization:

◆✉♠❡r✐❝ module not addressing real bits, only bytes so home-made ✐♥t❜✐ts❡t C extension in 2007

addressing real bits (factor of 8 already) saving space, saving (indexing) time

slide-121
SLIDE 121

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Outline

1

Introduction Digital Library Invenio

2

Case Studies Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

3

Conclusions

slide-122
SLIDE 122

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Splitting Web App Server and DB Server

load of CDS Web and DB servers at the split time: split leads to efficient use of OS resources by lone, non-competing Web and DB daemon processes

slide-123
SLIDE 123

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Load-Balanced Setup

useful for “LHC First Beam Day” rush situations with many concurrent visitors Apache mod_proxy_balancer User Load Balancer App Worker 2 App Worker 1 App Worker 3 DB Server

slide-124
SLIDE 124

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Measuring Scalability

using siege to simulate concurrent users and to measure throughput on a sample of typical URLs

Example: inspirebeta.net under gentle siege ✩ s✐❡❣❡ ✲❞ ✶ ✲❝ ✷✵ ✲t ✶♠ ✲❢ ✐♥s♣✐r❡❜❡t❛❴✉r❧s✳t①t ❚r❛♥s❛❝t✐♦♥s✿ ✶✸✷✾ ❤✐ts ❆✈❛✐❧❛❜✐❧✐t②✿ ✶✵✵✳✵✵ ✪ ❊❧❛♣s❡❞ t✐♠❡✿ ✻✵✳✷✸ s❡❝s ❉❛t❛ tr❛♥s❢❡rr❡❞✿ ✸✼✳✶✷ ▼❇ ❘❡s♣♦♥s❡ t✐♠❡✿ ✵✳✹✶ s❡❝s ❚r❛♥s❛❝t✐♦♥ r❛t❡✿ ✷✷✳✵✼ tr❛♥s✴s❡❝ ❚❤r♦✉❣❤♣✉t✿ ✵✳✻✷ ▼❇✴s❡❝ ❈♦♥❝✉rr❡♥❝②✿ ✽✳✾✻ ❙✉❝❝❡ss❢✉❧ tr❛♥s❛❝t✐♦♥s✿ ✶✸✷✾ ❋❛✐❧❡❞ tr❛♥s❛❝t✐♦♥s✿ ✵ ▲♦♥❣❡st tr❛♥s❛❝t✐♦♥✿ ✸✳✵✺ ❙❤♦rt❡st tr❛♥s❛❝t✐♦♥✿ ✵✳✵✶

slide-125
SLIDE 125

Invenio Technology Tibor Šimko Introduction

Digital Library Invenio

Case Studies

Episode 1: Python Episode 2: Git Episode 3: Test Suite Episode 4: Building Efficient Indexes Episode 5: Load-balancing

Conclusions

Conclusions

selected lessons from building a digital library system

∼300,000 LOCs from ∼75 authors over ∼10 years

value of rapid prototyping value of organic-growth software development model value of coding aesthetics and minimalism morale from selected anecdotes?

“Never Lose A Holy Curiosity” (A. Einstein)