IMAGE RETRIEVAL IN DIGITAL LIBRARIES A LARGE SCALE MULTICOLLECTION - - PowerPoint PPT Presentation

image retrieval in digital libraries
SMART_READER_LITE
LIVE PREVIEW

IMAGE RETRIEVAL IN DIGITAL LIBRARIES A LARGE SCALE MULTICOLLECTION - - PowerPoint PPT Presentation

IMAGE RETRIEVAL IN DIGITAL LIBRARIES A LARGE SCALE MULTICOLLECTION EXPERIMENTATION OF MACHINE LEARNING TECHNIQUES Jean-Philippe MOREUX Guillaume CHIRON (L3i, La Rochelle) IFLA News Media Section Dresden, August 2017 Outline Image Search


slide-1
SLIDE 1

IMAGE RETRIEVAL IN DIGITAL LIBRARIES

A LARGE SCALE MULTICOLLECTION EXPERIMENTATION OF MACHINE LEARNING TECHNIQUES

Jean-Philippe MOREUX Guillaume CHIRON (L3i, La Rochelle)

IFLA News Media Section Dresden, August 2017

slide-2
SLIDE 2

Outline

  • Image Search in DLs
  • ETL (Extract, Transform, Load) approach on World War 1 theme
  • Machine Learning experimentation:
  • Image Genres Classification
  • Visual Recognition
  • Image Retrieval PoC
  • Conclusion

« L’Auto », photo lab, 1914

slide-3
SLIDE 3

On gallica.bnf.fr:

  • 63% of the users consult the image collection, 85% know its existence [2017 survey]
  • 50% of the Top 500 user queries contain

named entities Person, Place, Historical Event

[2016 analysis of 28M user queries]

  • For these encyclopedic queries, giving

users access to iconographic resources could be a valuable service

  • But the Gallica image collection only

contains 1.2 M items: silence, limited number of illustrations (only 140 results for "Georges Clemenceau« , 1910-1920)

Our Users are Looking for Images

Number of image documents found in Gallica for the first Top 100 queries on a named entity of type Person

3

Image Search in DLs

slide-4
SLIDE 4

DLs are full of Images!

  • 1.2M pages manually indexed and tagged

as "image" (photos, engravings, maps…)

  • Huge reservoir of potential images

growing at a 20M digitized pages/year pace

To make these assets visible to users, we need automation:

  • automatic recognition of images
  • automatic description of images

4

Image Search in DLs

slide-5
SLIDE 5

For Printed Content, OCR can help

Pages de Gloire, fév. 1917

Le Miroir, nov. 1918 La Science et la Vie, déc. 1917

… to identify illustrations

5

Image Search in DLs

slide-6
SLIDE 6

And for other Materials?

  • Enlighted manuscripts, documents

with no OCR: image detection algorithms

  • Video: each frame

is an image

Bayerische Staatsbibliothek Image-based Similarity Search, 43 M images indexed on morphological features Google TensorFlow Object Detection API

6

Image Search in DLs

slide-7
SLIDE 7

We Have Million of Images…

… but image retrieval is challenging…

  • Content based image retrieval (CBIR) is still a scientific challenge
  • Heritage images are often stored in data silos of various types

(drawings, engravings, photos…) which may need specific CBIRs

  • DLs catalogs don’t handle image metadata (size, color, quality, etc.)

at the illustration granularity

7

Image Search in DLs

slide-8
SLIDE 8

CBIR: Other Issues to Keep in Mind

  • Different image retrieval use cases must be considered:
  • Similarity search based on the selection of a source image
  • Content indexing with keywords
  • Various users needs, from mining of pictures for social media reuse

to scientific study of bindings

  • Usability: DLs web apps have been designed for searching on catalog

records and full text. They are page based

vs

Looking for cat & kitten Looking for coat of arms

8

Image Search in DLs

slide-9
SLIDE 9

Classic page flip mode for browsing heritage documents

The page paradigm is an obstacle

OCR

9

Dix-sept dessins de George Barbier sur le Cantique des Cantiques, 1914

slide-10
SLIDE 10

Newspapers have multiple illustrations per page and double page spread illustrations

…particulary for newspapers

10

slide-11
SLIDE 11
  • Extract-Transform-Load approach
  • On World War 1 materials: still images, newspapers,

magazines, monographs, posters (1910-1920)

  • Enriched with Machine Learning techniques

Proof of Concept

From catalogs and OCRs Transform & enrich the image metadata Image retrieval (web app)

Extract Transform Load

11

ETL approach

slide-12
SLIDE 12

The Tool Bag

  • Standard tools and APIs
  • Machine Learning: Sofware as a Service (IBM Watson API)

& pretrained models (Google TensorFlow)

Extract

  • Gallica APIs
  • OAI-PMH
  • SRU

Transform

  • Watson (IBM)
  • TensorFlow

(Google)

  • IIIF
  • Tesseract

Load

  • BaseX
  • XQuery
  • IIIF
  • Mansory.js

 The glue: Perl and Python scripts

12

ETL approach

slide-13
SLIDE 13

Extract

  • All the available metadata from
  • ur data sources: catalog records,

images, OCR, ToC

Image MD: size, color… Catalog records OCRed text around image (when exists), ToC

E

13

ETL approach

slide-14
SLIDE 14

Extract: remarks

  • This first step is worth the pain:

it gives access to “invisible” illustrations to user! (invisible=

deeply hide into the printed content)

  • Challenges:
  • heterogeneity of formats, digitization

practices and metadata available (e.g. image genres)

  • computationally intensive (but

parallelizable)

  • noisy results for newspapers

(≈50-70% of the illustrations are noise)

14

ETL approach « Der Rosenkavalier » premiere in Dresden (Richard Strauss, Hugo von Hofmannsthal), L’Excelsior, 27/01/1911

slide-15
SLIDE 15

Volumes

  • ≈ 300k (usable) illustrations (on ≈900k) illustrations extracted from

490k pages. Bibliographic selection (WW1) and samples of the

newspapers collection

  • On the same time period, Gallica offers 490k illustrations
  • Newspapers are (really) generous…

(L’Excelsior : 90k illustrations, 3 ill./page)

 Over the entire digital collection, we can expect hundreds of M of illustrations!  Just a scratch on the digital collections!

WW1 images database: sources of the images 15

ETL approach

slide-16
SLIDE 16

Transform & Enrich

  • OCR around illustration (if no text is available): Tesseract
  • Topic modeling: semantic network, LDA (Latent Dirichlet Allocation)
  • Image genres classification: TensorFlow/Inception-v3 model
  • Image content recognition: Watson/Visual Recognition API

Visual recognition Image genres classification Topic Modeling

T

16

ETL approach

slide-17
SLIDE 17

Image Genres Classification with TensorFlow

  • Machine learning approach based on Convolutional Neural

Networks: Google Inception-V3 model (1,000 classes, Top 5 error rate: 3.46%)

  • Retrained (only the last layer, “transfer learning” approach)
  • n our GT dataset of 12 classes,

7,750 img)

  • Evaluated on a 1,950

images dataset

  • Retraining: ≈ 3-4 hours

Labeling: < 1s / image

17

Image Genres Classification

slide-18
SLIDE 18

Image Genres Classification with TensorFlow

  • Recall: 0.90
  • Accuracy: 0.90
  • The "noisy illustrations" can be removed: cover & blank pages from portofolios;

text, ornaments & ads from newspapers

18

Image Genres Classification Better performances can be obtained on less generic models (e.g. monographs only: recall=94%) or with full trained models (needs computing power)

slide-19
SLIDE 19

Image Genres Classification: Filtering

  • Data mining raw OCR of newspapers can make you sick!

 Full-scale test on a newspaper title (6,000 ill.): 98.3% of the noisy illustrations are identified

19

Image Genres Classification

slide-20
SLIDE 20

Image Genres Classification: Q&A

  • Better performances can be obtained on less

generic models (e.g. monographs only: F-measure=94%) or with full trained models

  • Real life use for newspapers?

A 98.3% filtering rate means:

  • ≈900 noisy illustrations are missed on a 50,000 pages newspaper title
  • ≈900 valuable illustrations are removed… but these ones can be (quickly)

checked by humans!

A 94% classification rate means:

  • 6 illustrations are missclassified every 100, but far more less in real life, as we

have (sometimes) genre metadata in our catalogs

  • Drawings or photos are classified as engravings, comics as drawing, etc.

Not a big deal!

  •  Full-scale use is realistic for DLs

20

Image Genres Classification

slide-21
SLIDE 21

CBIR: Introduction

  • Historically, Content Based Image Retrieval (CBIR) systems

were designed to:

  • 1. Extract visual descriptors from an image,
  • 2. Deduce a signature from it and…
  • 3. Search for similar images by minimizing the distances

into the signatures space

 Flicker Similary Search

21

Visual Recognition

slide-22
SLIDE 22
  • The constraint that CBIR systems can only by queried by a

source image (or a sketch drawn by the user) has a negative impact

  • n its usability
  • Now, deep learning techniques tend to overcome these limitations,

in particular thanks to visual recognition of objects in images, which enables textual queries

CBIR: Introduction

22

 IBM Watson Visual Recognition, Google TensorFlow Object Detection

Visual Recognition

slide-23
SLIDE 23

Visual Recognition with IBM Watson

  • Visual Recognition Service API
  • Outputs pairs of class/confidence score
  • Detects objects, persons, faces, colors…

"images": [ { "classifiers": [ { "classes": [ { "class": "armored personnel carrier", "score": 0.568, "type_hierarchy": "/vehicle/wheeled vehicle/armored vehicle/ armored personnel carrier" }, { "class": "armored vehicle", "score": 0.576 }, { "class": "wheeled vehicle", "score": 0.705 }, { "class": "vehicle", "score": 0.706 }, { "class": "personnel carrier", "score": 0.541, "type_hierarchy": "/vehicle/wheeled vehicle/personnel carrier" }, { "class": "fire engine", "score": 0.526, "type_hierarchy": "/vehicle/wheeled vehicle/truck/fire engine" }, { "class": "truck", "score": 0.526 }, { "class": "structure", "score": 0.516 }, { "class": "Army Base", "score": 0.511, "type_hierarchy": "/defensive structure/Army Base" }, { "class": "defensive structure", "score": 0.512 }, { "class": "gas pump", "score": 0.5, "type_hierarchy": "/mechanical device/pump/gas pump" }, { "class": "pump", "score": 0.5 }, { "class": "mechanical device", "score": 0.501 }, { "class": "black color", "score": 0.905 }, { "class": "coal black color", "score": 0.691 } …

black color/0.905 vehicle/0.706 coal black color/0.691 armored vehicle/0.576 « Les tanks de la bataille de Cambrai, la reine d'Angleterre écoute les explications données par un officiers anglais », 1917

23

Visual Recognition

slide-24
SLIDE 24

Experimentation on Person Detection

  • Ground truth of 2,200 images for Person detection.

600 images for Soldier detection.

  • “Person”: recall=60.5%, accuracy=98.4%
  • With a WW1 custom classifier: recall=65%
  • “Soldier”: recall=56%, accuracy=80,5%
  • Modest rates but we’ve to keep in mind that Person or Soldier

categories are not available in catalog records!

  • Keyword Search on the Soldier GT:

"soldier" OR "military officer" OR "gunner" OR…: recall=21% (65 images)

  • Visual Recognition: recall=56% (172 images)
  • Mixed Search (text+visual): recall=70%

(215 images)

Soldiers moving a sculpture, 1918

24

Visual Recognition

slide-25
SLIDE 25

Experimentation on Soldier Detection

25

Visual Recognition

0% 20% 40% 60% 80% 100% Text MD only Visual Reco.

+ custom classifier

Mixed MD

recall

70%

slide-26
SLIDE 26
  • Works on heritage documents: engrawing, drawing, printed

photography, even on "difficult" ones

Experimentation on Person Detection

26

Visual Recognition

slide-27
SLIDE 27
  • But this experiment also exposes some limitations:
  • Generalization from a contemporary training corpus:

 anachronisms

  • Generalization from a limited training corpus:

 classification errors (1,000 classes is enough for encyclopedic search, not for the large spectrum

  • f cultural heritage artifacts we cure)
  • Difficulty to handle complex scenes

(multiclasses)

Experimentation on Person Detection

Segway armored vehicule bourgogne wine label 27

Visual Recognition

scene picture frame

slide-28
SLIDE 28

Experimentation on Face Detection

  • The Watson API also performs Face and Gender detection:
  • “Face”: recall=30%, accuracy=99.8% / “Gender”: recall=22%
  • The mixed use of the two recognition APIs (Person and Face Detection) results

in an improvement of the overall recall for Person detection from 60,5% to 65%

28

Visual Recognition

slide-29
SLIDE 29

Experimentation on Face Detection

What are the use cases?

  • Digital Humanities: gender

studies, visual studies

  • Digital Mediation:
  • Arts and DH: Time Based Image Averaging

“Robots Reading Vogue” Ginosar et al., “A Century of Portraits. A Visual Historical Record of American High School Yearbooks”

“Gallica WW1 Portrait Gallery”

“Seb Przd, Time covers, 1923-2006 »

29

Visual Recognition

slide-30
SLIDE 30

Load (& Search)

  • In a XML database (BaseX)
  • Search with XQuery
  • Display with IIIF

Image metadata Catalog metadata Full text

30

Image Retrieval

slide-31
SLIDE 31

Image Retrieval: the Data Deluge

  • The complexity of the search form and the large number of results it
  • ften leads to reveal that searching and browsing in image

databases carries specific issues of usability and remains a research topic in its own right…

31

Image Retrieval

slide-32
SLIDE 32

Retrieval: Encyclopedic Query on a Named Entity

  • Textual descriptors (metadata and OCR) are used.

“George Clemenceau” query: 140 ill. in Gallica/Images, >1,000 in the WW1 DB

Caricatures can be found with the “drawing” facet

32

Image Retrieval

slide-33
SLIDE 33

Retrieval: Image Metadata Query

  • Image descriptors are used.

Search for large illustrations: maps, double spread page, posters, comics…

  • Search for musical score covers with red-dominant color

33

Image Retrieval

slide-34
SLIDE 34

Retrieval: Encyclopedic Query on Concept

  • The conceptual classes extracted by the Watson API are used.

Query on the superclass “vehicle” returns many instances of its subclasses (car, bicycle, airplane, airship, etc.)

Concepts overcome silent metadata or OCR, multilanguage barrier, lexical evolution

34

Image Retrieval

slide-35
SLIDE 35

Retrieval: Query on Concepts

  • Search for visuals of a gun inside a bunker:

class=”bunker” AND class=”gun”

Textual metadata for this image is: « Canon camouflé dans une casemate et soldat français » « Casemate » is an aged synonym of bunker, blockhaus  Classification

  • vercomes language dependant

issues 35

Image Retrieval

slide-36
SLIDE 36

Retrieval: Mixed Query

  • Conceptual classes, text and image MD are used

Search for visuals relating to the urban destruction following the Battle

  • f Verdun: class=(”street” OR ”house”) AND keyword=”Verdun”

36

Image Retrieval

slide-37
SLIDE 37

Retrieval: Mixed Query

  • Search for visuals of military vehicules used in French colonies

class=”wheeled vehicule” AND keyword=(”sand” OR ”dune”)

(The image in the middle is a false positif)

« L’Aérosable », L'Aviation et l'automobilisme militaires : revue mensuelle des progrès scientifiques appliqués à la Défense nationale, 1914

37

Image Retrieval

slide-38
SLIDE 38

Retrieval: Mixed Query

  • Study of the evolution of the French soldiers uniforms during the conflict. The aim

is to document the history of the famous red trousers worn until beginning of 1915

  • Based on two queries using the conceptual classes (“soldier”, “officer”, etc.), record

metadata (date), and an image-based criteria (“color”)

date < 31/12/1914 date > 01/01/1915 38

Image Retrieval

slide-39
SLIDE 39

Retrieval: Mixed Query

  • Same use case: evolution of aeronautical techniques during the conflict

date <= 1914 date >= 1918

The illustrations provided by these queries could feed on averaging of images approaches, which increasingly escape the artistic sphere to address other subjects or other uses (e.g. automatic dating of photographs)

39

Image Retrieval

slide-40
SLIDE 40

The IIIF Presentation API provides a way to describe the illustrations in a document using Open Annotations attached to a layer (Canvas) in the IIIF manifest.

Opening the Data

API IIIF Presentation

Annotation

Image DB

{ "@id": "http://wellcomelibrary.org/iiif/b28047345/annos/contentAsText/a31i0", "@type": "oa:Annotation", "motivation": "oa:classifying", "resource": { "@id": "dctypes:Image", "label": "Picture" }, "on": "http://mylibrary.org/iiif/b28047345/canvas/c31#xywh=201,1768,2081,725" }

All the iconographic resources can then be operated by machine (library-specific projects, data harvesting (Europeana), research, hacker/makers/social networks 40

Conclusion

slide-41
SLIDE 41

Conclusion

  • Unified access to all illustrations in an encyclopedic digital

collection is an innovative service that meets a real need.

  • It will foster the illustrations reuse
  • It also opens new perspective for researchers (DH, visual studies)
  • The maturity of modern AI techniques in image content processing

makes possible their integration into the digital library toolbox.

  • Their results, even imperfect, help to make visible and searchable

the large quantities of illustrations of our collections.

41

Conclusion

slide-42
SLIDE 42

Digital Humanities focus

  • Today, the image is a new playground for DH researchers
  • Tomorrow, image datasets will be the daily life
  • f researchers
  • AI tools will be free and trivialized
  • Heritage libraries will be solicited for their iconographic

collections (web archive, photo collections, newspapers and magazines, etc.) for visual data mining

42

Conclusion

Note: Datasets, scripts and code are available: https://altomator.github.io/Image_Retrieval/

slide-43
SLIDE 43

43

Portraits Galery