Building Full Text Indexes of Web Content using Open Source Tools - - PowerPoint PPT Presentation

▶

building full text indexes of web content using open

Building Full Text Indexes of Web Content using Open Source Tools - - PowerPoint PPT Presentation

Apr 25, 2023 147 likes •588 views

Building Full Text Indexes of Web Content using Open Source Tools Erik Hetzner rtr UC Curation Center, California Digital Library 30 June 2012 Erik Hetzner

slide-1

SLIDE 1

Building Full Text Indexes of Web Content using Open Source Tools

Erik Hetzner

❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉

UC Curation Center, California Digital Library

30 June 2012

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 1 / 38

slide-2

SLIDE 2

CDL’s Web Archiving System

We don’t decide what to collect. We don’t decide when to collect it. We build tools to allow curators to make those decisions.

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 2 / 38

slide-3

SLIDE 3

CDL’s Web Archiving System

Vital statistics

49 public archives 19 partners 3684 web sites 489,898,652 URLs (×2) 25.5 TB (×2)

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 3 / 38

slide-4

SLIDE 4

CDL’s Web Archiving System

Vital statistics

49 public archives 19 partners 3684 web sites 489,898,652 URLs (×2) 25.5 TB (×2)

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 3 / 38

slide-5

SLIDE 5

CDL’s Web Archiving System

Vital statistics

49 public archives 19 partners 3684 web sites 489,898,652 URLs (×2) 25.5 TB (×2)

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 3 / 38

slide-6

SLIDE 6

CDL’s Web Archiving System

Vital statistics

49 public archives 19 partners 3684 web sites 489,898,652 URLs (×2) 25.5 TB (×2)

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 3 / 38

slide-7

SLIDE 7

CDL’s Web Archiving System

Vital statistics

49 public archives 19 partners 3684 web sites 489,898,652 URLs (×2) 25.5 TB (×2)

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 3 / 38

slide-8

SLIDE 8

CDL’s Web Archiving System

How we organize thing

Each curator creates projects Each project contains sites Each site contains jobs

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 4 / 38

slide-9

SLIDE 9

Actually existing web archive search

Why do we always see this?

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 5 / 38

slide-10

SLIDE 10

Actually existing web archive search

URL Lookup

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 6 / 38

slide-11

SLIDE 11

Actually existing web archive search

NutchWAX

Web Archiving eXtensions for Nutch. Nutch is an open source web crawler, with search. Web Archiving eXtensions written by Internet Archive.

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 7 / 38

slide-12

SLIDE 12

Actually existing web archive search

WAS

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 8 / 38

slide-13

SLIDE 13

Actually existing web archive search

Archive-IT

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 9 / 38

slide-14

SLIDE 14

Actually existing web archive search

Portugese Web Archive

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 10 / 38

slide-15

SLIDE 15

Actually existing web archive search

Library of Congress

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 11 / 38

slide-16

SLIDE 16

Actually existing web archive search

Google

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 12 / 38

slide-17

SLIDE 17

Some of the challenges

Scale

IA collections > 2PB WAS collections > 50TB

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 13 / 38

slide-18

SLIDE 18

Some of the challenges

Temporal search is not easy

[ michael jackson death ]

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 14 / 38

slide-19

SLIDE 19

Some of the challenges

Resources

Google’s 2011 revenue: $38 bn. UC’s 2011/12 revenue: $22 bn.

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 15 / 38

slide-20

SLIDE 20

Why a new indexing system?

Deduplication

Reduce redundant storage by storing pointers back to identical, previously captured content. . . . but how to index this? Couldn’t figure how to make NutchWAX do this.

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 16 / 38

slide-21

SLIDE 21

Why a new indexing system?

Curator-supplied metadata

Our curators supply metadata (primarily tags) about the sites they capture This metadata should be indexed Curators should be able to modify this metadata at any time

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 17 / 38

slide-22

SLIDE 22

Why a new indexing system?

NutchWAX

. . . and besides, Nutch is aging. Nutch now focused on crawling, not search. Our usage of NutchWAX was very slow.

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 18 / 38

slide-23

SLIDE 23

Why a new indexing system?

Temporal web

. . . futhermore, web archive indexing is different. We capture the same URLs, again and again. It would be nice to build a web search system that takes time into account.

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 19 / 38

slide-24

SLIDE 24

weari: a WEb ARchive Indexer

weari: a WEb ARchive Indexer

We began writing a new indexing system We want to write as little as possible (see resources, above) So we stitched together FOSS tools

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 20 / 38

slide-25

SLIDE 25

Tools used

Scala

Written in the Scala language To interact with Pig, Solr, etc.

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 21 / 38

slide-26

SLIDE 26

Tools used

Tika

We mostly need to parse HTML, but PDFs are very important to

ur users

Not to mention Office Apache software project Wraps parsers for different file types in a uniform interface. Parses most common file types. Use the same code to parse different types.

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 22 / 38

slide-27

SLIDE 27

Tools used

Tika difficulties

Some files are slow to parse. Some files blow up your memory. Some file parses never return.

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 23 / 38

slide-28

SLIDE 28

Tools used

Tika solutions

Don’t parse files that are too big (e.g. > 2 MB) Fork and monitor process from the outside (Hadoop comes in handy) Preparse everything

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 24 / 38

slide-29

SLIDE 29

④ ✧❢✐❧❡♥❛♠❡✧ ✿ ✧❈❉▲✲✷✵✵✼✵✻✶✸✶✼✷✾✺✹✲✵✵✵✵✷✲✐♥❣❡st✶✳❛r❝✳❣③✧✱ ✧❞✐❣❡st✧ ✿ ✧❉❲❍◆▼■◗◆✸❖❩▲●✸❩❲✷P❩◗❈❚❊❯❖❆❲❈▲✺❘❏✧✱ ✧✉r❧✧ ✿ ✧❤tt♣✿✴✴♠❡❞❧✐♥❡♣❧✉s✳❣♦✈✴✧✱ ✧❞❛t❡✧ ✿ ✶✶✽✶✼✺✺✽✵✻✵✵✵✱ ✧t✐t❧❡✧ ✿ ✧▼❡❞❧✐♥❡P❧✉s ❍❡❛❧t❤ ■♥❢♦r♠❛t✐♦♥ ✳✳✳✧✱ ✧❧❡♥❣t❤✧ ✿ ✷✹✻✺✺✱ ✧❝♦♥t❡♥t✧ ✿ ✧▼❡❞❧✐♥❡P❧✉s ❍❡❛❧t❤ ■♥❢♦r♠❛t✐♦♥ ✳✳✳✧✱ ✧s✉♣♣❧✐❡❞❈♦♥t❡♥t❚②♣❡✧ ✿ ④ ✧t♦♣✧ ✿ ✧t❡①t✧✱ ✧s✉❜✧ ✿ ✧❤t♠❧✧ ⑥✱ ✧❞❡t❡❝t❡❞❈♦♥t❡♥t❚②♣❡✧ ✿ ④ ✧t♦♣✧ ✿ ✧t❡①t✧✱ ✧s✉❜✧ ✿ ✧❤t♠❧✧ ⑥✱ ✧♦✉t❧✐♥❦s✧ ✿ ❬ ✻✷✸✶✷✾✹✾✸✺✻✶✹✹✻✶✻✵✱ ✳✳✳ ❪ ⑥

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 25 / 38

slide-30

SLIDE 30

Tools

What is Pig?

Platform for data analysis from Apache. Based on Hadoop.

fault tolerant distributed processing

Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive.

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 26 / 38

slide-31

SLIDE 31

Tools

Why solr?

Why not? Widely used. Takes the ‘kitchen sink’ approach to features. Hathitrust work seems to show that it can scale up to our needs.

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 27 / 38

slide-32

SLIDE 32

Tools

Solr difficulties

Cannot modify documents Solution: use stored fields, merge Need fast check for deduplicated content Solution: fetch document IDs, lookup in Bloom Filter

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 28 / 38

slide-33

SLIDE 33

Tools

Thrift

To communicate between our WAS-specific Ruby code and Scala

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 29 / 38

slide-34

SLIDE 34

Tools

Hadoop File System (HDFS)

To store parsed JSON files.

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 30 / 38

slide-35

SLIDE 35

Merging docs

Original

❞✐❣❡st ✿ ▼◗❳◆❈■✼❑❆✸❨❇❙❏❯❩❱❍●❳❨✸❳✷❑❇❙✺✻✹✹✹ ✉r❧ ✿ ❤tt♣✿✴✴✇✇✇✳❣♦♦❣❧❡❜♦♦❦s❡tt❧❡♠❡♥t✳❝♦♠✴❤❡❧♣✴❜✐♥✴❛♥s✇❡r✳♣②❄❛♥s✇❡r❂✶✸✹✻✹✹✫❤❧❂❜✺ ❛r❝♥❛♠❡ ✿ ❈❉▲✲✷✵✶✷✵✺✸✵✵✻✷✵✶✺✲✵✵✵✵✵✲t❛♥❛❣❡r✳✉❝♦♣✳❡❞✉✲✵✵✸✵✻✻✹✷✳❛r❝✳❣③ ❞❛t❡ ✿ ✷✵✶✷✲✵✺✲✸✵❚✵✻✿✸✼✿✵✸❩

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 31 / 38

slide-36

SLIDE 36

Merging docs

New

❞✐❣❡st ✿ ▼◗❳◆❈■✼❑❆✸❨❇❙❏❯❩❱❍●❳❨✸❳✷❑❇❙✺✻✹✹✹ ✉r❧ ✿ ❤tt♣✿✴✴✇✇✇✳❣♦♦❣❧❡❜♦♦❦s❡tt❧❡♠❡♥t✳❝♦♠✴❤❡❧♣✴❜✐♥✴❛♥s✇❡r✳♣②❄❛♥s✇❡r❂✶✸✹✻✹✹✫❤❧❂❜✺ ❛r❝♥❛♠❡ ✿ ❈❉▲✲✷✵✶✷✵✺✸✵✵✻✷✵✶✺✲✵✵✵✵✶✲t❛♥❛❣❡r✳✉❝♦♣✳❡❞✉✲✵✵✸✵✻✻✹✷✳❛r❝✳❣③ ❞❛t❡ ✿ ✷✵✶✷✲✵✺✲✸✵❚✵✻✿✷✵✿✺✵❩

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 32 / 38

slide-37

SLIDE 37

Merging docs

Merged

❞✐❣❡st ✿ ▼◗❳◆❈■✼❑❆✸❨❇❙❏❯❩❱❍●❳❨✸❳✷❑❇❙✺✻✹✹✹ ✉r❧ ✿ ❤tt♣✿✴✴✇✇✇✳❣♦♦❣❧❡❜♦♦❦s❡tt❧❡♠❡♥t✳❝♦♠✴❤❡❧♣✴❜✐♥✴❛♥s✇❡r✳♣②❄❛♥s✇❡r❂✶✸✹✻✹✹✫❤❧❂❜✺ ❛r❝♥❛♠❡ ✿ ❈❉▲✲✷✵✶✷✵✺✸✵✵✻✷✵✶✺✲✵✵✵✵✵✲t❛♥❛❣❡r✳✉❝♦♣✳❡❞✉✲✵✵✸✵✻✻✹✷✳❛r❝✳❣③✱ ❈❉▲✲✷✵✶✷✵✺✸✵✵✻✷✵✶✺✲✵✵✵✵✶✲t❛♥❛❣❡r✳✉❝♦♣✳❡❞✉✲✵✵✸✵✻✻✹✷✳❛r❝✳❣③ ❞❛t❡ ✿ ✷✵✶✷✲✵✺✲✸✵❚✵✻✿✸✼✿✵✸❩ ✷✵✶✷✲✵✺✲✸✵❚✵✻✿✷✵✿✺✵❩

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 33 / 38

slide-38

SLIDE 38

So far

about 200 m. unique documents 4 solr shards 2 TBs of index

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 34 / 38

slide-39

SLIDE 39

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 35 / 38

slide-40

SLIDE 40

Next steps

Better ranking

We have not explored ranking very much We store a Rabin fingerprint for every URL and its outlinks Have done some basic work with Webgraph tools to calculate ranks

❤tt♣✿✴✴✇❡❜❣r❛♣❤✳❞✐✳✉♥✐♠✐✳✐t✴

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 36 / 38

slide-41

SLIDE 41

Next steps

Speed improvements

Currently we index about 3k jobs per day A lot of the slowness is related to merging content Some of the slowness is probably Solr tuning

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 37 / 38

slide-42

SLIDE 42

weari : A WEb ARchive Indexer

Tika + HDFS + Pig + Solr = weari

❤tt♣✿✴✴❜✐t❜✉❝❦❡t✳♦r❣✴❝❞❧✴✇❡❛r✐

Thanks!

❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉

Erik Hetzner ❡r✐❦✳❤❡t③♥❡r❅✉❝♦♣✳❡❞✉ (CDL) Indexing 30 June 2012 38 / 38