Building a Search Engine for the Cuban Web Jorge Luis Betancourt - - PowerPoint PPT Presentation

building a search engine for the cuban web
SMART_READER_LITE
LIVE PREVIEW

Building a Search Engine for the Cuban Web Jorge Luis Betancourt - - PowerPoint PPT Presentation

Building a Search Engine for the Cuban Web Jorge Luis Betancourt Search/Crawl Engineer N O V E M B E R 1 6 - 1 8 , 2 0 1 6 S E V I L L E , S PA I N Who am I 01 Jorge Luis Betancourt Gonzlez Search/Crawl Engineer Apache Nutch


slide-1
SLIDE 1

N O V E M B E R 1 6 - 1 8 , 2 0 1 6 • S E V I L L E , S PA I N

Building a Search Engine for the Cuban Web

Jorge Luis Betancourt Search/Crawl Engineer

slide-2
SLIDE 2 2

Who am I

01

Jorge Luis Betancourt González

Search/Crawl Engineer Apache Nutch Committer & PMC Apache Solr/ES enthusiast

slide-3
SLIDE 3 3

Agenda

  • Introduction & motivation
  • Technologies used
  • Customizations
  • Conclusions and future work
slide-4
SLIDE 4 4

Introduction / Motivation

Cuba Internet Intranet Global search engines can’t access documents hosted the Cuban Intranet

slide-5
SLIDE 5 5

Writing your own web search engine

from scratch?

  • r …
slide-6
SLIDE 6 6

Common search engine features

2 1 3 Web search: HTML & documents (PDF, DOC) Image search (size, format, color, objects) News search (alerting, notifications)

  • highlighting
  • filters (facets)
  • suggestions
  • autocorrection
  • thumbnails
  • filters (facets)
  • show metadata
  • match text with images
  • near real time
  • email, push, SMS
slide-7
SLIDE 7 7

How to fulfill these requirements?

store query At the core a search engine: stores some information a retrieve this information when a question is received

slide-8
SLIDE 8 8

Open Source to the rescue …

Index Server crawler web interface 2 1 3

slide-9
SLIDE 9 9

Apache Nutch

Nutch is a well matured, production ready Web crawler. Enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing.

slide-10
SLIDE 10 10

Apache Nutch

  • Highly scalable
  • Highly extensible
  • Pluggable parsing protocols, storage,

indexing, scoring,

  • Active community
  • Apache License
slide-11
SLIDE 11 11

Apache Solr

TOTAL

DOWNLOADS 8M+

MONTHLY

DOWNLOADS

250,000+

  • Apache License
  • Highly modular
  • Based on Lucene
  • Great community
  • Stability / Scalability
  • Battle tested
slide-12
SLIDE 12 12

Back to the list of features

2 1 3 Web search: HTML & documents (PDF, DOC) Image search (size, format, color, objects) News search (alerting, notifications)

  • highlighting
  • filters (facets)
  • suggestions
  • autocorrection
  • thumbnails
  • show metadata
  • match text with images
  • near real time
  • email, push, SMS
  • filters (facets)
slide-13
SLIDE 13 13

Image search and thumbnails

Custom parser & indexer to store the image thumbnail img p

h1

Custom parser & indexer & scoring identify and store the text related with an image

slide-14
SLIDE 14 14

How does it work?

img p

h1

1 img img 3 2

slide-15
SLIDE 15 15

News search (NRT & alerting)

Nutch is really not suited for this task: Batch nature of the Hadoop Jobs doesn’t fit well in this scenario

slide-16
SLIDE 16 16

Our topology

http://news-site.com

RSS

fetch parse index

parse the RSS feed and outputs the news links to be processed according to SC protocol.

https://github.com/commoncrawl/news-crawl monit

  • r

flaxsearch/luwak

slide-17
SLIDE 17 17

Querying the data

2 1 3 Web search: HTML & documents (PDF, DOC) Image search (size, format, color, objects) News search (alerting, notifications)

  • highlighting
  • filters (facets)
  • suggestions
  • autocorrection
  • thumbnails
  • show metadata
  • match text with images
  • near real time
  • email, push, SMS
  • filters (facets)
17
slide-18
SLIDE 18 18

Querying the data

2 1 3 Web search: HTML & documents (PDF, DOC) Image search (size, format, color, objects) News search (alerting, notifications)

  • highlighting
  • filters (facets)
  • suggestions
  • autocorrection
  • thumbnails
  • show metadata
  • match text with images
  • near real time
  • email, push, SMS
  • filters (facets)
18
slide-19
SLIDE 19 19

Apache Solr

  • Solr has full support for highlighting (3 impl)
  • powerful faceting capabilities (even more on recent

releases)

  • autocorrection support based on the index content
  • awesome scalability (SolrCloud, classic master-slave

replication)

slide-20
SLIDE 20 20

The features, once again

2 1 3 Web search: HTML & documents (PDF, DOC) Image search (size, format, color, objects) News search (alerting, notifications)

  • highlighting
  • filters (facets)
  • suggestions
  • autocorrection
  • thumbnails
  • show metadata
  • match text with images
  • near real time
  • email, push, SMS
  • filters (facets)
slide-21
SLIDE 21 21

The features, once again

2 1 3 Web search: HTML & documents (PDF, DOC) Image search (size, format, color, objects) News search (alerting, notifications)

  • highlighting
  • filters (facets)
  • suggestions
  • autocorrection
  • thumbnails
  • show metadata
  • match text with images
  • near real time
  • email, push, SMS
  • filters (facets)
slide-22
SLIDE 22 22

Other features - monitoring

We needed a way of monitoring our infrastructure without a great Internet connection you can’t send GB of logs to a cloud environment, so … (and facets) analytical tool (and logs) (and metrics) time series store

slide-23
SLIDE 23 23

Other features - monitoring

(and facets) analytical tool (and logs) (and metrics) time series store (and logs) parsing & aggregation

slide-24
SLIDE 24 24

Banana (Kibana port) for visualizations

slide-25
SLIDE 25 25

Infrastructure

Solr Master Crawlers Nutch Solr Replicador WEB

HTTP HTTP HTTP HTTP JAVABIN

1 2

slide-26
SLIDE 26 26

Some usage stats

less than 10 000 visits around 600 unique visitors

slide-27
SLIDE 27 27

Future work

Apply deep learning techniques to process the raw images and mix with current approach Increase the number of signals that we get from our crawlers (correlate even more crawl related events)

slide-28
SLIDE 28

Thanks Questions?

M

  • jorgelbg@apache.org

@jorgelbg