N O V E M B E R 1 6 - 1 8 , 2 0 1 6 • S E V I L L E , S PA I N
Building a Search Engine for the Cuban Web
Jorge Luis Betancourt Search/Crawl Engineer
Building a Search Engine for the Cuban Web Jorge Luis Betancourt - - PowerPoint PPT Presentation
Building a Search Engine for the Cuban Web Jorge Luis Betancourt Search/Crawl Engineer N O V E M B E R 1 6 - 1 8 , 2 0 1 6 S E V I L L E , S PA I N Who am I 01 Jorge Luis Betancourt Gonzlez Search/Crawl Engineer Apache Nutch
N O V E M B E R 1 6 - 1 8 , 2 0 1 6 • S E V I L L E , S PA I N
Building a Search Engine for the Cuban Web
Jorge Luis Betancourt Search/Crawl Engineer
Who am I
01Jorge Luis Betancourt González
Search/Crawl Engineer Apache Nutch Committer & PMC Apache Solr/ES enthusiast
Agenda
Introduction / Motivation
Cuba Internet Intranet Global search engines can’t access documents hosted the Cuban Intranet
Writing your own web search engine
from scratch?
Common search engine features
2 1 3 Web search: HTML & documents (PDF, DOC) Image search (size, format, color, objects) News search (alerting, notifications)
How to fulfill these requirements?
store query At the core a search engine: stores some information a retrieve this information when a question is received
Open Source to the rescue …
Index Server crawler web interface 2 1 3
Apache Nutch
Nutch is a well matured, production ready Web crawler. Enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing.
Apache Nutch
indexing, scoring,
Apache Solr
TOTAL
DOWNLOADS 8M+
MONTHLY
DOWNLOADS
250,000+
Back to the list of features
2 1 3 Web search: HTML & documents (PDF, DOC) Image search (size, format, color, objects) News search (alerting, notifications)
Image search and thumbnails
Custom parser & indexer to store the image thumbnail img p
h1
Custom parser & indexer & scoring identify and store the text related with an image
How does it work?
img p
h1
1 img img 3 2
News search (NRT & alerting)
Nutch is really not suited for this task: Batch nature of the Hadoop Jobs doesn’t fit well in this scenario
Our topology
http://news-site.com
RSS
fetch parse index
parse the RSS feed and outputs the news links to be processed according to SC protocol.
https://github.com/commoncrawl/news-crawl monit
flaxsearch/luwak
Querying the data
2 1 3 Web search: HTML & documents (PDF, DOC) Image search (size, format, color, objects) News search (alerting, notifications)
Querying the data
2 1 3 Web search: HTML & documents (PDF, DOC) Image search (size, format, color, objects) News search (alerting, notifications)
Apache Solr
releases)
replication)
The features, once again
2 1 3 Web search: HTML & documents (PDF, DOC) Image search (size, format, color, objects) News search (alerting, notifications)
The features, once again
2 1 3 Web search: HTML & documents (PDF, DOC) Image search (size, format, color, objects) News search (alerting, notifications)
Other features - monitoring
We needed a way of monitoring our infrastructure without a great Internet connection you can’t send GB of logs to a cloud environment, so … (and facets) analytical tool (and logs) (and metrics) time series store
Other features - monitoring
(and facets) analytical tool (and logs) (and metrics) time series store (and logs) parsing & aggregation
Banana (Kibana port) for visualizations
Infrastructure
Solr Master Crawlers Nutch Solr Replicador WEB
HTTP HTTP HTTP HTTP JAVABIN
1 2
Some usage stats
less than 10 000 visits around 600 unique visitors
Future work
Apply deep learning techniques to process the raw images and mix with current approach Increase the number of signals that we get from our crawlers (correlate even more crawl related events)
Thanks Questions?
M
@jorgelbg