Building a Search Engine for the Cuban Web Jorge Luis Betancourt - PowerPoint PPT Presentation

Building a Search Engine for the Cuban Web Jorge Luis Betancourt Search/Crawl Engineer N O V E M B E R 1 6 - 1 8 , 2 0 1 6 • S E V I L L E , S PA I N

Who am I 01 Jorge Luis Betancourt González Search/Crawl Engineer Apache Nutch Committer & PMC Apache Solr/ES enthusiast 2

Agenda • Introduction & motivation • Technologies used • Customizations • Conclusions and future work 3

Introduction / Motivation Cuba Internet Intranet Global search engines can’t access documents hosted the Cuban Intranet 4

Writing your own web search engine from scratch? or … 5

Common search engine features 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • thumbnails • show metadata • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 6

How to fulfill these requirements? At the core a search store query engine: stores some information a retrieve this information when a question is received 7

Open Source to the rescue … crawler 1 Index Server 2 web interface 3 8

Apache Nutch “ Nutch is a well matured, production ready Web crawler. Enables fine grained configuration, relying on Apache Hadoop™ data structures, which are great for batch processing. 9

Apache Nutch • Highly scalable • Highly extensible • Pluggable parsing protocols, storage, indexing, scoring, • Active community • Apache License 10

Apache Solr TOTAL DOWNLOADS 8M+ MONTHLY 250,000+ DOWNLOADS • Apache License • Great community • Highly modular • Stability / Scalability • Based on Lucene • Battle tested 11

Back to the list of features 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • show metadata • thumbnails • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 12

Image search and thumbnails Custom parser & indexer to store the image thumbnail h1 Custom parser & indexer & scoring p img identify and store the text related with an image 13

How does it work? 2 img 3 1 h1 img p img 14

News search (NRT & alerting) Nutch is really not suited for this task: Batch nature of the Hadoop Jobs doesn’t fit well in this scenario 15

Our topology http://news-site.com index RSS fetch parse flaxsearch/luwak monit or parse the RSS feed and outputs the news links to be processed according to SC protocol. https://github.com/commoncrawl/news-crawl 16

Querying the data 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • show metadata • thumbnails • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 17 17

Querying the data 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • show metadata • thumbnails • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 18 18

Apache Solr • Solr has full support for highlighting (3 impl) • powerful faceting capabilities (even more on recent releases) • autocorrection support based on the index content • awesome scalability (SolrCloud, classic master-slave replication) 19

The features, once again 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • show metadata • thumbnails • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 20

The features, once again 1 Web search: HTML & documents (PDF, DOC) • highlighting • suggestions • filters (facets) • autocorrection 2 Image search (size, format, color, objects) • show metadata • thumbnails • filters (facets) • match text with images 3 News search (alerting, notifications) • near real time • email, push, SMS 21

Other features - monitoring We needed a way of monitoring our infrastructure without a great Internet connection you can’t send GB of logs to a cloud environment, so … (and metrics) time series store (and logs) analytical tool (and facets) 22

Other features - monitoring (and logs) parsing & aggregation (and metrics) time series store (and logs) analytical tool (and facets) 23

Banana (Kibana port) for visualizations 24

Infrastructure WEB HTTP HTTP HTTP Solr 2 Replicador HTTP JAVABIN Solr 1 Master Crawlers Nutch 25

Some usage stats less than 10 000 visits around 600 unique visitors 26

Future work Apply deep learning techniques to process the raw images and mix with current approach Increase the number of signals that we get from our crawlers (correlate even more crawl related events) 27

Thanks Questions? M jorgelbg@apache.org � @jorgelbg

Building a Search Engine for the Cuban Web Jorge Luis Betancourt - PowerPoint PPT Presentation

Building a Search Engine for the Cuban Web Jorge Luis Betancourt Search/Crawl Engineer N O V E M B E R 1 6 - 1 8 , 2 0 1 6 S E V I L L E , S PA I N Who am I 01 Jorge Luis Betancourt Gonzlez Search/Crawl Engineer Apache Nutch

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

CUBAN SPECIAL PERIOD DURING PEACETIME EFFECTS ON CUBAN AGRICULTURE AND HEALTH David

Web Mining for Knowledge Discovery Current Search Engine Search engines are doing good jobs

The Economics of Internet Search Hal R. Varian Sept 31, 2007 Search engine use Search

EE 6882 Visual Search Engine Lec. 1: Introduction tinyeye, photo copy search Web image search

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

The search engine you can see Connects people to information and services The search engine you

THE SICKLE, THE HAMMER, AND THE CUBAN HEALTHCARE SYSTEM An Ethnographic Analysis of the Influence

Technologies behind Internet Search Engine Ming-Jer Lee CTO VisionNEXT Inc. Type of Search

Deep Representation: Building a Semantic Image Search Engine Emmanuel Ameisen PINTEREST SEARCH

search engine optimization ABOUT ME HOLISTIC SEARCH 2.0 ECOSYSTEM eRetail Search Platform

How to Rank Your Website on Page #1 of Google SEARCH ENGINE OPTIMISATION (SEO) Search Results

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Google App Engine Guido van Rossum Stanford EE380 Colloquium, Nov 5, 2008 Google App Engine

Search 2.0: Web 2.0, Personal Information Flows, and the Drive for the Perfect Search Engine

Whats coming? Content aware retargeting Image and Video Retargeting Texture

Adaptive sparse grids and quasi Monte Carlo for option pricing under the rough Bergomi model

Supplement 203: Thumbnail Resources for DICOMweb Working Group 27: Web Technologies 1 Problem

CS 528 Mobile and Ubiquitous Computing Lecture 4b: Camera, Face Recognition, Detection and

Week 13: Audacity Roger B. Dannenberg Professor of Computer Science and Art Carnegie Mellon

Welcome to Insite Breese Printing & Publishings prepress portal for uploading,

New Approaches to Specimen Preparation for Molecular TEM Clint Potter National Resource for

Funderbolt Adventures in Thunderbolt DMA Attacks Russ Sevinsky A Trip Down Memory Lanes A Trip

Building a Search Engine for the Cuban Web Jorge Luis Betancourt - PowerPoint PPT Presentation

Building a Search Engine for the Cuban Web Jorge Luis Betancourt Search/Crawl Engineer N O V E M B E R 1 6 - 1 8 , 2 0 1 6 S E V I L L E , S PA I N Who am I 01 Jorge Luis Betancourt Gonzlez Search/Crawl Engineer Apache Nutch

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

CUBAN SPECIAL PERIOD DURING PEACETIME EFFECTS ON CUBAN AGRICULTURE AND HEALTH David

Web Mining for Knowledge Discovery Current Search Engine Search engines are doing good jobs

The Economics of Internet Search Hal R. Varian Sept 31, 2007 Search engine use Search

EE 6882 Visual Search Engine Lec. 1: Introduction tinyeye, photo copy search Web image search

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

The search engine you can see Connects people to information and services The search engine you

THE SICKLE, THE HAMMER, AND THE CUBAN HEALTHCARE SYSTEM An Ethnographic Analysis of the Influence

Technologies behind Internet Search Engine Ming-Jer Lee CTO VisionNEXT Inc. Type of Search

Deep Representation: Building a Semantic Image Search Engine Emmanuel Ameisen PINTEREST SEARCH

search engine optimization ABOUT ME HOLISTIC SEARCH 2.0 ECOSYSTEM eRetail Search Platform

How to Rank Your Website on Page #1 of Google SEARCH ENGINE OPTIMISATION (SEO) Search Results

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Google App Engine Guido van Rossum Stanford EE380 Colloquium, Nov 5, 2008 Google App Engine

Search 2.0: Web 2.0, Personal Information Flows, and the Drive for the Perfect Search Engine

Whats coming? Content aware retargeting Image and Video Retargeting Texture

Adaptive sparse grids and quasi Monte Carlo for option pricing under the rough Bergomi model

Supplement 203: Thumbnail Resources for DICOMweb Working Group 27: Web Technologies 1 Problem

CS 528 Mobile and Ubiquitous Computing Lecture 4b: Camera, Face Recognition, Detection and

Week 13: Audacity Roger B. Dannenberg Professor of Computer Science and Art Carnegie Mellon

Welcome to Insite Breese Printing &amp; Publishings prepress portal for uploading,

New Approaches to Specimen Preparation for Molecular TEM Clint Potter National Resource for

Funderbolt Adventures in Thunderbolt DMA Attacks Russ Sevinsky A Trip Down Memory Lanes A Trip

Welcome to Insite Breese Printing & Publishings prepress portal for uploading,