The original vision of Nutch, 14 years later: Building an open - PowerPoint PPT Presentation

The original vision of Nutch, 14 years later: Building an open source search engine Apache Big Data Europe 2016 sylvain@sylvainzimmer.com @sylvinus

/usr/bin/whoami • Jamendo (Founder & CTO, 2004-2011) • TEDxParis (Co-founder, 2009-2012) • dotConferences (Founder, 2012-) • Pricing Assistant (Co-founder & CTO, 2012-)

"The original motivation for the Nutch project was to provide a transparent alternative to the growing power of a handful of private search services over most users’ view of the Web. However, as Nutch has been adopted with greater enthusiasm by smaller organizations, the Nutch Organization has de-emphasized operating a multi- billion-page index in the public interest." CommerceNet Labs Technical Report, Nov 2004

again?

transparency reproducibility

https://uidemo.commonsearch.org

https://explain.commonsearch.org/?q=python&g=en

Agenda • Values & tech choices • Search engine components • Challenges • Opportunities

Values & tech choices

Radical transparency • Open source (Apache License v2) • Open data • (Governance)

Privacy • Results can be tailored by language/country, but NOT by user/cookie/sessionid • \o/ Cache everything! • Tor service: http://comsearchl2zlnre.onion

Participation & Pragmatism • Use high-level languages as much as possible (Python, Go) • Embrace active communities (Spark, Elasticsearch) • Use mainstream participation platforms, even if they are nonfree (GitHub, Slack)

Search engines

Crawler Indexer Database Ranker Searcher The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) http://infolab.stanford.edu/~backrub/google.html

Crawler

http://commoncrawl.org

Today at 3:30pm !

http://scrapy.org

http://github.com/cocrawler/cocrawler

Indexer

Specs • HTML parsing & analysis • Tokenization / NLP • Static rankings • Language detection • I/O from crawls to databases

Common Search Pipeline Doc sources Data output Filter Document Output Common Crawl, Database, file, plugins parsing plugins WARC files, HDFS, S3, ... URLs ...

HTML parsers • BeautifulSoup & friends • lxml • html5lib • Gumbo!

https://github.com/google/gumbo-parser

Gumbocy • Use Cython instead of ctypes • Smaller API • Tree traversal on the Cython side with basic boilerplate/visibility support https://github.com/commonsearch/gumbocy

https://github.com/commonsearch/urlparse4

Database(s)

http://lucene.apache.org/

Ranker

Ranking formula rank = f( static_score , dynamic_score( query ) ) Alexa ElasticSearch & Lucene DMOZ TF-IDF Blacklists BM25 PageRank ... ...

https://about.commonsearch.org/developer/get-started

Today @ 4:30pm ;-)

Searcher / Frontend

Specs • Send user query to databases • Search-as-you-type • HTML & JSON endpoints • High performance

https://github.com/commonsearch/cosr-front

Crawler Parser Index Ranker Searcher The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) http://infolab.stanford.edu/~backrub/google.html

Challenges

Funding / Scale • Frugalism • Caching • In-kind services • Individual donations / Foundation grants • General economic incentives

Spam • Email spam • Wikipedia vandalism • Algorithm complexity & scale • Given enough eyeballs, all spam is shallow?

Relevance • Exhaustivity • Rescoring • Evaluation • More at 4:30pm ;-)

More search dimensions • Realtime search • Local search • Universal search

Semantic search • Wikidata • YAGO • Conversational / Voice search

Outreach • Easy onboarding & docs • Making people care believe

Opportunities

Decentralization • YaCy • Extremely high technical & social cost! • Transparency?

Research • More people should know how to build search engines • Spam, Relevance, Large-scale data processing • We need more open datasets!

https://about.commonsearch.org/blog/

Make the Web a better place! • SEO • Transparency • Influence of money • Public service

Questions? https://about.commonsearch.org/contributing https://github.com/commonsearch contact@commonsearch.org slack.commonsearch.org

The original vision of Nutch, 14 years later: Building an open - PowerPoint PPT Presentation

The original vision of Nutch, 14 years later: Building an open source search engine Apache Big Data Europe 2016 sylvain@sylvainzimmer.com @sylvinus /usr/bin/whoami Jamendo (Founder & CTO, 2004-2011) TEDxParis (Co-founder, 2009-2012)

Nutch as a Web mining platform Nutch Berlin Buzzwords '10 the present and the future Andrzej

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

Part 8 Planning Report Mayfair Building July 2016 Original Building Original Building

Photoshop Workshop By Nate Kong Original Cropped Original Filters Original B&W Original

Seven Years Later: Seven Years Later: What the Agile Manifesto Left Out What the Agile Manifesto

Life on on the the Battlefields Battlefields Life 94 Years Years Later Later 94 Charlotte

Web Crawling with Apache Nutch Sebastian Nagel ApacheCon EU 2014 2014-11-18 snagel@apache.org

Text Categorization P2P Security Datamining Semantic Web Case Studies: Nutch, Google,

Computer Vision Computer Vision How does vision work? What is vision for? Ela Claridge

Branding Presentation VISION Mevushal VISION Muscat of Alexandria & Viognier VISION

The Original Peter Rabbit Presentation Box 1 23 R I The Original Peter Rabbit Presentation Box 1

Garrett County Government The Need For More Revenue - Looking Back FY 2014 FY 2015 FY 2016 FY

D r o u g h 33.6 -45.5 33.6 -45.5 11 years 5.4 years 11 years 5.4 years years t

Vision Services Vision Services & & Vision Therapy Vision Therapy February 2, 2007

Vision Our National Church partners .. Vision Our National Network partners Vision Getting

Building Our Vision St. Andrews Vision and Mission Our Vision: Our Vision: The Tree of Life is

A gumbo with hints of partitions, modular forms, special integer sequences and supercongruences

Natural Language Processing Question Answering Dan Klein UC Berkeley The following slides are

and cognitive demand? JIANWEI YAN & HAITAO LIU DEPARTMENT OF LINGUISTICS, ZHEJIANG UNIVERSITY

The Business Case for Reducing Embodied Carbon MONIKA HENN, SENIOR MANAGER, ULI GREENPRINT CENTER

Metaphor Magic Neil McMahon ABS International ABS International Challenge Your English May 11

RSC Workshop on the Integration of WAPA, Basin, Heartland April 4, 2014 AGENDA I. Impact of

E a r n i n g s C o n f e r e n c e C a l l February 28th, 2019 Forward-Looking Statement The

Severin, Romania at the 45th parallel. The dramatic moss-covered falls are situated in the

Sambuz

Useful Links

Newsletter

Mail Us

The original vision of Nutch, 14 years later: Building an open - PowerPoint PPT Presentation

The original vision of Nutch, 14 years later: Building an open source search engine Apache Big Data Europe 2016 sylvain@sylvainzimmer.com @sylvinus /usr/bin/whoami Jamendo (Founder & CTO, 2004-2011) TEDxParis (Co-founder, 2009-2012)

Nutch as a Web mining platform Nutch Berlin Buzzwords '10 the present and the future Andrzej

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

Part 8 Planning Report Mayfair Building July 2016 Original Building Original Building

Photoshop Workshop By Nate Kong Original Cropped Original Filters Original B&amp;W Original

Seven Years Later: Seven Years Later: What the Agile Manifesto Left Out What the Agile Manifesto

Life on on the the Battlefields Battlefields Life 94 Years Years Later Later 94 Charlotte

Web Crawling with Apache Nutch Sebastian Nagel ApacheCon EU 2014 2014-11-18 snagel@apache.org

Text Categorization P2P Security Datamining Semantic Web Case Studies: Nutch, Google,

Computer Vision Computer Vision How does vision work? What is vision for? Ela Claridge

Branding Presentation VISION Mevushal VISION Muscat of Alexandria &amp; Viognier VISION

The Original Peter Rabbit Presentation Box 1 23 R I The Original Peter Rabbit Presentation Box 1

Garrett County Government The Need For More Revenue - Looking Back FY 2014 FY 2015 FY 2016 FY

D r o u g h 33.6 -45.5 33.6 -45.5 11 years 5.4 years 11 years 5.4 years years t

Vision Services Vision Services &amp; &amp; Vision Therapy Vision Therapy February 2, 2007

Vision Our National Church partners .. Vision Our National Network partners Vision Getting

Building Our Vision St. Andrews Vision and Mission Our Vision: Our Vision: The Tree of Life is

A gumbo with hints of partitions, modular forms, special integer sequences and supercongruences

Natural Language Processing Question Answering Dan Klein UC Berkeley The following slides are

and cognitive demand? JIANWEI YAN &amp; HAITAO LIU DEPARTMENT OF LINGUISTICS, ZHEJIANG UNIVERSITY

The Business Case for Reducing Embodied Carbon MONIKA HENN, SENIOR MANAGER, ULI GREENPRINT CENTER

Metaphor Magic Neil McMahon ABS International ABS International Challenge Your English May 11

RSC Workshop on the Integration of WAPA, Basin, Heartland April 4, 2014 AGENDA I. Impact of

E a r n i n g s C o n f e r e n c e C a l l February 28th, 2019 Forward-Looking Statement The

Severin, Romania at the 45th parallel. The dramatic moss-covered falls are situated in the

Sambuz

Useful Links

Newsletter

Mail Us

Photoshop Workshop By Nate Kong Original Cropped Original Filters Original B&W Original

Branding Presentation VISION Mevushal VISION Muscat of Alexandria & Viognier VISION

Vision Services Vision Services & & Vision Therapy Vision Therapy February 2, 2007

and cognitive demand? JIANWEI YAN & HAITAO LIU DEPARTMENT OF LINGUISTICS, ZHEJIANG UNIVERSITY