The original vision of Nutch, 14 years later: Building an open - - PowerPoint PPT Presentation

the original vision of nutch 14 years later building an
SMART_READER_LITE
LIVE PREVIEW

The original vision of Nutch, 14 years later: Building an open - - PowerPoint PPT Presentation

The original vision of Nutch, 14 years later: Building an open source search engine Apache Big Data Europe 2016 sylvain@sylvainzimmer.com @sylvinus /usr/bin/whoami Jamendo (Founder & CTO, 2004-2011) TEDxParis (Co-founder, 2009-2012)


slide-1
SLIDE 1

The original vision of Nutch, 14 years later: Building an open source search engine Apache Big Data Europe 2016

sylvain@sylvainzimmer.com @sylvinus

slide-2
SLIDE 2

/usr/bin/whoami

  • Jamendo (Founder & CTO, 2004-2011)
  • TEDxParis (Co-founder, 2009-2012)
  • dotConferences (Founder, 2012-)
  • Pricing Assistant (Co-founder & CTO, 2012-)
slide-3
SLIDE 3

"The original motivation for the Nutch project was to provide a transparent alternative to the growing power of a handful of private search services over most users’ view of the Web.

CommerceNet Labs Technical Report, Nov 2004

However, as Nutch has been adopted with greater enthusiasm by smaller organizations, the Nutch Organization has de-emphasized operating a multi- billion-page index in the public interest."

slide-4
SLIDE 4
slide-5
SLIDE 5

again?

slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9
slide-10
SLIDE 10

transparency reproducibility

slide-11
SLIDE 11
slide-12
SLIDE 12

https://uidemo.commonsearch.org

slide-13
SLIDE 13

https://explain.commonsearch.org/?q=python&g=en

slide-14
SLIDE 14

Agenda

  • Values & tech choices
  • Search engine components
  • Challenges
  • Opportunities
slide-15
SLIDE 15

Values & tech choices

slide-16
SLIDE 16
slide-17
SLIDE 17

Radical transparency

  • Open source (Apache License v2)
  • Open data
  • (Governance)
slide-18
SLIDE 18

Privacy

  • Results can be tailored by language/country, but

NOT by user/cookie/sessionid

  • \o/ Cache everything!
  • Tor service: http://comsearchl2zlnre.onion
slide-19
SLIDE 19

Participation & Pragmatism

  • Use high-level languages as much as possible

(Python, Go)

  • Embrace active communities (Spark, Elasticsearch)
  • Use mainstream participation platforms, even if they

are nonfree (GitHub, Slack)

slide-20
SLIDE 20

Search engines

slide-21
SLIDE 21

http://infolab.stanford.edu/~backrub/google.html

The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)

Crawler Indexer Database Searcher Ranker

slide-22
SLIDE 22

Crawler

slide-23
SLIDE 23

http://commoncrawl.org

slide-24
SLIDE 24

Today at 3:30pm!

slide-25
SLIDE 25

http://scrapy.org

slide-26
SLIDE 26

http://github.com/cocrawler/cocrawler

slide-27
SLIDE 27

Indexer

slide-28
SLIDE 28

Specs

  • HTML parsing & analysis
  • Tokenization / NLP
  • Static rankings
  • Language detection
  • I/O from crawls to databases
slide-29
SLIDE 29
slide-30
SLIDE 30

Common Search Pipeline

Doc sources Common Crawl, WARC files, URLs ... Filter plugins Document parsing Output plugins Data output Database, file, HDFS, S3, ...

slide-31
SLIDE 31

HTML parsers

  • BeautifulSoup & friends
  • lxml
  • html5lib
  • Gumbo!
slide-32
SLIDE 32

https://github.com/google/gumbo-parser

slide-33
SLIDE 33

Gumbocy

  • Use Cython instead of ctypes
  • Smaller API
  • Tree traversal on the Cython side with basic

boilerplate/visibility support

https://github.com/commonsearch/gumbocy

slide-34
SLIDE 34

https://github.com/commonsearch/urlparse4

slide-35
SLIDE 35

Database(s)

slide-36
SLIDE 36
slide-37
SLIDE 37

http://lucene.apache.org/

slide-38
SLIDE 38
slide-39
SLIDE 39

Ranker

slide-40
SLIDE 40

Ranking formula

rank = f( static_score , dynamic_score( query ) )

Alexa DMOZ Blacklists PageRank ... ElasticSearch & Lucene TF-IDF BM25 ...

slide-41
SLIDE 41
slide-42
SLIDE 42

https://about.commonsearch.org/developer/get-started

slide-43
SLIDE 43

Today @ 4:30pm ;-)

slide-44
SLIDE 44

Searcher / Frontend

slide-45
SLIDE 45

Specs

  • Send user query to databases
  • Search-as-you-type
  • HTML & JSON endpoints
  • High performance
slide-46
SLIDE 46
slide-47
SLIDE 47

https://github.com/commonsearch/cosr-front

slide-48
SLIDE 48

http://infolab.stanford.edu/~backrub/google.html

The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)

Crawler Parser Index Searcher Ranker

slide-49
SLIDE 49

Challenges

slide-50
SLIDE 50

Funding / Scale

  • Frugalism
  • Caching
  • In-kind services
  • Individual donations / Foundation grants
  • General economic incentives
slide-51
SLIDE 51

Spam

  • Email spam
  • Wikipedia vandalism
  • Algorithm complexity & scale
  • Given enough eyeballs, all spam is shallow?
slide-52
SLIDE 52

Relevance

  • Exhaustivity
  • Rescoring
  • Evaluation
  • More at 4:30pm ;-)
slide-53
SLIDE 53

More search dimensions

  • Realtime search
  • Local search
  • Universal search
slide-54
SLIDE 54

Semantic search

  • Wikidata
  • YAGO
  • Conversational / Voice search
slide-55
SLIDE 55

Outreach

  • Easy onboarding & docs
  • Making people care believe
slide-56
SLIDE 56

Opportunities

slide-57
SLIDE 57

Decentralization

  • YaCy
  • Extremely high technical & social cost!
  • Transparency?
slide-58
SLIDE 58

Research

  • More people should know how to build search

engines

  • Spam, Relevance, Large-scale data processing
  • We need more open datasets!
slide-59
SLIDE 59

https://about.commonsearch.org/blog/

slide-60
SLIDE 60

Make the Web a better place!

  • SEO
  • Transparency
  • Influence of money
  • Public service
slide-61
SLIDE 61

Questions?

https://about.commonsearch.org/contributing https://github.com/commonsearch contact@commonsearch.org slack.commonsearch.org