SLIDE 1 Ranking the Web with Spark Apache Big Data Europe 2016
sylvain@sylvainzimmer.com @sylvinus
SLIDE 2 /usr/bin/whoami
- Jamendo (Founder & CTO, 2004-2011)
- TEDxParis (Co-founder, 2009-2012)
- dotConferences (Founder, 2012-)
- Pricing Assistant (Co-founder & CTO, 2012-)
SLIDE 3
transparency reproducibility
SLIDE 4
SLIDE 5 https://uidemo.commonsearch.org
SLIDE 6 https://explain.commonsearch.org/?q=python&g=en
SLIDE 7
Ranking
SLIDE 8 Disclaimer: IANASRE
(I Am Not A Search Relevance Engineer)
SLIDE 9
What's in a score
score = fn( doc, query, language, user, time )
SLIDE 10
What's in a score
score = fn( doc, query )
SLIDE 11
What's in a score
score = fn( static_score, dynamic_score ( query ))
SLIDE 12
Static score
SLIDE 13 Static features
- Scopes:
- Page: URL depth, markup stats, ...
- Domain: Age, page count, blacklists, ...
- WebGraph: PageRank, ...
SLIDE 14 http://infolab.stanford.edu/~backrub/google.html
The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)
Crawler Indexer Database Searcher Ranker
SLIDE 15
Dynamic score
SLIDE 16 Dynamic features
- Text match: TF-IDF, BM25, proximity, topic, ...
- Query-level: number of words, popularity, ...
- Usage: clicks, dwell time, reformulations, ...
- Time
SLIDE 17
Scoring function
SLIDE 18 Users Database Elasticsearch Indexer Python, Spark Data sources Common Crawl, Alexa top 1M, ...
words, static score query top 10 docs, final scores
Offline Online
Searcher Go
SLIDE 19
SLIDE 20 https://explain.commonsearch.org/?q=python&g=en
SLIDE 21 Issues with this architecture
- Static & dynamic scoring are in different
codebases
- No control over result diversity
- Hard to optimize
- Very dependent on Elasticsearch
SLIDE 22
Rescoring
SLIDE 23 Users Database Indexer
words, static score, features query
Searcher
top 1k docs, features
Rescorer
final 10 docs
SLIDE 24 Issues with rescoring
- Latency
- Pagination
- Harder to explain
SLIDE 25
Learning to rank
SLIDE 26 LTR Model
- Features
- Training dataset
- Evaluation: NDCG, ERR, ...
- Algorithms: AdaRank, ListNet, LambdaMART, ...
- Learning with Spark!
SLIDE 27 The right questions
- What do users expect?
- What features?
- How to evaluate and fine-tune in the real world?
SLIDE 28
PageRank with Spark
SLIDE 29
SLIDE 30
http://commoncrawl.org
SLIDE 31 https://github.com/commonsearch/cosr-back
SLIDE 32 Common Search Pipeline
Doc sources Common Crawl, WARC files, URLs ... Filter plugins Document parsing Output plugins Data output Database, file, HDFS, S3, ...
SLIDE 33
Most popular Wikipedia pages
SLIDE 34
Dumping the web graph
SLIDE 35
Naive pyspark PageRank
SLIDE 36
GraphFrames
SLIDE 37
SparkSQL PageRank
SLIDE 38 SparkSQL PageRank
https://github.com/commonsearch/cosr-back/blob/master/spark/jobs/pagerank.py
SLIDE 39 Tests
http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
https://github.com/commonsearch/cosr-back/blob/master/tests/sparktests/test_pagerank.py
SLIDE 40 https://about.commonsearch.org/developer/get-started
SLIDE 41
SLIDE 42
Top 10
SLIDE 43
SLIDE 44
Spam
SLIDE 45
SLIDE 46 Spamdexing
- Keyword stuffing, hidden text
- Scraper sites, Mirrors
- Link farms
- Splogs, Comment spam
- Domaining
- Cloaking
- Bombing
SLIDE 47
Questions?
https://about.commonsearch.org/contributing https://github.com/commonsearch contact@commonsearch.org slack.commonsearch.org