Ranking the Web with Spark Apache Big Data Europe 2016 - PowerPoint PPT Presentation

Feb 05, 2024 •880 likes •1.37k views

Ranking the Web with Spark Apache Big Data Europe 2016 sylvain@sylvainzimmer.com @sylvinus /usr/bin/whoami Jamendo (Founder & CTO, 2004-2011) TEDxParis (Co-founder, 2009-2012) dotConferences (Founder, 2012-) Pricing Assistant

Ranking the Web with Spark Apache Big Data Europe 2016 sylvain@sylvainzimmer.com @sylvinus
/usr/bin/whoami • Jamendo (Founder & CTO, 2004-2011) • TEDxParis (Co-founder, 2009-2012) • dotConferences (Founder, 2012-) • Pricing Assistant (Co-founder & CTO, 2012-)
transparency reproducibility
https://uidemo.commonsearch.org
https://explain.commonsearch.org/?q=python&g=en
Ranking
Disclaimer: IANASRE (I Am Not A Search Relevance Engineer)
What's in a score score = fn( doc, query, language, user, time )
What's in a score score = fn( doc, query )
What's in a score score = fn( static_score, dynamic_score ( query ))
Static score
Static features • Scopes: • Page: URL depth, markup stats, ... • Domain: Age, page count, blacklists, ... • WebGraph: PageRank, ...
Crawler Indexer Database Ranker Searcher The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998) http://infolab.stanford.edu/~backrub/google.html
Dynamic score
Dynamic features • Text match: TF-IDF, BM25, proximity, topic, ... • Query-level: number of words, popularity, ... • Usage: clicks, dwell time, reformulations, ... • Time
Scoring function
Data sources Common Crawl, Alexa top 1M, ... Offline Indexer Python, Spark words, static score Database Elasticsearch query top 10 docs, final scores Online Searcher Go Users
https://explain.commonsearch.org/?q=python&g=en
Issues with this architecture • Static & dynamic scoring are in different codebases • No control over result diversity • Hard to optimize • Very dependent on Elasticsearch
Rescoring
Indexer words, static score, features Database top 1k docs, features query Rescorer final 10 docs Searcher Users
Issues with rescoring • Latency • Pagination • Harder to explain
Learning to rank
LTR Model • Features • Training dataset • Evaluation: NDCG, ERR, ... • Algorithms: AdaRank, ListNet, LambdaMART, ... • Learning with Spark!
The right questions • What do users expect? • What features? • How to evaluate and fine-tune in the real world?
PageRank with Spark
http://commoncrawl.org
https://github.com/commonsearch/cosr-back
Common Search Pipeline Doc sources Data output Filter Document Output Common Crawl, Database, file, plugins parsing plugins WARC files, HDFS, S3, ... URLs ...
Most popular Wikipedia pages
Dumping the web graph
Naive pyspark PageRank
GraphFrames
SparkSQL PageRank
SparkSQL PageRank https://github.com/commonsearch/cosr-back/blob/master/spark/jobs/pagerank.py
Tests http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm https://github.com/commonsearch/cosr-back/blob/master/tests/sparktests/test_pagerank.py
https://about.commonsearch.org/developer/get-started
Top 10
Spam
Spamdexing • Keyword stuffing, hidden text • Scraper sites, Mirrors • Link farms • Splogs, Comment spam • Domaining • Cloaking • Bombing
Questions? https://about.commonsearch.org/contributing https://github.com/commonsearch contact@commonsearch.org slack.commonsearch.org

Recommend

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark: A Unified Engine for Big Data Processing Engine? Unified? Apache Spark: A Unified Engine for Big Data Processing PAGE 2 Apache Spark: A

499 views • 36 slides

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark Streaming and Spark SQL Explored Streaming API of Apache Spark on Ukko Cluster Window based Stream Content Direct Stream content

221 views • 9 slides

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more

1.5k views • 52 slides

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust - @michaelarmbrust What is Apache Spark? Fast and general cluster computing system, interoperable with Hadoop, included in all major distros

667 views • 43 slides

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 1 / 67 Big Data small data big data Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 2 / 67 Big Data

1.09k views • 86 slides

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx Streaming Spark Dataframe Spark Core (RDD) 2 Machine Learning Algorithms Supervised learning Given a set of features and labels Builds a model that

590 views • 24 slides

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA Jet Propulsion Laboratory Agenda Data and Processing Data Systems Apache OODT Apache Spark Streaming OODT

725 views • 33 slides

Apache Apex: Next Gen Big Data Analytics Thomas Weise <thw@apache.org> @thweise PMC Chair

Apache Apex: Next Gen Big Data Analytics Thomas Weise <thw@apache.org> @thweise PMC Chair Apache Apex, Architect DataTorrent Apache Big Data Europe, Sevilla, Nov 14 th 2016 Stream Data Processing Real-time Data Delivery Transform /

632 views • 35 slides

Crawling the Web for Sebastian Nagel Apache Big Data Europe 2016 snagel@apache.org

Crawling the Web for Sebastian Nagel Apache Big Data Europe 2016 snagel@apache.org sebastian@commoncrawl.org About Me computational linguist software developer, search and data matching since 2016 crawl engineer at Common Crawl Apache Nutch

626 views • 36 slides

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian Tzolov Whoami Christian Tzolov Engineer at Pivotal, Big-Data, Hadoop, Spring Cloud Dataflow, Apache Geode, Apache HAWQ, Apache Committer, Apache

796 views • 41 slides

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE with Apache CXF Practical JOSE with Apache CXF Practical JOSE with Apache CXF Practical JOSE with Apache CXF What Is Apache CXF Production

465 views • 25 slides

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About cziegeler@apache.org @cziegeler RnD Team at Adobe Research Switzerland Member of the Apache So fu ware Foundation Apache Felix and Apache

725 views • 26 slides

Apache Spark Dr. Mihail Content derived from: Ankam, Venkat. Big Data Analytics. Packt

Apache Spark Dr. Mihail Content derived from: Ankam, Venkat. Big Data Analytics. Packt Publishing, 2016. July 9, 2019 (Dr. Mihail ) Intro Big Data July 9, 2019 1 / 8 Apache Spark Why Hadoop and MapReduce have been around for 10 years and

443 views • 9 slides

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg mats@neotechnology.com opencypher.org | opencypher@googlegroups.com opencypher.org | opencypher@googlegroups.com Cypher for Apache Spark Apache Spark:

281 views • 9 slides

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng Wang, Intel (huafengw@apache.org) Apache: Big Data Europe 2016 Sevilla, Spain 14 November 2016 Agenda What is Gearpump? Why Apache

853 views • 60 slides

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI * Outline Review of Deep Learning Apache MXNet Framework Distributed Inference using MXNet and Spark Deep Learning Output CAR

652 views • 23 slides

EU Art.29 Data Protection Users care about privacy Working Party From: Special Eurobarometer

28/05/15 Outline Data privacy and the evolving regulation Privacy threats from LBS to geoSN EveryWare Lab Main (location) privacy protection techniques Data Management for Mobile Protecting geo-tagged resource publication

411 views • 7 slides

EvilSeed : A Guided Approach to Finding Malicious Web Pages L. Invernizzi 1 S. Benvenuti 2 M. Cova

EvilSeed : A Guided Approach to Finding Malicious Web Pages L. Invernizzi 1 S. Benvenuti 2 M. Cova 3 , 5 P. Milani Comparetti 4 , 5 C. Kruegel 1 G. Vigna 1 1 UC Santa Barbara 2 University of Genova 3 University of Birmingham 4 Vienna University of

1.05k views • 48 slides

Metamaterials and dispersion engineering for accelerators Emmy Sharples

Metamaterials and dispersion engineering for accelerators Emmy Sharples emmysharples@Helmholtz-berlin.de Helmholtz Zentrum Berlin Presenting work done at the Cockcroft institute and Lancaster University 2 nd workshop on Microwave Cavities and

508 views • 25 slides

Identity-Defined Networking Andrei Gurtov IDA, Linkping University Erik Giesa, Marc Kaplan

Identity-Defined Networking Andrei Gurtov IDA, Linkping University Erik Giesa, Marc Kaplan TemperedNetworks TDDD17, LiU Contents Traditional Networking: Challenging and Complex Identity-Defined Networking (IDN): A New

502 views • 26 slides

CSE484/CSE584 DRIVE-BY MALWARE Dr. Benjamin Livshits Homework, Labs, and Project 2 Please

CSE484/CSE584 DRIVE-BY MALWARE Dr. Benjamin Livshits Homework, Labs, and Project 2 Please be ready to give HW-3 due Friday a short 2-minute pitch Lab-3 due Tuesday about your strategy We want to give you Ask more questions

792 views • 21 slides

Syllabus Professor Adam Bates Fall 2018 Security & Privacy Research at Illinois (SPRAI)

CS 563 - Advanced Computer Security: Syllabus Professor Adam Bates Fall 2018 Security & Privacy Research at Illinois (SPRAI) Learning Objectives Before CS 563: Intermediate knowledge of computer security topics Experience working

775 views • 38 slides

Now What? Foster Provost Thanks to Josh Attenburgh, Henry Chen, Brian Dalessandro, Sam

Foster Provost 11/17/17 So Youve Built a Machine Learning Model Now What? Foster Provost Thanks to Josh Attenburgh, Henry Chen, Brian Dalessandro, Sam Fraiberger, Thore Graepel, Panos Ipeirotis, Michal Kosinski, David Martens,

434 views • 20 slides

Corners Scattering and Inverse Scattering Jingni Xiao Department of Mathematics Rutgers

Corners Scattering and Inverse Scattering Jingni Xiao Department of Mathematics Rutgers University Joint with Emilia Bl asten, Fioralba Cakoni and Hongyu Liu IAS, HKUST May 22, 2019 1 / 27 Introduction Introduction 1 Corner of Media

943 views • 57 slides