bing search
play

Bing Search: An Engine in the Clouds Munich Munich Rablstr. - PowerPoint PPT Presentation

Bing Search: An Engine in the Clouds Munich Munich Rablstr. Gewrzmhlstr. Founded in January 2009 Offices in London and Munich More than 70 employees in total Close collaborations with Microsoft Research in Redmond and


  1. Bing Search: An Engine in the Clouds

  2. Munich – Munich – Rablstr. Gewürzmühlstr. • Founded in January 2009 • Offices in London and Munich • More than 70 employees in total • Close collaborations with Microsoft Research in Redmond and Cambridge • Collaborations with various MSFT Product Groups (incl. Office, Skype, Windows, Xbox etc.) London - Cardinal Place • STC-E Munich: web ranking • Other STCs in Beijing, Hyderabad, and Silicon Valley

  3. Applied Research branch of MSR working on: • • IT-Security • Data-privacy Enabli bling ng acquisi quisition tion of data ta on all class sses es of devices/se ces/sensor nsors, and d provi oviding ding industr stry y leadin ding g analyt alytic ics s and proce cess ssing ng capabil pabilitie ities s • Mobility from om the edge ge to cloud, d, leveragi raging: ng: • Mobile Solutions • Web-Services Deep technical Relationships with Strategic customer expertise engineering groups scenarios (Windows, Office 365 (8 years of experience in (T elemetry, Stream Analytics, Windows Azure) Early fault detection, platform and small devices) Predictive maintenance) http://research.microsoft.com/en-us/labs/atle/default.aspx

  4. 1 2 3 4 5

  5.  Content Gathering  Crawling  Indexing  Matching Query Words to Content  Query modifications needed?  Ranking Results  Features used

  6.  Comprehensiveness  Serving & discovery of hundreds of billions of documents  Frequency  Optimized towards freshness & politeness  Balance  Depth vs. breadth in the processing of document content

  7.  Result counts are misleading  Intelligent document selection matters

  8.  Balance  Frequency  Depth vs. breadth in the  Optimized towards processing of document freshness & politeness content

  9. • Most components are easily parallelizable such as crawling and document processing • Developed on top of Private Cloud • Leveraging Cosmos as a highly scalable storage and computing system • Running under highly performant Datacenter Management System

  10.  Petabyte Store and Computation System  About 62 physical petabytes stored (~275 logical petabytes stored in 2011)  Tens of thousands of computers across many datacenters  Massively parallel processing based on Dryad  Similar to MapReduce but can represent arbitrary DAGs of computation  Automatic computation placement with data  SCOPE (Structured Computation Optimized for Parallel Execution)  SQL-like language with set-oriented record and column manipulation  Automatically compiled and optimized for execution over Dryad  Management of hundreds of “Virtual Clusters” for computation allocation Source: http://research.microsoft.com/en-us/events/fs2011/helland_cosmos_big_data_and_big_challenges.pdf

  11.  Linguistic alterations  Stemming, morphological variations, plurals  Spell Corrections  Approximately one third of queries contain misspellings  Synonyms  Used sparingly, only in high confidence situations

  12.  Web represents Knowledge  Approximately one third of queries contain misspellings  Users often correct themselves  Patterns of common (and not so common) misspellings are discovered  Linguistic Depth  Less useful  Rules like “ i ” before “e” except after “c” do not scale  Instead develop computer models of correct spelling  Statistics & usage data help

  13. • Heavy dependence on interaction logs ( queries, sessions …)  User behavior is king • Leveraging Cosmos for processing Billions of entries  Deriving query histogram  Deriving query reformulation graph • Leveraging Machine Learning for ranking suggestions  T raining word-level/contextual-level models

  14. Create lots of features (attributes) 1. Calculate them per query/document pair 2. Let the computer learn how to rank with 3. millions/billions of examples Rinse & Repeat 4.

  15.  Do the words appear in the title of the document?  Do the words appear in the order specified?  Are the words a substantial part of the title?  Are the words excessively repeated?  Is the title uncommonly long?

  16. Good • Title: Spaghetti Western - Wikipedia • Title: Western Spaghetti Recipe – T aste of Not so Home good • Title: The Spaghetti Western Orchestra Not so Tickets 2013 - The Spaghetti Western good Orchestra Concert tour 2013 Tickets

  17. Words in Title in Correct Order? yes no 1000s of other features… Is Title uncommonly long? no yes Are words repeated excessively? +4 yes no 1000s of other trees… -1 +2

  18. • The document with highest cumulative score is ranked highest • Improvements to algorithms & features are made constantly & shipped frequently • Machine Learning allows for scalability across many dimensions

  19.  Titles  Images  Document Content  Visual Layout  Links  Co-occurrences of words  Clicks  Freshness  Word Frequencies  Word Proximity … and many more

  20. • Leveraging Hybrid Cloud Computing Platform • Built on top of heterogeneous computing resources: • Single box • Cosmos: Map/Reduce • MPI • … • Easy deployment/management of applications (modules) • Powerful graphical layer of abstraction defining data-workflow. • Batch mode allowing to scale across dimensions such as markets • Leveraging Machine Learning as a Service

  21. Accelerating Feature computation through Field Programmable Gate Arrays Sources: http://www.altera.com/technology/system-design/articles/2014/cpu-architects.html http://www.enterprisetech.com/2014/09/03/microsoft-using-fpgas-speed-bing-search/

  22.  Feature Fundamentals  Which features are “universal”, which are market -specific?  Multiple Queries  When does it make sense to issue multiple (altered) queries?  How does one merge the results for the user?  Anchor & Link Signals  Which links and which anchor text are informative?  What features need to be leveraged in this context?  Knowledge Modeling  View the task of ranking as a translation from “query language” to “document language”

  23. • Global Ranking • Everything mentioned before … for International (no English -US, no CJK) • Improving, Evaluating and Shipping Rankers in hundreds of markets • Universal Ranker • One Team and One Ranker

  24. http://www.bingiton.com/

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend