USEing Transfer Learning in Retrieval of Statistical Data July 24, - PowerPoint PPT Presentation

USEing Transfer Learning in Retrieval of Statistical Data July 24, 2019 Anton Firsov, Vladimir Bugay, Anton Karpenko – Knoema Corporation

INTRODUCTION ▪ Knoema is a global data aggregator and a search engine for data ▪ Our search operates with 3.2B time series which are mostly numbers with a limited textual metadata available ▪ More than 500K analysts and researchers look for data, facts and insights at https://knoema.com every month WHAT IS CHILD HOW MANY PEOPLE MORTALITY IN UGANDA LIVE IN PARIS? HOW MUCH MONEY IS SPENT ON RESEARCH IN USA 2

SPECIFICS OF OUR DOMAIN ▪ Narrow domain => less users => less data for training models ▪ Very short documents – time series ▪ Structure in the textual metadata (multiple fields aka dimensions, hierarchies) ▪ Complex queries for which only a collection of related time series can be an answer Uganda – Under-5 mortality rate Paris – Population (per 1,000 live births) United States – Basic Research Expenditures, Public Research, Million USD PPPs

COMPARISON QUERIES CHINA VS INDIA POPULATION

RANKING QUERIES COUNTRIES BY GDP PER CAPITA

PREVIOUS APPROACHES INVERTED INDEX + ONTOLOGY + PRE/POST PROCESSING ▪ Requires domain-specific ontology ▪ Problem with on-site data repositories ▪ A lot of heuristics and parameters => difficult to maintain DEEP STRUCTURED SEMANTIC MODEL (DSSM) [Huang et al., 2013] ▪ Small amount of click through data (~100K)

TRANSFER LEARNING USE BERT Full name Universal Sentence Encoder Bidirectional Encoder Representations from Transformers Authors [Cer et al., 2018] [Devlin et al., 2018] Model variation transformer-based base, uncased model for English language Underlying Transformer Transformer architecture Number of 6 12 attention layers Number of ~200M ~110M parameters Embedding size 512 768

ARCHITECTURE P(D|Q) Cosine similarity USE USE query document model model Query (Q) Document (D)

MODEL • ത 𝑅 = 𝑉𝑇𝐹 𝑅 𝑅 , where Q is the query and 𝑉𝑇𝐹 𝑅 - USE query model 𝐸 = 𝑉𝑇𝐹 𝐸 𝐸 , where D is the document (timeseries), 𝑉𝑇𝐹 𝐸 𝐸 - USE • ഥ document model 𝐸) – similarity between a query Q and document D • 𝑇 𝑅, 𝐸 = cos( ത 𝑅, ഥ To effectively calculate probability of document D given query Q we used negative sampling: − , where 𝐸 + is the document that was clicked • 𝐄 = 𝐸 + , 𝐸 1 − , 𝐸 2 − , … , 𝐸 𝑙 for query Q and 𝐸 𝑗 − - random unclicked documents exp(𝑇(𝑅,𝐸 + )) • 𝑄(𝐸 + |𝑅) = σ 𝐸′∈𝐄 exp(𝑇(𝑅,𝐸′)) • 𝑚𝑝𝑡𝑡 𝑅, 𝐄 = −log(𝑄 𝐸 + 𝑅 )

IMPLEMENTATION Click data Fine-tuning Finetuned BERT/USE model Embeddings Timeseries calculation Embeddings Indexing Index Search Query

TRAINING ▪ Training set ~13K click-through samples ▪ CV set ~2K click-through samples ▪ Adam optimizer with learning rate 1e-5 ▪ Batch size 32 (BERT) and 128 (USE) ▪ 4 negative samples per query ▪ 600 steps ▪ Training time <5 min on V100

EMBEDDING CALCULATIONS USE BERT • 400M timeseries • 400M timeseries • Calculated on V100 • Calculated on Google TPUv3 • ~8K timeseries per second • ~10K timeseries per second • Total time: ~14 hours • Total time: ~11 hours • Cost: ~32$ • Cost: ~90$ • Total size: ~900Gb • Total size: ~1.3Tb

INDEXING ▪ Using FAISS library [Johnson et al., 2017] for an approximate nearest neighbor search ▪ IVF index with 2^18 centroids and HNSW quantizer ▪ Centroids are trained on 25M random vectors (~5h on r5.2xlarge) ▪ Product Quantization with 32 and 16 components for index size reduction ▪ Total time to build index: ~10h ▪ Index size ~17Gb for PQ=32 and ~11Gb for PQ=16

RESULTS A/B test: mixed equal 3000 number of classic and USE results 2500 2000 Source of clicked Number of clicked results results 1500 USE 2791 Classic 2352 1000 Total 5143 500 18% HIGHER CTR 0 USE Classic

RESULT ANALYSIS AUTOMATICALLY DEDUCED SEMANTIC CLOSENESS ▪ Query: us gdp ▪ Result: United States - Gross domestic product, current prices (U.S. dollars) QUESTIONS IN NATURAL LANGUAGE ▪ Query: how many people live in paris? ▪ Result: Paris - Population RESULT GENERALIZATION ▪ Query: bmw theft in japan ▪ Result: Japan - Theft of Private Cars - Rate

WHAT’S NEXT COMPLEX QUERIES PROCESSING ▪ "china vs india population" ▪ "countries ranking by gdp" ▪ "world population density in 2017 on map" CHATBOT (DIGITAL RESEARCH ASSISTANT) • Need to keep context of the conversation • Difficulties with general questions RETRIEVAL OF STATISTICAL DATA RELEVANT TO THE TEXT (FACTFINDER) ▪ Multiple vectors per text ▪ Co-reference, ellipsis, anaphora, endophora resolution SUPPORT OF MULTIPLE LANGUAGES

CONCLUSION Finetuning of pretrained deep neural net models allowed us to: ▪ Improve the results of our search engine ▪ Decrease cost of ontology engineering ▪ Decrease resources cost (memory and CPU) ▪ Continuously, automatically and cost-effectively improve our search engine further using clickstream data ▪ Reduce codebase and simplify its maintenance However, some tasks such as complex query processing are still easier to solve with heuristics and some pre/post processing

THANK YOU FOR ATTENTION! QUESTIONS? ANTON FIRSOV AFIRSOV@KNOEMA.COM

USEing Transfer Learning in Retrieval of Statistical Data July 24, - PowerPoint PPT Presentation

USEing Transfer Learning in Retrieval of Statistical Data July 24, 2019 Anton Firsov, Vladimir Bugay, Anton Karpenko Knoema Corporation INTRODUCTION Knoema is a global data aggregator and a search engine for data Our search operates

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Industrial Transfer Learning Introduction to Industrial Transfer Learning Industrial Transfer

Radiative Transfer Radiative Transfer Radiative transfer is a branch of atmospheric physics. We

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Information Retrieval Introducing Information Retrieval and Web Search

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Transfer United: Partnerships to Foster Transfer Student Success Tuesday, November 5 th

Transfer Learning Eu Wern Teh What are we covering? Why transfer learning? Fine

Tutorial: Mining Massive Data Streams Michael Hahsler Lyle School of Engineering Southern

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 2: From MapReduce to

Building dev tools at the right level of abstraction Ben Davis CTO @BenCDavis

Ask what, not how Kostas Tzoumas Data is an important asset video & audio

CS 6453: StreamScope Soumya Basu March 7, 2017 Motivation Streaming data is everywhere!

Random Graphs Lecture 10 CSCI 4974/6971 3 Oct 2016 1 / 11 Todays Biz 1. Reminders 2.

Numerical Optimization Techniques L eon Bottou NEC Labs America COS 424 3/2/2010

Enabling Operator Reordering in Data Flow Programs Through Static Code Analysis XLDI 2012 Fabian

Sambuz

Useful Links

Newsletter

Mail Us

USEing Transfer Learning in Retrieval of Statistical Data July 24, - PowerPoint PPT Presentation

USEing Transfer Learning in Retrieval of Statistical Data July 24, 2019 Anton Firsov, Vladimir Bugay, Anton Karpenko Knoema Corporation INTRODUCTION Knoema is a global data aggregator and a search engine for data Our search operates

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Retrieval Models: Outline CS490W: Web I nformation Search &amp; Management Retrieval Models

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Industrial Transfer Learning Introduction to Industrial Transfer Learning Industrial Transfer

Radiative Transfer Radiative Transfer Radiative transfer is a branch of atmospheric physics. We

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Information Retrieval Introducing Information Retrieval and Web Search

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Transfer United: Partnerships to Foster Transfer Student Success Tuesday, November 5 th

Transfer Learning Eu Wern Teh What are we covering? Why transfer learning? Fine

Tutorial: Mining Massive Data Streams Michael Hahsler Lyle School of Engineering Southern

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 2: From MapReduce to

Building dev tools at the right level of abstraction Ben Davis CTO @BenCDavis

Ask what, not how Kostas Tzoumas Data is an important asset video &amp; audio

CS 6453: StreamScope Soumya Basu March 7, 2017 Motivation Streaming data is everywhere!

Random Graphs Lecture 10 CSCI 4974/6971 3 Oct 2016 1 / 11 Todays Biz 1. Reminders 2.

Numerical Optimization Techniques L eon Bottou NEC Labs America COS 424 3/2/2010

Enabling Operator Reordering in Data Flow Programs Through Static Code Analysis XLDI 2012 Fabian

Sambuz

Useful Links

Newsletter

Mail Us

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Ask what, not how Kostas Tzoumas Data is an important asset video & audio