[PPT] - From Content Publishing to Data Solutions via Machine Learning PowerPoint Presentation

SLIDE 1

2019-02-19 Bradley Allen, Chief Architect, Elsevier

From Content Publishing to Data Solutions via Machine Learning

Presentation to Los Angeles Machine Learning Meetup

SLIDE 2

Twenty-three years ago

“It’s hard to imagine a sweeter business than publishing academic journals. The editorial content is contributed free of charge by scholars desperate to publish to get tenure. School libraries are automatic customers—professors insist

n it. ... Is the party over? It may be

nearing its end. The Internet is closing in.”

Forbes, December 18, 1995

SLIDE 3

Read Search Do this this this

Cell Fundamentals Gray‘s Anatomy ScienceDirect Scopus ClinicalKey Reaxys Sherpath Mendeley Knovel

Today: from content publishing to data solutions

SLIDE 4

‘You could use this treatment to save a life’

Clinicians

‘This article answers your questions’

Researchers

‘This is the research to invest in’

Governments

‘This is the cancer treatment you should pursue’

Pharmaceutical companies

‘This is the area you need to improve to qualify’

Nursing students

Our five main customer segments

SLIDE 5

1. Industrial Research Institute 2. The Lancet 3. Tufts 4. World Health Organization

Life-saving drugs are expensive to develop.3 Global research spend is growing every year.1

3.4%

from2015 Predictedspend

$1.9TN

research in2016

Studies:

70-80% of research asksthe wrong questions

r cannot be

reproduced Researchers lack the tools they need to be effective.2 Preventable medicalerrors: Third largest cause of death in theUS Health providers cannot save lives without the best information.4

$2.5BN

medianpharmaceutical spend perdrug

1/20

successrate

fdrugs

Heart Disease

611k

Cancer

585k

Medical Error

225k

149k Respiratory Illness

The challenges our customers face

SLIDE 6

The assets we have at hand

Content Technology

Chemistry database 500m published experimental facts User queries 13m monthly users on ScienceDirect Books 35,000 published books Drug Database 100% of drug information from pharmaceutical

companies updated daily

Research 16% of the world’s research data and articles published by Elsevier 1,000 technologists employed by Elsevier Machine learning

Over 1,000 predictive models trained on 1.5 billion electronic health care events

Machine reading 475m facts extracted from ScienceDirect Collaborative filtering: 1bn scientific articles added by 2.5m

researchers analyzed daily to generate over 250m article recommendations

Semantic Enhancement

Knowledge on 50m chemicals captured as 11B facts

SLIDE 7

How we think about delivering data solutions

Determine the question (including use case and personae) Describe the data that needs to be produced to address the question If we have that data, reuse it If not, use the data we have to create it If we don’t have data we need, acquire what we’re missing

From Justin O’Beirne, “Google Maps’ Moat – How far ahead of Apple Maps is Google Maps?”, 2017-12. Retrieved from https://www.justinobeirne.com/google-maps-moat on 2018- 05-31.

SLIDE 8

Breaking it down into eight simple steps

Market Definition: Determine

target market personae & product features

Use Case Definition: Describe

tasks performed by personae yielding use cases

Data & Query Specification: Describe

data schemas & features to support use cases

Knowledge Delivery: Deliver query

& visualisation of data

Data Enhancement: Extract entities,

attributes & relations, map entities to

ntologies & taxonomies
Data Linking: Link extracted entities

with other entities in existing enterprise data

Knowledge Graph Construction:

Store mapped & linked data for access & discovery

Data Acquisition: Acquire content &

data in multiple formats from multiple sources

1 2 4 5 6 7 3 8

SLIDE 9

Knowledge graphs make it all hang together

I really believe that the key battleground in any industry is that of its knowledge

graph. Google has it for

media/advertising, Netflix has it for filmed entertainment, Uber has it for inner city transportation, Facebook has it across social media as well as messaging and the multiples speak for themselves.

Tony Askew, Founder/Partner at REV (personal communication, September 29, 2016)

SLIDE 10

The role that machine learning (ML) plays

Our goal is to drive business by enabling better outcomes through:

− Delivery of timely, appropriate advice for decision making & problem solving − Enhanced discovery and query over massive amounts of information

We plan to achieve this by using ML to build knowledge graphs that

enable the rapid development of data solutions

− Implementing entity/object extraction, relation extraction, entity disambiguation, classification, and sentiment analysis − Based on the scientific & medical literature, experimental data, and the data exhaust associated with the practice of scientific communication & medical practice

SLIDE 11

Breaking down our ML efforts

Early wins

− Deployed systems adding value to existing products and solutions

Roofshots

− Task-specific use of ML to improve discoverability, knowledge delivery

Practicalities

− Human-in-the-loop NLP pipelines augmented with ML components to scale entity and relation extraction, entity linking for knowledge graph construction

Moonshots

− Use of multi-task learning architectures to develop a general-purpose approach to question answering from the scientific and medical literature and from experimental data

SLIDE 12

Early win: Recognizing decision graphs in medical content

Clinical Key is Elsevier’s flagship

medical reference search product

Clinicians prefer “answers” in the form
f tables or flowcharts

− Eliminates need to page through retrieved content to find actionable information

Clinical Key provides a sidebar section

displaying answers, but this feature depends on very labor-intensive manual curation

Solution: automatically classify images

in medical content corpus at index time

Benefits: lower cost and improved user

experience

12

SLIDE 13

Early win: Recognizing decision graphs in medical content

Perfect fit for transfer learning approach

− Input to the classifier is a classifier image and output is one of 8 classes: Photo, Radiological, Data graphic, Illustration, Microscopy, Flowchart, Electrophoresis, Medical decision graph − Image dataset is augmented by producing variations of the training images by rotating, flipping, transposing, jittering, etc. − Reusing all but the last two Dense layers of a pre-trained model (VGG-CNN, available from Caffe’s “model zoo”) − VGG-CNN was trained on Imagenet (14 million images from the Web, 1000 general topic classes e.g., Cat, Airplane, House) − Last layer is a multinomial logistic regression (or softmax) classifier

Model trained on 10,167 images with a 70/30 train/test split
Achieves 93% test set accuracy

− Evaluated image + caption text model but did not get a big performance boost

Searchable image base used to support training set and

model development

SLIDE 14

Early win: Generating topic pages from scientific content

Take a ScienceDirect article Take a taxonomy Find

ccurrences
f concepts

Definition Snippet(s)

SLIDE 15

Early win: Generating topic pages from scientific content

SLIDE 16

Roofshot: Extracting clinically useful relationships from medical content

SLIDE 17

Roofshot: Extracting clinically useful relationships from medical content

Pulmonary Embolism Dyspnea has Clinical Finding

Three clinical symptoms were considered to be highly suggestive of PE: recent dyspnoea, recent chest pain and unusual tachycardia >75/min.

SLIDE 18

Roofshot: Extracting clinically useful relationships from medical content

CNN implemented in Keras Input: relationship labels and syntax paths linking relation arguments, semantic tagging Embed in 64-dimensional space(like Word2vec) Compute 1-dimensional convolution to learn path structure Perform final softmax activation to predict one of N relations Semantic analysis using FPE annotation Syntactic analysis with spaCy on Apache Spark

SLIDE 19

Roofshot: Extracting clinically useful relationships from medical content

SLIDE 20

Roofshot: Extracting clinically useful relationships from medical content

SLIDE 21

Roofshot: Assistants for the interpretation of pathology imagery

What we need Annotated Raw Images

Notice the multiple subependymal nodules in fig 3.

What we have Images with their captions

SLIDE 22

Roofshot: Assistants for the interpretation of pathology imagery

SLIDE 23

Practicality: Building continuous modeling and quality control into our deployment workflows

SLIDE 24

Practicality: Training our development squads

Objective Key Results Provide data and software engineers with baseline understanding of ML/DS concepts and techniques Increase total Individual Contributors in Data Engineering and Software Engineering Tech Tracks who have completed Yellow Belt course from ~5% to 80% Increase total Individual Contributors in Data Engineering and Software Engineering Tech Tracks who have completed Green Belt course from ~2% to 50% Increase total Individual Contributors in Data Engineering and Software Engineering Tech Tracks who have completed Blue Belt course from <1% to 20% Judo Belt Level Course Yellow Python for Data Science and Machine Learning Bootcamp Green Data Analysis with Pandas and Python Blue One of: Scala and Spark for Big Data and Machine Learning or Spark and Python for Big Data with PySpark Brown Deep Learning with Tensorflow

SLIDE 25

Task Technology Examples

Market Definition Web analytics platforms Adobe Analytics Sales, marketing & CRM platforms Salesforce Use Case Definition Enterprise wikis Confluence Collaboration platforms Slack Data & Query Specification Notebook-based data science platforms Databricks, Jupyter, RunKit Data Acquisition Web & LOD crawlers Nutch ETL tools Talend, AWS Glue, Apache Spark Workflow automation frameworks Activiti, Apache Airflow, AWS Mechanical Turk Log-based data processing frameworks Apache Flume, Logstash, Kafka, AWS Redshift Data Enhancement NLP & ML packages & services Ad hoc string or regular expression matching using standard language libraries, FPE, MedScan, OpenNLP, NLTK Ontology, taxonomy & data management tools PoolParty, TopBraid Composer, Gitlab, Github Data Linking Entity linking packages & services Ad hoc string and regular expression matching using standard language libraries, NLP/ML/DL algorithms Knowledge Graph Construction Graph stores (including RDF & property graph stores) GraphDB, JanusGraph, DataStax Graph, Neo4J, AWS Neptune Search engines Solr, ElasticSearch, managed search services Linked data REST servers Nginx, Apache, AWS Lambda Knowledge Delivery Data visualization & query applications Kibana, Tableau, D3 Web application frameworks Angular, React, Express

Practicality: Evaluating and selecting technologies

SLIDE 26

Moonshot: matrix factorization for relation extraction

p = 83 r = 176 83 x 176 sparse binary-valued matrix with 366 entries surface form relations structured relations entity pairs Content Universal schema Surface form relations Structured relations Factorization model

Matrix Construction Open Information Extraction Entity Resolution Matrix Factorization

Knowledge graph

Curation

Predicted relations

Matrix Completion

Taxonomy

Triple Extraction

14M articles from Science Direct 3.3M facts 475M facts

49M facts

920K concepts from EMMeT

glaucoma developed many years after chronic inflammation of uveal tract glaucoma develop following chronic inflammation of uveal tract glaucoma can appear soon in family history of glaucoma glaucoma can appear soon in age over 40 glaucoma the risk of functional visual field loss glaucoma contributing causes of functional visual field loss glaucoma contributed to functional visual field loss glaucoma is considered the second leading cause of functional visual field loss glaucoma remains the second leading cause of functional visual field loss

Latent factor matrix

r = 176 p = 83

Latent factor matrix

83 x 176 real-valued matrix with

14,608 entries

diseases

2791370 glaucoma have been documented to cause contact dermatitis 3815093 diseases diseases 2791370 glaucoma is assessed through evaluation 5415395 qualifier diseases 2791370 glaucoma progresses more rapidly than primary open-angle glaucoma 8247149 diseases diseases 2791370 glaucoma recommend treatment 5216597 procedures diseases 2791370 glaucoma supports the assumption that

xidative stress

8184588 diseases diseases 2791370 glaucoma is the death of retinal ganglion cells 8002088 anatomy

SLIDE 27

Moonshot: Question answering (QA) as THE Problem

“Question answering (QA) is a complex natural language processing task which requires an understanding of the meaning

f a text and the ability to reason over

relevant facts. Most, if not all, tasks in natural language processing can be cast as a question answering problem …“

Ankit Kumar, Peter Ondruska, Mohit Iyyer, James

Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, Richard Socher. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. The 33rd International Conference on Machine Learning (ICML 2016), 2016

SLIDE 28

Moonshot: Question answering (QA) as THE Problem

Currently evaluating BERT

(specifically, BioBERT)

Transformer-style architectures

seem to be unreasonably effective

However, their efficacy with

scientific and medical content is unknown

Early results are inconclusive as

reproducibility is challenging

SLIDE 29

The hidden opportunity

People that want to do ML need lots and lots of high-quality content & data

− The more the better

Algorithms for ML are commodities
Data for ML is expensive
Commissioning, authoring and curating data for ML is a new digital

revenue opportunity

SLIDE 30

My working hypothesis

As machines become increasingly capable of general- purpose language understanding, the burden of effort in building machine intelligence will shift from writing software to curating content.

SLIDE 31

Scholarly publishing in the next twenty years

model learning architecture data

Publishing content for people Publishing data for machines

context knowledge content

GPU

SLIDE 32

Stepping back: A personal perspective on AI

We’re in an era of tremendous expectations

− The Deep Learning Era (2010-?)

I’ve (we’ve) been here before

− The Expert Systems Era (1984-1992)

How is it different now and what does that mean for our ambitions?

SLIDE 33

Some things are the same

Expert Systems Era Deep Learning Era Expensive hardware Lisp Machines Cloud TPUs Expensive people Knowledge engineers Data scientists Geostrategic calls to action Japan China

SLIDE 34

Some things are dramatically different

Expert Systems Era Deep Learning Era # of computers connected to the Internet 102 109 FLOPS 109 1014 Bits/USD 104 109

SLIDE 35

We’re in a world where it is much, much easier to build and field AI applications...

A global culture of technology
(Relative) homogeneity of hardware and software platforms and packages
Ubiquity of networked hardware
Open source software packages and platforms
Open access to published results
Increased reproducibility and transferability of results

− Though not everywhere

SLIDE 36

... but we’re beginning to see hints of trouble

The huge strides of the last decade have been in machine perception, but not in machine

reasoning

While AI systems are cheaper to build and easier to field, they are often still brittle

− Self-driving cars and unjustified levels of trust in automated systems − Adversarial examples

Learning from data is yielding unanticipated consequences

− The politics and economics of dataset creation − Biased models from biased data

On top of this, we’re witnessing the end of the Moore‘s Law era

− Progress through faster, cheaper hardware may be much slower to come

And, oh yeah... IBM Watson

SLIDE 37

Should we be worried?

The Expert Systems Era was perceived as a failure...

− Fielded systems were brittle, hard to maintain, and often didn’t address real customer needs − Projected short-term benefits for businesses did not live up to the hype

... but was it really?

− If it works, it isn’t AI anymore: Ed Feigenbaum’s story about the Pulmonary Function Advisory Expert System

So no, but we need to be aware that AI applications can involve hard problems that will

take many business cycles to solve

− Speech recognition from 1970 to today − Robot walking from Marc Raibert’s Leg Lab to Boston Dynamics

Artificial General Intelligence: a problem for another generation, perhaps one in another

century

SLIDE 38

No, but we should be focused and deliberate

Always work backwards from real customer needs to define an application

− Did I mention IBM Watson?

Deal with the challenges of data acquisition and quality first

− Be proactive with respect to fairness and privacy

Design applications to mitigate brittleness

− Start simple − Augmentation before automation − Human-in-the-loop − Alter the problem and environment to make the problem more manageable

Leverage your differentiating strengths

− Our leadership in editorial processes for high quality trusted content − Our deep subject matter expertise

SLIDE 39

Summary

Knowledge graphs are a way of sharing knowledge between people and

machines, and the battleground on which dominance in markets will be established

We’re using machine learning to build knowledge graphs for use by

researchers, medical professionals, and students

There is much work underway and a lot yet to be done

SLIDE 40