2019-02-19 Bradley Allen, Chief Architect, Elsevier
From Content Publishing to Data Solutions via Machine Learning - - PowerPoint PPT Presentation
From Content Publishing to Data Solutions via Machine Learning - - PowerPoint PPT Presentation
From Content Publishing to Data Solutions via Machine Learning Presentation to Los Angeles Machine Learning Meetup 2019-02-19 Bradley Allen, Chief Architect, Elsevier Twenty-three years ago Its hard to imagine a sweeter business than
Twenty-three years ago
“It’s hard to imagine a sweeter business than publishing academic journals. The editorial content is contributed free of charge by scholars desperate to publish to get tenure. School libraries are automatic customers—professors insist
- n it. ... Is the party over? It may be
nearing its end. The Internet is closing in.”
- Forbes, December 18, 1995
Read Search Do this this this
Cell Fundamentals Gray‘s Anatomy ScienceDirect Scopus ClinicalKey Reaxys Sherpath Mendeley Knovel
Today: from content publishing to data solutions
‘You could use this treatment to save a life’
Clinicians
‘This article answers your questions’
Researchers
‘This is the research to invest in’
Governments
‘This is the cancer treatment you should pursue’
Pharmaceutical companies
‘This is the area you need to improve to qualify’
Nursing students
Our five main customer segments
- 1. Industrial Research Institute 2. The Lancet 3. Tufts 4. World Health Organization
Life-saving drugs are expensive to develop.3 Global research spend is growing every year.1
3.4%
from2015 Predictedspend
$1.9TN
research in2016
Studies:
70-80% of research asksthe wrong questions
- r cannot be
reproduced Researchers lack the tools they need to be effective.2 Preventable medicalerrors: Third largest cause of death in theUS Health providers cannot save lives without the best information.4
$2.5BN
medianpharmaceutical spend perdrug
1/20
successrate
- fdrugs
Heart Disease
611k
Cancer
585k
Medical Error
225k
149k Respiratory Illness
The challenges our customers face
The assets we have at hand
Content Technology
Chemistry database 500m published experimental facts User queries 13m monthly users on ScienceDirect Books 35,000 published books Drug Database 100% of drug information from pharmaceutical
companies updated daily
Research 16% of the world’s research data and articles published by Elsevier 1,000 technologists employed by Elsevier Machine learning
Over 1,000 predictive models trained on 1.5 billion electronic health care events
Machine reading 475m facts extracted from ScienceDirect Collaborative filtering: 1bn scientific articles added by 2.5m
researchers analyzed daily to generate over 250m article recommendations
Semantic Enhancement
Knowledge on 50m chemicals captured as 11B facts
How we think about delivering data solutions
Determine the question (including use case and personae) Describe the data that needs to be produced to address the question If we have that data, reuse it If not, use the data we have to create it If we don’t have data we need, acquire what we’re missing
From Justin O’Beirne, “Google Maps’ Moat – How far ahead of Apple Maps is Google Maps?”, 2017-12. Retrieved from https://www.justinobeirne.com/google-maps-moat on 2018- 05-31.
Breaking it down into eight simple steps
- Market Definition: Determine
target market personae & product features
- Use Case Definition: Describe
tasks performed by personae yielding use cases
- Data & Query Specification: Describe
data schemas & features to support use cases
- Knowledge Delivery: Deliver query
& visualisation of data
- Data Enhancement: Extract entities,
attributes & relations, map entities to
- ntologies & taxonomies
- Data Linking: Link extracted entities
with other entities in existing enterprise data
- Knowledge Graph Construction:
Store mapped & linked data for access & discovery
- Data Acquisition: Acquire content &
data in multiple formats from multiple sources
1 2 4 5 6 7 3 8
Knowledge graphs make it all hang together
I really believe that the key battleground in any industry is that of its knowledge
- graph. Google has it for
media/advertising, Netflix has it for filmed entertainment, Uber has it for inner city transportation, Facebook has it across social media as well as messaging and the multiples speak for themselves.
Tony Askew, Founder/Partner at REV (personal communication, September 29, 2016)
The role that machine learning (ML) plays
- Our goal is to drive business by enabling better outcomes through:
− Delivery of timely, appropriate advice for decision making & problem solving − Enhanced discovery and query over massive amounts of information
- We plan to achieve this by using ML to build knowledge graphs that
enable the rapid development of data solutions
− Implementing entity/object extraction, relation extraction, entity disambiguation, classification, and sentiment analysis − Based on the scientific & medical literature, experimental data, and the data exhaust associated with the practice of scientific communication & medical practice
Breaking down our ML efforts
- Early wins
− Deployed systems adding value to existing products and solutions
- Roofshots
− Task-specific use of ML to improve discoverability, knowledge delivery
- Practicalities
− Human-in-the-loop NLP pipelines augmented with ML components to scale entity and relation extraction, entity linking for knowledge graph construction
- Moonshots
− Use of multi-task learning architectures to develop a general-purpose approach to question answering from the scientific and medical literature and from experimental data
Early win: Recognizing decision graphs in medical content
- Clinical Key is Elsevier’s flagship
medical reference search product
- Clinicians prefer “answers” in the form
- f tables or flowcharts
− Eliminates need to page through retrieved content to find actionable information
- Clinical Key provides a sidebar section
displaying answers, but this feature depends on very labor-intensive manual curation
- Solution: automatically classify images
in medical content corpus at index time
- Benefits: lower cost and improved user
experience
12
Early win: Recognizing decision graphs in medical content
- Perfect fit for transfer learning approach
− Input to the classifier is a classifier image and output is one of 8 classes: Photo, Radiological, Data graphic, Illustration, Microscopy, Flowchart, Electrophoresis, Medical decision graph − Image dataset is augmented by producing variations of the training images by rotating, flipping, transposing, jittering, etc. − Reusing all but the last two Dense layers of a pre-trained model (VGG-CNN, available from Caffe’s “model zoo”) − VGG-CNN was trained on Imagenet (14 million images from the Web, 1000 general topic classes e.g., Cat, Airplane, House) − Last layer is a multinomial logistic regression (or softmax) classifier
- Model trained on 10,167 images with a 70/30 train/test split
- Achieves 93% test set accuracy
− Evaluated image + caption text model but did not get a big performance boost
- Searchable image base used to support training set and
model development
Early win: Generating topic pages from scientific content
Take a ScienceDirect article Take a taxonomy Find
- ccurrences
- f concepts
Definition Snippet(s)
Early win: Generating topic pages from scientific content
Roofshot: Extracting clinically useful relationships from medical content
Roofshot: Extracting clinically useful relationships from medical content
Pulmonary Embolism Dyspnea has Clinical Finding
Three clinical symptoms were considered to be highly suggestive of PE: recent dyspnoea, recent chest pain and unusual tachycardia >75/min.
Roofshot: Extracting clinically useful relationships from medical content
CNN implemented in Keras Input: relationship labels and syntax paths linking relation arguments, semantic tagging Embed in 64-dimensional space(like Word2vec) Compute 1-dimensional convolution to learn path structure Perform final softmax activation to predict one of N relations Semantic analysis using FPE annotation Syntactic analysis with spaCy on Apache Spark
Roofshot: Extracting clinically useful relationships from medical content
Roofshot: Extracting clinically useful relationships from medical content
Roofshot: Assistants for the interpretation of pathology imagery
What we need Annotated Raw Images
Notice the multiple subependymal nodules in fig 3.
What we have Images with their captions
Roofshot: Assistants for the interpretation of pathology imagery
Practicality: Building continuous modeling and quality control into our deployment workflows
Practicality: Training our development squads
Objective Key Results Provide data and software engineers with baseline understanding of ML/DS concepts and techniques Increase total Individual Contributors in Data Engineering and Software Engineering Tech Tracks who have completed Yellow Belt course from ~5% to 80% Increase total Individual Contributors in Data Engineering and Software Engineering Tech Tracks who have completed Green Belt course from ~2% to 50% Increase total Individual Contributors in Data Engineering and Software Engineering Tech Tracks who have completed Blue Belt course from <1% to 20% Judo Belt Level Course Yellow Python for Data Science and Machine Learning Bootcamp Green Data Analysis with Pandas and Python Blue One of: Scala and Spark for Big Data and Machine Learning or Spark and Python for Big Data with PySpark Brown Deep Learning with Tensorflow
Task Technology Examples
Market Definition Web analytics platforms Adobe Analytics Sales, marketing & CRM platforms Salesforce Use Case Definition Enterprise wikis Confluence Collaboration platforms Slack Data & Query Specification Notebook-based data science platforms Databricks, Jupyter, RunKit Data Acquisition Web & LOD crawlers Nutch ETL tools Talend, AWS Glue, Apache Spark Workflow automation frameworks Activiti, Apache Airflow, AWS Mechanical Turk Log-based data processing frameworks Apache Flume, Logstash, Kafka, AWS Redshift Data Enhancement NLP & ML packages & services Ad hoc string or regular expression matching using standard language libraries, FPE, MedScan, OpenNLP, NLTK Ontology, taxonomy & data management tools PoolParty, TopBraid Composer, Gitlab, Github Data Linking Entity linking packages & services Ad hoc string and regular expression matching using standard language libraries, NLP/ML/DL algorithms Knowledge Graph Construction Graph stores (including RDF & property graph stores) GraphDB, JanusGraph, DataStax Graph, Neo4J, AWS Neptune Search engines Solr, ElasticSearch, managed search services Linked data REST servers Nginx, Apache, AWS Lambda Knowledge Delivery Data visualization & query applications Kibana, Tableau, D3 Web application frameworks Angular, React, Express
Practicality: Evaluating and selecting technologies
Moonshot: matrix factorization for relation extraction
p = 83 r = 176 83 x 176 sparse binary-valued matrix with 366 entries surface form relations structured relations entity pairs Content Universal schema Surface form relations Structured relations Factorization model
Matrix Construction Open Information Extraction Entity Resolution Matrix Factorization
Knowledge graph
Curation
Predicted relations
Matrix Completion
Taxonomy
Triple Extraction
14M articles from Science Direct 3.3M facts 475M facts
49M facts
920K concepts from EMMeT
glaucoma developed many years after chronic inflammation of uveal tract glaucoma develop following chronic inflammation of uveal tract glaucoma can appear soon in family history of glaucoma glaucoma can appear soon in age over 40 glaucoma the risk of functional visual field loss glaucoma contributing causes of functional visual field loss glaucoma contributed to functional visual field loss glaucoma is considered the second leading cause of functional visual field loss glaucoma remains the second leading cause of functional visual field loss
Latent factor matrix
r = 176 p = 83
Latent factor matrix
- 83 x 176 real-valued matrix with
14,608 entries
- diseases
2791370 glaucoma have been documented to cause contact dermatitis 3815093 diseases diseases 2791370 glaucoma is assessed through evaluation 5415395 qualifier diseases 2791370 glaucoma progresses more rapidly than primary open-angle glaucoma 8247149 diseases diseases 2791370 glaucoma recommend treatment 5216597 procedures diseases 2791370 glaucoma supports the assumption that
- xidative stress
8184588 diseases diseases 2791370 glaucoma is the death of retinal ganglion cells 8002088 anatomy
Moonshot: Question answering (QA) as THE Problem
“Question answering (QA) is a complex natural language processing task which requires an understanding of the meaning
- f a text and the ability to reason over
relevant facts. Most, if not all, tasks in natural language processing can be cast as a question answering problem …“
- Ankit Kumar, Peter Ondruska, Mohit Iyyer, James
Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, Richard Socher. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. The 33rd International Conference on Machine Learning (ICML 2016), 2016
Moonshot: Question answering (QA) as THE Problem
- Currently evaluating BERT
(specifically, BioBERT)
- Transformer-style architectures
seem to be unreasonably effective
- However, their efficacy with
scientific and medical content is unknown
- Early results are inconclusive as
reproducibility is challenging
The hidden opportunity
- People that want to do ML need lots and lots of high-quality content & data
− The more the better
- Algorithms for ML are commodities
- Data for ML is expensive
- Commissioning, authoring and curating data for ML is a new digital
revenue opportunity
My working hypothesis
As machines become increasingly capable of general- purpose language understanding, the burden of effort in building machine intelligence will shift from writing software to curating content.
Scholarly publishing in the next twenty years
model learning architecture data
Publishing content for people Publishing data for machines
context knowledge content
GPU
Stepping back: A personal perspective on AI
- We’re in an era of tremendous expectations
− The Deep Learning Era (2010-?)
- I’ve (we’ve) been here before
− The Expert Systems Era (1984-1992)
- How is it different now and what does that mean for our ambitions?
Some things are the same
Expert Systems Era Deep Learning Era Expensive hardware Lisp Machines Cloud TPUs Expensive people Knowledge engineers Data scientists Geostrategic calls to action Japan China
Some things are dramatically different
Expert Systems Era Deep Learning Era # of computers connected to the Internet 102 109 FLOPS 109 1014 Bits/USD 104 109
We’re in a world where it is much, much easier to build and field AI applications...
- A global culture of technology
- (Relative) homogeneity of hardware and software platforms and packages
- Ubiquity of networked hardware
- Open source software packages and platforms
- Open access to published results
- Increased reproducibility and transferability of results
− Though not everywhere
... but we’re beginning to see hints of trouble
- The huge strides of the last decade have been in machine perception, but not in machine
reasoning
- While AI systems are cheaper to build and easier to field, they are often still brittle
− Self-driving cars and unjustified levels of trust in automated systems − Adversarial examples
- Learning from data is yielding unanticipated consequences
− The politics and economics of dataset creation − Biased models from biased data
- On top of this, we’re witnessing the end of the Moore‘s Law era
− Progress through faster, cheaper hardware may be much slower to come
- And, oh yeah... IBM Watson
Should we be worried?
- The Expert Systems Era was perceived as a failure...
− Fielded systems were brittle, hard to maintain, and often didn’t address real customer needs − Projected short-term benefits for businesses did not live up to the hype
- ... but was it really?
− If it works, it isn’t AI anymore: Ed Feigenbaum’s story about the Pulmonary Function Advisory Expert System
- So no, but we need to be aware that AI applications can involve hard problems that will
take many business cycles to solve
− Speech recognition from 1970 to today − Robot walking from Marc Raibert’s Leg Lab to Boston Dynamics
- Artificial General Intelligence: a problem for another generation, perhaps one in another
century
No, but we should be focused and deliberate
- Always work backwards from real customer needs to define an application
− Did I mention IBM Watson?
- Deal with the challenges of data acquisition and quality first
− Be proactive with respect to fairness and privacy
- Design applications to mitigate brittleness
− Start simple − Augmentation before automation − Human-in-the-loop − Alter the problem and environment to make the problem more manageable
- Leverage your differentiating strengths
− Our leadership in editorial processes for high quality trusted content − Our deep subject matter expertise
Summary
- Knowledge graphs are a way of sharing knowledge between people and
machines, and the battleground on which dominance in markets will be established
- We’re using machine learning to build knowledge graphs for use by
researchers, medical professionals, and students
- There is much work underway and a lot yet to be done