Development of IBM Watson with UIMA DUCC Eddie Epstein - PowerPoint PPT Presentation

Development of IBM Watson with UIMA DUCC Eddie Epstein eae@apache.org Apache UIMA PMC Member and Committer ApacheCon NA 2015

Presentation Outline  What is DUCC  Overview of the IBM-Jeopardy! Question- Answering System  Interesting development problems  Solutions embodied in DUCC  Fast cruise through DUCC's web interface

What is DUCC  A Linux-based cluster controller designed specifically for UIMA  Scales out any UIMA pipeline:  for high throughput, or  for low latency  Uses CGroups to partition user processes  Flexible Resource Management  Extensive Web, CLI and API interfaces

What DUCC Does  Collection Processing Jobs  Scale out a UIMA pipeline into multiple threads and processes, distribute collection as work items  Shared Services  Mange life cycle of services, supporting dependencies with Jobs or other Services  Arbitrary Processes  Launch arbitrary singleton processes or just provide a container to work

Motivations for DUCC  Support Ongoing Watson Development  Take advantage of game playing hardware  Expanding development team  Bring Functionality to Apache UIMA Community  Separate implementation from Watson code  Improve quality by targeting wide audience

Example Jeopardy Question IN 1698, THIS COMET IN 1698, THIS COMET Keywords: 1698, comet, Keywords: 1698, comet, DISCOVERER TOOK A Primary Question DISCOVERER TOOK A paramour, pink, … paramour, pink, … SHIP CALLED THE Search AnswerType (comet discoverer) Analysis AnswerType (comet discoverer) SHIP CALLED THE PARAMOUR PINK ON Date (1698) Date (1698) PARAMOUR PINK ON Took (discoverer, ship) Took (discoverer, ship) THE FIRST PURELY THE FIRST PURELY Called (ship, Paramour Pink) Called (ship, Paramour Pink) SCIENTIFIC SEA … … SCIENTIFIC SEA VOYAGE VOYAGE Candidate Answer Generation Taxonomic l a r l l a o a c p i … t i m x a Evidence e p e L S T Retrieval Evidence Isaac Newton Isaac Newton [0.58 0 -1.3 … 0.97] Scoring Merging & Wilhelm Tempel Wilhelm Tempel [0.71 1 13.4 … 0.72] Ranking HMS Paramour HMS Paramour [0.12 0 2.0 … 0.40] 1. Edmond Halley (0.85) 1. Edmond Halley (0.85) Christiaan Huygens Christiaan Huygens [0.84 1 10.6 … 0.21] 2. Christiaan Huygens (0.20) 2. Christiaan Huygens (0.20) Halley’s Comet Halley’s Comet 3. Peter Sellers (0.05) [0.33 0 6.3 … 0.83] 3. Peter Sellers (0.05) Edmond Halley Edmond Halley [0.21 1 11.1 … 0.92] Pink Panther Pink Panther [0.91 0 -8.2 … 0.61] Peter Sellers Peter Sellers [0.91 0 -1.7 … 0.60] …

Open Source Software Critical for Watson Runtime  Apache UIMA  Indri Text Search (www.lemurproject.org/indri/)  Apache Lucene (Text Search)  Sesame (http://aduna-software.com/technology/sesame)  Apache ActiveMQ (used by UIMA-AS) During Development  Eclipse (https://eclipse.org)  Weka (http://sourceforge.net/projects/weka/)  Apache Hadoop

Watson’s Knowledge for Jeopardy! Watson has analyzed and stored Watson also uses structured the equivalent of about 1 million sources such as WordNet and books (e.g., encyclopedias, DBpedia dictionaries, news articles, reference texts, plays, etc)

Watson on UIMA Aggregate Analysis Engine Aggregate Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Question Question Primary Primary Candidate Candidate Answer Answer Analysis Searches Generation Scoring Analysis Searches Generation Scoring CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Supporting Supporting Deep Evidence Deep Evidence Final Final Evidence Search Scoring Merger Evidence Search Scoring Merger CAS CAS CAS CAS CAS CAS Flow Flow Controller Controller

Watson on UIMA – Data Flow Aggregate Analysis Engine Aggregate Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Question Question Primary Primary Candidate Candidate Answer Answer Analysis Searches Generation Scoring Analysis Searches Generation Scoring CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Supporting Supporting Deep Evidence Deep Evidence Final Final Evidence Search Scoring Merger Evidence Search Scoring Merger CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS Flow Flow CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS Controller CAS Controller CAS CAS CAS CAS CAS CAS CAS

Problem – One Experiment  Average 2 hours per question  Wide range of times  28GB Java Heap on 32GB Machines  Large knowledge bases (e.g. Sesame in-memory store)  ~1000 questions each  To get statistically relevant results

Solution – One Experiment  Run parallel pipelines in multiple threads  Share the large in-memory objects  Utilize the 8-cores in each machine  Replicate processes across machines  Dynamically feed idle threads next question

BLADE Tool (before DUCC) RMI Worker Worker Node Worker Node Worker Node Worker Node Worker Node Node Server REST RMI REST Question Scheduler List http://domino.research.ibm.com/library/cyberdig.nsf/papers/152EF31994BD C3DC85257B1F005DE78F/$File/rc25356.pdf

UIMA DUCC - Job Model Collection of Input Data Analytic Pipeline Raw Data Analysis Results Analytic Pipeline Analytic Pipeline Data Ref’s Work Item Generator Inspect Data

Job Model – Core UIMA Job Job Processes AE AE Job Driver CM CC AE QIds AE CM CC AE QIds AE CM CC HTTP AE AE CM CC Collection Collection AE AE CM CC AE AE Reader CM CC Reader AE AE CM CC AE AE CM CC QIds AE AE CM CC QIds Application Code Application Code Ducc Code

Job Model – UIMA-AS Job Job Processes Job Driver AE QIds AE CM CC AE QIds AE CM CC HTTP Collection Collection AE AE CM UIMA-AS CC UIMA-AS AE AE Reader CM CC Reader Service Service AE AE CM CC QIds AE AE CM CC QIds Application Code Application Code Ducc Code

Job Model – Custom Job Job Processes Job Driver AE QIds AE CM CC AE QIds AE CM CC HTTP Collection Collection AE AE CM Java App CC Java App AE AE Reader CM CC Reader (Non-UIMA) (Non-UIMA) AE AE CM CC QIds AE AE CM CC QIds Application Code Application Code Ducc Code

Job Debugging – all_in_one Collection Collection Reader Reader Job Job “processing” “processing” Code Code All Job code deployed in a single thread in a single process for development & debug Application Code Application Code Ducc Code

Problem – 15 Researchers  Personnel evaluated by their contribution to overall accuracy  With exceptions, e.g. reduce “stupid answers”  Wanted their resource “fair share” NOW

Solution – 15 Researchers  Preempt running processes  Kill processes with least CPU investment  < 10% overhead for lost investment  Ramp up after successful initialization  Saved more than preemption loses  Allow processes to be non-preemptable  Reserve entire machines  Singleton processes (in CGroup containers)  Jobs

Watson on a 32GB Machine? Aggregate Analysis Engine Aggregate Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Question Question Primary Primary Candidate Candidate Answer Answer Analysis Searches Generation Scoring Analysis Searches Generation Scoring CAS CAS CAS CAS Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Supporting Supporting Deep Evidence Deep Evidence Final Final Evidence Search Scoring Merger Evidence Search Scoring Merger Flow Flow Controller Controller No, from the start some UIMA components were shared UIMA-AS services

Performance Bottleneck (Development Mode) 50 GB Search Index File system Buffers File system Buffers File system Buffers File system Buffers JVM with JNI JVM with JNI NFS Filesystem JVM with JNI JVM with JNI JVM with JNI JVM with JNI ~30 GB ~30 GB JVM JVM ~30 GB ~30 GB ~30 GB ~30 GB ~30 GB ~30 GB 32GB Machines

Services Improve Performance Shared UIMA-AS Service File 50 GB File system Search Index system Buffers Buffers File system Buffers File system Buffers File system Buffers File system Buffers Indri Search Indri Search Indri Search Indri Search JVM with JNI JVM with JNI NFS Filesystem JVM with JNI JVM with JNI JVM with JNI JVM with JNI ~30 GB ~30 GB JVM with JNI JVM with JNI ~30 GB ~30 GB ~30 GB ~30 GB ~30 GB ~30 GB 48GB Machines 32GB Machines

Problem – Managing Services  Startup and number of instances manual  Team had ~3 week sprints  Integrate changes and create new baseline  New indexes or code meant new services  Several baselines active concurrently

Development of IBM Watson with UIMA DUCC Eddie Epstein - PowerPoint PPT Presentation

Development of IBM Watson with UIMA DUCC Eddie Epstein eae@apache.org Apache UIMA PMC Member and Committer ApacheCon NA 2015 Presentation Outline What is DUCC Overview of the IBM-Jeopardy! Question- Answering System Interesting

Watson update - The nuts and bolts behind Watson - What has Watson been up to lately - How can

4CSLL5 IBM Translation Models Martin Emms October 22, 2020 4CSLL5 IBM Translation Models IBM

GATE and UIMA in Language Technology Teaching Graham Wilcock University of Helsinki

Iterative Learning of Relation Patterns for Market Analysis with UIMA Sebastian Blohm , Jrgen

Advanced GATE Embedded Additional material: UIMA/GATE integration Fifth GATE Training Course

4CSLL5 IBM Translation Models IBM models Probabilities and Translation Alignments Martin Emms

Ipopt Tutorial Andreas W achter IBM T.J. Watson Research Center andreasw@watson.ibm.com

Black-Box Performance Control for High-Volume Non-Interactive Systems Chunqiang (CQ) Tang IBM

Factorization on GPUs Wei Tan IBM T. J. Watson Research Center wtan@us.ibm.com

The Mechanical Man: James Broadus Watson By: Zach Herfel The Birth of J.B. Watson James

From HAL to Watson Early Science Fiction Predicted Modern Technology by Alan G. Labouseur

John B. Watson LOGAN NOE No, Not This John Watson This John Watson Trivia Time Born in 1878

Edith Watson From: Edith Watson Sent: Tuesday, July 24, 2018 2:39 PM Edith Watson To:

Infuse AI to Your Enterprise Yonghua LIN, IBM Research IBM Distinguished Engineer Leader of AI

IBM Systems Cognitive Systems Dr. Wolfgang Maier Director HW Development IBM Research &

Problem solved: IBM Notes Replacement 2 IBM Notes Replacement Migrating from IBM Notes to

November 14, 2017 Administrative notes Reminder: In the news call #3 individual component

Language Understanding Systems IBM Watson Can we create a computer system to compete against the

system capable of answering questions posed in natural language. Name some of the references

CONFIDENCE. CREDIT INSURANCE Insurance policy that guarantees your customer pays their bill

practice theory for social change practice theory is not for social change in itself it has no

Behaviourism Phil 255 In Psychology: Watson When Watson published Psychology as the

Cosmological Moduli, Dark Matter, and Possible Implications for the LHC Scott Watson Michigan

THE ETHICS OF AI Ian Watson Gibbons Memorial Lecture Series A DA C OUNTESS OF L OVELACE The

Development of IBM Watson with UIMA DUCC Eddie Epstein - PowerPoint PPT Presentation

Development of IBM Watson with UIMA DUCC Eddie Epstein eae@apache.org Apache UIMA PMC Member and Committer ApacheCon NA 2015 Presentation Outline What is DUCC Overview of the IBM-Jeopardy! Question- Answering System Interesting

Watson update - The nuts and bolts behind Watson - What has Watson been up to lately - How can

4CSLL5 IBM Translation Models Martin Emms October 22, 2020 4CSLL5 IBM Translation Models IBM

GATE and UIMA in Language Technology Teaching Graham Wilcock University of Helsinki

Iterative Learning of Relation Patterns for Market Analysis with UIMA Sebastian Blohm , Jrgen

Advanced GATE Embedded Additional material: UIMA/GATE integration Fifth GATE Training Course

4CSLL5 IBM Translation Models IBM models Probabilities and Translation Alignments Martin Emms

Ipopt Tutorial Andreas W achter IBM T.J. Watson Research Center andreasw@watson.ibm.com

Black-Box Performance Control for High-Volume Non-Interactive Systems Chunqiang (CQ) Tang IBM

Factorization on GPUs Wei Tan IBM T. J. Watson Research Center wtan@us.ibm.com

The Mechanical Man: James Broadus Watson By: Zach Herfel The Birth of J.B. Watson James

From HAL to Watson Early Science Fiction Predicted Modern Technology by Alan G. Labouseur

John B. Watson LOGAN NOE No, Not This John Watson This John Watson Trivia Time Born in 1878

Edith Watson From: Edith Watson Sent: Tuesday, July 24, 2018 2:39 PM Edith Watson To:

Infuse AI to Your Enterprise Yonghua LIN, IBM Research IBM Distinguished Engineer Leader of AI

IBM Systems Cognitive Systems Dr. Wolfgang Maier Director HW Development IBM Research &amp;

Problem solved: IBM Notes Replacement 2 IBM Notes Replacement Migrating from IBM Notes to

November 14, 2017 Administrative notes Reminder: In the news call #3 individual component

Language Understanding Systems IBM Watson Can we create a computer system to compete against the

system capable of answering questions posed in natural language. Name some of the references

CONFIDENCE. CREDIT INSURANCE Insurance policy that guarantees your customer pays their bill

practice theory for social change practice theory is not for social change in itself it has no

Behaviourism Phil 255 In Psychology: Watson When Watson published Psychology as the

Cosmological Moduli, Dark Matter, and Possible Implications for the LHC Scott Watson Michigan

THE ETHICS OF AI Ian Watson Gibbons Memorial Lecture Series A DA C OUNTESS OF L OVELACE The

IBM Systems Cognitive Systems Dr. Wolfgang Maier Director HW Development IBM Research &