development of ibm watson with uima ducc
play

Development of IBM Watson with UIMA DUCC Eddie Epstein - PowerPoint PPT Presentation

Development of IBM Watson with UIMA DUCC Eddie Epstein eae@apache.org Apache UIMA PMC Member and Committer ApacheCon NA 2015 Presentation Outline What is DUCC Overview of the IBM-Jeopardy! Question- Answering System Interesting


  1. Development of IBM Watson with UIMA DUCC Eddie Epstein eae@apache.org Apache UIMA PMC Member and Committer ApacheCon NA 2015

  2. Presentation Outline  What is DUCC  Overview of the IBM-Jeopardy! Question- Answering System  Interesting development problems  Solutions embodied in DUCC  Fast cruise through DUCC's web interface

  3. What is DUCC  A Linux-based cluster controller designed specifically for UIMA  Scales out any UIMA pipeline:  for high throughput, or  for low latency  Uses CGroups to partition user processes  Flexible Resource Management  Extensive Web, CLI and API interfaces

  4. What DUCC Does  Collection Processing Jobs  Scale out a UIMA pipeline into multiple threads and processes, distribute collection as work items  Shared Services  Mange life cycle of services, supporting dependencies with Jobs or other Services  Arbitrary Processes  Launch arbitrary singleton processes or just provide a container to work

  5. Motivations for DUCC  Support Ongoing Watson Development  Take advantage of game playing hardware  Expanding development team  Bring Functionality to Apache UIMA Community  Separate implementation from Watson code  Improve quality by targeting wide audience

  6. Example Jeopardy Question IN 1698, THIS COMET IN 1698, THIS COMET Keywords: 1698, comet, Keywords: 1698, comet, DISCOVERER TOOK A Primary Question DISCOVERER TOOK A paramour, pink, … paramour, pink, … SHIP CALLED THE Search AnswerType (comet discoverer) Analysis AnswerType (comet discoverer) SHIP CALLED THE PARAMOUR PINK ON Date (1698) Date (1698) PARAMOUR PINK ON Took (discoverer, ship) Took (discoverer, ship) THE FIRST PURELY THE FIRST PURELY Called (ship, Paramour Pink) Called (ship, Paramour Pink) SCIENTIFIC SEA … … SCIENTIFIC SEA VOYAGE VOYAGE Candidate Answer Generation Taxonomic l a r l l a o a c p i … t i m x a Evidence e p e L S T Retrieval Evidence Isaac Newton Isaac Newton [0.58 0 -1.3 … 0.97] Scoring Merging & Wilhelm Tempel Wilhelm Tempel [0.71 1 13.4 … 0.72] Ranking HMS Paramour HMS Paramour [0.12 0 2.0 … 0.40] 1. Edmond Halley (0.85) 1. Edmond Halley (0.85) Christiaan Huygens Christiaan Huygens [0.84 1 10.6 … 0.21] 2. Christiaan Huygens (0.20) 2. Christiaan Huygens (0.20) Halley’s Comet Halley’s Comet 3. Peter Sellers (0.05) [0.33 0 6.3 … 0.83] 3. Peter Sellers (0.05) Edmond Halley Edmond Halley [0.21 1 11.1 … 0.92] Pink Panther Pink Panther [0.91 0 -8.2 … 0.61] Peter Sellers Peter Sellers [0.91 0 -1.7 … 0.60] …

  7. Open Source Software Critical for Watson Runtime  Apache UIMA  Indri Text Search (www.lemurproject.org/indri/)  Apache Lucene (Text Search)  Sesame (http://aduna-software.com/technology/sesame)  Apache ActiveMQ (used by UIMA-AS) During Development  Eclipse (https://eclipse.org)  Weka (http://sourceforge.net/projects/weka/)  Apache Hadoop

  8. Watson’s Knowledge for Jeopardy! Watson has analyzed and stored Watson also uses structured the equivalent of about 1 million sources such as WordNet and books (e.g., encyclopedias, DBpedia dictionaries, news articles, reference texts, plays, etc)

  9. Watson on UIMA Aggregate Analysis Engine Aggregate Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Question Question Primary Primary Candidate Candidate Answer Answer Analysis Searches Generation Scoring Analysis Searches Generation Scoring CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Supporting Supporting Deep Evidence Deep Evidence Final Final Evidence Search Scoring Merger Evidence Search Scoring Merger CAS CAS CAS CAS CAS CAS Flow Flow Controller Controller

  10. Watson on UIMA – Data Flow Aggregate Analysis Engine Aggregate Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Question Question Primary Primary Candidate Candidate Answer Answer Analysis Searches Generation Scoring Analysis Searches Generation Scoring CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Supporting Supporting Deep Evidence Deep Evidence Final Final Evidence Search Scoring Merger Evidence Search Scoring Merger CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS Flow Flow CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS Controller CAS Controller CAS CAS CAS CAS CAS CAS CAS

  11. Problem – One Experiment  Average 2 hours per question  Wide range of times  28GB Java Heap on 32GB Machines  Large knowledge bases (e.g. Sesame in-memory store)  ~1000 questions each  To get statistically relevant results

  12. Solution – One Experiment  Run parallel pipelines in multiple threads  Share the large in-memory objects  Utilize the 8-cores in each machine  Replicate processes across machines  Dynamically feed idle threads next question

  13. BLADE Tool (before DUCC) RMI Worker Worker Node Worker Node Worker Node Worker Node Worker Node Node Server REST RMI REST Question Scheduler List http://domino.research.ibm.com/library/cyberdig.nsf/papers/152EF31994BD C3DC85257B1F005DE78F/$File/rc25356.pdf

  14. UIMA DUCC - Job Model Collection of Input Data Analytic Pipeline Raw Data Analysis Results Analytic Pipeline Analytic Pipeline Data Ref’s Work Item Generator Inspect Data

  15. Job Model – Core UIMA Job Job Processes AE AE Job Driver CM CC AE QIds AE CM CC AE QIds AE CM CC HTTP AE AE CM CC Collection Collection AE AE CM CC AE AE Reader CM CC Reader AE AE CM CC AE AE CM CC QIds AE AE CM CC QIds Application Code Application Code Ducc Code

  16. Job Model – UIMA-AS Job Job Processes Job Driver AE QIds AE CM CC AE QIds AE CM CC HTTP Collection Collection AE AE CM UIMA-AS CC UIMA-AS AE AE Reader CM CC Reader Service Service AE AE CM CC QIds AE AE CM CC QIds Application Code Application Code Ducc Code

  17. Job Model – Custom Job Job Processes Job Driver AE QIds AE CM CC AE QIds AE CM CC HTTP Collection Collection AE AE CM Java App CC Java App AE AE Reader CM CC Reader (Non-UIMA) (Non-UIMA) AE AE CM CC QIds AE AE CM CC QIds Application Code Application Code Ducc Code

  18. Job Debugging – all_in_one Collection Collection Reader Reader Job Job “processing” “processing” Code Code All Job code deployed in a single thread in a single process for development & debug Application Code Application Code Ducc Code

  19. Problem – 15 Researchers  Personnel evaluated by their contribution to overall accuracy  With exceptions, e.g. reduce “stupid answers”  Wanted their resource “fair share” NOW

  20. Solution – 15 Researchers  Preempt running processes  Kill processes with least CPU investment  < 10% overhead for lost investment  Ramp up after successful initialization  Saved more than preemption loses  Allow processes to be non-preemptable  Reserve entire machines  Singleton processes (in CGroup containers)  Jobs

  21. Watson on a 32GB Machine? Aggregate Analysis Engine Aggregate Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Question Question Primary Primary Candidate Candidate Answer Answer Analysis Searches Generation Scoring Analysis Searches Generation Scoring CAS CAS CAS CAS Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Analysis Engine Supporting Supporting Deep Evidence Deep Evidence Final Final Evidence Search Scoring Merger Evidence Search Scoring Merger Flow Flow Controller Controller No, from the start some UIMA components were shared UIMA-AS services

  22. Performance Bottleneck (Development Mode) 50 GB Search Index File system Buffers File system Buffers File system Buffers File system Buffers JVM with JNI JVM with JNI NFS Filesystem JVM with JNI JVM with JNI JVM with JNI JVM with JNI ~30 GB ~30 GB JVM JVM ~30 GB ~30 GB ~30 GB ~30 GB ~30 GB ~30 GB 32GB Machines

  23. Services Improve Performance Shared UIMA-AS Service File 50 GB File system Search Index system Buffers Buffers File system Buffers File system Buffers File system Buffers File system Buffers Indri Search Indri Search Indri Search Indri Search JVM with JNI JVM with JNI NFS Filesystem JVM with JNI JVM with JNI JVM with JNI JVM with JNI ~30 GB ~30 GB JVM with JNI JVM with JNI ~30 GB ~30 GB ~30 GB ~30 GB ~30 GB ~30 GB 48GB Machines 32GB Machines

  24. Problem – Managing Services  Startup and number of instances manual  Team had ~3 week sprints  Integrate changes and create new baseline  New indexes or code meant new services  Several baselines active concurrently

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend