Development of IBM Watson with UIMA DUCC Eddie Epstein - - PowerPoint PPT Presentation

development of ibm watson with uima ducc
SMART_READER_LITE
LIVE PREVIEW

Development of IBM Watson with UIMA DUCC Eddie Epstein - - PowerPoint PPT Presentation

Development of IBM Watson with UIMA DUCC Eddie Epstein eae@apache.org Apache UIMA PMC Member and Committer ApacheCon NA 2015 Presentation Outline What is DUCC Overview of the IBM-Jeopardy! Question- Answering System Interesting


slide-1
SLIDE 1

Development of IBM Watson with UIMA DUCC

Eddie Epstein eae@apache.org Apache UIMA PMC Member and Committer ApacheCon NA 2015

slide-2
SLIDE 2

Presentation Outline

 What is DUCC  Overview of the IBM-Jeopardy! Question-

Answering System

 Interesting development problems

 Solutions embodied in DUCC

 Fast cruise through DUCC's web interface

slide-3
SLIDE 3

What is DUCC

 A Linux-based cluster controller designed

specifically for UIMA

 Scales out any UIMA pipeline:

 for high throughput, or  for low latency

 Uses CGroups to partition user processes  Flexible Resource Management  Extensive Web, CLI and API interfaces

slide-4
SLIDE 4

What DUCC Does

 Collection Processing Jobs

 Scale out a UIMA pipeline into multiple threads

and processes, distribute collection as work items

 Shared Services

 Mange life cycle of services, supporting

dependencies with Jobs or other Services

 Arbitrary Processes

 Launch arbitrary singleton processes or just

provide a container to work

slide-5
SLIDE 5

Motivations for DUCC

 Support Ongoing Watson Development

 Take advantage of game playing hardware  Expanding development team

 Bring Functionality to Apache UIMA

Community

 Separate implementation from Watson code  Improve quality by targeting wide audience

slide-6
SLIDE 6

Example Jeopardy Question

IN 1698, THIS COMET DISCOVERER TOOK A SHIP CALLED THE PARAMOUR PINK ON THE FIRST PURELY SCIENTIFIC SEA VOYAGE IN 1698, THIS COMET DISCOVERER TOOK A SHIP CALLED THE PARAMOUR PINK ON THE FIRST PURELY SCIENTIFIC SEA VOYAGE

Primary Search Wilhelm Tempel Wilhelm Tempel HMS Paramour HMS Paramour Isaac Newton Isaac Newton Halley’s Comet Halley’s Comet Pink Panther Pink Panther

Christiaan Huygens Christiaan Huygens

Peter Sellers Peter Sellers Edmond Halley Edmond Halley

Candidate Answer Generation

  • 1. Edmond Halley (0.85)
  • 2. Christiaan Huygens (0.20)
  • 3. Peter Sellers (0.05)
  • 1. Edmond Halley (0.85)
  • 2. Christiaan Huygens (0.20)
  • 3. Peter Sellers (0.05)

Merging & Ranking Evidence Retrieval Question Analysis Keywords: 1698, comet,

paramour, pink, …

AnswerType(comet discoverer) Date(1698) Took(discoverer, ship) Called(ship, Paramour Pink) … Keywords: 1698, comet,

paramour, pink, …

AnswerType(comet discoverer) Date(1698) Took(discoverer, ship) Called(ship, Paramour Pink) … [0.58 0 -1.3 … 0.97] [0.71 1 13.4 … 0.72] [0.12 0 2.0 … 0.40] [0.84 1 10.6 … 0.21] [0.33 0 6.3 … 0.83] [0.21 1 11.1 … 0.92] [0.91 0 -8.2 … 0.61] [0.91 0 -1.7 … 0.60] Evidence Scoring

S p a t i a l T e m p

  • r

a l L e x i c a l Taxonomic

slide-7
SLIDE 7

Open Source Software

Critical for Watson

Runtime

 Apache UIMA  Indri Text Search (www.lemurproject.org/indri/)  Apache Lucene (Text Search)  Sesame (http://aduna-software.com/technology/sesame)  Apache ActiveMQ (used by UIMA-AS)

During Development

 Eclipse (https://eclipse.org)  Weka (http://sourceforge.net/projects/weka/)  Apache Hadoop

slide-8
SLIDE 8

Watson’s Knowledge for Jeopardy!

Watson has analyzed and stored the equivalent of about 1 million books (e.g., encyclopedias, dictionaries, news articles, reference texts, plays, etc) Watson also uses structured sources such as WordNet and DBpedia

slide-9
SLIDE 9

Watson on UIMA

Aggregate Analysis Engine Aggregate Analysis Engine

Flow Controller Flow Controller

Analysis Engine Analysis Engine Question Analysis Question Analysis Analysis Engine Analysis Engine Primary Searches Primary Searches Analysis Engine Analysis Engine Candidate Generation Candidate Generation CAS CAS Analysis Engine Analysis Engine Answer Scoring Answer Scoring Analysis Engine Analysis Engine Supporting Evidence Search Supporting Evidence Search Analysis Engine Analysis Engine Deep Evidence Scoring Deep Evidence Scoring Analysis Engine Analysis Engine Final Merger Final Merger CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS

slide-10
SLIDE 10

Watson on UIMA – Data Flow

Aggregate Analysis Engine Aggregate Analysis Engine

Flow Controller Flow Controller

Analysis Engine Analysis Engine Question Analysis Question Analysis Analysis Engine Analysis Engine Primary Searches Primary Searches Analysis Engine Analysis Engine Candidate Generation Candidate Generation CAS CAS Analysis Engine Analysis Engine Answer Scoring Answer Scoring Analysis Engine Analysis Engine Supporting Evidence Search Supporting Evidence Search Analysis Engine Analysis Engine Deep Evidence Scoring Deep Evidence Scoring Analysis Engine Analysis Engine Final Merger Final Merger CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS CAS

slide-11
SLIDE 11

Problem – One Experiment

 Average 2 hours per question

 Wide range of times

 28GB Java Heap on 32GB Machines

 Large knowledge bases (e.g. Sesame in-memory

store)

 ~1000 questions each

 To get statistically relevant results

slide-12
SLIDE 12

Solution – One Experiment

 Run parallel pipelines in multiple threads

 Share the large in-memory objects  Utilize the 8-cores in each machine

 Replicate processes across machines

 Dynamically feed idle threads next question

slide-13
SLIDE 13

BLADE Tool (before DUCC)

http://domino.research.ibm.com/library/cyberdig.nsf/papers/152EF31994BD C3DC85257B1F005DE78F/$File/rc25356.pdf Worker Node Worker Node

RMI REST REST

Worker Node Worker Node Worker Node Worker Node Scheduler Server

Question List

RMI

slide-14
SLIDE 14

UIMA DUCC - Job Model

Collection of Input Data Analysis Results

Analytic Pipeline Analytic Pipeline Analytic Pipeline

Raw Data

Work Item Generator

Data Ref’s Inspect Data

slide-15
SLIDE 15

Job Model – Core UIMA Job

QIds QIds

AE AE CM CC CM CM CC CC AE AE AE AE

QIds QIds

AE AE CM CC CM CM CC CC AE AE AE AE

Job Driver

Collection Reader Collection Reader

Job Processes

AE AE CM CC CM CM CC CC AE AE AE AE Application Code Application Code Ducc Code

HTTP

slide-16
SLIDE 16

Job Model – UIMA-AS Job

QIds QIds

AE AE CM CC CM CM CC CC AE AE AE AE

QIds QIds

AE AE CM CC CM CM CC CC AE AE AE AE

Job Driver

Collection Reader Collection Reader

Job Processes

UIMA-AS Service UIMA-AS Service Application Code Application Code Ducc Code

HTTP

slide-17
SLIDE 17

Job Model – Custom Job

QIds QIds

AE AE CM CC CM CM CC CC AE AE AE AE

QIds QIds

AE AE CM CC CM CM CC CC AE AE AE AE

Job Driver

Collection Reader Collection Reader

Job Processes

Java App (Non-UIMA) Java App (Non-UIMA) Application Code Application Code Ducc Code

HTTP

slide-18
SLIDE 18

Job Debugging – all_in_one

Job “processing” Code Job “processing” Code Application Code Application Code Ducc Code Collection Reader Collection Reader

All Job code deployed in a single thread in a single process for development & debug

slide-19
SLIDE 19

Problem – 15 Researchers

 Personnel evaluated by their contribution to

  • verall accuracy

 With exceptions, e.g. reduce “stupid answers”

 Wanted their resource “fair share” NOW

slide-20
SLIDE 20

Solution – 15 Researchers

 Preempt running processes

 Kill processes with least CPU investment  < 10% overhead for lost investment

 Ramp up after successful initialization

 Saved more than preemption loses

 Allow processes to be non-preemptable

 Reserve entire machines  Singleton processes (in CGroup containers)  Jobs

slide-21
SLIDE 21

Watson on a 32GB Machine?

Aggregate Analysis Engine Aggregate Analysis Engine

Flow Controller Flow Controller

Analysis Engine Analysis Engine Question Analysis Question Analysis Analysis Engine Analysis Engine Primary Searches Primary Searches Analysis Engine Analysis Engine Candidate Generation Candidate Generation Analysis Engine Analysis Engine Answer Scoring Answer Scoring Analysis Engine Analysis Engine Supporting Evidence Search Supporting Evidence Search Analysis Engine Analysis Engine Deep Evidence Scoring Deep Evidence Scoring Analysis Engine Analysis Engine Final Merger Final Merger CAS CAS CAS CAS

No, from the start some UIMA components were shared UIMA-AS services

slide-22
SLIDE 22

Performance Bottleneck (Development Mode)

32GB Machines

JVM with JNI ~30 GB JVM with JNI ~30 GB

File system Buffers 50 GB Search Index NFS Filesystem

JVM with JNI ~30 GB JVM with JNI ~30 GB

File system Buffers

JVM with JNI ~30 GB JVM with JNI ~30 GB

File system Buffers

JVM ~30 GB JVM ~30 GB

File system Buffers

slide-23
SLIDE 23

Services Improve Performance

JVM with JNI ~30 GB JVM with JNI ~30 GB

File system Buffers 50 GB Search Index NFS Filesystem

JVM with JNI ~30 GB JVM with JNI ~30 GB

File system Buffers

JVM with JNI ~30 GB JVM with JNI ~30 GB

File system Buffers

JVM with JNI ~30 GB JVM with JNI ~30 GB

File system Buffers Shared UIMA-AS Service

Indri Search Indri Search

File system Buffers 32GB Machines 48GB Machines

Indri Search Indri Search

File system Buffers

slide-24
SLIDE 24

Problem – Managing Services

 Startup and number of instances manual  Team had ~3 week sprints

 Integrate changes and create new baseline  New indexes or code meant new services  Several baselines active concurrently

slide-25
SLIDE 25

DUCC Services

 Service registry  UIMA-AS or CUSTOM

 Service “pinger” class required  Built-in pinger for UIMA-AS

 Always-on or start-on-demand  Pinger interface supports autonomous

instance management

slide-26
SLIDE 26

DUCC Service

Application Code Application Code DUCC Code

Service Manager

Service Pinger Service Pinger Service Code

Instantiate & query Instantiate Monitor

Service Code Service Process

slide-27
SLIDE 27

DUCC – Node Visualization

slide-28
SLIDE 28

DUCC Web Demo

slide-29
SLIDE 29

Backup if no Demo

slide-30
SLIDE 30

DUCC Viz Page

slide-31
SLIDE 31

DUCC Job Page

slide-32
SLIDE 32

Job Details - Processes

slide-33
SLIDE 33

Job Details - Performance

slide-34
SLIDE 34

DUCC Service Page

slide-35
SLIDE 35

DUCC Reservation Page

slide-36
SLIDE 36

Thank You