Visualizing Multi-Int Data Craig Knoblock & Pedro Szekely - PowerPoint PPT Presentation

A Scalable Architecture for Extracting, Aligning, Linking, and Visualizing Multi-Int Data Craig Knoblock & Pedro Szekely University of Southern California

Introduction • Massive quantities of data available for analysis – OSING, HUMINT, SIGINT, MASINT, GEOINT, … • Data is spread across multiple sources, multiple sites and multiple formats – Databases, text, web sites, XML, JSON, etc... • If an analyst could exploit all of this data, it could transform analysis – Disruptive technology for analysis University of Southern California 2

Solution: Domain-specific Insight Graphs • Innovative architecture – Extracting, aligning, linking, and visualizing massive amounts of data – Domain-specific content from structured and unstructured sources • State-of-the art open source software – Open architecture with flexible APIs – Cloud-based infrastructure (HDFS, Hadoop, ElasticSearch, etc.) University of Southern California 3

Example Scenario • Want to determine the nuclear know-how of a given country from open source data • Analyze the universities, academics, publications, reports, articles within the country University of Southern California 4

Scenario Results • Exploit the data available from – Web pages, publications, articles, etc. • Produce a knowledge graph – Key people and connections – Technical capabilities and how they have changed over time University of Southern California 5

DIG Pipeline • Crawling • Extracting • Cleaning • Integration • Computing simlarity • Entity resolution • Graph construction • Query, analysis, and visualization University of Southern California 6

Crawling • Challenge: how to crawl just the relevant pages • Approach: – Uses the Apache Nutch framework for Web pages – Uses Karma to extract pages from the deep Web University of Southern California 7

Extracting • Need to produce a structured representation for indexing and linking • Highly configurable architecture for extractors – Learning of landmark extractors for structured data – Trainable CRF-based extractors for unstructured data – Uses Mechanical Turk to crowd source training data University of Southern California 8

Cleaning • Cleaning and normalization to support analysis and linking – Visualization showing data distribution – Learned transformations from examples – Cleaning programs written in Python University of Southern California 9

Integration • Need to align the data across extracted data and structured sources • Performed using a data integration tool called Karma • Karma maps arbitrary sources into a shared domain vocabulary (schema alignment) • Uses machine learning to minimize user effort University of Southern California 10

Integration Using Karma University of Southern California 11

Similarity • Computes similarity across text fields and images – Image similarity done using DeepSentiBank – Text similarity done using Minhash/LSH University of Southern California 12

Entity Resolution • Finds matching entities • Reference source – Match against source to disambiguate entities – E.g., geonames for locations • No reference source – Combine entities by considering the similarity across multiple fields University of Southern California 13

Graph Construction • Data is integrated into a graph that can be queries and analyzed – Data stored in HDFS – Data represented in a common language JSON- LD – Represented using a common terminology University of Southern California 14

Query, Analysis and Visualization • Challenge: support efficient querying against the graph • Employ ElasticSearch to provide keyword querying, faceted browsing, and aggregation queries University of Southern California 15

Query, Analysis & Visualization • Visualization interface that provides faceted queries, timeslines, maps, etc. University of Southern California 16

Discussion • Technology that can provide dramatic new insights from data that is already available • Applies to a wide range of problems – Determining the nuclear know-how of a given country • Technologies, key scientists, relevant organizations – Combating human trafficking – Understanding trends in technical areas • E.g., Material Science – Analyzing the competitive landscape of companies – and many other domains with massive quantities of data University of Southern California 17

USC DIG Team University of Southern California 18

Acknowledgements • Collaborators – Next Century Technologies – InferLink Inc. – JPL – Columbia University • Sponsor – DARPA • AFRL contract number FA8750-14-C-0240 University of Southern California 19

Thanks! • More information: – Homepage • isi.edu/~knoblock – DIG • usc-isi-i2.github.io/dig – Karma • usc-isi-i2.github.io/karma University of Southern California 20

Visualizing Multi-Int Data Craig Knoblock & Pedro Szekely - PowerPoint PPT Presentation

A Scalable Architecture for Extracting, Aligning, Linking, and Visualizing Multi-Int Data Craig Knoblock & Pedro Szekely University of Southern California Introduction Massive quantities of data available for analysis OSING,

y = x; } int a = 2, b = 6; swap(a,b); void swap(int x, int y) { int temp = y; y = x; x =

Selection Problems int FindMax(int[] list,int low, int high){ int max = low; for(int

The heap hic 1 Limitations of the stack int *table_of(int num, int len) { int table[len+1];

void fuzz(char* buf, int& len){ void fuzz(char* buf, int& len){ void fuzz(char* buf,

CSE 351: Week 4 Tom Bergan, TA 1 Does this code look okay? int binarySearch(int a[], int

TDDE18 & 726G77 Templates Duplicate code functions int sum(int a, int b) { return a + b;

Outline - Tasks - Map projections - Visualizing area data - Visualizing point data -

Examples: Well-formed types These are types: int bool int * bool int * int ->

Reasoning About Code 1/25/2010 int deref(int p) { return p; } /* requires: p != NULL */ int

Linear Search int search(int[] list, int target, int n) { for (int i=1; i<=n; i++) if

CSC 2400: Computer Systems Using the Stack for Function Calls Lecture Goals int add3(int a, int

Compiler Construction of Idempotent Regions and Applications in Architecture Design Marc de

Objectives Introduction to Grammars Identify and explain the parts of a grammar. Defjne

Here be dragons Things we didnt cover in depth... int triangle(int a, int b, int c) { if (a

Expression Evaluation Expression Value Type 4+1 5 int 30/5 6 int 30%5 0 int 05/01/10

Exercise 1 static void testMethod(int x, int y) { int a = 0; int b = 1; if((x-(2*y)) == 14) a

Resolution in First-Order Logic Basic steps for proving a conclusion S given premises Premise 1 ,

Math you need to know 1 Linear algebra Linear algebra is mainly concerned with solving equations

[M AP R EDUCE /H ADOOP ] Shrideep Pallickara Computer Science Colorado State University CS555:

CPSC 121: Models of Computation Module 3: Representing Values in a Computer Module 3: Coming

Multiresolution Modeling A Very Brief Introduction 1 Spring 2010 Multiresolution

Logic as a Tool Chapter 2: Deductive Reasoning in Propositional Logic 2.5 Normal forms of

Minimal free resolutions of orbit closures of quivers Andr as Cristian L orincz Humboldt

Propositional Resolution Valentin Goranko DTU Informatics September 2010 V Goranko The

Visualizing Multi-Int Data Craig Knoblock & Pedro Szekely - PowerPoint PPT Presentation

A Scalable Architecture for Extracting, Aligning, Linking, and Visualizing Multi-Int Data Craig Knoblock & Pedro Szekely University of Southern California Introduction Massive quantities of data available for analysis OSING,

y = x; } int a = 2, b = 6; swap(a,b); void swap(int x, int y) { int temp = y; y = x; x =

Selection Problems int FindMax(int[] list,int low, int high){ int max = low; for(int

The heap hic 1 Limitations of the stack int *table_of(int num, int len) { int table[len+1];

void fuzz(char* buf, int&amp; len){ void fuzz(char* buf, int&amp; len){ void fuzz(char* buf,

CSE 351: Week 4 Tom Bergan, TA 1 Does this code look okay? int binarySearch(int a[], int

TDDE18 &amp; 726G77 Templates Duplicate code functions int sum(int a, int b) { return a + b;

Outline - Tasks - Map projections - Visualizing area data - Visualizing point data -

Examples: Well-formed types These are types: int bool int * bool int * int -&gt;

Reasoning About Code 1/25/2010 int deref(int *p) { return *p; } /* requires: p != NULL */ int

Linear Search int search(int[] list, int target, int n) { for (int i=1; i&lt;=n; i++) if

CSC 2400: Computer Systems Using the Stack for Function Calls Lecture Goals int add3(int a, int

Compiler Construction of Idempotent Regions and Applications in Architecture Design Marc de

Objectives Introduction to Grammars Identify and explain the parts of a grammar. Defjne

Here be dragons Things we didnt cover in depth... int triangle(int a, int b, int c) { if (a

Expression Evaluation Expression Value Type 4+1 5 int 30/5 6 int 30%5 0 int 05/01/10

Exercise 1 static void testMethod(int x, int y) { int a = 0; int b = 1; if((x-(2*y)) == 14) a

Resolution in First-Order Logic Basic steps for proving a conclusion S given premises Premise 1 ,

Math you need to know 1 Linear algebra Linear algebra is mainly concerned with solving equations

[M AP R EDUCE /H ADOOP ] Shrideep Pallickara Computer Science Colorado State University CS555:

CPSC 121: Models of Computation Module 3: Representing Values in a Computer Module 3: Coming

Multiresolution Modeling A Very Brief Introduction 1 Spring 2010 Multiresolution

Logic as a Tool Chapter 2: Deductive Reasoning in Propositional Logic 2.5 Normal forms of

Minimal free resolutions of orbit closures of quivers Andr as Cristian L orincz Humboldt

Propositional Resolution Valentin Goranko DTU Informatics September 2010 V Goranko The

void fuzz(char* buf, int& len){ void fuzz(char* buf, int& len){ void fuzz(char* buf,

TDDE18 & 726G77 Templates Duplicate code functions int sum(int a, int b) { return a + b;

Examples: Well-formed types These are types: int bool int * bool int * int ->

Reasoning About Code 1/25/2010 int deref(int p) { return p; } /* requires: p != NULL */ int

Linear Search int search(int[] list, int target, int n) { for (int i=1; i<=n; i++) if