[PPT] - Big Data and Internet Thinking Chentao Wu Associate Professor PowerPoint Presentation

SLIDE 1

Big Data and Internet Thinking

Chentao Wu Associate Professor

Dept. of Computer Science and Engineering

wuct@cs.sjtu.edu.cn

SLIDE 2

Download lectures

ftp://public.sjtu.edu.cn
User: wuct
Password: wuct123456
http://www.cs.sjtu.edu.cn/~wuct/bdit/

SLIDE 3

Schedule

lec1: Introduction on big data, cloud computing & IoT
Iec2: Parallel processing framework (e.g., MapReduce)
lec3: Advanced parallel processing techniques (e.g.,

YARN, Spark)

lec4: Cloud & Fog/Edge Computing
lec5: Data reliability & data consistency
lec6: Distributed file system & objected-based storage
lec7: Metadata management & NoSQL Database
lec8: Big Data Analytics

SLIDE 4

Collaborators

SLIDE 5

Contents

Big Data Analytics

1

SLIDE 6

Big Data Challenges

SLIDE 7

It’s not just about the data…

Mac achine Le Lear arning/D /Deep Le Lear arning IoT (Internet of Thin hings) & & Sensor Anal nalyt ytic ics Mod

delin

ling Wi Will llin ingness- to to-Pay

Natural Lan Language Process ssing Ana Analyzing Da Data @ Sc Scale Creating a Lak Lake St Streaming Co Consumer Be Behavior Da Data

Big Data Big Data Analyt lytic ics +

Leveraging a computer’s ability to learn without

being explicitly programmed to solve business problems

Understanding value drivers from the ever-

growing network of connected physical objects and the communication between them

Mining product reviews to estimate willingness-

to-pay for product features

Understanding human speech as it is spoken

through application of computer science, AI, and computational linguistics

Using distributed computing and machine

learning tools to analyze hundreds of gigabytes

f data
Mining social data in real time to understand

when and where consumers are making choices

1 2 3 4 5 6

Methods of using Big Data to generate insight Refers to the DATA only

It is important to understand the distinction between

Big Data sets (large, unstructured, fast, and uncertain data) and ‘Big Data Analytics’.

SLIDE 8

It’s also about what, how, and why you use it

Big Data Analytics – the process of harnessing Big Data to yield

actionable insights – is a combination of five key elements:

Decisions Analytics Data Technology Mindset & Skills

The value of Big Data Analytics is driven by the unique decisions facing leaders, companies, and countries today. In turn, the type, frequency, speed, and complexity of decisions drive how Big Data Analytics is deployed. To leverage the variety and volume of Big Data while managing its volatility, advanced analytical approaches are necessary, such as natural language processing, network analysis, simulative modeling, artificial intelligence, etc. Big Data Analytics is about

perationalizing new

and more data, but it is also about data quality, data interoperability, data disaggregation, and the ability to modularize data structures to quickly absorb new data and new types of data. To store, manage, and use Big Data

ften requires

investments in new technologies and data processing methods, such as distributed processing (e.g., Hadoop), NoSQL storage, and Cloud computing. Big Data Analytics requires firm commitment to using analytics in decision- making; a decisive mentality capable of employing in-the- moment intelligence; and investment in analytical technology, resources, and skills.

SLIDE 9

Big Data Analytical Capabilities

Continuing increases in processing capacity have opened the

door to a range of advanced algorithms and modeling techniques that can produce valuable insights from Big Data.

Traditional Emerging Structured Unstructured

A/B/N Testing Experiment to find the most effective variation of a website, product, etc Sentiment Analysis Extract consumer reactions based on social media behavior Complex Event Processing Combine data sources to recognize events Predictive Modeling Use data to forecast or infer behavior Regression Discover relationships between variables Time Series Analysis Discover relationships

ver time

Classification Organize data points into known categories Simulation Modeling Experiment with a system virtually Spatial Analysis Extract geographic or topological information Cluster Analysis Discover meaningful groupings of data points Signal Analysis Distinguish between noise and meaningful information Visualization Use visual representations of data to find and communicate info Network Analysis Discover meaningful nodes and relationships on networks Optimization Improve a process or function based on criteria Deep QA Find answers to human questions using artificial intelligence Natural Language Processing Extract meaning from human speech or writing

SLIDE 10

Forward-Looking vs. Rear-View Analytics

Big Data Analytics improves the speed and efficiency with which we

understand the past, and opens up entirely new avenues for preparing for and adapting to the future.

What happened?

Describe, summarize and analyze historical data

What should be done?

Recommend ‘right’ or

ptimal actions or

decisions

How do we adapt to change?

Monitor, decide, and act autonomously or semi-autonomously

What could happen?

Predict future

utcomes based on the

past

Observed behavior or

events

Non-traditional data

sources such as social listening and web crawling

Forward-looking view
f current and future

value

Sentiment Scoring
Graph analysis and

Natural Language Processing to identify hidden relationships and themes

Dual objective models
Behavioral economics
Real-time product and

service propositions (graph analysis, entity resolution on data lakes to infer present customer need)

Rapid evaluation of

multiple ‘what-if’ scenarios

Optimization decisions

and actions

Monitor results on a

continuous basis

Dynamically adjust

strategies based on changing environment and improved predictions

Agent-based and

dynamic simulation models, time-series analysis Descriptive Analytics

Predictive Analytics Prescriptive Analytics Continuous Analytics

Increasing Business Value

Why did it happen?

Identify causes of trends and outcomes

Observed behavior or

events

Non-traditional data

sources such as social listening and web crawling

Statistical and

regression analysis

Dynamic visualization

Diagnostic Analytics

Increasing Sophistication of Data & Analytics Rea ear-view For

rward-loo
oking

SLIDE 11

Examples of Big Data Analytics in Action

Market Leaders are leveraging Big Data Analytics to generate

value by starting with a business need and focusing on implementing actionable insights quickly and decisively.

Com

mpany

Bus usiness Ne Need Data and nd Analy alytic ics Imp mpact Greater tail ailorin ing of f credit it car ard offers to fit customer needs Statistical model based on public credit and demographic data to target customized products to customers Net revenue grew at a CAGR of 32% 32% from 1994 to 2003; prompted competitors to shift focus to data and analytics Da Data-en enable led eng ngin ine pr prog

gnostic

ics, monitoring, maintenance and repair Analysis of sensor data from hundreds of sensors in 4,000 engines to identify and solve issues weeks in advance Over 70 70% ann nnual l revenue from the aircraft engine division attributable to this service Search-to to-purchase conversio ion by anticipating intent of a shopper’s search and delivering relevant results Semantic search, which enables discovery using algorithms that rank results via social signals from around the web Increases 10-15% the likelihood that a customer will complete their purchase – translating to mi milli llions of f do dolla llars in n re reven enue e Transformation from subscription streaming service to orig igin inal content t pr prod

ducer

Analysis of data from 66 million subscribers’ viewing habits and preferences Rev even enue e and nd sub ubscri riber er ba base e incr ncrea eased ed by by 15% and nd 9% 9% respectively in 2013 Le Leverage Internet of f Thi hings (IoT) by connecting machines to facilitate data- enabled prognostics, increase efficiency and reduce downtime Launched software to help airlines and railroads move their data to the cloud and predict mechanical malfunctions, improve safety, and reduce trip cancellations and cost Estimated 1% reduction in fuel costs, projected to sa save the airline industry $3 $30 0 bi billio lion ov

ver

15 year years

Imp Impact Bi Big Da Data a Ana Analy lytics Bus Busin iness Nee eed

SLIDE 12

Big Data Analytics in Development

Big Data Analytics is making an equally impressive impact on

Development interventions – allowing decision-makers to reach and serve previously neglected populations.

Com

mpany

Bus usiness Ne Need Data and nd Analy alytic ics Imp mpact Mor

re tran

ansparent, reli liable le, and nd low

w-cost

t me meth thod to to trac ack k infla flatio ion in Argentina Web scraping of online price data used to produce price indices, and econometric analysis used to model disaggregated impacts of policies Gov

vernment statis

tistic ical offic ices sh shift iftin ing to to accept t Big ig Data. Central banks using Big Data to see day-to-day volatility. Unde nderstand how how mi migrants act t as s arbit itrageurs to bring labor markets into equilibrium Iterative analysis of call detail records (CDRs) to track movement of migrants in response to local shocks to labor demand (weather, economy, conflict, etc.) Informin ing labor po polic licy y de desig sign in n low

w-in

income countrie ies to incentivize or disincentivize migratory behavior The he city ty of f Rio io de de Jan aneir iro wanted to to imp mprove its eme mergency response by better predicting heavy rainfall and subsequent severe landslides and flooding The city combines data from 30 city agencies – including weather, satellite, video, GPS, historic rainfall, and topographic survey data – in a central Operations Center Rio io has has imp mproved eme mergency y response tim ime by by 30 30%, %, catalogued 200+ flood points, and can now predict heavy rains 48 hours in advance on a half- km basis Create a be better ecos

system for
r mo

mobile ile se servic ices in n the he agric icult ltural l se sectors of Kenya, Tanzania, and Mozambique Remote crowdsourced data gathered via cell phones used to connect farmers to markets, assess farmers’ credit worthiness, and incubate new mobile businesses with greater predictors of success M-PESA is being used to low

wer

cos

sts for
r far

armers to to receiv ive loan

ans

and nd pe perform tran ansactio ions with distributers and buyers, as well as to provide geography-specific market information

Imp Impact Big Big Da Data Ana naly lytics Bu Busin iness Nee eed

SLIDE 13

Big Data Landscape

SLIDE 14

Creating a Big Data Organization Step 1: Be Yourself

Beginning with a clear understanding of the specific questions you intend to use

Big Data Analytics to address can help guide where and which data solutions are deployed.

Value enablement Value enhancement

Strategic Tactical Operational

Day y to

day

y op

pera

rations

ns
Struggle to move from narrow focus on reactive
perations to more proactive, comprehensive

management of daily operations

High value for digitization of operational processes

across program units

Often already proficient in traditional business

intelligence Enabl abling ng strat rateg egy y and d imp mprov roving ng perfo rforman rmance

Use analytics to reduce political divergence and

drive consensus

Real-time analytics to enable quick responses to

events

Use data to develop personalized services
Need for more objective and higher quality data

Deliveri ering g fut uture ure va value ue

Data-driven decision-making in real time
Use analytics to develop new

programs/opportunities

Relies heavily on data supplied by others
Often struggles to move away from exclusively

intuitive decision-making

SLIDE 15

Creating a Big Data Organization Step 2: Secure People & Skills

The competencies required of “data scientists” within an analytics organization
r project converge from multiple skill domains.

Organization-specific knowledge about data assets – including enterprise “metadata” – their location and appropriate business context for use in advanced analytics Comfort in programming across various languages, a thorough understanding of external and internal data sources, data gathering, storing, and retrieving methods which help combine disparate data sources to generate unique insights

Sub ubje ject ct Ar Area or

r Do

Domain Ex Expertise Compute ter r Science ce & Pr Progr gramming Sta tati tistical & Mat athemat atica cal Orga ganization-specific Inf nformation Kno nowledge

Expertise in statistical techniques, tools and languages used to run analyses that generate insights to effectively determine and communicate actionable insights Deep understanding of industry, subject area, or research domain to help determine which questions need answering and on what frequency, specificity, or geography

SLIDE 16

Creating a Big Data Organization Step 3: Let objectives dictate structure, not vice versa

How analytics efforts or organizations are structured – whether reporting is vertically or

horizontally aligned, how interconnected or autonomous separate units are, how resources and successes are shared – can influence efficiency and impact.

CENTRA TRAL

Ana naly lyti tics Co Compet mpeten ency Cente ter

Dis Distri ributed ed An Analyt ytics Cen entr tralize zed An Analyt ytics

LO LOCAL CENTRA TRAL

Ana naly lyti tics Co Compet mpeten ency Ce Cente nter ETL Data Wa Ware reho house BI BI App ppli licatio tions Meta etadata ta Reposito itory ry Data Mart rt

Feder erated ed An Analyt ytics

LO LOCAL CENTRA TRAL

Ana naly lyti tics Co Compet mpeten ency Ce Cente nter ETL Data Wa Ware reho house BI BI App ppli licatio tions Meta etadata ta Reposito itory ry Data Mart rt ET ETL Data Wa Ware reho house BI BI App ppli licatio tions Meta etadata ta Reposito itory ry Data Mart rt

Objectives

Adopt previously proven practices
Highly focused analytics support
Subject area-specific innovations
Repeatable models
Governance
Aligning analytics to organization-

wide strategy

Data Warehouses, Marts, etc.

Deployed locally
Deployed locally
Some data and models shared

across groups

Deployed and managed centrally

Analytics Tools

Managed locally
Managed locally, but connected to

group framework

Controlled centrally, with units

having access to shared resources

Analytics Staff/ Competencies

Placed within individual units
Placed within individual units
Skills tailored to specific region or

subject matter

Placed within central analytics team,

available as needed to support individual units

SLIDE 17

The ‘Hub-Spoke’ operating model often serves as a well- synchronized, connected system

Competency Center ‘Standardization’

2

Loc Local l Bus usiness Ope peratio ions Glob lobal l Bus usiness Str trategy Local Adoption

f Practices

Centers of Excellence (Regional) Competency Center (‘Standards’) Central Decision Hub

Local ‘Spoke’

Central Decision Hub

1

Center of Excellence (Regional)

3

Center of Excellence (Regional)

3

Center of Excellence (Regional)

3 1 2 3 4

Local ‘Spoke’

4

Local ‘Spoke’

4

Local ‘Spoke’

4

Local ‘Spoke’

4

Local ‘Spoke’

4

Local ‘Spoke’

4

Local ‘Spoke’

4

Local ‘Spoke’

4

Local ‘Spoke’

4

Local ‘Spoke’

4

Sample Hub-Spoke Interaction Model

SLIDE 18

Creating a Big Data Organization Step 4: Invest in Appropriate Infrastructure

Big Data introduces challenges related to data volume and variety, processing

constraints, and new data structures that traditional data infrastructure is not equipped to support

Obj bjective Consi siderations Impact

Identify the type of analysis that will be conducted and define which analytics capabilities will be employed

Dictates performance needs along with data structures and processing architecture Interface could restrict the ability to perform analysis ad hoc and restrict ability to update Support for analysis specific data structures can improve performance and reduce analysis effort

Define the data set that will be used for the analysis including its sources, size, and structure Size of data sets introduce need for scalable infrastructure and performance Variability of source data models and data set structure require data model flexibility Diverse sources will require scalability, model flexibility, and flexible interfaces Define the timeliness and frequency of the analysis results for reporting and downstream systems Frequency of analysis will dictate the processing architecture (batch or real time) The timeliness of the analysis will impact the need for scalability and performance In and out bound interfaces are defined by the use

f data and required flexibility

Analytics Capabilities Data Variety Application

Analysis Type Size Structure Sources Frequency Speed Interfaces Analysis Flexibility Analysis Structures

SLIDE 19

Contents

Architecture Design

2

SLIDE 20

Emerging Infrastructure Options

To harness Big Data, storage solutions must be able to support

targeted analytics capabilities, data diversity and performance needs

Distributed Processing

Hadoop and similar solutions that provide scalable distributed storage and distributed computation on commodity hardware

NoSQL

Embedded and persisted storage that implement data models through document, graph, and dictionary structures

Cloud Computing

Cloud computing can improve flexibility, scalability and cost management and enable a cohesive business strategy across a org

Scalability Issues
Big Data set information extraction and

queries require large volumes of processing cycles that can quickly scale

Data storage solutions need to provide

flexible data models to better ingest unstructured and semi structured data

Need to combine and link multiple data

sources Traditional challenges being addressed…

SLIDE 21

Building an Analytics Organization: Critical Components

Distributed Processing

Hadoop and similar solutions that provide scalable distributed storage and distributed computation on commodity hardware

Introduction to Hadoop

Hadoop is based on work done by Google in

early 2000s (combination of Google File System (GFS) and MapReduce)

Useful for analyzing copious amounts of

complex data across multiple data sources

Distributes data as it is initially stored in the

system

Applications are written in high-level code
Computation happens where data is stored,

whenever possible

Data is replicated multiple times on the

system for increased availability and reliability

Faster and Lower Cost Analysis Linear Scalability Greater flexibility

Emerging Infrastructure – Computing/Storage Options

SLIDE 22

Building an Analytics Organization: Critical Components

Emerging Infrastructure – Storage Options

NoSQL

Embedded and persisted storage that implement data models through document, graph, and dictionary structures

NoSQL - Storage Types

Document Store Key – Value Store Graph Store Columnar Store

Solution Examples Increasing Data Complexity

Pros: Simplicity & Scalability Cons: Lack of advanced features/queries Pros: Scalability & Flexibility Cons: Complexity Pros: Easy to Use Cons: Scalability Pros: Graph Joins Cons: Flexibility

SLIDE 23

Building an Analytics Organization: Critical Components

Emerging Infrastructure – System Options

Cloud Computing

The model is compelling; cloud computing can improve flexibility, scalability and cost management. Businesses best able to realize the potential will establish a cohesive business strategy as cloud computing can transform your entire organization — people, processes, and systems

Source: PwC, “Digital IQ Snapshot: Cloud,”; PwC, “FS Viewpoint: Clouds is the forecast”

Cloud transformation begins at the infrastructure level and leads to more agile applications, resulting in faster speed to market and more flexibility to meet client needs. The key benefits, beyond consolidation, include standardized application and development environments, resulting in better controlled and more efficient application lifecycles.

SLIDE 24

Relational Reference Architecture

SLIDE 25

Extended Relational Reference Architecture

SLIDE 26

Non-Relational Reference Architecture

SLIDE 27

Data Discovery: Non-Relational Architecture

SLIDE 28

Business Reporting: Hybrid Architecture

SLIDE 29

Contents

Big Data Algorithms

3

SLIDE 30

Key components of Mahout in Hadoop (1)

SLIDE 31

Key components of Mahout in Hadoop (2)

SLIDE 32

Key Components of Spark MLlib

SLIDE 33

Spark ML Basic Statistics

◼ Correlation: Calculating the correlation between two series of data is a common

peration in Statistics

➢Pearson’s Correlation ➢Spearman’s Correlation

SLIDE 34

Example of Popular Similarity Measurements

◆Pearson Correlation Similarity ◆Euclidean Distance Similarity ◆Cosine Measure Similarity ◆Spearman Correlation Similarity ◆Tanimoto Coefficient Similarity (Jaccard coefficient) ◆Log-Likelihood Similarity

SLIDE 35

Pearson Correlation Similarity

Data: Missing Data

SLIDE 36

On Pearson Similarity

Three problems with the Pearson Similarity:

1. Not take into account of the number of items in which

two users’ preferences overlap. (e.g., 2 overlap items ==> 1, more items may not be better.)

2. If two users overlap on only one item, no correlation

can be computed.

3. The correlation is undefined if either series of

preference values are identical.

Adding Weighting. WEIGHTED as 2nd parameter of the constructor can cause the resulting correlation to be pushed towards 1.0, or -1.0, depending on how many points are used.

SLIDE 37

Spearman Correlation Similarity

Example for ties Pearson value on the relative ranks

SLIDE 38

Basic Spark Data Format

Data: 1.0, 0.0, 3.0 // straightforward

// number of parameters, location of non-zero indices, and non-zero values // number of parameters, Sequence of non-value values (index, value)

SLIDE 39

Correlation Example in Spark

1.0, 0.0, 0.0, -2.0 4.0, 5.0, 0.0, 3.0 6.0, 7.0, 0.0, 8.0 9.0, 0.0, 0.0, 1.0

SLIDE 40

Euclidean Distance Similarity

Similarity = 1 / ( 1 + d )

SLIDE 41

Cosine Similarity

Cosine similarity and Pearson similarity get the same results if data are normalized (mean == 0).

SLIDE 42

Spearman Correlation Similarity is time consuming. Need to use Caching ==> remember s user-user similarity which was previously computed.

Caching User Similarity

SLIDE 43

Tanimoto (Jaccard) Coefficient Similarity

Discard preference values

SLIDE 44

Log-LikeLihood Similarity

Asses how unlikely it is that the overlap between the two users is just due to chance.

SLIDE 45

Performance Measurements

Using GroupLens data (http://grouplens.org): 10 million rating MovieLens dataset.

Spearnman: 0.8
Tanimoto: 0.82
Log-Likelihood: 0.73
Euclidean: 0.75
Pearson (weighted): 0.77
Pearson: 0.89

SLIDE 46

Spark ML Basic Statistics

Hypothesis testing: Hypothesis testing is a

powerful tool in statistics to determine whether a result is statistically significant. Spark ML currently supports Pearson’s Chi-squared (χ2) tests for independence.

ChiSquareTest conducts Pearson’s

independence test for every feature against the label.

SLIDE 47

Chi-Square Tests (1)

SLIDE 48

Chi-Square Tests (2)

SLIDE 49

Chi-Square Tests (3)

We would reject the null hypothesis that there is no relationship between location and type of malaria. Our data tell us there is a relationship between type of malaria and location.

SLIDE 50

Chi-Square Tests in Spark

SLIDE 51

Spark ML Clustering

SLIDE 52

Example: Clustering

Feature Space

SLIDE 53

Clustering

SLIDE 54

Clustering – on feature plane

SLIDE 55

Clustering example

SLIDE 56

Steps on Clustering

SLIDE 57

Making Initial Cluster Centers

SLIDE 58

K-means Clustering

SLIDE 59

HelloWorld Clustering Scenario Result

SLIDE 60

Testing difference distance measures

SLIDE 61

Manhattan and Cosine distances

SLIDE 62

Tanimoto distance and weighted distance

SLIDE 63

Results Comparison

SLIDE 64

Sample Code of K-Means Clustering in Spark

SLIDE 65

Vectorization Example

0: Weight 1: Color 2: Size

SLIDE 66

Canopy Clustering (estimate the number of clusters)

Tell what size clusters to look for. The algorithm will find the number of clusters that have approximately that size. The algorithm uses two distance thresholds. This method prevents all points close to an already existing canopy from being the center of a new canopy.

SLIDE 67

Other Clustering Algorithms

Hierarchical clustering

SLIDE 68

Different Clustering Algorithms

https://github.com/HewlettPackard/cacti

SLIDE 69

Spark ML Classification

SLIDE 70

Spark ML Classification

SLIDE 71

Classification - definition

SLIDE 72

Classification example: using SVM to recognize a Toyota Camry

SLIDE 73

Classification example: using SVM to recognize a Toyota Camry

SLIDE 74

When to use Big Data System for Classification?

SLIDE 75

The advantage of using Big Data System for Classification

SLIDE 76

How does a classification systems work?

SLIDE 77

Key Terminology for Classification

SLIDE 78

Input and Output of a classification model

SLIDE 79

Four types of values for predictor variables

SLIDE 80

Sample data that illustrates all four values

SLIDE 81

Supervised vs. Unsupervised Learning

SLIDE 82

Work flow in a typical classification project

SLIDE 83

Classification Example – Color-Fill

Position looks promising, especially the x-axis ==> predictor variable. Shape seems to be irrelevant. Target variable is “color-fill” label.

SLIDE 84

Classification Example – Color-Fill (another feature)

SLIDE 85

Fundamental classification algorithm

Example of fundamental classification algorithms:

Naive Bayesian
Complementary Naive Bayesian
Stochastic Gradient Descent (SDG)
Random Forest
Support Vector Machines

SLIDE 86

Choose algorithm

SLIDE 87

Support Vector Machine (SVM)

maximize boundary distances; remembering “support vectors” nonlinear kernels

SLIDE 88

Example SVM code in Spark

SLIDE 89

Contents

Tools Support

4

SLIDE 90

Data Mining, Text Mining, and Natural Language Processing

Extraction of implicit, previously unknown, and potentially useful information from data

Data Mining

Analysis of large quantities of natural language text and detecting lexical or linguistic usage patterns to extract probably useful information

Text Mining Natural Language Processing

NLP is a theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications.

SLIDE 91

NLP Tools

Too

ol

Desc Description Ana Analysis Typ ype

Ope penNLP A machine learning based toolkit for the processing of natural language text. Link

Tokenization
sentence segmentation
Part-of-speech tagging
Named entity extraction
Chunking, parsing
Coreference resolution.

GATE ATE A Java suite of tools that can perform natural language processing tasks for multiple languages. Link

Information extraction
Part of speech tagging
Tokenizer
Sentence splitter

NL NLTK A suite of libraries and programs for symbolic and statistical natural language processing Python. Link

Information extraction
Part of speech tagging,
Tokenizer
Word categorization
Text classification

Sta tanford NLP Statistical NLP toolkits for various computational linguistics problems that can be incorporated into applications with human language technology needs. Link

Including tokenization
Part-of-speech tagging
Named entity recognition
Parsing
Classification
Segmentation
Coreference Resolution

LingPipe A tool kit for processing text using computational

linguistics. Link
Sentiment analysis
Entity recognition
Clustering
Topic classification
Part of speech tagging
Sentence detection
Disambiguation

Mon

ntyLingua

A suite of libraries and programs for symbolic and statistical natural language processing for both Python and Java. Link

Information extraction
Part of speech tagging
Tokenizer
Word categorization
Text generation
Stemming
Phrase chunking

Ros Rosetta Linguistic Pl Platform A suite of linguistic analysis components that integrate into applications for mining unstructured data. Link

Language Identification
Name, places, and key

concept extraction

name matching
name translation

SLIDE 92

Text Mining/Analytics Tools

Too

ol

Des Descrip iption Ana nalysis is Type

Rap apid idMin iner An open source environment for machine learning, data mining, text mining, predictive analytics, and business analytics. Link

Document classification
Sentiment analysis
Topic tracking
Data mining
Traditional analytics

SAS S Text t Miner A suite of text processing and analysis tools. Link,

Text Parsing
Filtering
Feature Extraction
Topic Clustering

VisualT lText Integrated development environment for building information extraction systems, natural language processing systems, and text analyzers. Link

Information extractions
Summarization
Categorization
Data Mining
Document Filtering
Natural Language

Search SAS S Sentim timent t Analy alysis is Commercial tool that is dedicated to customer sentiment analysis. Link

Customer sentiment

monitoring

sentiment discovery

Textif tifie ier Tool for sorting large amounts of unstructured text with The Public Comment Analysis Toolkit (PCAT). Link

Topic modeling,
Information retrieval
Document analysis
Social media analysis

Infin nfinite Ins nsig ight System for automatically preparing and transforming unstructured text attributes into a structured representation. Link

Term frequency
Term frequency inverse
Document frequency
Root word coding
synonym identification
Customization of stop

words

Stemming rules
Concepts merging

Clusti tify fy Software for grouping related documents into clusters, providing an overview of the document set and aiding with categorization. Link

Document clustering

SLIDE 93

Text Mining/Analytics Tools Cont.

Too

ol

De Description Analysis Type

Attensi sity Ana Analyze Customer analytics applications that help analyze high volumes of customer conversations across multiple channels. Link

Unstructured

communication analysis

sentiment analysis
consumer profiling

ReV eVerb A program that automatically identifies and extracts binary relationships from English

sentences. Link
Information

extraction

Topic Identification
Topic Linking

Ope pen text xt sum summarizer Open source tool for summarizing texts. Link

Document

summarization Ope pen Cala alais Web based API that is used to analyze content and extract topics or information. Link

Attribute/feature

extraction

Fact identification

Knowledge Sea Search Family of techniques tools for searching and

rganizing large data collections. Link
Semantic Analysis

KH KH Coder A free software for Quantitative Content Analysis or Text Mining Link

Text Parsing
document search
Network analysis

SLIDE 94

Image Analytics Overview

Ov Overv rview

The process of pulling relevant

information from an image or sets of images for advanced classification and traditional analysis

Applies image capture, image

processing, and machine learning techniques to extract, quantify, and structure, image information Adv dvantages

Provides a method to structure,
rganize, and search information that is

stored within images

Offers an additional data set that can

be applied to understanding consumer behavior, automating business processes, and discovering knowledge enterprise content

SLIDE 95

Image Analytics Tools

Too

ol

Ov Overv rvie iew Ima mage Processin ing Co Computer Visio ion Ma Machin ine Le Learnin ing Ope OpenCV Open source library of computer vision functions that is accessible via C, Java, and Python

X X X

PAX AXit it Ima mage Ana Analysis Integrated image analysis platform that provides basic feature identification functions

X X

Ima mageJ Java based image processing platform that can be accessed via an API and expanded with custom plugins

X

PIL Python image processing library

X

PyBrain in A modular machine learning library for Python

X

SLIDE 96

Audio Analytics Overview

Over erview

The process of capturing audio and

analyzing its features as to extract content and context of an event

Applies speech analysis and signal

processing principles to structure audio information for analysis via NLP or traditional analytics techniques Advantages

Provides a method for identifying

events or common patterns within sound bytes

Offers a way of capturing not only

the content and topics within a conversation, but also the emotions and context

SLIDE 97

Audio Analytics Tools

Too

ol

Ov Overv rvie iew Aud Audio Processin ing Infor

rmatio

ion Retr trie ieval Cla Clam A C++ library that provides varying level

f audio processing and information

retrieval capabilities

X X

Call CallMin iner A tool that is capable of translating calls to a more structured text data set and combining with other communication forms

X

Nu Nuance Logs calls and structures audio for text based search and retrieval

X

yaa aafe Aduio feature extraction toolkit with wrappers for several languages

X

PRA RAAT Multiple platform audio analysis toolkit

X

SLIDE 98

Social Network → Applications (1)

Ana Analysis Obj Objectives Co Colla labor

ratio

ion Ana Analy lysis is

Evaluate team structures , information flows among team members, and information exchanges with other teams to improve working structures

Identify team structures that are not effective
Identify informal organizational structures
Identify individuals/roles or groups that are

influential to collaborative work environments

Co Content/ Kno nowle ledge Man Management

Evaluate how knowledge or content is diffused and accessed within an

rganization
Improve content and knowledge distribution
Identify content bottlenecks, open

communication flows, and establish channels

Explore impact of new communication methods

Co Communit ity Mining

Identify groups or informal teams that share knowledge, communicate frequently, solve problems, or work together to perform specific tasks

Improved structures for key organizational

functions.

Improved information flows
Identify potential bottlenecks for organizational

functions

Identify cultural patterns to build other

communities

Or Organiz izatio ion De Develo lopment

Explore formal and informal

rganization structures and

how individuals work with one another to improve the design

f the organization
Improve hierarchy and structure of organization

to better align with the informal practices

Identify team members that are effective leaders

and would impact the organization if promoted

SLIDE 99

Social Network → Applications (2)

Ana Analysis Obj bjectives Disas Disaster rec ecovery pla planning

Assess organizational structures and communication patterns as they relate to the groups that play a role in disaster recovery plans

Identify communication improvements to disaster recovery

teams

Identify weak links among functional groups to improve

collaboration during recovery plan execution

Da Data/ Informati tion Disse Dissemination

Assess how data points or information sets originate or are distributed across the enterprise to their intended targets

Identify overlapping information sets and bottlenecks for

information dissemination

Assess how organization structures or information

architecture impact the flow of information to its targets

Fraud De Detecti tion / pr prevention

Assess the organization or external network to identify communication

r collaboration patterns that align

with known fraudulent activity

Identify network agents that collaborate with known

fraudulent agents

Identify activities that align with known fraudulent behavior

Process ss Disc Discovery / Improvement

Analyze the organization structure and communication patterns to uncover process improvements or identify new processes

Identify process improvements through discovery of hidden

process steps, communication flows , and actors

Discover undocumented or informal processes that are

hidden within frequent collaboration and communication paths

Sup Supply Cha hain Ana Analysis

Evaluate the structure of a supply network and the interactions among the entities that comprise the network to identify gaps, bottlenecks and sourcing strategies

Identify communication gaps that could impact dependent

process or operations

Identify strategic relationships to optimize the supply

network

Identify supply nodes that create inefficiencies

SLIDE 100

Social Network → Applications (3)

Ana Analysis Obj bjectives Novelty/ Sen Sentiment Dif Diffusi sion Ana Analysis

Observe how a specific topic, news articles or sentiment diffuses through a consumer network

Assess how target consumers/market will react to a piece
f news or campaign
Evaluate how long news, data, or sentiment will be

retained within a system and how far it will spread

Mar arket Infl fluencer Ide dentifi fication

Monitor and analyze connections within social media networks to identify markets or consumers that are influential within communities

Identify individuals or groups that influence markets and

adoption

Identify untapped markets
Identify market segments as targets for ad campaigns to

improve product/service adoption

Consumer r Seg Segmentation

Analyze the connections and consumer attributes within the target market to discover communities or groups with common characteristics

Improve product or service offerings based on attributes

that connect the consumer market

Develop strategies to target new or existing consumers

based on identified segmentation characteristics

Product or

r Brand

Dif Diffusi sion Ana Analysi sis

Analyze the flow of communication

r ideas through a market segment

to evaluate how a product may diffuse

Identify segments or individuals that will be likely early

adopters

Identify incentives or campaigns that will improve

product/service adoption

Rec ecommendation Systems

Analyze consumer network connections and common features among consumers to develop recommendations

Identify new feature sets for products and services
Assess new markets for selling similar or new products
Target consumers with specific products or services

SLIDE 101

Social Network → Tools (1)

Too

ol

Ov Overv rview Netw twor

rk

Ana nalysis is Netw twor

rk

Vi Visual Netw twor

rk

Manipulation SN SNAP

A general purpose network analysis and graph mining library for C++ . Link

X X

St Statnet

A package for R that provides capabilities for social network statistical

analysis. Link

X

lib ibSNA, , gr graphTool, , ne netw tworkX

Python libraries for network analysis and manipulation. libSNA, networkX, graphTool

X X

JUN JUNG

Java package for network analysis and

modeling. Link

X X X

Nod

deXL

Excel plug-in that provides an easy to use and interactive interface to explore and visualize networks Link

X X

SLIDE 102

Social Network → Tools (2)

Tool Overview Network Ana Analysis Network Visu Visual Network Man anipulation GEP EPHI

Interactive open source platform for network analysis and visualization. Gephi

X X X

Uci cinet

Commercial social network analysis tool with separate visualization component. Link

X X

Graphviz

Open source graph visualization package. Link

X

NetMiner

Proprietary package that provides the ability to develop and implement custom algorithms link

X X X

kxen SN SNA

Network analysis package that provides predictive analytics and customer MDM integration. Link

X X X

ProM

Open source package for mining business process

networks. Link

X X X

Cytoscape

Open source tool for network modeling, and

analysis. Can connect to external data sources Link

X X X

Network Work

rkbench

Large-Scale Network Analysis, Modeling and Visualization Toolkit for Biomedical, Social Science and Physics Research. Link

X X X

SLIDE 103

Contents

Deep QA/Mind/Brain Systems

5

SLIDE 104

DeepQA/Mind/Brain

What is DeepQA?

DeepQA forms that core of Watson, the open

domain question analysis and answering system

The DeepQA stack is comprised of set of

search, NLP, learning, and scoring algorithms

DeepQA operates on a distributed computing

infrastructure that leverages Map Reduce and the Unstructured Information Management Architecture

What is the target problem set?

Understanding the meaning and context of

human language

Searching and retrieving information from

large library of unstructured information

Identifying accurate and precise answers to

questions that are complex and must sourced from a large knowledge set

SLIDE 105

DeepQA Infrastructure Technology

Data Management and Search

Techn hnolog

gy

Li Links Uns nstructured Informatio ion Archit itecture UIMA Link SQL QL Server MySQL Link Apache Derby Link Java Na Natu tural l Lan Language Tool

olkit

kit Open NLP Link Stanford NLP Link Map ap/R /Reduce Apache Hadoop Link Com

mmonsense

Know nowle ledgebase OpenCYC Link Open Mind Common Sense Link Trip iple le Stor

re

Apache Jena Link OpenAnzo Link Text t Sear arch Lucene Link Open FTS Link

SLIDE 106

DeepQA Infrastructure Technology

Platform and Administration

Techn hnolog

gy

Li Links Web b Server Apache Link Vir irtu tualiz izatio ion Hos Host VMWare Link Zen Link Distrib ibuted File le Sy System Apache Hadoop Link OpenAFS Link File le Management/ Archival rSync Link OS OS Fedora Link Clou loud Man anagement Extreme Cloud Administration Link Open Nebula Link

SLIDE 107

Business Applications

Ov Overview Obj bjectives Knowledge Disc Discovery

Search internal and external unstructured/structured information assets to uncover previously unknown knowledge

Identify information about a subject through deep analysis of

internal and external information sources

Answer questions about a business problem or trend that

may be difficult to analyze within traditional data sources

E-Di Disc scovery

Search documents and communications to uncover relevant information associated with a specific topic

Identify business topics and trends within communication

and documents

Search for non compliance activities within internal and

external data sources

Contract Eval aluations

Search through single or multiple contracts to answer specific questions about the nature of the contract

Identify key facts or issues that comprise a contract or sets of

contracts

Identify contracts or legal documents that contain similar

entities or features

Rel elationship Man anag agement

Provide the ability to interact with consumers providing precise responses to technical and open domain questions

Provide a platform for automatically answering consumer

questions about products or services

Reduce reliance on call centers and improve interaction with

consumers

Consumer Disc Discovery

Search consumer communications, social media, and sales information to identify opportunities and demographics

Identify background information about consumers
Identify consumer qualities that create risks or represent
pportunities

Tec echnical al Troubleshooting

Find answers to technical and process problems through

Utilize unstructured data and communications to identify

solutions or root causes to system and process problems

SLIDE 108

Areas for Further Research

Infrastructure/Tools and Search Technologies/Concepts

Topic Res esearch Tools

Had Hadoop Map ap/R /Reduce The tool is used to distribute queries, analysis, and other processing activities across multiple CPUs. Further research is required to understand the tools architecture and how to integrate it with other tool kits. OpenNLP, UIMA, Lucene, etc. Ope penNLP A Java library for NLP tasks. Need to evaluate the tools capabilities and gaps as well as how it can be incorporated into the UIMA Ope penCYC An open common sense reasoning platform. Need to better understand the tools role as well as how it fits within the other technologies UIMA An architecture for managing unstructured data. Further research is needed to understand how to run in parallel and how the SDK can be applied to NLP activities Luc Lucene A text search platform. Further research is needed to understand the library and how to incorporate it into UIMA

Sea Search

Text t Sear arch Scor

rin

ing Algorithms are used to score search results based on their alignment with the question. Further research is needed to understand what models and scoring metrics can be applied to search results at various phases of DeepQA. Trip iple le Stor

re

Search Triple stores maintain data in a subject-predicate-object structure and is used for turning around quick facts. Further research is needed to understand the philosophy and technologies behind these data storage mechanisms Com

mmonsense

Reas asonin ing Research is required to understand the branch of AI, technologies and role within DeepQA. Doc

cument/

t/ Informatio ion Retr trie ieval Generate research on information and document retrieval practices. Technologies and algorithms need to be reviewed. Falls within a broader research topic for enterprise search.

SLIDE 109

Areas for Further Research

Machine Learning and Natural Language Processing

Topi

pic

De Descrip iptio ion Ma Machine Learnin ing

MetaLearners Research the concept and how they are to used evaluate learning models and assign a confidence score based on the learning models that are used to rank search results Qu Question Clas lassi sifi fication Identify techniques and models that can be employed to analyze and classify questions Sea Search Ran Ranking Mod

dels

Research models are available for ranking search results based on the various search and recall techniques that are employed for a question

NL NLP

Log Logical Form

rm

Ana Analysis Research how SNA is used to discover logical relationships within text and product an understanding about the information within the text Sem Semantic St Structure Ana Analysis Identify tools and algorithms that are employed to uncover semantic relationships within texts/phrases and how these relationships can be applied to extract relevant information for question analysis and search Rel elationship Ana Analysis Research techniques and tools for uncovering temporal, geospatial and spatial relationships within a knowledge set Featu ture Ex Extraction Evaluate tools and algorithms that are used to extract features of entities from text and identify methods for structuring the data for search Phr hrase Ana Analysi sis Identify algorithms and tools that can be applied to extract key phrases from text based on a search context

SLIDE 110