Big Data and Internet Thinking Chentao Wu Associate Professor - - PowerPoint PPT Presentation

big data and internet thinking
SMART_READER_LITE
LIVE PREVIEW

Big Data and Internet Thinking Chentao Wu Associate Professor - - PowerPoint PPT Presentation

Big Data and Internet Thinking Chentao Wu Associate Professor Dept. of Computer Science and Engineering wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User: wuct Password: wuct123456


slide-1
SLIDE 1

Big Data and Internet Thinking

Chentao Wu Associate Professor

  • Dept. of Computer Science and Engineering

wuct@cs.sjtu.edu.cn

slide-2
SLIDE 2

Download lectures

  • ftp://public.sjtu.edu.cn
  • User: wuct
  • Password: wuct123456
  • http://www.cs.sjtu.edu.cn/~wuct/bdit/
slide-3
SLIDE 3

Schedule

  • lec1: Introduction on big data, cloud computing & IoT
  • Iec2: Parallel processing framework (e.g., MapReduce)
  • lec3: Advanced parallel processing techniques (e.g.,

YARN, Spark)

  • lec4: Cloud & Fog/Edge Computing
  • lec5: Data reliability & data consistency
  • lec6: Distributed file system & objected-based storage
  • lec7: Metadata management & NoSQL Database
  • lec8: Big Data Analytics
slide-4
SLIDE 4

Collaborators

slide-5
SLIDE 5

Contents

Big Data Analytics

1

slide-6
SLIDE 6

Big Data Challenges

slide-7
SLIDE 7

It’s not just about the data…

Mac achine Le Lear arning/D /Deep Le Lear arning IoT (Internet of Thin hings) & & Sensor Anal nalyt ytic ics Mod

  • delin

ling Wi Will llin ingness- to to-Pay

Natural Lan Language Process ssing Ana Analyzing Da Data @ Sc Scale Creating a Lak Lake St Streaming Co Consumer Be Behavior Da Data

Big Data Big Data Analyt lytic ics +

  • Leveraging a computer’s ability to learn without

being explicitly programmed to solve business problems

  • Understanding value drivers from the ever-

growing network of connected physical objects and the communication between them

  • Mining product reviews to estimate willingness-

to-pay for product features

  • Understanding human speech as it is spoken

through application of computer science, AI, and computational linguistics

  • Using distributed computing and machine

learning tools to analyze hundreds of gigabytes

  • f data
  • Mining social data in real time to understand

when and where consumers are making choices

1 2 3 4 5 6

Methods of using Big Data to generate insight Refers to the DATA only

  • It is important to understand the distinction between

Big Data sets (large, unstructured, fast, and uncertain data) and ‘Big Data Analytics’.

slide-8
SLIDE 8

It’s also about what, how, and why you use it

  • Big Data Analytics – the process of harnessing Big Data to yield

actionable insights – is a combination of five key elements:

Decisions Analytics Data Technology Mindset & Skills

The value of Big Data Analytics is driven by the unique decisions facing leaders, companies, and countries today. In turn, the type, frequency, speed, and complexity of decisions drive how Big Data Analytics is deployed. To leverage the variety and volume of Big Data while managing its volatility, advanced analytical approaches are necessary, such as natural language processing, network analysis, simulative modeling, artificial intelligence, etc. Big Data Analytics is about

  • perationalizing new

and more data, but it is also about data quality, data interoperability, data disaggregation, and the ability to modularize data structures to quickly absorb new data and new types of data. To store, manage, and use Big Data

  • ften requires

investments in new technologies and data processing methods, such as distributed processing (e.g., Hadoop), NoSQL storage, and Cloud computing. Big Data Analytics requires firm commitment to using analytics in decision- making; a decisive mentality capable of employing in-the- moment intelligence; and investment in analytical technology, resources, and skills.

slide-9
SLIDE 9

Big Data Analytical Capabilities

  • Continuing increases in processing capacity have opened the

door to a range of advanced algorithms and modeling techniques that can produce valuable insights from Big Data.

Traditional Emerging Structured Unstructured

A/B/N Testing Experiment to find the most effective variation of a website, product, etc Sentiment Analysis Extract consumer reactions based on social media behavior Complex Event Processing Combine data sources to recognize events Predictive Modeling Use data to forecast or infer behavior Regression Discover relationships between variables Time Series Analysis Discover relationships

  • ver time

Classification Organize data points into known categories Simulation Modeling Experiment with a system virtually Spatial Analysis Extract geographic or topological information Cluster Analysis Discover meaningful groupings of data points Signal Analysis Distinguish between noise and meaningful information Visualization Use visual representations of data to find and communicate info Network Analysis Discover meaningful nodes and relationships on networks Optimization Improve a process or function based on criteria Deep QA Find answers to human questions using artificial intelligence Natural Language Processing Extract meaning from human speech or writing

slide-10
SLIDE 10

Forward-Looking vs. Rear-View Analytics

  • Big Data Analytics improves the speed and efficiency with which we

understand the past, and opens up entirely new avenues for preparing for and adapting to the future.

What happened?

Describe, summarize and analyze historical data

What should be done?

Recommend ‘right’ or

  • ptimal actions or

decisions

How do we adapt to change?

Monitor, decide, and act autonomously or semi-autonomously

What could happen?

Predict future

  • utcomes based on the

past

  • Observed behavior or

events

  • Non-traditional data

sources such as social listening and web crawling

  • Forward-looking view
  • f current and future

value

  • Sentiment Scoring
  • Graph analysis and

Natural Language Processing to identify hidden relationships and themes

  • Dual objective models
  • Behavioral economics
  • Real-time product and

service propositions (graph analysis, entity resolution on data lakes to infer present customer need)

  • Rapid evaluation of

multiple ‘what-if’ scenarios

  • Optimization decisions

and actions

  • Monitor results on a

continuous basis

  • Dynamically adjust

strategies based on changing environment and improved predictions

  • Agent-based and

dynamic simulation models, time-series analysis Descriptive Analytics

Predictive Analytics Prescriptive Analytics Continuous Analytics

Increasing Business Value

Why did it happen?

Identify causes of trends and outcomes

  • Observed behavior or

events

  • Non-traditional data

sources such as social listening and web crawling

  • Statistical and

regression analysis

  • Dynamic visualization

Diagnostic Analytics

Increasing Sophistication of Data & Analytics Rea ear-view For

  • rward-loo
  • oking
slide-11
SLIDE 11

Examples of Big Data Analytics in Action

  • Market Leaders are leveraging Big Data Analytics to generate

value by starting with a business need and focusing on implementing actionable insights quickly and decisively.

Com

  • mpany

Bus usiness Ne Need Data and nd Analy alytic ics Imp mpact Greater tail ailorin ing of f credit it car ard offers to fit customer needs Statistical model based on public credit and demographic data to target customized products to customers Net revenue grew at a CAGR of 32% 32% from 1994 to 2003; prompted competitors to shift focus to data and analytics Da Data-en enable led eng ngin ine pr prog

  • gnostic

ics, monitoring, maintenance and repair Analysis of sensor data from hundreds of sensors in 4,000 engines to identify and solve issues weeks in advance Over 70 70% ann nnual l revenue from the aircraft engine division attributable to this service Search-to to-purchase conversio ion by anticipating intent of a shopper’s search and delivering relevant results Semantic search, which enables discovery using algorithms that rank results via social signals from around the web Increases 10-15% the likelihood that a customer will complete their purchase – translating to mi milli llions of f do dolla llars in n re reven enue e Transformation from subscription streaming service to orig igin inal content t pr prod

  • ducer

Analysis of data from 66 million subscribers’ viewing habits and preferences Rev even enue e and nd sub ubscri riber er ba base e incr ncrea eased ed by by 15% and nd 9% 9% respectively in 2013 Le Leverage Internet of f Thi hings (IoT) by connecting machines to facilitate data- enabled prognostics, increase efficiency and reduce downtime Launched software to help airlines and railroads move their data to the cloud and predict mechanical malfunctions, improve safety, and reduce trip cancellations and cost Estimated 1% reduction in fuel costs, projected to sa save the airline industry $3 $30 0 bi billio lion ov

  • ver

15 year years

Imp Impact Bi Big Da Data a Ana Analy lytics Bus Busin iness Nee eed

slide-12
SLIDE 12

Big Data Analytics in Development

  • Big Data Analytics is making an equally impressive impact on

Development interventions – allowing decision-makers to reach and serve previously neglected populations.

Com

  • mpany

Bus usiness Ne Need Data and nd Analy alytic ics Imp mpact Mor

  • re tran

ansparent, reli liable le, and nd low

  • w-cost

t me meth thod to to trac ack k infla flatio ion in Argentina Web scraping of online price data used to produce price indices, and econometric analysis used to model disaggregated impacts of policies Gov

  • vernment statis

tistic ical offic ices sh shift iftin ing to to accept t Big ig Data. Central banks using Big Data to see day-to-day volatility. Unde nderstand how how mi migrants act t as s arbit itrageurs to bring labor markets into equilibrium Iterative analysis of call detail records (CDRs) to track movement of migrants in response to local shocks to labor demand (weather, economy, conflict, etc.) Informin ing labor po polic licy y de desig sign in n low

  • w-in

income countrie ies to incentivize or disincentivize migratory behavior The he city ty of f Rio io de de Jan aneir iro wanted to to imp mprove its eme mergency response by better predicting heavy rainfall and subsequent severe landslides and flooding The city combines data from 30 city agencies – including weather, satellite, video, GPS, historic rainfall, and topographic survey data – in a central Operations Center Rio io has has imp mproved eme mergency y response tim ime by by 30 30%, %, catalogued 200+ flood points, and can now predict heavy rains 48 hours in advance on a half- km basis Create a be better ecos

  • system for
  • r mo

mobile ile se servic ices in n the he agric icult ltural l se sectors of Kenya, Tanzania, and Mozambique Remote crowdsourced data gathered via cell phones used to connect farmers to markets, assess farmers’ credit worthiness, and incubate new mobile businesses with greater predictors of success M-PESA is being used to low

  • wer

cos

  • sts for
  • r far

armers to to receiv ive loan

  • ans

and nd pe perform tran ansactio ions with distributers and buyers, as well as to provide geography-specific market information

Imp Impact Big Big Da Data Ana naly lytics Bu Busin iness Nee eed

slide-13
SLIDE 13

Big Data Landscape

slide-14
SLIDE 14

Creating a Big Data Organization Step 1: Be Yourself

  • Beginning with a clear understanding of the specific questions you intend to use

Big Data Analytics to address can help guide where and which data solutions are deployed.

Value enablement Value enhancement

Strategic Tactical Operational

Day y to

  • day

y op

  • pera

rations

  • ns
  • Struggle to move from narrow focus on reactive
  • perations to more proactive, comprehensive

management of daily operations

  • High value for digitization of operational processes

across program units

  • Often already proficient in traditional business

intelligence Enabl abling ng strat rateg egy y and d imp mprov roving ng perfo rforman rmance

  • Use analytics to reduce political divergence and

drive consensus

  • Real-time analytics to enable quick responses to

events

  • Use data to develop personalized services
  • Need for more objective and higher quality data

Deliveri ering g fut uture ure va value ue

  • Data-driven decision-making in real time
  • Use analytics to develop new

programs/opportunities

  • Relies heavily on data supplied by others
  • Often struggles to move away from exclusively

intuitive decision-making

slide-15
SLIDE 15

Creating a Big Data Organization Step 2: Secure People & Skills

  • The competencies required of “data scientists” within an analytics organization
  • r project converge from multiple skill domains.

Organization-specific knowledge about data assets – including enterprise “metadata” – their location and appropriate business context for use in advanced analytics Comfort in programming across various languages, a thorough understanding of external and internal data sources, data gathering, storing, and retrieving methods which help combine disparate data sources to generate unique insights

Sub ubje ject ct Ar Area or

  • r Do

Domain Ex Expertise Compute ter r Science ce & Pr Progr gramming Sta tati tistical & Mat athemat atica cal Orga ganization-specific Inf nformation Kno nowledge

Expertise in statistical techniques, tools and languages used to run analyses that generate insights to effectively determine and communicate actionable insights Deep understanding of industry, subject area, or research domain to help determine which questions need answering and on what frequency, specificity, or geography

slide-16
SLIDE 16

Creating a Big Data Organization Step 3: Let objectives dictate structure, not vice versa

  • How analytics efforts or organizations are structured – whether reporting is vertically or

horizontally aligned, how interconnected or autonomous separate units are, how resources and successes are shared – can influence efficiency and impact.

CENTRA TRAL

Ana naly lyti tics Co Compet mpeten ency Cente ter

Dis Distri ributed ed An Analyt ytics Cen entr tralize zed An Analyt ytics

LO LOCAL CENTRA TRAL

Ana naly lyti tics Co Compet mpeten ency Ce Cente nter ETL Data Wa Ware reho house BI BI App ppli licatio tions Meta etadata ta Reposito itory ry Data Mart rt

Feder erated ed An Analyt ytics

LO LOCAL CENTRA TRAL

Ana naly lyti tics Co Compet mpeten ency Ce Cente nter ETL Data Wa Ware reho house BI BI App ppli licatio tions Meta etadata ta Reposito itory ry Data Mart rt ET ETL Data Wa Ware reho house BI BI App ppli licatio tions Meta etadata ta Reposito itory ry Data Mart rt

Objectives

  • Adopt previously proven practices
  • Highly focused analytics support
  • Subject area-specific innovations
  • Repeatable models
  • Governance
  • Aligning analytics to organization-

wide strategy

Data Warehouses, Marts, etc.

  • Deployed locally
  • Deployed locally
  • Some data and models shared

across groups

  • Deployed and managed centrally

Analytics Tools

  • Managed locally
  • Managed locally, but connected to

group framework

  • Controlled centrally, with units

having access to shared resources

Analytics Staff/ Competencies

  • Placed within individual units
  • Placed within individual units
  • Skills tailored to specific region or

subject matter

  • Placed within central analytics team,

available as needed to support individual units

slide-17
SLIDE 17

The ‘Hub-Spoke’ operating model often serves as a well- synchronized, connected system

Competency Center ‘Standardization’

2

Loc Local l Bus usiness Ope peratio ions Glob lobal l Bus usiness Str trategy Local Adoption

  • f Practices

Centers of Excellence (Regional) Competency Center (‘Standards’) Central Decision Hub

Local ‘Spoke’

Central Decision Hub

1

Center of Excellence (Regional)

3

Center of Excellence (Regional)

3

Center of Excellence (Regional)

3 1 2 3 4

Local ‘Spoke’

4

Local ‘Spoke’

4

Local ‘Spoke’

4

Local ‘Spoke’

4

Local ‘Spoke’

4

Local ‘Spoke’

4

Local ‘Spoke’

4

Local ‘Spoke’

4

Local ‘Spoke’

4

Local ‘Spoke’

4

Sample Hub-Spoke Interaction Model

slide-18
SLIDE 18

Creating a Big Data Organization Step 4: Invest in Appropriate Infrastructure

  • Big Data introduces challenges related to data volume and variety, processing

constraints, and new data structures that traditional data infrastructure is not equipped to support

Obj bjective Consi siderations Impact

Identify the type of analysis that will be conducted and define which analytics capabilities will be employed

Dictates performance needs along with data structures and processing architecture Interface could restrict the ability to perform analysis ad hoc and restrict ability to update Support for analysis specific data structures can improve performance and reduce analysis effort

Define the data set that will be used for the analysis including its sources, size, and structure Size of data sets introduce need for scalable infrastructure and performance Variability of source data models and data set structure require data model flexibility Diverse sources will require scalability, model flexibility, and flexible interfaces Define the timeliness and frequency of the analysis results for reporting and downstream systems Frequency of analysis will dictate the processing architecture (batch or real time) The timeliness of the analysis will impact the need for scalability and performance In and out bound interfaces are defined by the use

  • f data and required flexibility

Analytics Capabilities Data Variety Application

Analysis Type Size Structure Sources Frequency Speed Interfaces Analysis Flexibility Analysis Structures

slide-19
SLIDE 19

Contents

Architecture Design

2

slide-20
SLIDE 20

Emerging Infrastructure Options

  • To harness Big Data, storage solutions must be able to support

targeted analytics capabilities, data diversity and performance needs

Distributed Processing

Hadoop and similar solutions that provide scalable distributed storage and distributed computation on commodity hardware

NoSQL

Embedded and persisted storage that implement data models through document, graph, and dictionary structures

Cloud Computing

Cloud computing can improve flexibility, scalability and cost management and enable a cohesive business strategy across a org

  • Scalability Issues
  • Big Data set information extraction and

queries require large volumes of processing cycles that can quickly scale

  • Data storage solutions need to provide

flexible data models to better ingest unstructured and semi structured data

  • Need to combine and link multiple data

sources Traditional challenges being addressed…

slide-21
SLIDE 21

Building an Analytics Organization: Critical Components

Distributed Processing

Hadoop and similar solutions that provide scalable distributed storage and distributed computation on commodity hardware

Introduction to Hadoop

  • Hadoop is based on work done by Google in

early 2000s (combination of Google File System (GFS) and MapReduce)

  • Useful for analyzing copious amounts of

complex data across multiple data sources

  • Distributes data as it is initially stored in the

system

  • Applications are written in high-level code
  • Computation happens where data is stored,

whenever possible

  • Data is replicated multiple times on the

system for increased availability and reliability

Faster and Lower Cost Analysis Linear Scalability Greater flexibility

Emerging Infrastructure – Computing/Storage Options

slide-22
SLIDE 22

Building an Analytics Organization: Critical Components

Emerging Infrastructure – Storage Options

NoSQL

Embedded and persisted storage that implement data models through document, graph, and dictionary structures

NoSQL - Storage Types

Document Store Key – Value Store Graph Store Columnar Store

Solution Examples Increasing Data Complexity

Pros: Simplicity & Scalability Cons: Lack of advanced features/queries Pros: Scalability & Flexibility Cons: Complexity Pros: Easy to Use Cons: Scalability Pros: Graph Joins Cons: Flexibility

slide-23
SLIDE 23

Building an Analytics Organization: Critical Components

Emerging Infrastructure – System Options

Cloud Computing

The model is compelling; cloud computing can improve flexibility, scalability and cost management. Businesses best able to realize the potential will establish a cohesive business strategy as cloud computing can transform your entire organization — people, processes, and systems

Source: PwC, “Digital IQ Snapshot: Cloud,”; PwC, “FS Viewpoint: Clouds is the forecast”

Cloud transformation begins at the infrastructure level and leads to more agile applications, resulting in faster speed to market and more flexibility to meet client needs. The key benefits, beyond consolidation, include standardized application and development environments, resulting in better controlled and more efficient application lifecycles.

slide-24
SLIDE 24

Relational Reference Architecture

slide-25
SLIDE 25

Extended Relational Reference Architecture

slide-26
SLIDE 26

Non-Relational Reference Architecture

slide-27
SLIDE 27

Data Discovery: Non-Relational Architecture

slide-28
SLIDE 28

Business Reporting: Hybrid Architecture

slide-29
SLIDE 29

Contents

Big Data Algorithms

3

slide-30
SLIDE 30

Key components of Mahout in Hadoop (1)

slide-31
SLIDE 31

Key components of Mahout in Hadoop (2)

slide-32
SLIDE 32

Key Components of Spark MLlib

slide-33
SLIDE 33

Spark ML Basic Statistics

◼ Correlation: Calculating the correlation between two series of data is a common

  • peration in Statistics

➢Pearson’s Correlation ➢Spearman’s Correlation

slide-34
SLIDE 34

Example of Popular Similarity Measurements

◆Pearson Correlation Similarity ◆Euclidean Distance Similarity ◆Cosine Measure Similarity ◆Spearman Correlation Similarity ◆Tanimoto Coefficient Similarity (Jaccard coefficient) ◆Log-Likelihood Similarity

slide-35
SLIDE 35

Pearson Correlation Similarity

Data: Missing Data

slide-36
SLIDE 36

On Pearson Similarity

Three problems with the Pearson Similarity:

  • 1. Not take into account of the number of items in which

two users’ preferences overlap. (e.g., 2 overlap items ==> 1, more items may not be better.)

  • 2. If two users overlap on only one item, no correlation

can be computed.

  • 3. The correlation is undefined if either series of

preference values are identical.

Adding Weighting. WEIGHTED as 2nd parameter of the constructor can cause the resulting correlation to be pushed towards 1.0, or -1.0, depending on how many points are used.

slide-37
SLIDE 37

Spearman Correlation Similarity

Example for ties Pearson value on the relative ranks

slide-38
SLIDE 38

Basic Spark Data Format

Data: 1.0, 0.0, 3.0 // straightforward

// number of parameters, location of non-zero indices, and non-zero values // number of parameters, Sequence of non-value values (index, value)

slide-39
SLIDE 39

Correlation Example in Spark

1.0, 0.0, 0.0, -2.0 4.0, 5.0, 0.0, 3.0 6.0, 7.0, 0.0, 8.0 9.0, 0.0, 0.0, 1.0

slide-40
SLIDE 40

Euclidean Distance Similarity

Similarity = 1 / ( 1 + d )

slide-41
SLIDE 41

Cosine Similarity

Cosine similarity and Pearson similarity get the same results if data are normalized (mean == 0).

slide-42
SLIDE 42

Spearman Correlation Similarity is time consuming. Need to use Caching ==> remember s user-user similarity which was previously computed.

Caching User Similarity

slide-43
SLIDE 43

Tanimoto (Jaccard) Coefficient Similarity

Discard preference values

slide-44
SLIDE 44

Log-LikeLihood Similarity

Asses how unlikely it is that the overlap between the two users is just due to chance.

slide-45
SLIDE 45

Performance Measurements

Using GroupLens data (http://grouplens.org): 10 million rating MovieLens dataset.

  • Spearnman: 0.8
  • Tanimoto: 0.82
  • Log-Likelihood: 0.73
  • Euclidean: 0.75
  • Pearson (weighted): 0.77
  • Pearson: 0.89
slide-46
SLIDE 46

Spark ML Basic Statistics

  • Hypothesis testing: Hypothesis testing is a

powerful tool in statistics to determine whether a result is statistically significant. Spark ML currently supports Pearson’s Chi-squared (χ2) tests for independence.

  • ChiSquareTest conducts Pearson’s

independence test for every feature against the label.

slide-47
SLIDE 47

Chi-Square Tests (1)

slide-48
SLIDE 48

Chi-Square Tests (2)

slide-49
SLIDE 49

Chi-Square Tests (3)

We would reject the null hypothesis that there is no relationship between location and type of malaria. Our data tell us there is a relationship between type of malaria and location.

slide-50
SLIDE 50

Chi-Square Tests in Spark

slide-51
SLIDE 51

Spark ML Clustering

slide-52
SLIDE 52

Example: Clustering

Feature Space

slide-53
SLIDE 53

Clustering

slide-54
SLIDE 54

Clustering – on feature plane

slide-55
SLIDE 55

Clustering example

slide-56
SLIDE 56

Steps on Clustering

slide-57
SLIDE 57

Making Initial Cluster Centers

slide-58
SLIDE 58

K-means Clustering

slide-59
SLIDE 59

HelloWorld Clustering Scenario Result

slide-60
SLIDE 60

Testing difference distance measures

slide-61
SLIDE 61

Manhattan and Cosine distances

slide-62
SLIDE 62

Tanimoto distance and weighted distance

slide-63
SLIDE 63

Results Comparison

slide-64
SLIDE 64

Sample Code of K-Means Clustering in Spark

slide-65
SLIDE 65

Vectorization Example

0: Weight 1: Color 2: Size

slide-66
SLIDE 66

Canopy Clustering (estimate the number of clusters)

Tell what size clusters to look for. The algorithm will find the number of clusters that have approximately that size. The algorithm uses two distance thresholds. This method prevents all points close to an already existing canopy from being the center of a new canopy.

slide-67
SLIDE 67

Other Clustering Algorithms

Hierarchical clustering

slide-68
SLIDE 68

Different Clustering Algorithms

https://github.com/HewlettPackard/cacti

slide-69
SLIDE 69

Spark ML Classification

slide-70
SLIDE 70

Spark ML Classification

slide-71
SLIDE 71

Classification - definition

slide-72
SLIDE 72

Classification example: using SVM to recognize a Toyota Camry

slide-73
SLIDE 73

Classification example: using SVM to recognize a Toyota Camry

slide-74
SLIDE 74

When to use Big Data System for Classification?

slide-75
SLIDE 75

The advantage of using Big Data System for Classification

slide-76
SLIDE 76

How does a classification systems work?

slide-77
SLIDE 77

Key Terminology for Classification

slide-78
SLIDE 78

Input and Output of a classification model

slide-79
SLIDE 79

Four types of values for predictor variables

slide-80
SLIDE 80

Sample data that illustrates all four values

slide-81
SLIDE 81

Supervised vs. Unsupervised Learning

slide-82
SLIDE 82

Work flow in a typical classification project

slide-83
SLIDE 83

Classification Example – Color-Fill

Position looks promising, especially the x-axis ==> predictor variable. Shape seems to be irrelevant. Target variable is “color-fill” label.

slide-84
SLIDE 84

Classification Example – Color-Fill (another feature)

slide-85
SLIDE 85

Fundamental classification algorithm

Example of fundamental classification algorithms:

  • Naive Bayesian
  • Complementary Naive Bayesian
  • Stochastic Gradient Descent (SDG)
  • Random Forest
  • Support Vector Machines
slide-86
SLIDE 86

Choose algorithm

slide-87
SLIDE 87

Support Vector Machine (SVM)

maximize boundary distances; remembering “support vectors” nonlinear kernels

slide-88
SLIDE 88

Example SVM code in Spark

slide-89
SLIDE 89

Contents

Tools Support

4

slide-90
SLIDE 90

Data Mining, Text Mining, and Natural Language Processing

Extraction of implicit, previously unknown, and potentially useful information from data

Data Mining

Analysis of large quantities of natural language text and detecting lexical or linguistic usage patterns to extract probably useful information

Text Mining Natural Language Processing

NLP is a theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications.

slide-91
SLIDE 91

NLP Tools

Too

  • ol

Desc Description Ana Analysis Typ ype

Ope penNLP A machine learning based toolkit for the processing of natural language text. Link

  • Tokenization
  • sentence segmentation
  • Part-of-speech tagging
  • Named entity extraction
  • Chunking, parsing
  • Coreference resolution.

GATE ATE A Java suite of tools that can perform natural language processing tasks for multiple languages. Link

  • Information extraction
  • Part of speech tagging
  • Tokenizer
  • Sentence splitter

NL NLTK A suite of libraries and programs for symbolic and statistical natural language processing Python. Link

  • Information extraction
  • Part of speech tagging,
  • Tokenizer
  • Word categorization
  • Text classification

Sta tanford NLP Statistical NLP toolkits for various computational linguistics problems that can be incorporated into applications with human language technology needs. Link

  • Including tokenization
  • Part-of-speech tagging
  • Named entity recognition
  • Parsing
  • Classification
  • Segmentation
  • Coreference Resolution

LingPipe A tool kit for processing text using computational

  • linguistics. Link
  • Sentiment analysis
  • Entity recognition
  • Clustering
  • Topic classification
  • Part of speech tagging
  • Sentence detection
  • Disambiguation

Mon

  • ntyLingua

A suite of libraries and programs for symbolic and statistical natural language processing for both Python and Java. Link

  • Information extraction
  • Part of speech tagging
  • Tokenizer
  • Word categorization
  • Text generation
  • Stemming
  • Phrase chunking

Ros Rosetta Linguistic Pl Platform A suite of linguistic analysis components that integrate into applications for mining unstructured data. Link

  • Language Identification
  • Name, places, and key

concept extraction

  • name matching
  • name translation
slide-92
SLIDE 92

Text Mining/Analytics Tools

Too

  • ol

Des Descrip iption Ana nalysis is Type

Rap apid idMin iner An open source environment for machine learning, data mining, text mining, predictive analytics, and business analytics. Link

  • Document classification
  • Sentiment analysis
  • Topic tracking
  • Data mining
  • Traditional analytics

SAS S Text t Miner A suite of text processing and analysis tools. Link,

  • Text Parsing
  • Filtering
  • Feature Extraction
  • Topic Clustering

VisualT lText Integrated development environment for building information extraction systems, natural language processing systems, and text analyzers. Link

  • Information extractions
  • Summarization
  • Categorization
  • Data Mining
  • Document Filtering
  • Natural Language

Search SAS S Sentim timent t Analy alysis is Commercial tool that is dedicated to customer sentiment analysis. Link

  • Customer sentiment

monitoring

  • sentiment discovery

Textif tifie ier Tool for sorting large amounts of unstructured text with The Public Comment Analysis Toolkit (PCAT). Link

  • Topic modeling,
  • Information retrieval
  • Document analysis
  • Social media analysis

Infin nfinite Ins nsig ight System for automatically preparing and transforming unstructured text attributes into a structured representation. Link

  • Term frequency
  • Term frequency inverse
  • Document frequency
  • Root word coding
  • synonym identification
  • Customization of stop

words

  • Stemming rules
  • Concepts merging

Clusti tify fy Software for grouping related documents into clusters, providing an overview of the document set and aiding with categorization. Link

  • Document clustering
slide-93
SLIDE 93

Text Mining/Analytics Tools Cont.

Too

  • ol

De Description Analysis Type

Attensi sity Ana Analyze Customer analytics applications that help analyze high volumes of customer conversations across multiple channels. Link

  • Unstructured

communication analysis

  • sentiment analysis
  • consumer profiling

ReV eVerb A program that automatically identifies and extracts binary relationships from English

  • sentences. Link
  • Information

extraction

  • Topic Identification
  • Topic Linking

Ope pen text xt sum summarizer Open source tool for summarizing texts. Link

  • Document

summarization Ope pen Cala alais Web based API that is used to analyze content and extract topics or information. Link

  • Attribute/feature

extraction

  • Fact identification

Knowledge Sea Search Family of techniques tools for searching and

  • rganizing large data collections. Link
  • Semantic Analysis

KH KH Coder A free software for Quantitative Content Analysis or Text Mining Link

  • Text Parsing
  • document search
  • Network analysis
slide-94
SLIDE 94

Image Analytics Overview

Ov Overv rview

  • The process of pulling relevant

information from an image or sets of images for advanced classification and traditional analysis

  • Applies image capture, image

processing, and machine learning techniques to extract, quantify, and structure, image information Adv dvantages

  • Provides a method to structure,
  • rganize, and search information that is

stored within images

  • Offers an additional data set that can

be applied to understanding consumer behavior, automating business processes, and discovering knowledge enterprise content

slide-95
SLIDE 95

Image Analytics Tools

Too

  • ol

Ov Overv rvie iew Ima mage Processin ing Co Computer Visio ion Ma Machin ine Le Learnin ing Ope OpenCV Open source library of computer vision functions that is accessible via C, Java, and Python

X X X

PAX AXit it Ima mage Ana Analysis Integrated image analysis platform that provides basic feature identification functions

X X

Ima mageJ Java based image processing platform that can be accessed via an API and expanded with custom plugins

X

PIL Python image processing library

X

PyBrain in A modular machine learning library for Python

X

slide-96
SLIDE 96

Audio Analytics Overview

Over erview

  • The process of capturing audio and

analyzing its features as to extract content and context of an event

  • Applies speech analysis and signal

processing principles to structure audio information for analysis via NLP or traditional analytics techniques Advantages

  • Provides a method for identifying

events or common patterns within sound bytes

  • Offers a way of capturing not only

the content and topics within a conversation, but also the emotions and context

slide-97
SLIDE 97

Audio Analytics Tools

Too

  • ol

Ov Overv rvie iew Aud Audio Processin ing Infor

  • rmatio

ion Retr trie ieval Cla Clam A C++ library that provides varying level

  • f audio processing and information

retrieval capabilities

X X

Call CallMin iner A tool that is capable of translating calls to a more structured text data set and combining with other communication forms

X

Nu Nuance Logs calls and structures audio for text based search and retrieval

X

yaa aafe Aduio feature extraction toolkit with wrappers for several languages

X

PRA RAAT Multiple platform audio analysis toolkit

X

slide-98
SLIDE 98

Social Network → Applications (1)

Ana Analysis Obj Objectives Co Colla labor

  • ratio

ion Ana Analy lysis is

Evaluate team structures , information flows among team members, and information exchanges with other teams to improve working structures

  • Identify team structures that are not effective
  • Identify informal organizational structures
  • Identify individuals/roles or groups that are

influential to collaborative work environments

Co Content/ Kno nowle ledge Man Management

Evaluate how knowledge or content is diffused and accessed within an

  • rganization
  • Improve content and knowledge distribution
  • Identify content bottlenecks, open

communication flows, and establish channels

  • Explore impact of new communication methods

Co Communit ity Mining

Identify groups or informal teams that share knowledge, communicate frequently, solve problems, or work together to perform specific tasks

  • Improved structures for key organizational

functions.

  • Improved information flows
  • Identify potential bottlenecks for organizational

functions

  • Identify cultural patterns to build other

communities

Or Organiz izatio ion De Develo lopment

Explore formal and informal

  • rganization structures and

how individuals work with one another to improve the design

  • f the organization
  • Improve hierarchy and structure of organization

to better align with the informal practices

  • Identify team members that are effective leaders

and would impact the organization if promoted

slide-99
SLIDE 99

Social Network → Applications (2)

Ana Analysis Obj bjectives Disas Disaster rec ecovery pla planning

Assess organizational structures and communication patterns as they relate to the groups that play a role in disaster recovery plans

  • Identify communication improvements to disaster recovery

teams

  • Identify weak links among functional groups to improve

collaboration during recovery plan execution

Da Data/ Informati tion Disse Dissemination

Assess how data points or information sets originate or are distributed across the enterprise to their intended targets

  • Identify overlapping information sets and bottlenecks for

information dissemination

  • Assess how organization structures or information

architecture impact the flow of information to its targets

Fraud De Detecti tion / pr prevention

Assess the organization or external network to identify communication

  • r collaboration patterns that align

with known fraudulent activity

  • Identify network agents that collaborate with known

fraudulent agents

  • Identify activities that align with known fraudulent behavior

Process ss Disc Discovery / Improvement

Analyze the organization structure and communication patterns to uncover process improvements or identify new processes

  • Identify process improvements through discovery of hidden

process steps, communication flows , and actors

  • Discover undocumented or informal processes that are

hidden within frequent collaboration and communication paths

Sup Supply Cha hain Ana Analysis

Evaluate the structure of a supply network and the interactions among the entities that comprise the network to identify gaps, bottlenecks and sourcing strategies

  • Identify communication gaps that could impact dependent

process or operations

  • Identify strategic relationships to optimize the supply

network

  • Identify supply nodes that create inefficiencies
slide-100
SLIDE 100

Social Network → Applications (3)

Ana Analysis Obj bjectives Novelty/ Sen Sentiment Dif Diffusi sion Ana Analysis

Observe how a specific topic, news articles or sentiment diffuses through a consumer network

  • Assess how target consumers/market will react to a piece
  • f news or campaign
  • Evaluate how long news, data, or sentiment will be

retained within a system and how far it will spread

Mar arket Infl fluencer Ide dentifi fication

Monitor and analyze connections within social media networks to identify markets or consumers that are influential within communities

  • Identify individuals or groups that influence markets and

adoption

  • Identify untapped markets
  • Identify market segments as targets for ad campaigns to

improve product/service adoption

Consumer r Seg Segmentation

Analyze the connections and consumer attributes within the target market to discover communities or groups with common characteristics

  • Improve product or service offerings based on attributes

that connect the consumer market

  • Develop strategies to target new or existing consumers

based on identified segmentation characteristics

Product or

  • r Brand

Dif Diffusi sion Ana Analysi sis

Analyze the flow of communication

  • r ideas through a market segment

to evaluate how a product may diffuse

  • Identify segments or individuals that will be likely early

adopters

  • Identify incentives or campaigns that will improve

product/service adoption

Rec ecommendation Systems

Analyze consumer network connections and common features among consumers to develop recommendations

  • Identify new feature sets for products and services
  • Assess new markets for selling similar or new products
  • Target consumers with specific products or services
slide-101
SLIDE 101

Social Network → Tools (1)

Too

  • ol

Ov Overv rview Netw twor

  • rk

Ana nalysis is Netw twor

  • rk

Vi Visual Netw twor

  • rk

Manipulation SN SNAP

A general purpose network analysis and graph mining library for C++ . Link

X X

St Statnet

A package for R that provides capabilities for social network statistical

  • analysis. Link

X

lib ibSNA, , gr graphTool, , ne netw tworkX

Python libraries for network analysis and manipulation. libSNA, networkX, graphTool

X X

JUN JUNG

Java package for network analysis and

  • modeling. Link

X X X

Nod

  • deXL

Excel plug-in that provides an easy to use and interactive interface to explore and visualize networks Link

X X

slide-102
SLIDE 102

Social Network → Tools (2)

Tool Overview Network Ana Analysis Network Visu Visual Network Man anipulation GEP EPHI

Interactive open source platform for network analysis and visualization. Gephi

X X X

Uci cinet

Commercial social network analysis tool with separate visualization component. Link

X X

Graphviz

Open source graph visualization package. Link

X

NetMiner

Proprietary package that provides the ability to develop and implement custom algorithms link

X X X

kxen SN SNA

Network analysis package that provides predictive analytics and customer MDM integration. Link

X X X

ProM

Open source package for mining business process

  • networks. Link

X X X

Cytoscape

Open source tool for network modeling, and

  • analysis. Can connect to external data sources Link

X X X

Network Work

  • rkbench

Large-Scale Network Analysis, Modeling and Visualization Toolkit for Biomedical, Social Science and Physics Research. Link

X X X

slide-103
SLIDE 103

Contents

Deep QA/Mind/Brain Systems

5

slide-104
SLIDE 104

DeepQA/Mind/Brain

What is DeepQA?

  • DeepQA forms that core of Watson, the open

domain question analysis and answering system

  • The DeepQA stack is comprised of set of

search, NLP, learning, and scoring algorithms

  • DeepQA operates on a distributed computing

infrastructure that leverages Map Reduce and the Unstructured Information Management Architecture

What is the target problem set?

  • Understanding the meaning and context of

human language

  • Searching and retrieving information from

large library of unstructured information

  • Identifying accurate and precise answers to

questions that are complex and must sourced from a large knowledge set

slide-105
SLIDE 105

DeepQA Infrastructure Technology

Data Management and Search

Techn hnolog

  • gy

Li Links Uns nstructured Informatio ion Archit itecture UIMA Link SQL QL Server MySQL Link Apache Derby Link Java Na Natu tural l Lan Language Tool

  • olkit

kit Open NLP Link Stanford NLP Link Map ap/R /Reduce Apache Hadoop Link Com

  • mmonsense

Know nowle ledgebase OpenCYC Link Open Mind Common Sense Link Trip iple le Stor

  • re

Apache Jena Link OpenAnzo Link Text t Sear arch Lucene Link Open FTS Link

slide-106
SLIDE 106

DeepQA Infrastructure Technology

Platform and Administration

Techn hnolog

  • gy

Li Links Web b Server Apache Link Vir irtu tualiz izatio ion Hos Host VMWare Link Zen Link Distrib ibuted File le Sy System Apache Hadoop Link OpenAFS Link File le Management/ Archival rSync Link OS OS Fedora Link Clou loud Man anagement Extreme Cloud Administration Link Open Nebula Link

slide-107
SLIDE 107

Business Applications

Ov Overview Obj bjectives Knowledge Disc Discovery

Search internal and external unstructured/structured information assets to uncover previously unknown knowledge

  • Identify information about a subject through deep analysis of

internal and external information sources

  • Answer questions about a business problem or trend that

may be difficult to analyze within traditional data sources

E-Di Disc scovery

Search documents and communications to uncover relevant information associated with a specific topic

  • Identify business topics and trends within communication

and documents

  • Search for non compliance activities within internal and

external data sources

Contract Eval aluations

Search through single or multiple contracts to answer specific questions about the nature of the contract

  • Identify key facts or issues that comprise a contract or sets of

contracts

  • Identify contracts or legal documents that contain similar

entities or features

Rel elationship Man anag agement

Provide the ability to interact with consumers providing precise responses to technical and open domain questions

  • Provide a platform for automatically answering consumer

questions about products or services

  • Reduce reliance on call centers and improve interaction with

consumers

Consumer Disc Discovery

Search consumer communications, social media, and sales information to identify opportunities and demographics

  • Identify background information about consumers
  • Identify consumer qualities that create risks or represent
  • pportunities

Tec echnical al Troubleshooting

Find answers to technical and process problems through

  • Utilize unstructured data and communications to identify

solutions or root causes to system and process problems

slide-108
SLIDE 108

Areas for Further Research

Infrastructure/Tools and Search Technologies/Concepts

Topic Res esearch Tools

Had Hadoop Map ap/R /Reduce The tool is used to distribute queries, analysis, and other processing activities across multiple CPUs. Further research is required to understand the tools architecture and how to integrate it with other tool kits. OpenNLP, UIMA, Lucene, etc. Ope penNLP A Java library for NLP tasks. Need to evaluate the tools capabilities and gaps as well as how it can be incorporated into the UIMA Ope penCYC An open common sense reasoning platform. Need to better understand the tools role as well as how it fits within the other technologies UIMA An architecture for managing unstructured data. Further research is needed to understand how to run in parallel and how the SDK can be applied to NLP activities Luc Lucene A text search platform. Further research is needed to understand the library and how to incorporate it into UIMA

Sea Search

Text t Sear arch Scor

  • rin

ing Algorithms are used to score search results based on their alignment with the question. Further research is needed to understand what models and scoring metrics can be applied to search results at various phases of DeepQA. Trip iple le Stor

  • re

Search Triple stores maintain data in a subject-predicate-object structure and is used for turning around quick facts. Further research is needed to understand the philosophy and technologies behind these data storage mechanisms Com

  • mmonsense

Reas asonin ing Research is required to understand the branch of AI, technologies and role within DeepQA. Doc

  • cument/

t/ Informatio ion Retr trie ieval Generate research on information and document retrieval practices. Technologies and algorithms need to be reviewed. Falls within a broader research topic for enterprise search.

slide-109
SLIDE 109

Areas for Further Research

Machine Learning and Natural Language Processing

Topi

  • pic

De Descrip iptio ion Ma Machine Learnin ing

MetaLearners Research the concept and how they are to used evaluate learning models and assign a confidence score based on the learning models that are used to rank search results Qu Question Clas lassi sifi fication Identify techniques and models that can be employed to analyze and classify questions Sea Search Ran Ranking Mod

  • dels

Research models are available for ranking search results based on the various search and recall techniques that are employed for a question

NL NLP

Log Logical Form

  • rm

Ana Analysis Research how SNA is used to discover logical relationships within text and product an understanding about the information within the text Sem Semantic St Structure Ana Analysis Identify tools and algorithms that are employed to uncover semantic relationships within texts/phrases and how these relationships can be applied to extract relevant information for question analysis and search Rel elationship Ana Analysis Research techniques and tools for uncovering temporal, geospatial and spatial relationships within a knowledge set Featu ture Ex Extraction Evaluate tools and algorithms that are used to extract features of entities from text and identify methods for structuring the data for search Phr hrase Ana Analysi sis Identify algorithms and tools that can be applied to extract key phrases from text based on a search context

slide-110
SLIDE 110

Thank you!