Scalable Learning Technologies Scalable Learning Technologies for - - PowerPoint PPT Presentation

scalable learning technologies scalable learning
SMART_READER_LITE
LIVE PREVIEW

Scalable Learning Technologies Scalable Learning Technologies for - - PowerPoint PPT Presentation

DASFAA 2015 Hanoi Tutorial DASFAA 2015 Hanoi Tutorial Scalable Learning Technologies Scalable Learning Technologies for Big Data Mining for Big Data Mining Gerard de Melo, Tsinghua University Gerard de Melo, Tsinghua University


slide-1
SLIDE 1

DASFAA 2015 Hanoi Tutorial

Scalable Learning Technologies for Big Data Mining

Gerard de Melo, Tsinghua University

http://gerard.demelo.org

Aparna Varde, Montclair State University

http://www.montclair.edu/~vardea/

DASFAA 2015 Hanoi Tutorial

Scalable Learning Technologies for Big Data Mining

Gerard de Melo, Tsinghua University

http://gerard.demelo.org

Aparna Varde, Montclair State University

http://www.montclair.edu/~vardea/

slide-2
SLIDE 2

Big Data Big Data Big Data Big Data

Image: Caixin Image: Corbis

Alibaba: 31 million orders per day! (2014) Alibaba: 31 million orders per day! (2014)

slide-3
SLIDE 3

Big Data on the Web Big Data on the Web Big Data on the Web Big Data on the Web

Source: Coup Media 2013

slide-4
SLIDE 4

Big Data on the Web Big Data on the Web Big Data on the Web Big Data on the Web

Source: Coup Media 2013

slide-5
SLIDE 5

From Big Data to Knowledge From Big Data to Knowledge From Big Data to Knowledge From Big Data to Knowledge

Image: Brett Ryder

slide-6
SLIDE 6

Learning from Data Learning from Data

Previous Knowledge Preparation Time Passed Exam?

Student 1 80% 48h Yes Student 2 50% 75h Yes Student 3 95% 24h Yes Student 4 60% 24h No Student 5 80% 10h No

slide-7
SLIDE 7

Learning from Data Learning from Data

Previous Knowledge Preparation Time Passed Exam?

Student 1 80% 48h Yes Student 2 50% 75h Yes Student 3 95% 24h Yes Student 4 60% 24h No Student 5 80% 10h No Student A 30% 20h ?

slide-8
SLIDE 8

Learning from Data Learning from Data

Previous Knowledge Preparation Time Passed Exam?

Student 1 80% 48h Yes Student 2 50% 75h Yes Student 3 95% 24h Yes Student 4 60% 24h No Student 5 80% 10h No Student A 30% 20h ? Student B 80% 45h ?

slide-9
SLIDE 9

Machine Learning Machine Learning

Labels for Test Data Prediction Prediction

Probably Spam!

Classifier Model Unsupervised

  • r Supervised

Learning Unsupervised

  • r Supervised

Learning

D1

0.324 0.739 0.000 0.112

Data with

  • r without

labels

slide-10
SLIDE 10

Unsupervised

  • r Supervised

Learning Unsupervised

  • r Supervised

Learning

Data Mining Data Mining

Labels for Test Data Classifier Model Data with

  • r without

labels World Data Acquisition Data Acquisition Raw Data

Preprocessing + Feature Engineering Preprocessing + Feature Engineering

Visualization Visualization Prediction Prediction Use of new Knowledge Use of new Knowledge Model Data Acquisition Data Acquisition

0100101 0010110 1110011

Model

D1

0.324 0.739 0.000 0.112

Unsupervised

  • r Supervised

Analysis Unsupervised

  • r Supervised

Analysis Analysis Results

slide-11
SLIDE 11

Problem with Classic Methods: Problem with Classic Methods: Scalability Scalability Problem with Classic Methods: Problem with Classic Methods: Scalability Scalability

http://www.whistlerisawesome.com/wp-content/uploads/2011/12/drinking-from-firehose.jpg

slide-12
SLIDE 12

Scaling Up Scaling Up

slide-13
SLIDE 13

Scaling Up: More Features Scaling Up: More Features

Previous Knowledge Preparation Time Passed Exam?

Student 1 80% 48h Yes Student 2 50% 75h Yes Student 3 95% 24h Yes Student 4 60% 24h No Student 5 80% 10h No

slide-14
SLIDE 14

Scaling Up: More Features Scaling Up: More Features

Previous Knowledge Preparation Time

... ... ... ... ...

Passed Exam?

Student 1 80% 48h ... ... ... ... ... Yes Student 2 50% 75h ... ... ... ... ... Yes Student 3 95% 24h ... ... ... ... ... Yes Student 4 60% 24h ... ... ... ... ... No Student 5 80% 10h ... ... ... ... ... No

For example:

  • words and phrases mentioned in exam response
  • Facebook likes, Websites visited
  • user interaction details in online learning

For example:

  • words and phrases mentioned in exam response
  • Facebook likes, Websites visited
  • user interaction details in online learning

Could be many millions! Could be many millions!

slide-15
SLIDE 15

Scaling Up: More Features Scaling Up: More Features

Previous Knowledge Preparation Time

... ... ... ... ...

Passed Exam?

Student 1 80% 48h ... ... ... ... ... Yes Student 2 50% 75h ... ... ... ... ... Yes Student 3 95% 24h ... ... ... ... ... Yes Student 4 60% 24h ... ... ... ... ... No Student 5 80% 10h ... ... ... ... ... No

Classic solution: Feature Selection

slide-16
SLIDE 16

Scaling Up: More Features Scaling Up: More Features

Previous Knowledge Preparation Time

... ... ... ... ...

Passed Exam?

Student 1 80% 48h ... ... ... ... ... Yes Student 2 50% 75h ... ... ... ... ... Yes Student 3 95% 24h ... ... ... ... ... Yes Student 4 60% 24h ... ... ... ... ... No Student 5 80% 10h ... Σ ... ... ... No

Scalable solution: Buckets with sums

  • f original features

“Clicked on http://physics...” “Clicked on http://icsi.berkeley...”

slide-17
SLIDE 17

Scaling Up: More Features Scaling Up: More Features

F0 F1 F2 F3 ... ... Fn Passed Exam?

Student 1 ... ... ... ... ... ... ... Yes Student 2 ... ... ... ... ... ... ... Yes Student 3 ... ... ... ... ... ... ... Yes Student 4 ... ... ... ... ... ... ... No Student 5 ... ... ... ... ... ... ... No

Feature Hashing: Use fixed feature dimensionality n. Hash original Feature ID (e.g. “Clicked on http://...”) to a bucket number in 0 to n-1 Normalize features and use bucket-wise sums Feature Hashing: Use fixed feature dimensionality n. Hash original Feature ID (e.g. “Clicked on http://...”) to a bucket number in 0 to n-1 Normalize features and use bucket-wise sums

Small loss of precision usually trumped by big gains from being able to use more features Small loss of precision usually trumped by big gains from being able to use more features

slide-18
SLIDE 18

Scaling Up: More Training Examples Scaling Up: More Training Examples

Banko & Brill (2001): Word confusion experiments

(e.g. “principal” vs. “principle”)

slide-19
SLIDE 19

Scaling Up: More Training Examples Scaling Up: More Training Examples

Banko & Brill (2001): Word confusion experiments

(e.g. “principal” vs. “principle”)

More Data

  • ften trumps

better Algorithms

Alon Halevy, Peter Norvig, Fernando Pereira (2009). The Unreasonable Effectiveness of Data

More Data

  • ften trumps

better Algorithms

Alon Halevy, Peter Norvig, Fernando Pereira (2009). The Unreasonable Effectiveness of Data

slide-20
SLIDE 20

Scaling Up: More Training Examples Scaling Up: More Training Examples

Léon Bottou. Learning with Large Datasets Tutorial. Text Classification experiments

slide-21
SLIDE 21

Background: Stochastic Gradient Descent Background: Stochastic Gradient Descent

Images:http://en.wikipedia.org/wiki/File:Hill_climb.png http://en.wikipedia.org/wiki/Hill_climbing#mediaviewer/File:Local_maximum.png

move towards

  • ptimum by

approximating gradient based

  • n 1

(or a small batch of) random training examples move towards

  • ptimum by

approximating gradient based

  • n 1

(or a small batch of) random training examples

Stochastic nature may help us escape local optima Stochastic nature may help us escape local optima

Improved variants: AdaGrad (Duchi et al. 2011) AdaDelta (Zeiler et al. 2012) Improved variants: AdaGrad (Duchi et al. 2011) AdaDelta (Zeiler et al. 2012)

slide-22
SLIDE 22

Scaling Up: More Training Examples Scaling Up: More Training Examples

Recommended Tool VowPal Wabbit By John Langford et al. http://hunch.net/~vw/ Recommended Tool VowPal Wabbit By John Langford et al. http://hunch.net/~vw/

slide-23
SLIDE 23

Scaling Up: More Training Examples Scaling Up: More Training Examples

Parallelization? Parallelization? Use lock-free approach to updating weight vector components (HogWild! by Niu, Recht, et al.)

slide-24
SLIDE 24

Problem: Where to get Training Examples? Problem: Where to get Training Examples?

Gerard de Melo

Labeled Data is expensive!

  • Penn Chinese Treebank:

2 years for 4000 sentences

  • Adaptation is difficult
  • Wall Street Journal ≠ Novels ≠ Twitter
  • For Speech Recognition, ideally need training data

for each domain, voice/accent, microphone, microphone setup, social setting, etc.

http://en.wikipedia.org/wiki/File:Chronic_fatigue_syndrome.JPG

slide-25
SLIDE 25

Semi-Supervised Learning Semi-Supervised Learning

slide-26
SLIDE 26
  • Goal: When learning a model, use unlabeled

data in addition to labeled data

  • Example: Cluster-and-label
  • Run a clustering algorithm on

labeled and unlabeled data

  • Assign cluster majority label

to unlabeled examples of every cluster

Image: Wikipedia

Semi-Supervised Learning Semi-Supervised Learning

slide-27
SLIDE 27
  • Bootstrapping or Self-Training

(e.g. Yarowsky 1995)

– Use classifier to label

the unlabelled examples

– Add the

labels with the highest confidence to the training data and re-train

– Repeat

Semi-Supervised Learning Semi-Supervised Learning

slide-28
SLIDE 28
  • Co-Training (Blum & Mitchell 1998)
  • Given:

multiple (ideally independent) views

  • f the same data

(e.g. left context and right context of a word)

  • Learn separate models for each view
  • Allow different views to teach each other:

Model 1 can generate labels that will be helpful to improve model 2 and vice versa.

Semi-Supervised Learning Semi-Supervised Learning

slide-29
SLIDE 29

Semi-Supervised Learning: Transductive Setting Semi-Supervised Learning: Transductive Setting

Image: Partha Pratim Talukdar

slide-30
SLIDE 30

Semi-Supervised Learning: Transductive Setting Semi-Supervised Learning: Transductive Setting

Image: Partha Pratim Talukdar

Algorithms: Label Propagation (Zhu et al. 2003), Adsorption (Baluja et al. 2008), Modified Adsorption (Talukdar et al. 2009)

slide-31
SLIDE 31
  • Sentiment Analysis:
  • Look for Twitter tweets with emoticons like “:)”, “:(“
  • Remove emoticons. Then use as training data!

Crimson Hexagon

Distant Supervision Distant Supervision

slide-32
SLIDE 32

Representation Learning to Better Exploit Big Data Representation Learning to Better Exploit Big Data

slide-33
SLIDE 33

Representations Representations Representations Representations

Image: David Warde-Farley via Bengio et al. Deep Learning Book

slide-34
SLIDE 34

Representations Representations Representations Representations

Inputs Bits:

0011001…..

Images: Marc'Aurelio Ranzato

Note sharing between classes Note sharing between classes

slide-35
SLIDE 35

Representations Representations Representations Representations

Inputs Bits:

0011001…..

Images: Marc'Aurelio Ranzato

Massive improvements in image object recognition (human-level?), speech recognition. Good improvements in NLP and IR-related tasks. Massive improvements in image object recognition (human-level?), speech recognition. Good improvements in NLP and IR-related tasks.

slide-36
SLIDE 36

Example Example Example Example

Google's image Source: Jeff Dean, Google

slide-37
SLIDE 37

Inspiration: The Brain Inspiration: The Brain Inspiration: The Brain Inspiration: The Brain

Source: Alex Smola

Input: delivered via dendrites from other neurons Processing: Synapses may alter input signals. The cell then combines all input signals Output: If enough activation from inputs, output signal sent through a long cable (“axon”) Input: delivered via dendrites from other neurons Processing: Synapses may alter input signals. The cell then combines all input signals Output: If enough activation from inputs, output signal sent through a long cable (“axon”)

slide-38
SLIDE 38

Perceptron Perceptron Perceptron Perceptron

Input: Features Every feature fi gets a weight wi. Input: Features Every feature fi gets a weight wi.

feature weight dog 7.2 food 3.4 bank

  • 7.3

delicious 1.5 train

  • 4.2

Feature f1 Feature f2 Feature f3 Feature f4

Neuron

w1 w2 w3 w4

slide-39
SLIDE 39

Perceptron Perceptron Perceptron Perceptron

Neuron

Output w1 w2 w3 w4

Activation of Neuron

Multiply the feature values

  • f an object x

with the feature weights.

Activation of Neuron

Multiply the feature values

  • f an object x

with the feature weights.

a(x)=∑

i

wi f i(x)=w

t f (x) Feature f1 Feature f2 Feature f3 Feature f4

slide-40
SLIDE 40

Perceptron Perceptron Perceptron Perceptron

Neuron

Output w1 w2 w3 w4

Output of Neuron

Check if activation exceeds a threshold t = –b

Output of Neuron

Check if activation exceeds a threshold t = –b

Feature f1 Feature f2 Feature f3 Feature f4

  • utput(x)=g(w

t f (x)+b)

e.g. g could return 1 (positive) if positive,

  • 1 otherwise

e.g. 1 for “spam”,

  • 1 for “not-spam”

e.g. 1 for “spam”,

  • 1 for “not-spam”
slide-41
SLIDE 41

Decision Surfaces Decision Surfaces Decision Surfaces Decision Surfaces

Decision Trees Linear Classifiers (Perceptron, SVM) Kernel-based Classifiers (Kernel Perceptron, Kernel SVM) Multi-Layer Perceptron

Images: Vibhav Gogate

Not max-margin Not max-margin Only straight decision surface Only straight decision surface Any decision surface Any decision surface

slide-42
SLIDE 42

Deep Learning: Deep Learning: Multi-Layer Perceptron Multi-Layer Perceptron Deep Learning: Deep Learning: Multi-Layer Perceptron Multi-Layer Perceptron

Neuron 1

Output Feature f1 Feature f2 Feature f3 Feature f4

Neuron 2 Neuron

Input Layer Output Layer Hidden Layer

slide-43
SLIDE 43

Deep Learning: Deep Learning: Multi-Layer Perceptron Multi-Layer Perceptron Deep Learning: Deep Learning: Multi-Layer Perceptron Multi-Layer Perceptron

Neuron 1

Output Feature f1 Feature f2 Feature f3 Feature f4

Neuron 2 Neuron Neuron 2

Input Layer Hidden Layer Output Layer

slide-44
SLIDE 44

Deep Learning: Deep Learning: Multi-Layer Perceptron Multi-Layer Perceptron Deep Learning: Deep Learning: Multi-Layer Perceptron Multi-Layer Perceptron

Neuron 1

Output 1

Feature f1 Feature f2 Feature f3 Feature f4

Neuron 2 Neuron Neuron 2 Neuron

Output 2

Input Layer Hidden Layer Output Layer

slide-45
SLIDE 45

Deep Learning: Deep Learning: Multi-Layer Perceptron Multi-Layer Perceptron Deep Learning: Deep Learning: Multi-Layer Perceptron Multi-Layer Perceptron

Single-Layer:

  • utput(x)=g(W f (x)+b)

Input Layer (Feature Extraction)

f (x)

Three-Layer Network:

  • utput(x)=g2(W 2 g1(W 1 f (x)+b1)+b2)

Four-Layer Network:

  • utput(x)=g3(W 3 g2(W 2 g1(W 1 f (x)+b1)+b2)+b3)
slide-46
SLIDE 46

Deep Learning: Deep Learning: Multi-Layer Perceptron Multi-Layer Perceptron Deep Learning: Deep Learning: Multi-Layer Perceptron Multi-Layer Perceptron

slide-47
SLIDE 47

Deep Learning: Deep Learning: Computing the Output Computing the Output Deep Learning: Deep Learning: Computing the Output Computing the Output

Simply evaluate the

  • utput function

(for each node, compute an

  • utput based on

the node inputs) Simply evaluate the

  • utput function

(for each node, compute an

  • utput based on

the node inputs)

Output

y1 Input x1 Input x2 z1 z2 z3

Output

y2

slide-48
SLIDE 48

Deep Learning: Deep Learning: Training Training Deep Learning: Deep Learning: Training Training

Compute error

  • n output,

if non-zero, do a stochastic gradient step

  • n the error

function to fix it Compute error

  • n output,

if non-zero, do a stochastic gradient step

  • n the error

function to fix it Backpropagation The error is propagated back from output nodes towards the input layer Backpropagation The error is propagated back from output nodes towards the input layer

Output

y1 Input x1 Input x2 z1 z2 z3

Output

y2

slide-49
SLIDE 49

Exploit the chain rule to compute the gradient

Deep Learning: Deep Learning: Training Training Deep Learning: Deep Learning: Training Training

Backpropagation The error is propagated back from output nodes towards the input layer Backpropagation The error is propagated back from output nodes towards the input layer Compute error

  • n output,

if non-zero, do a stochastic gradient step

  • n the error

function to fix it Compute error

  • n output,

if non-zero, do a stochastic gradient step

  • n the error

function to fix it

x

y=f(x) z=g(y) ∂ z ∂ y ∂ y ∂ x

∂ z ∂ x = ∂ z ∂ y ∂ y ∂ x

We are interested in the gradient, i.e. the partial derivatives for the

  • utput function z=g(y)

with respect to all [inputs and] weights, including those at a deeper part of the network

slide-50
SLIDE 50

DropOut Technique DropOut Technique DropOut Technique DropOut Technique

Basic Idea While training, randomly drop inputs (make the feature zero) Basic Idea While training, randomly drop inputs (make the feature zero) Effect Training on variations of original training data (artificial increase

  • f training data size).

Trained network relies less on the existence of specific features. Effect Training on variations of original training data (artificial increase

  • f training data size).

Trained network relies less on the existence of specific features. Reference: Hinton et al. (2012) Also: Maxout Networks by Goodfellow et al. (2013)

slide-51
SLIDE 51

Deep Learning: Deep Learning: Convolutional Neural Networks Convolutional Neural Networks Deep Learning: Deep Learning: Convolutional Neural Networks Convolutional Neural Networks

Image: http://torch.cogbits.com/doc/tutorials_supervised/

Reference: Yann LeCun's work

slide-52
SLIDE 52

Deep Learning: Deep Learning: Recurrent Neural Networks Recurrent Neural Networks Deep Learning: Deep Learning: Recurrent Neural Networks Recurrent Neural Networks

Source: Bayesian Behavior Lab, Northwestern University

slide-53
SLIDE 53

Deep Learning: Deep Learning: Recurrent Neural Networks Recurrent Neural Networks Deep Learning: Deep Learning: Recurrent Neural Networks Recurrent Neural Networks

Source: Bayesian Behavior Lab, Northwestern University

Then can do backpropagation.

Challenge: Vanishing/Exploding gradients

Then can do backpropagation.

Challenge: Vanishing/Exploding gradients

slide-54
SLIDE 54

Deep Learning: Deep Learning: Long Short Term Memory Networks Long Short Term Memory Networks Deep Learning: Deep Learning: Long Short Term Memory Networks Long Short Term Memory Networks

Source: Bayesian Behavior Lab, Northwestern University

slide-55
SLIDE 55

Deep Learning: Deep Learning: Long Short Term Memory Networks Long Short Term Memory Networks Deep Learning: Deep Learning: Long Short Term Memory Networks Long Short Term Memory Networks

Deep LSTMs for Sequence-to-sequence Learning Suskever et al. 2014 (Google)

slide-56
SLIDE 56

Deep Learning: Deep Learning: Long Short Term Memory Networks Long Short Term Memory Networks Deep Learning: Deep Learning: Long Short Term Memory Networks Long Short Term Memory Networks

French Original: La dispute fait rage entre les grands constructeurs aéronautiques ̀ propos de la largeur des sìges de la classe touriste sur les vols long-courriers, ouvrant la voie ̀ une confrontation am̀re lors du salon aéronautique de Dubä qui a lieu de mois-ci. LSTM's English Translation: The dispute is raging between large aircraft manufacturers on the size of the tourist seats on the long-haul flights, leading to a bitter confrontation at the Dubai Airshow in the month of October. Ground Truth English Translation: A row has flared up between leading plane makers over the width of tourist-class seats on long-distance flights, setting the tone for a bitter confrontation at this Month's Dubai Airshow.

Suskever et al. 2014 (Google)

slide-57
SLIDE 57

Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines

Source: Bayesian Behavior Lab, Northwestern University

slide-58
SLIDE 58

Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines

Source: Bayesian Behavior Lab, Northwestern University

slide-59
SLIDE 59

Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines

Source: Bayesian Behavior Lab, Northwestern University

slide-60
SLIDE 60

Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines

Source: Bayesian Behavior Lab, Northwestern University

slide-61
SLIDE 61

Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines

Source: Bayesian Behavior Lab, Northwestern University

slide-62
SLIDE 62

Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines Deep Learning: Deep Learning: Neural Turing Machines Neural Turing Machines

Learning to sort! Learning to sort! Vectors for numbers are random

slide-63
SLIDE 63

Big Data in Feature Engineering and Representation Learning Big Data in Feature Engineering and Representation Learning

slide-64
SLIDE 64
  • Language Models for Autocompletion

Web Semantics: Statistics from Big Data as Features Web Semantics: Statistics from Big Data as Features

slide-65
SLIDE 65

Source: Wang et al. An Overview of Microsoft Web N-gram Corpus and Applications

Word Segmentation Word Segmentation

slide-66
SLIDE 66
  • NP Coordination

Source: Bansal & Klein (2011)

Parsing: Ambiguity Parsing: Ambiguity

slide-67
SLIDE 67

Source: Bansal & Klein (2011)

Parsing: Web Semantics Parsing: Web Semantics

slide-68
SLIDE 68
  • Lapata & Keller (2004): The Web as a

Baseline (also: Bergsma et al. 2010)

  • “big fat Greek wedding”

but not “fat Greek big wedding”

Source: Shane Bergsma

Adjective Ordering Adjective Ordering

slide-69
SLIDE 69

Source: Bansal & Klein 2012

Coreference Resolution Coreference Resolution

slide-70
SLIDE 70

Source: Bansal & Klein 2012

Coreference Resolution Coreference Resolution

slide-71
SLIDE 71
  • Data Sparsity:

E.g. most words are rare (in the “long tail”) → Missing in training data

  • Solution (Blitzer et al. 2006, Koo & Collins 2008,

Huang & Yates 2009, etc.)

  • Cluster together similar features
  • Use clustered features instead of / in addition to
  • riginal features

Brown Corpus Source: Baroni & Evert

Distributional Semantics Distributional Semantics

slide-72
SLIDE 72

Even worse: Arnold Schwarzenegger

Spelling Correction Spelling Correction

slide-73
SLIDE 73

Vector Representations Vector Representations

x x x x

petronia sparrow parched arid x dry

x bird

Put words into a vector space (e.g. with d=300 dimensions) Put words into a vector space (e.g. with d=300 dimensions)

slide-74
SLIDE 74

Word Vector Representations Word Vector Representations

Tomas Mikolov et al.

  • Proc. ICLR 2013.

Available from https://code.google.com/p/word2vec/

slide-75
SLIDE 75

Wikipedia Wikipedia

slide-76
SLIDE 76
  • Exploit edit history, especially on

Simple English Wikipedia

  • “collaborate” → “work together”

“stands for” → “is the same as”

Text Simplification Text Simplification

slide-77
SLIDE 77

Answering Questions

IBM's Jeopardy!-winning Watson system

Gerard de Melo

slide-78
SLIDE 78

Knowledge Integration

slide-79
SLIDE 79

UWN/MENTA

multilingual extension of WordNet for word senses and taxonomical information over 200 languages www.lexvo.org/uwn/

slide-80
SLIDE 80

WebChild

AAAI 2014 WSDM 2014 AAAI 2011

WebChild

AAAI 2014 WSDM 2014 AAAI 2011

WebChild: Common-Sense Knowledge WebChild: Common-Sense Knowledge

slide-81
SLIDE 81

Challenge: From Really Big Data Challenge: From Really Big Data to Real Insights to Real Insights Challenge: From Really Big Data Challenge: From Really Big Data to Real Insights to Real Insights

Image: Brett Ryder

slide-82
SLIDE 82

Big Data Mining in Practice Big Data Mining in Practice

slide-83
SLIDE 83

Gerard de Melo (Tsinghua University, Bejing China) Aparna Varde (Montclair State University, NJ, USA) DASFAA, Hanoi, Vietnam, April 2015

1

slide-84
SLIDE 84
  • Dr. Aparna Varde

2

slide-85
SLIDE 85

 Internet-based computing - shared resources, software &

data provided on demand, like the electricity grid

 Follows a pay-as-you-go model

3

slide-86
SLIDE 86

 Several technologies, e.g., MapReduce & Hadoop  MapReduce: Data-parallel programming model for

clusters of commodity machines

  • Pioneered by Google
  • Processes 20 PB of data per day

 Hadoop: Open-source framework, distributed

storage and processing of very large data sets

  • HDFS (Hadoop Distributed File System) for storage
  • MapReduce for processing
  • Developed by Apache

4

slide-87
SLIDE 87
  • Scalability

– To large data volumes – Scan 100 TB on 1 node @ 50 MB/s = 24 days – Scan on 1000-node cluster = 35 minutes

  • Cost-efficiency

– Commodity nodes (cheap, but unreliable) – Commodity network (low bandwidth) – Automatic fault-tolerance (fewer admins) – Easy to use (fewer programmers)

5

slide-88
SLIDE 88

Data type

key-value records

Map function

(Kin, Vin)  list(Kinter, Vinter)

Reduce function

(Kinter, list(Vinter))  list(Kout, V

  • ut)

6

slide-89
SLIDE 89

MapReduce Example

the quick brown fox the fox ate the mouse how now brown cow

Map Map Map

Reduce Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1

the, 1 brown, 1 fox, 1 quick, 1 the, 1 fox, 1 the, 1 how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1

Input Map Shuffle & Sort Reduce Output

7

slide-90
SLIDE 90

 40 nodes/rack, 1000-4000 nodes in cluster  1 Gbps bandwidth in rack, 8 Gbps out of rack  Node specs (Facebook):

8-16 cores, 32 GB RAM, 8×1.5 TB disks, no RAID

Aggregation switch Rack switch

8

slide-91
SLIDE 91

 Files split into 128MB blocks  Blocks replicated across

several data nodes (often 3)

 Name node stores metadata

(file names, locations, etc)

 Optimized for large files,

sequential reads

 Files are append-only

Namenode Datanodes

1 2 3 4 1 2 4 2 1 3 1 4 3 3 2 4 File1 9

slide-92
SLIDE 92

Hive:

Relational D/B on Hadoop developed at Facebook

Provides

SQL-like query language

10

slide-93
SLIDE 93

Supports table partitioning,

complex data types, sampling, some query optimization

These help discover knowledge

by various tasks, e.g.,

  • Search for relevant terms
  • Operations such as word count
  • Aggregates like MIN, AVG

11

slide-94
SLIDE 94

/* Find documents of enron table with word frequencies within range of 75 and 80 */ SELECT DISTINCT D.DocID FROM docword_enron D WHERE D.count > 75 and D.count < 80 limit 10; OK 1853… 11578 16653 Time taken: 64.788 seconds

12

slide-95
SLIDE 95

/* Create a view to find the count for WordID=90 and docID=40, for the nips table */ CREATE VIEW Word_Freq AS SELECT D.DocID, D.WordID, V.word, D.count FROM docword_Nips D JOIN vocabNips V ON D.WordID=V.WordID AND D.DocId=40 and D.WordId=90; OK Time taken: 1.244 seconds

13

slide-96
SLIDE 96

/* Find documents which use word "rational" from nips table */ SELECT D.DocID,V.word FROM docword_Nips D JOIN vocabnips V ON D.wordID=V.wordID and V.word="rational" LIMIT 10; OK 434 rational 275 rational 158 rational …. 290 rational 422 rational Time taken: 98.706 seconds

14

slide-97
SLIDE 97

/* Find average frequency of all words in the enron table */ SELECT AVG(count) FROM docWord_enron; OK 1.728152608060543 Time taken: 68.2 seconds

15

slide-98
SLIDE 98

Query Execution Time for HQL & MySQL on big data sets Similar claims for other SQL packages

16

slide-99
SLIDE 99

17

Server Storage Capacity Max Storage per instance

slide-100
SLIDE 100

18

slide-101
SLIDE 101

 Hive supports rich data types: Map, Array & Struct; Complex types  It supports queries with SQL Filters, Joins, Group By, Order By etc.  Here is when (original) Hive users miss SQL….

 No support in Hive to update data after insert  No (or little) support in Hive for relational semantics (e.g., ACID)  No "delete from" command in Hive - only bulk delete is possible  No concept of primary and foreign keys in Hive

19

slide-102
SLIDE 102

 Ensure dataset is already compliant with

integrity constraints before load

 Ensure that only compliant data rows are

loaded SELECT & temporary staging table

 Check for referential constraints using Equi-

Join and query on those rows that comply

20

slide-103
SLIDE 103

Providing more power than Hive, Hadoop & MR

21

slide-104
SLIDE 104

Cloudera’s Impala: More efficient SQL- compliant analytic database Hortonworks’ Stinger: Driving the future

  • f Hive with enterprise SQL at Hadoop

scale Apache’s Mahout: Machine Learning Algorithms for Big Data Spark: Lightning fast framework for Big Data MLlib: Machine Learning Library of Spark MLbase: Platform Base for MLlib in Spark

22

slide-105
SLIDE 105

 Fully integrated, state-of-the-art analytic

D/B to leverage the flexibility & scalability of Hadoop

 Combines benefits

  • Hadoop: flexibility, scalability, cost-effectiveness
  • SQL: performance, usability, semantics

23

slide-106
SLIDE 106

MPP: Massively Parallel Processing

slide-107
SLIDE 107

NDV: function for

counting

  • Table w/ 1 billion rows

COUNT(DISTINCT)

  • precise answer
  • slow for large-scale

data

NDV() function

  • approximate result
  • much faster

25

slide-108
SLIDE 108

Hardware Configuration

  • Generates less CPU load than Hive
  • Typical performance gains: 3x-4x
  • Impala cannot go faster than hardware permits!

Query Complexity

  • Single-table aggregation queries : less gain
  • Queries with at least one join: gains of 7-45X

Main Memory as Cache

  • Data accessed by query is in cache, speedup is more
  • Typical gains with cache: 20x-90x

26

slide-109
SLIDE 109

No - Many viable use cases for MR

& Hive

  • Long-running data transformation

workloads & traditional DW frameworks

  • Complex analytics on limited, structured

data sets

Impala is a complement to the

approaches

  • Supports cases with very large data sets
  • Especially to get focused result sets

quickly

27

slide-110
SLIDE 110

Drive future of Hive with enterprise SQL at

Hadoop scale

3 main objectives

  • Speed: Sub-second query response times
  • Scale: From GB to TB & PB
  • SQL: Transactions & SQL:2011 analytics for Hive

28

slide-111
SLIDE 111

 Wider use cases with modifications to data  BEGIN, COMMIT, ROLLBACK for multi-stmt transactions

29

slide-112
SLIDE 112

Hybrid engine with LLAP (Live Long and Process)

  • Caching & data reuse across queries
  • Multi-threaded execution
  • High throughput I/O
  • Granular column level security

30

slide-113
SLIDE 113

Common Table Expressions Sub-queries: correlated & uncorrelated Rollup, Cube, Standard Aggregates Inner, Outer, Semi & Cross Joins Non Equi-Joins Set Functions: Union, Except & Intersect Most sub-queries, nested and otherwise

31

slide-114
SLIDE 114

32

slide-115
SLIDE 115

Supervised and Unsupervised Learning Algorithms

33

slide-116
SLIDE 116

ML algorithms on distributed frameworks

good for mining big data on the cloud

Supervised Learning: e.g. Classification Unsupervised Learning: e.g. Clustering The word Mahout means “elephant rider” in

Hindi (from India), an interesting analogy 

34

slide-117
SLIDE 117

Clustering (Unsupervised)

  • K-means, Fuzzy k-means, Streaming k-means etc.

Classification (Supervised)

  • Random Forest, Naïve Bayes etc.

Collaborative Filtering (Semi-Supervised)

  • Item Based, Matrix Factorization etc.

Dimensionality Reduction (For Learning)

  • SVD, PCA etc.

Others

  • LDA for Topic Models, Sparse TF-IDF Vectors from Text etc.

35

slide-118
SLIDE 118

Input: Big Data from Emails

  • Goal: automatically classify text in various categories

Prior to classifying text, TF-IDF applied

  • Term Frequency – Inverse Document Frequency
  • TF-IDF increases with frequency of word in doc, offset

by frequency of word in corpus

Naïve Bayes used for classification

  • Simple classifier using posteriori probability
  • Each attribute is distributed independently of others

36

slide-119
SLIDE 119

Training Data Pre- Process Training Algorithm Model

Building the Model

Historical data with reference decisions:

  • Collected a set of e-

mails organized in directories labeled with predictor categories:

  • Mahout
  • Hive
  • Other
  • Stored email as text

in HDFS on an EC2 virtual server Using Apache Mahout:

  • Convert text

files to HDFS Sequential File format

  • Create TF-IDF

weighted Sparse Vectors Build and evaluate the model with Mahout’s implementation

  • f Naïve Bayes

Classifier Classification Model which takes as input vectorized text documents and assigns one of three document topics:

  • Mahout
  • Hive
  • Other

37

slide-120
SLIDE 120

New Data Model Output

Using Model to Classify New Data

  • Store a set of new

email documents as text in HDFS on an EC2 virtual server

  • Pre-process using

Apache Mahout’s Libraries Use the existing model to predict topics for new text files. This was implemented in Java

  • With Apache Mahout Libraries and
  • Apache Maven to manage dependencies and build the project
  • The executable JAR file is submitted with the project documentation
  • The program works with data files stored in HDFS

For each input document, the model returns one of the following categories:

  • Mahout
  • Hive
  • Other

38

slide-121
SLIDE 121

Text Classification - Results

Predicted Category: Hive Input: Output : Possible uses

  • Automatic email

sorting

  • Automatic news

classification

  • Topic modeling

39

slide-122
SLIDE 122

 Lightning fast processing for Big Data  Open-source cluster computing developed in

AMPLab at UC Berkeley

 Advanced DAG execution engine for cyclic data

flow & in-memory computing

 Very well-suited for large-scale Machine Learning

40

slide-123
SLIDE 123

Spark runs much faster

than Hadoop & MR

100x faster in memory 10x faster on disk

slide-124
SLIDE 124

42

slide-125
SLIDE 125

Uses TimSort (derived from merge-sort &

insertion-sort), faster than quick-sort

Exploits cache locality due to in-memory

computing

Fault-tolerant when scaling, well-designed

for failure-recovery

Deploys power of cloud through enhanced

N/W & I/O intensive throughput

43

slide-126
SLIDE 126

 Spark Core: Distributed task dispatching, scheduling, basic I/O  Spark SQL: Has SchemaRDD (Resilient Distributed Databases);

SQL support with CLI, ODBC/JDBC

 Spark Streaming: Uses Spark’s fast scheduling for stream analytics,

code written for batch analytics can be used for streams

 MLlib: Distributed machine learning framework (10x of Mahout)  Graph X: Distributed graph processing framework with API

44

slide-127
SLIDE 127

 MLBase - tools & interfaces to bridge gap b/w

  • perational & investigative ML

 Platform to support MLlib - the distributed

Machine Learning Library on top of Spark

45

slide-128
SLIDE 128

 MLlib: Distributed ML library for classification,

regression, clustering & collaborative filtering

 MLI: API for feature extraction, algorithm development,

high-level ML programming abstractions

 ML Optimizer: Simplifies ML problems for end users

by automating model selection

46

slide-129
SLIDE 129

 Classification

  • Support Vector Machines (SVM), Naive Bayes, decision trees

 Regression

  • linear regression, regression trees

 Collaborative Filtering

  • Alternating Least Squares (ALS)

 Clustering

  • k-means

 Optimization

  • Stochastic gradient descent (SGD), Limited-memory BFGS

 Dimensionality Reduction

  • Singular value decomposition (SVD), principal component analysis

(PCA)

47

slide-130
SLIDE 130

Easily interpretable Ensembles are top performers Support for categorical variables Can handle missing data Distributed decision trees Scale well to massive datasets

48

slide-131
SLIDE 131

Uses modified version of k-means Feature extraction & selection

  • Extraction: lot of time and tools
  • Selection: domain expertise
  • Wrong selection of features: bad quality clusters

Glassbeam’s SCALAR platform

  • SPL (Semiotic Parsing Language)
  • Makes feature engineering easier & faster

49

slide-132
SLIDE 132

50

slide-133
SLIDE 133

 High-D data: Not all features IMP to build

model & answer Qs

 Many applications: Reduce dimensions before

building model

 MLlib: 2 algorithms for dimensionality

reduction

  • Principal Component Analysis (PCA)
  • Singular Value Decomposition (SVD)

51

slide-134
SLIDE 134

52

slide-135
SLIDE 135

53

slide-136
SLIDE 136

 Grow into unified platform for data scientists  Reduce time to market with platforms like

Glassbeam’s SCALAR for feature engineering

 Include more ML algorithms  Introduce enhanced filters  Improve visualization for better performance

54

slide-137
SLIDE 137

Processing Streaming Data on the Cloud

55

slide-138
SLIDE 138

Apache Storm

  • Reliably process unbounded streams, do for

real-time what Hadoop did for batch processing

Apache Flink

  • Fast & reliable large scale data processing

engine with batch & stream based alternatives

56

slide-139
SLIDE 139

 Integrates queueing & D/B

 Nimbus node

  • Upload computations
  • Distribute code on cluster
  • Launch workers on cluster
  • Monitor computation

 ZooKeeper nodes

  • Coordinate the Storm cluster

 Supervisor nodes

  • Interacts w/ Nimbus through Zookeeper
  • Starts & stops workers w/ signals from

Nimbus

57

slide-140
SLIDE 140

 Tuple: ordered list of elements,

e.g., a “4-tuple” (7, 1, 3, 7)

 Stream: unbounded sequence of

tuples

 Spout: source of streams in a

computation (e.g. Twitter API)

 Bolt: process I/P streams &

produce O/P streams to run functions

 Topology: overall calculation, as

N/W of spouts and bolts

58

slide-141
SLIDE 141

Exploits in-memory

data streaming & adds iterative processing into system

Makes system

super fast for data- intensive & iterative jobs

59

Performance Comparison

slide-142
SLIDE 142

 Requires few config

parameters

 Built-in optimizer finds

best way to run program

 Supports all Hadoop

I/O & data types

 Runs MR operators

unmodified & faster

 Reads data from HDFS

60

Execution of Flink

slide-143
SLIDE 143

Summary and Ongoing Work

61

slide-144
SLIDE 144

 MapReduce & Hadoop: Pioneering tech  Hive: SQL-like, good for querying  Impala: Complementary to Hive, overcomes

its drawbacks

 Stinger: Drives future of Hive w/ advanced

SQL semantics

 Mahout: ML algorithms (Sup & Unsup)  Spark: Framework more efficient & scalable

than Hadoop

 MLlib: Machine Learning Library of Spark  MLbase: Platform supporting MLlib  Storm: Stream processing for cloud & big data  Flink: Both stream & batch processing

62

slide-145
SLIDE 145

 Store & process Big Data?

  • MR & Hadoop - Classical technologies
  • Spark – Very large data, Fast & scalable

 Query over Big Data?

  • Hive – Fundamental, SQL-like
  • Impala – More advanced alternative
  • Stinger - Making Hive itself better

 ML supervised / unsupervised?

  • Mahout - Cloud based ML for big data
  • MLlib - Super large data sets, super fast

 Mine over streaming big data?

  • Only streams – Storm
  • Batch & Streams – Flink

63

slide-146
SLIDE 146

Big Data Skills

  • Cloud Technology
  • Deep Learning
  • Business Perspectives
  • Scientific Domains

Salary $100k +

  • Big Data

Programmers

  • Big Data Analysts

Univ Programs &

concentrations

  • Data Analytics
  • Data Science

64

slide-147
SLIDE 147

 Big Data Concentration being

developed in CS Dept

  • http://cs.montclair.edu/
  • Data Mining, Remote Sensing, HCI,

Parallel Computing, Bioinformatics, Software Engineering

 Global Education Programs

available for visiting and exchange students

  • http://www.montclair.edu/global-

education/

 Please contact me for details

65

slide-148
SLIDE 148

 Include more cloud intelligence in big

data analytics

 Further bridge gap between Hive & SQL  Add features from standalone ML packages

to Mahout, MLlib …

 Extend big data capabilities to focus more

  • n PB & higher scales

 Enhance mining of big data streams w/

cloud services & deep learning

 Address security & privacy issues on a

deeper level for cloud & big data

 Build lucrative applications w/ scalable

technologies for big data

 Conduct domain-specific research, e.g.,

Cloud & GIS, Cloud & Green Computing

66

slide-149
SLIDE 149

[1] M. Bansal, D. Klein. Coreference Semantics from Web Features. In Proceedings of ACL 2012. [2] R. Bekkerman, M. Bilenko, J. Langford (Eds.). Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, 2011. [3] T. Brants, A. Franz. Web 1T 5-Gram Version 1, Linguistic Data Consortium, 2006. [4] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, P. Kuksa. Natural Language Processing (Almost) from Scratch, Journal of Machine Learning Research (JMLR), 2011. [5] G. de Melo. Exploiting Web Data Sources for Advanced NLP. In Proceedings of COLING 2012. [6] G. de Melo, K. Hose. Searching the Web of Data. In Proceedings of ECIR 2013. Springer LNCS. [7] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large

  • Clusters. In USENIX OSDI-04, San Francisco, CA, pp. 137-149.

[8] C. Engle, A. Lupher, R. Xin, M. Zaharia, M. Franklin, S. Shenker, I. Stoica. Shark: Fast Data Analysis using Coarse-Grained Distributed Memory. In Proceedings of SIGMOD 2013, ACM, pp. 689-692. [9] A. Ghoting, P. Kambadur, E. Pednault, R. Kannan. NIMBLE: A Toolkit for the Implementation of Parallel Data Mining and Machine Learning Algorithms on

  • MapReduce. In Proceedings of KDD 2011. ACM, New York, NY, USA, pp. 334-342.

[10] K. Hammond, A.Varde. Cloud-Based Predictive Analytics,. In ICDM-13 KDCloud Workshop, Dallas, Texas, December 2013.

67

slide-150
SLIDE 150

[11] K. Hose, R. Schenkel, M. Theobald, G. Weikum: Database Foundations for Scalable RDF Processing. Reasoning Web 2011, pp. 202-249. [12] R. Kiros, R. Salakhutdinov, R. Zemel. Multimodal Neural Language Models. In

  • Proc. of ICML, 2014.

[13] T. Kraska, A. Talwalkar, J.Duchi, R. Griffith, M. Franklin, M. Jordan. MLbase: A Distributed Machine Learning System. In Conference on Innovative Data Systems Research, 2013. [14] Q.V . Le, T. Mikolov. Distributed Representations of Sentences and Documents. In Proceedings of ICML, 2014. [15] J. Leskovec, A. Rajaraman, J. Ullman. Mining of Massive Datasets. Cambridge University Press, 2011. [16] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean. Distributed Representations of Words and Phrases and Their Compositionality. In NIPS 26, pp. 3111–3119, 2013. [17] R. Nayak, P. Senellart, F. Suchanek, A. Varde. Discovering Interesting Information with Advances in Web Technology. SIGKDD Explorations 14(2): 63-81 (2012). [18] M. Riondato, J. DeBrabant, R. Fonseca, E. Upfal. PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce. In Proceedings of CIKM 2012. ACM, pp. 85-94. [19] F. Suchanek, A. Varde, R.Nayak, P.Senellart. The Hidden Web, XML and the Semantic Web: Scientific Data Management Perspectives. EDBT 2011, Uppsala, Sweden, pp. 534-537.

68

slide-151
SLIDE 151

[20] N. Tandon, G. de Melo, G. Weikum. Deriving a Web-Scale Common Sense Fact

  • Database. In Proceedings of AAAI 2011.

[21] J. Tancer, A. Varde. The Deployment of MML For Data Analytics Over The Cloud. In ICDM-11 KDCloud Workshop, December 2011, Vancouver, Canada, pp. 188-195. [22] A. Thusoo, J. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, R.

  • Murthy. Hive: A Petabyte Scale Data Warehouse Using Hadoop, 2009.

[23] J. Turian, L. Ratinov, Y. Bengio. Word Representations: A simple and general method for semi-supervised learning. In Proceedings of ACL 2010. [24] A. Varde, F. Suchanek, R.Nayak, P.Senellart. Knowledge Discovery over the Deep Web, Semantic Web and XML. DASFAA 2009, Brisbane, Australia, pp. 784-788. [25] T. White. Hadoop. The Definitive Guide, O’Reilly, 2011. [26] http://flink.incubator.apache.org [27] https://github.com/twitter/scalding‎ [28] http://hortonworks.com/labs/stinger/ [29]http://linkeddata.org/ [30] http://mahout.apache.org/ [31] https://storm.apache.org

69

slide-152
SLIDE 152

Contact Information Gerard de Melo (gdm@demelo.org) [http://gerard.demelo.org] Aparna Varde (vardea@montclair.edu) [http://www.montclair.edu/~vardea]

70