Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 21: - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 21: Review Jan-Willem van de Meent

Schedule

Topics for Exam Pre-Midterm Post-Midterm • Probability • Topic Models • Information Theory • Dimensionality Reduction • Linear Regression • Recommender Systems • Classification • Association Rules • Clustering • Link Analysis • Time Series • Social Networks

Post-Midterm Topics

Topic Models • Bag of words representations of documents • Multinomial mixture models • Latent Dirichlet Allocation ◦ Generative model ◦ Expectation Maximization (PLSA/PLSI) ◦ Variational inference (high level) • Perplexity • Extensions (high level) ◦ Dynamic Topic Models ◦ Supervised LDA ◦ Ideal Point Topic Models

Dimensionality Reduction Principal Component Analysis ◦ Interpretation as minimization of reconstruction error ◦ Interpretation as maximization of captured variance ◦ Interpretation as EM in generative model ◦ Computation using eigenvalue decomposition ◦ Computation using SVD ◦ Applications (high-level) ▪ Eigenfaces ▪ Latent Semantic Analysis ▪ Relationship to LDA ▪ Multi-task learning ◦ Kernel PCA ▪ Direct method vs modular method

Dimensionality Reduction • Canonical Correlation Analysis ◦ Objective ◦ Relationship to PCA ◦ Regularized CCA ▪ Motivation ▪ Objective • Singular Value Decomposition ◦ Definition ◦ Complexity ◦ Relationship to PCA • Random Projections ◦ Johnson-Lindenstrauss Lemma

Dimensionality Reduction • Stochastic Neighbor Embeddings ◦ Similarity definition in original space ◦ Similarity definition in lower dimensional space ◦ Definition of objective in terms of KL divergence ◦ Gradient of objective

Recommender Systems • Motivation: The long tail of product popularity • Content-based filtering ◦ Formulation as a regression problem ◦ User and item bias ◦ Temporal effects • Matrix Factorization ◦ Formulation of recommender systems   as matrix factorization ◦ Solution through alternating least squares ◦ Solution through stochastic gradient descent

Recommender Systems • Collaborative filtering ◦ (user, user) vs (item, item) similarity ▪ pro’s and cons of each approach ◦ Parzen-window CF ◦ Similarity measures ▪ Pearson correlation coefficient ▪ Regularization for small support ▪ Regularization for small neigborhood ▪ Jaccard similarity ▪ Regularization ▪ Observed/expected ratio ▪ Regularization

Association Rules • Problem formulation and examples ◦ Customer purchasing ◦ Plagiarism detection • Frequent Itemset ◦ Definition of (fractional) support • Association Rules ◦ Confidence ◦ Measures of interest ▪ Added value ▪ Mutual information

Association Rules • A-priori ◦ Base principle ◦ Algorithm ◦ Self-joining and pruning of candidate sets ◦ Maximal vs closed itemsets ◦ Hash tree implementation for subset matching ◦ I/O and memory limited steps ◦ PCY method for reducing candidate sets • FP-Growth ◦ FP-tree construction ◦ Pattern mining using conditional FP-trees • Performance of A-priori vs FP-growth

Aside: PCY vs PFP (parallel FP-Growth) I asked an actual expert I notice that Spark MLib ships PFP as its main algorithm and I notice you benchmark against this as well. That said I can imagine there are might be different regimes where these algorithms are applicable. For example I notice you look at large numbers of transactions (order 10^7) but relatively small numbers of frequent items (10^3-10^4). The MMDS guys seem to emphasize the case where you cannot hold counts for all candidate pairs in memory, which presumably means numbers of items of order (10^5-10^6). Is it the case that once you are doing this at Walmart or Amazon scale, you in practice have to switch to PCY-variants? Hi Jan, This is a good question. In my opinion, it is not true that if you have million of items then you need to use PCY-variants. FP-Growth and its many of variants are most likely going to perform better anyway, because available implementations have been seriously optimized. They are not really creating and storing pairs of candidates anyway, so that’s not really the problem. Matteo   Hope this helps, Riondato Matteo

Link Analysis ◦ Recursive formulation ▪ Interpretation of links as weighted votes ▪ Interpretation as equilibrium condition   in population model for surfers   (inflow equal to outflow) ▪ Interpretation as visit frequency of random surfer ◦ Probabilistic model ◦ Stochastic matrices ◦ Power iteration ◦ Dead ends (and fix) ◦ Spider traps (and fix) ◦ PageRank Equation ▪ Extension to topic-specific page-rank ▪ Extension to TrustRank

Times Series • Time series smoothing ◦ Moving average ◦ Exponential • Definition of a stationary time series • Autocorrelation • AR(p), MA(q), ARMA(p,q) and ARIMA(p,d,q) models • Hidden Markov Models ◦ Relationship of dynamics to   random surfer in page rank ◦ Relatinoship to mixture models ◦ Forward-backward algorithm (see notes)

Social Networks • Centrality measures ◦ Betweenness ◦ Closeness ◦ Degree • Girvan-Newman algorithm for clustering ◦ Calculating betweenness ◦ Selecting number of clusters using the modularity

Social Networks • Spectral clustering ◦ Graph cuts ◦ Normalized cuts ◦ Laplacian Matrix ▪ Definition in terms of Adjacency and Degree matrix ▪ Properties of eigenvectors ▪ Eigenvalues are >= 0 ▪ First eigenvector ▪ Eigenvalue is 0 ▪ Eigenvector is [1 … 1]^T ▪ Second eigenvector (Fiedler vector) ▪ Elements sum to 0 ▪ Eigenvalue is normalized sum   of squared edge distances ◦ Use of first eigenvector to find normalized cut

Pre-Midterm Topics

Conjugate Distributions Binomial: Probability of m heads in N flips � N � µ m (1 − µ ) N − m Bin( m | N, µ ) = m [ ] = Beta: Probability for bias μ Γ ( a + b ) Γ ( a ) Γ ( b ) µ a − 1 (1 − µ ) b − 1 Beta( µ | a, b ) = a

Conjugate Distributions Posterior probability for μ given flips

Information Theoretic Measures Perplexity KL Divergence P x p ( x ) log 2 p ( x ) Per ( p ) = 2 − Perplexity (of a model) Mutual Information P N n = 1 log 2 q ( y n ) Per ( q ) = 2 p ( y ) = 1 P N ˆ n = 1 I [ y n = y ] Entropy N P H ( ˆ p , q ) = − y ˆ p ( y ) log q ( y ) Per ( q ) = e H ( ˆ p , q )

Loss Functions 1 2 ( w > x � y ) 2 squared loss: y 2 R Linear Regression 1 4 ( Sign ( w > x ) � y ) 2 y 2 { � 1, + 1 } zero-one: Perceptron � � � 1 + exp ( � y w > x ) logistic loss: log y 2 { � 1, + 1 } Logistic Regression max { 0,1 � y w > x } hinge loss: y 2 { � 1, + 1 } Soft SVMs

Bias-Variance Trade-Off Error on test set Variance of what exactly?

Bias-Variance Trade-Off { | } Training Data Classifier/Regressor N X L ( y i , f ( x i )) T = { ( x i , y i ) | i = 1 , . . . , n } f T = argmin f i =1 X Expected value for y Expected prediction ¯ y = E y [ y | x ] ¯ f ( x ) = E T [ f T ( x )] Bias-Variance Decomposition E y,T [( y − f T ( x )) 2 | x ] = E y [( y − ¯ y ) 2 | x ] + E y,T [( ¯ f ( x ) − f T ( x )) 2 | x ] y − ¯ f ( x )) 2 | x ] + E y [(¯ = var y ( y | x ) + var T ( f ( x )) + bias ( f T ( x )) 2

Bagging and Boosting Bagging Boosting B B ( x ) = 1 F boost ( x ) = 1 X X F bag f T b ( x ) α b f w b ( x ) T B B b = 1 b = 1 • Sample B datasets T b   • Sequential training at random with replacement   • Assign higher weight   from the full data T to previously misclassified   • Train on classifiers   data points independently on each   • Combines weighted weak   dataset and average results learners (high bias) into   • Decreases variance   a strong learner (low bias) (i.e. overfitting) does not   • Also some reduction of   affect bias (i.e. accuracy). variance (in later iterations)

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 21: - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 21: Review Jan-Willem van de Meent Schedule Topics for Exam Pre-Midterm Post-Midterm Probability Topic Models Information Theory Dimensionality Reduction Linear

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining: Concepts and Techniques Chapter 1 Introduction 1 August 19, 2013

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Data Mining: Concepts and Techniques Chap 8. Data Streams, Time Series Data, and Sequential

Trade-Off Between Trade-off problems for . . . Sample Size and Accuracy: Solutions Case of

Automating the Area-Delay Trade-off Problem Haven Skinner, Rafael Possignolo, Jose Renau CARRV

Lambda Lecture at the London School of Economics Mark Carney Governor of the Bank of England 16

Analyzing Trade-offs and Making Decisions A Staffing and Workload Webinar Jennifer Allen , MD,

A quantum information trade-off for Augmented Index Ashwin Nayak Joint work with Dave Touchette

Communication trade-offs for synchronized distributed SGD with large step size Aymeric DIEULEVEUT

Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya 1 and Hjalte

Performance/Power Trade-Offs of Bitline Isolation Se-Hyun Yang and Babak Falsafi Computer