. . How would you group . . . . . . . . . . The price of - PDF document

12/11/17 What is Clustering? Clustering: k-means, Expectation-Maximization • Given some instances of data: group them such that Ethics: Ethical Questions in AI • Examples within a group are similar • Examples in different groups are different • These groups are clusters B(1,4) 4 A(3,2) C(3,3) 3 • A kind of unsupervised learning – the instances do Muzzle 2 not include a class attribute. 1 1 2 3 4 Tail Based partly on: M desJardins, T Oates, P Matuszek, RJ Mooney: www.cs.utexas.edu/~mooney/cs388/slides/TextClustering.ppt, and other sources as noted Clustering Example A Different Example . . • How would you group . . . . . . . . . . • ‘The price of crude oil has increased significantly’ . . . . . . • ‘Demand for crude oil outstrips supply’ . . • ‘Some people do not like the flavor of olive oil’ . . . . . . • ‘The food was very oily’ . . A note: you might . . • ‘Crude oil is in short supply’ or might not know . . • ‘Oil platforms extract oil’ how many clusters to look for. • ‘Canola oil is supposed to be healthy’ • ‘Iraq has significant oil reserves’ • ‘There are different types of cooking oil’ A Different Example Another Example • How would you group • ‘The price of crude oil has increased significantly’ • ‘Demand for crude oil outstrips supply’ • ‘Some people do not like the flavor of olive oil’ • ‘The food was very oily’ • ‘Crude oil is in short supply’ • ‘Oil platforms extract oil’ • ‘Canola oil is supposed to be healthy’ • ‘Iraq has significant oil reserves’ • ‘There are different types of cooking oil’ 1

12/11/17 Some Example Uses Clustering Basics • Collect examples • Compute similarity among examples according to some metric • Group examples together such that: 1. Examples within a cluster are similar 2. Examples in different clusters are different • Summarize each cluster • Sometimes : assign new instances to the cluster it I most similar to Measures of Similarity Measures of Similarity • Semantic similarity (but that’s hard) • To do clustering we need some measure of similarity. • For example, olive oil/crude oil • This is basically our “critic” • Similar attribute counts • Computed over a vector of values representing instances • Number of attributes with the same value • Appropriate for large, sparse vectors • Types of values depend on domain: • Bag-of-Words: BoW • Documents: bag of words, linguistic features • Purchases: cost, purchaser data, item data • More complex vector comparisons: • Census data: most of what is collected • Euclidean Distance • Cosine Similarity • Multiple different measures exist Euclidean Distance Euclidean • Euclidean distance: distance between two measures • Calculate differences summed across each feature • Ears: pointy? • Muzzle: how many inches long? dist(x i , x j ) = sqrt((x i1 -x j1 ) 2 +(x i2 -x j2 ) 2 +..+(x in -x jn ) 2 ) • Tail: how many inches long? • Squared differences give more weight to larger differences • dist([1,2],[3,8]) = sqrt((1-3) 2 +(2-8) 2 ) = sqrt((-2) 2 +(-6) 2 ) = dist(x 1, x 2 ) = sqrt((0-1) 2 +(3-1) 2 +..+(2-4) 2 )=sqrt(9)=3 sqrt(4+36) = sqrt(40) = ~6.3 dist(x 1, x 3 ) = sqrt((0-0) 2 +(3-3) 2 +..+(2-3) 2 )=sqrt(1)=1 2

12/11/17 Cosine Similarity Cosine Similarity • A measure of similarity between vectors Most similar? B(1,4) • Find cosine of the angle between them < x1,y1 > 4 • Cosine = 1 when angle = 0 A(3,2) C(3,3) • Cosine < 1 otherwise 3 y < x 2 ,y 2 > Muzzle θ • As angle between vectors shrinks, 2 θ approaches 1 • Meaning: the two vectors are getting closer x 1 • Meaning: the similarity of whatever is represented by the vectors increases 1 2 3 4 • Vectors can have any number of dimensions Tail Based on home.iitk.ac.in/~mfelixor/Files/non-numeric-Clustering-seminar.ppt, with thanks Euclidean Distance vs Cosine Clustering Algorithms Similarity vs Other • Cosine Similarity: • Flat: • Measures relative proportions of various features • K means • Ignores magnitude • When all the correlated dimensions between two vectors are in proportion, you get maximum similarity • Hierarchical: • Euclidean Distance: • Bottom up • Measures actual distance between two points • Top down (not common) • More concerned with absolutes • Probabilistic: • Often similar in practice, especially on high dimensional data • Expectation Maximization (E-M) • Consider meaning of features/feature vectors f or your domain Justin Washtell @ semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/ Partitioning (Flat) Algorithms k-means Clustering • Partitioning method • Simplest hierarchical method, widely used • Construct a partition of n instances into a set of k clusters • Create clusters based on a centroid; each • Given: a set of documents and the number k instance is assigned to the closest centroid • Find: a partition of k clusters that optimizes the • K is given as a parameter chosen partitioning criterion • Globally optimal: exhaustively enumerate all partitions. • Heuristic and iterative • Usually too expensive. • Effective heuristic methods: k-means algorithm. www.csee.umbc.edu/~nicholas/676/MRSslides/lecture17-clustering.ppt 3

12/11/17 1. randomly place centroids 2. iteratively: k-means Algorithm K Means Example (K=2) • assign points to closest centroid, forming clusters • calculate centroids of new clusters 3. until convergence 1. Choose k (the number of clusters) 2. Randomly choose k instances to center clusters on 3. Assign each point to the centroid it’s closest to, forming clusters 4. Recalculate centroids of new clusters 5. Reassign points based on new centroids 6. Iterate until… 7. Convergence (no point is reassigned) or after a fixed number of iterations. This (happens to be) a pretty good random initialization! 19 www.youtube.com/watch?v=5I3Ei69I40s k-means Problem: Bad Initial Seeds • Tradeoff between having more clusters (better focus within each cluster) and having too many clusters. • Overfitting is a possibility with too many! • Results depend on random seed selection. • Some seeds can result in slow convergence or convergence to poor clusters • Algorithm is sensitive to outliers • Data points that are very far from other data points • Could be errors, special cases, … www.csee.umbc.edu/~nicholas/676/MRSslides/lecture17-clustering.ppt datasciencelab.wordpress.com/2014/01/15/improved-seeding-for-clustering-with-k-means/ Expectation Maximization Evaluation of k-means Clustering Advantages Disadvantages • Expectation-Maximization is a core ML algorithm • Easy to understand, implement • Must choose k beforehand • Not just for clustering! • Bad k à bad clusters • Most popular clustering • Basic idea: assign instances to clusters • Sometimes we don’t know algorithm probabilistically rather than absolutely • Efficient, almost linear • Sensitive to initialization • Time complexity: O(tkn) • Instead of assigning membership in a group, learn a • One fix: run several times with • n = number of data points different random centers and probability function for each group • k = number of clusters look for agreement • t = number of iterations. • Instead of absolute assignments, output is • Sensitive to outliers, irrelevant • In practice, performs well features probability of each instance being in each cluster (especially on text) 25 28 4

12/11/17 EM Clustering Algorithm Expectation Maximization (EM) • Goal: maximize overall probability of data • Probabilistic method for soft clustering • Idea: learn k classifications from unlabeled data • Iterate between: • Expectation: estimate probability that each instance • Assumes k clusters:{c 1 , c 2 ,… c k } belongs to each cluster • “Soft” version of k-means • Maximization: recalculate parameters of probability distribution for each cluster • Assumes a probabilistic model of categories (such as Naive Bayes) • Until convergence or iteration limit. • Allows computing P(c i | I) for each category, c i , for a given instance I 29 (Slightly) More Formally EM Initialize: • Iteratively learn probabilistic categorization model from Assign random probabilistic labels to unlabeled data unsupervised data • Initially assume random assignment of examples to categories Unlabeled Examples • “Randomly label” data + − • Learn initial probabilistic model by estimating model + − + − parameters θ from randomly labeled data + − + − • Iterate until convergence: • Expectation (E-step): • Compute P(c i | I) for each instance (example) given the current model • Probabilistically re-label the examples based on these posterior probability estimates • Maximization (M-step): Re-estimate model parameters, θ , from re-labeled data https://www.mathworks.com/matlabcentral/fileexchange/24867-gaussian-mixture-model-m EM EM Initialize: Initialize: Give soft-labeled training data to a probabilistic learner Produce a probabilistic classifier + − + − + − Prob. + − Prob. Prob. + − Learner + − Classifier Learner + − + − + − + − 5

. . How would you group . . . . . . . . . . The price of - PDF document

12/11/17 What is Clustering? Clustering: k-means, Expectation-Maximization Given some instances of data: group them such that Ethics: Ethical Questions in AI Examples within a group are similar Examples in different groups are

ASPHALT SUPPLY IN A VOLATILE OIL WORLD Bill Haverland Bill Haverland ConocoPhillips Company

Stochastic optimization for the crude oil procurement problem Thomas Martin, Michel De Lara

AAPA Public Relations Committee March 24, 2014 Presentation I love it when you talk crude Crude

Oil Price Shocks Precautionary Demand in Crude Oil Market Islam Rizvanoghlu Rice University

BNSF RAIL BNSF RAILWA WAY Y Cr Crude ude by Rail by Rail Saf Safety ety Ov Over erview

Overview Opportunity Crude Trends (High TAN Crude) and its challenges Overview of

Crude Oil Price Differentials and Pipeline Investment Shaun McRae January 5, 2018 ITAM Rapid

Oil Price Shocks and Inflation (Bharat Trehan, FRBSF Economic Letter, 2005) 1 When oil price

Topside Equipment A presentation of a coating technology for the prevention of crude oil derived

Investor Presentation 1 December, 2019 PEMEXs Main Issues 1 2 3 4 5 Crude Oil High

BLOCKCHAIN SOLUTION FOR US CRUDE OIL MARKET Arnaud Stevens - March 2018 CORPORATE &

Crude-Oil Blend Scheduling Optimization of a Complex Industrial-Sized Refinery: A Discrete-Time

The Problem: volatile prices for oil Volatile revenues Alberta resource revenue plunges to

What is Happening in the Global Market for Crude Oil? Some Insights from the Recent Literature

Energy Demand & Energy Demand & World Oil Production : Forecast World Oil Production :

Cointegration in the Market Cointegration in the Market for Crude with Structural for Crude with

MAT137 - Calculus with proofs Assignment #3 due on November 5 Assignment #4 due on November 26

Do Now Day 6 Aug 139:31 AM Objectives: Students will be able to solve and graph absolute

BIT @ TRECVid SED 2013 Yicheng Zhao, Binjun Gan, Shuo Tang, Jing Liu, Xiaoyu Li, Yulong Li,

here! Some personal favorites When Moses told these words to all the people of Israel, the

Ray Tracing Assignment Goal is to reproduce the following Whitted, 1980 1 Ray Tracing

Concepts Introduced in Chapter 8 register assignment instruction selection run-time

3 Swinburne University of Technology 1 Mass-luminosity relation from

Security Purposes draft-iab-identifier-comparison-00 Dave Thaler dthaler@microsoft.com 1

. . How would you group . . . . . . . . . . The price of - PDF document

12/11/17 What is Clustering? Clustering: k-means, Expectation-Maximization Given some instances of data: group them such that Ethics: Ethical Questions in AI Examples within a group are similar Examples in different groups are

ASPHALT SUPPLY IN A VOLATILE OIL WORLD Bill Haverland Bill Haverland ConocoPhillips Company

Stochastic optimization for the crude oil procurement problem Thomas Martin, Michel De Lara

AAPA Public Relations Committee March 24, 2014 Presentation I love it when you talk crude Crude

Oil Price Shocks Precautionary Demand in Crude Oil Market Islam Rizvanoghlu Rice University

BNSF RAIL BNSF RAILWA WAY Y Cr Crude ude by Rail by Rail Saf Safety ety Ov Over erview

Overview Opportunity Crude Trends (High TAN Crude) and its challenges Overview of

Crude Oil Price Differentials and Pipeline Investment Shaun McRae January 5, 2018 ITAM Rapid

Oil Price Shocks and Inflation (Bharat Trehan, FRBSF Economic Letter, 2005) 1 When oil price

Topside Equipment A presentation of a coating technology for the prevention of crude oil derived

Investor Presentation 1 December, 2019 PEMEXs Main Issues 1 2 3 4 5 Crude Oil High

BLOCKCHAIN SOLUTION FOR US CRUDE OIL MARKET Arnaud Stevens - March 2018 CORPORATE &amp;

Crude-Oil Blend Scheduling Optimization of a Complex Industrial-Sized Refinery: A Discrete-Time

The Problem: volatile prices for oil Volatile revenues Alberta resource revenue plunges to

What is Happening in the Global Market for Crude Oil? Some Insights from the Recent Literature

Energy Demand &amp; Energy Demand &amp; World Oil Production : Forecast World Oil Production :

Cointegration in the Market Cointegration in the Market for Crude with Structural for Crude with

MAT137 - Calculus with proofs Assignment #3 due on November 5 Assignment #4 due on November 26

Do Now Day 6 Aug 139:31 AM Objectives: Students will be able to solve and graph absolute

BIT @ TRECVid SED 2013 Yicheng Zhao, Binjun Gan, Shuo Tang, Jing Liu, Xiaoyu Li, Yulong Li,

here! Some personal favorites When Moses told these words to all the people of Israel, the

Ray Tracing Assignment Goal is to reproduce the following Whitted, 1980 1 Ray Tracing

Concepts Introduced in Chapter 8 register assignment instruction selection run-time

3 Swinburne University of Technology 1 Mass-luminosity relation from

Security Purposes draft-iab-identifier-comparison-00 Dave Thaler dthaler@microsoft.com 1

BLOCKCHAIN SOLUTION FOR US CRUDE OIL MARKET Arnaud Stevens - March 2018 CORPORATE &

Energy Demand & Energy Demand & World Oil Production : Forecast World Oil Production :