Big Data Management & Analytics EXERCISE 8 TEXT PROCESSING, PCA - PowerPoint PPT Presentation

Oct 05, 2023 •421 likes •528 views

Big Data Management & Analytics EXERCISE 8 TEXT PROCESSING, PCA 21st of December, 2015 Sabrina Friedl LMU Munich 1 Product Component Analysis (PCA) REVISION AND EXAMPLE 2 Goals of PCA Find a lower-dimensional representation of

Big Data Management & Analytics EXERCISE 8 – TEXT PROCESSING, PCA 21st of December, 2015 Sabrina Friedl LMU Munich 1
Product Component Analysis (PCA) REVISION AND EXAMPLE 2
Goals of PCA Find a lower-dimensional representation of data to: ◦ Detect hidden correlations ◦ Remove (summarize redundant, irrelevant or noisy features ◦ Fascilitate interpretation and visualization (actually visualization is possible only for few dimensions) ◦ Make storage and processing of data easier d=2 d=3 3
Idea of PCA A good data representation retains the main differences between data points but eliminates irrelevant variances ◦ Given matrix 𝑌 : 𝑜 data points with 𝑒 dimensions (features) ◦ Find 𝑙 directions (linear combinations of dimensions) with highest variance = principal components: 𝑤 1 , 𝑤 2 , … 𝑤 𝑙 ◦ Project data points onto these directions ◦ General Form: 𝑌𝑄 = 𝑍 X = raw data matrix P = (v 1 , v 2 ,... v k ) transformation matrix (n x d) * (d x k) = (n x k) Y = k-dimensional representation of X 4
PCA – Graphical Intuition Center data Transform by P 5
How to get Principal Components? Calculate the eigenvalues and eigenvectors of the covariance matrix Sigma here is the name of the matrix, not the sum symbol! = 𝐷𝑃𝑊(𝑌, 𝑌) Describes the pairwise correlation between all features For a centralized data matrix 𝑌 with µ = 0 we 𝟐 𝒐 𝒀 𝑼 𝒀 = can calculate the covariance matrix as: 6
Eigenvalues and Eigenvectors 7
Dimension Reduction For 𝑜 dimensions of 𝑌 we get 𝑜 eigevalues and eigenvectors. The transformation matrix is then constructed by putting the eigenvectors as columns into a matrix: T = 𝑤 1 , 𝑤 2 , … 𝑤 𝑜 Σ = covariance matrix T = (v 1 , v 2 ,... v n ) transformation matrix Eigendecomposition: Σ = 𝑈Ʌ𝑈 𝑈 Ʌ = diagonalised matrix with eigenvalues on diagonal To get a k-dimensional representation Y of (centered) data X we take only the first k eigenvectors (principal components) of T and call this matrix P . We calculate: 𝒀𝑸 = Y To transform back: Z = 𝑍𝑄 𝑈 8
PCA – Summary of Steps Center the data 𝑌 : 𝑦 𝑗 − µ 𝑗 1. Σ = 1 2. Calculate the covariance-matrix: 𝑜 𝑌 𝑈 𝑌 3. Calculate the eigenvalues and eigenvectors of Σ Calculate eigenvalues λ by finding the zeros of the characteristic polynomial: det( Σ − λ 𝐽 ) ◦ ◦ Calculate the eigenvectors by solving ( Σ − λ 𝐽)𝑤 = 0 4. Select the 𝑙 eigenvectors with the biggest eigenvalues and create P = (𝑤 1 , 𝑤 2 , … 𝑤 𝑙 ) Transform the original (n x d) matrix 𝑌 to a (n x k) representation: 𝑌𝑄 = 𝑍 5. 9
Useful links o KDD II script: http://www.dbs.ifi.lmu.de/Lehre/KDD_II/WS1516/skript/KDD2-2- HDData.DimensionalityReduction.pdf o A tutorial about PCA: http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf 10

Recommend

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data Analytics Analysis Big Data Big Value Real world Question Data Model Conclusion Machine Learning Use real data to train a model, which can

625 views • 27 slides

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by Prof. Dan Ariely, Duke University 2 What is big data? No standard definition! Wikipedia: Big data is a field that treats ways to

1.47k views • 53 slides

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data algorithms Clinical Big Data Our new algorithms Small data vs. Big data Small data vs. Big data VS Small data vs. Big

922 views • 57 slides

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department | Colorado State University CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 1. INTRODUCTION TO BIG DATA What is Big Data? Sangmi Lee Pallickara

569 views • 7 slides

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data Management summit 24 nd March 216 Philippe Marie-Jeanne Group CDO & Head of the Data Innovation Lab Philippe.mariejeanne@axa.com Big Data is an

449 views • 10 slides

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available Only a few people know how to deal with it Youre now one of them Applications The project is a start Keep your hands dirty Consider using the

176 views • 13 slides

Covered Topics! v Big Graph Data Mining Sampling Ranking v Big Data Management Indexing v

Covered Topics! v Big Graph Data Mining Sampling Ranking v Big Data Management Indexing v Big Data Preprocessing/Cleaning v Big Data Acquisition/Measurement J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,tp:// 1

484 views • 47 slides

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

APACHE BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE FIRST SPARK-BASED BIG DATA PLATFORM RELEASED After almost a decade developing Big Data projects in Paradigma, through its R+D department we

640 views • 34 slides

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING

BIG DATA: Revolutionizing construction business through socmed data mining REVOLUTIONIZING CONSTRUCTION BUSINESS THROUGH SOCMED DATA MINING 01 02 03 Socmed Data The Big Data The Big Data Mining Concept Adoption DATA THE BIG DATA

656 views • 29 slides

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape

Getting the Big (Data) Picture Eva Andreasson , Cloudera Big Data? Todays Big Data Landscape Journey PART 1 10000ft Drivers to re-thinking data Where does Hadoop come from? Industry trends and vendor map When should I

709 views • 46 slides

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Fundamentals of Big Data BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science Analyst, CyVerse What is Big Data? Big data is a term used to refer to the study and applications of data sets that are too complex for traditional

528 views • 24 slides

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition Whats the BIG deal?! 2011 2011 2008 2010 2012 Whats the BIG deal?! (Gartner Hype Cycle) Whats the BIG deal?! Flu Trends

738 views • 41 slides

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall 2017 Whats the BIG deal?! 2011 2011 2008 2010 2012 Whats the BIG deal?! (Gartner Hype Cycle) Whats the BIG deal?! Flu Trends

891 views • 45 slides

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data

HPE SecureData for Big Data Platform HPE Vertica Big Data Platform HPE Security Data Security February 2016 Data Security Impacts Design and Delivery of Big Data Projects Data Security frequently is a leading Role Security Impacts

547 views • 23 slides

BIG DATA IN HIGH ENERGY PHYSICS Igor Mandrichenko Big Data meeting 4/3/2015 What is Big Data ?

BIG DATA IN HIGH ENERGY PHYSICS Igor Mandrichenko Big Data meeting 4/3/2015 What is Big Data ? For different industries and areas of science it means different things Clicks, ad exposures, movies preferences, hyper text links, genome

326 views • 16 slides

BIG DATA 2 This is the Big Data era Big Data are linked System G WHAT IS GRAPH COMPUTING

GraphBIG : Understanding Graph Computing in the Context Of Industrial Solutions Lifeng Nai , Hyesoon Kim (Georgia Tech) Yinglong Xia, IlieTanase, Ching-Yung Lin (IBM Research) BIG DATA 2 This is the Big Data era Big Data are linked

806 views • 47 slides

Week 1: 6 weeks, Sep 13 - Oct 18 Instructor: Tamara Munzner participation, 10%

Whos who Class time Structure Week 1: 6 weeks, Sep 13 - Oct 18 Instructor: Tamara Munzner participation, 10% once/week, 3 hr session 9:30am-12:30pm UBC Computer Science attend lectures and demos, discuss Intro,

410 views • 4 slides

From AHAR to LSA: Understanding the FY18 Changes Office Hours, Session #1 Tuesday, October 23,

From AHAR to LSA: Understanding the FY18 Changes Office Hours, Session #1 Tuesday, October 23, 2018 Agenda Introduction of Panelists Office Hours Logistics & WebEx Functionality Overview of the LSA & Submission Process Q&A Session

611 views • 30 slides

Data and Analysis Part I Structured Data Ian Stark January 2011 Part I: Structured Data

Inf1-DA 20102011 I: 1 / 117 Informatics 1 School of Informatics, University of Edinburgh Data and Analysis Part I Structured Data Ian Stark January 2011 Part I: Structured Data Inf1-DA 20102011 I: 2 / 117 Part I Structured Data

383 views • 23 slides

Refinements in Data Manipulation Method for Coarse Grained Reconfigurable Architectures Takuya

Refinements in Data Manipulation Method for Coarse Grained Reconfigurable Architectures Takuya Kojima and Hideharu Amano Keio University, Japan 14th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC 2019)

501 views • 28 slides

CSC 411 Lecture 18: Matrix Factorizations Roger Grosse, Amir-massoud Farahmand, and Juan

CSC 411 Lecture 18: Matrix Factorizations Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 18-Matrix Factorizations 1 / 27 Overview Recall PCA: project data onto a low-dimensional subspace

478 views • 27 slides

Data Quality Assurance Or How to get good data , by Florian Netzer & Lars Wolf Image

Data Quality Assurance Or How to get good data , by Florian Netzer & Lars Wolf Image sources: stackexchange.com texwelt.de stackexchange.com CC BY-SA 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial

549 views • 30 slides

Machine Learning (CSE 446): Perceptron Sham M Kakade c 2018 University of Washington

Machine Learning (CSE 446): Perceptron Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 14 Announcements HW due this week. See detailed instructions in the hw. One pdf file. Answers and figures

539 views • 21 slides

r s r rt

r s r rt s sts s tr s

526 views • 15 slides