Structural-Factor Modeling of Big Dependent Data Ruey S. Tsay Booth - PowerPoint PPT Presentation

Structural-Factor Modeling of Big Dependent Data Ruey S. Tsay Booth School of Business, University of Chicago January 11, 2019 Joint with: Zhaoxing Gao R. Tsay (U Chicago) Factor-Modeling of Big Dependent Data January 11, 2019 1 / 56

Table of Contents Some Big Data Examples Available Statistical Methods Approximate Factor Model and Its limitations The Proposed Model and Methodology Theoretical Properties Numerical Results Conclusions R. Tsay (U Chicago) Factor-Modeling of Big Dependent Data January 11, 2019 2 / 56

Data tsunami Information and technology have revolutionized data collection. Millions of surveillance video cameras and billions of Internet searches and social media chats and tweets produce massive data that contain vital information about security, public health, consumer preference, business sentiments, economic health, among others. Billions of prescriptions and enormous amount of genetic and genomic information provide critical data on health and precision medicine. R. Tsay (U Chicago) Factor-Modeling of Big Dependent Data January 11, 2019 3 / 56

Big data are ubiquitous ’There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days’ – Eric Schmidt, Former CEO of Google R. Tsay (U Chicago) Factor-Modeling of Big Dependent Data January 11, 2019 4 / 56

What is Big Data? Large and Complex data: Structured data ( n and p are both large); Unstructured data (email, text, web, videos) Biological Sci.: Genomics, Medicine, Genetics, Neuroscience Engineering: Machine Learning, Computer Vision, Networks Social Sci.: Economics, Business and Digital Humanities Natural Sci.: Meteorology, Earth Science, Astronomy Characterizes contemporary scientific and decision problems R. Tsay (U Chicago) Factor-Modeling of Big Dependent Data January 11, 2019 5 / 56

Examples: Biological Sciences Bioinformatics: disease classification/predicting clinical outcomes using microarray data or proteomic data Association studies between phenotypes and SNPs (eQTL) Detecting activated voxels after stimuli in Neuroscience R. Tsay (U Chicago) Factor-Modeling of Big Dependent Data January 11, 2019 6 / 56

Example: Machine Learning Document or text classification: E-mail spam Computer vision, object classification (images, curves) Social media and Internet Online learning and recommendation Surveillance videos and network security R. Tsay (U Chicago) Factor-Modeling of Big Dependent Data January 11, 2019 7 / 56

Example: Finance, Economics and Business Data: Stock, currency, derivative, commodity, high-frequency trades, macroeconomic, unstructured news and texts, consumers’ confidence and business sentiments in social media and Internet US Unemployment Rate: 1976.1 to 2018.8 49 Industry Portfolios: 1926.7 to 2018.4 40 15 30 20 urate 10 10 0 5 −20 0 100 200 300 400 500 0 2000 4000 6000 8000 10000 12000 Time Time Social media contains useful information on economic health, consumer confidence and preference, suppliers and demands Retail sales provide useful information on public health, economic health, consumer confidence and preference, etc. R. Tsay (U Chicago) Factor-Modeling of Big Dependent Data January 11, 2019 8 / 56

Example: Finance, Economics and Business Risk and portfolio management: Managing 2K stocks involves 2m elements in covariance Credit: Default probability depends on firm specific attributes, market conditions, macroeconomic variables, feedback effects of firms, etc. Predicting Housing prices: 1000 neighborhoods require 1m parameters using, e.g. VAR(1), X t = AX t − 1 + ε t . R. Tsay (U Chicago) Factor-Modeling of Big Dependent Data January 11, 2019 9 / 56

What can Big Data do? Hold great promises for understanding Heterogeneity: personalized medicine or services Commonality: in presence of large variations (noises) Dependence: financial data series from large pools of variables, factors, genes, environments, and their interactions as well as latent factors. R. Tsay (U Chicago) Factor-Modeling of Big Dependent Data January 11, 2019 10 / 56

Available statistical methods (TS) Focus on sparsity 1 LASSO: Tibshirani (1996) Group Lasso: Yuan and Lin (2006) Elastic net: Zou and Hastie (2005) SCAD: Fan and Li (2001) Fused LASSO: Tibshirani et al. (2005) Focus on dimension reduction 2 PCA: Pearson (1901) CCA: Box and Tiao (1977) SCM: Tiao and Tsay (1989) Factor model: Pe˜ na and Box (1987), Bai and Ng (2002), Stock and Watson (2005), Lam and Yao (2011, 2012) etc. R. Tsay (U Chicago) Factor-Modeling of Big Dependent Data January 11, 2019 11 / 56

Approximate factor model (Econ. & Finance) The model: y t = Ax t + ε t , (1) where { y 1 , ..., y n } with y t = ( y 1 t , ..., y pt ) ′ ∈ R p which are observable A ∈ R p × r , x t ∈ R r are unknown ε t is the idiosyncratic component The goal Estimate the loading matrix A Recover the factor process x t Estimate the number of common factors r R. Tsay (U Chicago) Factor-Modeling of Big Dependent Data January 11, 2019 12 / 56

Available methods Principal Component Analysis (PCA): Bai and Ng (2002, 1 Econometrica ), Bai (2003, Econometrica )... ε t is not necessarily white noise � n � t = � P � D � P ′ , � A = � P r � x t = � � Σ y = n − 1 D 1 / 2 D − 1 / 2 y t y ′ P ′ , � r y t r r t =1 ε t = y t − � x t = ( I p − � P r � � A � P ′ r ) y t Eigen-analysis on Auto-covariances: Lam, Yao and Bathia (2011, 2 Biometrika ), Lam and Yao (2012, AoS ) Assume ε t is vector white noise k 0 n � � ′ � Γ k � � k with � Γ k = n − 1 M = y t y ′ t − k , k 0 is fixed Γ k =1 t = k +1 A contains the eigenvectors of � � M corresponding to top r eigenvalues x t = � ε t = y t − � x t = ( I p − � A � A ′ y t , � A ′ ) y t � A � R. Tsay (U Chicago) Factor-Modeling of Big Dependent Data January 11, 2019 13 / 56

Some fundamental issues PCA may fail if signal-to-noise ratio is low. For the analysis of high-dimensional financial series, as the market and economic information accumulates, the noise is often increasing faster than the signal; PCA cannot distinguish signal and noises in some sense. For example, some components in � x t might be white noises; x t in Lam and Yao (2011) includes the noise components. When � the largest eigenvalues of the noise covariance are diverging, the resulting estimators would deteriorate.; The information criterion in Bai and Ng (2002) and the ratio-based method in Lam and Yao (2011) may also fail if the largest eigenvalues of the covariance matrix of the noise are diverging. The sample covariance matrix of the estimated noises is singular if r > 0 . R. Tsay (U Chicago) Factor-Modeling of Big Dependent Data January 11, 2019 14 / 56

Contributions of the proposed method Address the aforementioned issues from a different perspective 1 Provide a new model to understand the mechanism of factor 2 models Propose a Projected PCA to eliminate the (diverging) effect of the 3 idiosyncratic terms A new way to identify the number of factors, which is more reliable 4 than the information criterion and ratio-based method. R. Tsay (U Chicago) Factor-Modeling of Big Dependent Data January 11, 2019 15 / 56

Setting Assume y t with E y t = 0 and admits a latent structure: � f t � � f t � y t = L = [ L 1 , L 2 ] = L 1 f t + L 2 ε t , (2) ε t ε t � f t � L ∈ R p × p is a full rank loading matrix, implying L − 1 y t = , ε t f t = ( f 1 t , . . . , f rt ) ′ is a r -dimensional factor process, ε t = ( ε 1 t , . . . , ε vt ) ′ is a v -dimensional white noise vector, r is a small and fixed nonnegative integer. Cov ( f t ) = I r , Cov ( ε t ) = I v , Cov ( f t , ε t ) = 0 , and no linear combination of f t is serially uncorrelated. R. Tsay (U Chicago) Factor-Modeling of Big Dependent Data January 11, 2019 16 / 56

Where does this come from? t − m ) ′ for Canonical Correlation Analysis (CCA): Let η t = ( y ′ t − 1 , ..., y ′ sufficient large m , Σ y = Cov ( y t ) , Σ η = Cov ( η t ) and Σ yη = Cov ( y t , η t ) L − 1 contains the eigenvectors of Σ − 1 y Σ yη Σ − 1 η Σ ′ yη associated with its descending eigenvalues, L − 1 y t has uncorrelated components and their correlation with the past lagged variables are in decreasing order Assume the top r eigenvalues are non-zero L − 1 y t = ( f ′ t , ε ′ t ) ′ See Tiao and Tsay (1989, JRSSB): Include all finite-order VARMA models. R. Tsay (U Chicago) Factor-Modeling of Big Dependent Data January 11, 2019 17 / 56

Why CCA does not always work in practice? A natural method is to adopt the CCA at the sample level to recover the latent structure. But, the sample covariance matrix is not consistent to the population one in high dimension; the sample precision matrix is not consistent to the population one in high dimension. For instance, when the dimension is greater than the sample size. R. Tsay (U Chicago) Factor-Modeling of Big Dependent Data January 11, 2019 18 / 56

Structural-Factor Modeling of Big Dependent Data Ruey S. Tsay Booth - PowerPoint PPT Presentation

Structural-Factor Modeling of Big Dependent Data Ruey S. Tsay Booth School of Business, University of Chicago January 11, 2019 Joint with: Zhaoxing Gao R. Tsay (U Chicago) Factor-Modeling of Big Dependent Data January 11, 2019 1 / 56 Table

Why Dependent Origination? So what is dependent origination? Dependent on ignorance, there

Confirmatory Factor Analysis and Exploratory-Confirmatory Factor Analysis Maximum

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Verilog HDL:Digital Design and Modeling Chapter 9 Structural Modeling Chapter 9 Structural

Triadic Factor Analysis Cynthia Glodeanu Institute of Algebra, TU Dresden October 19, 2010.

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Analysis of Big Dependent Data in Economics and Finance Ruey S. Tsay Booth Shool of Business,

Model the WAIS-III IQ Scale Erin Buchanan Professor DataCamp Structural Equation Modeling with

Dependent Eligibility Audit Dependent Eligibility Audit Purpose: The dependent eligibility audit

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Attribute Grammars intermediate syntax semantics representation Language Implementation 2

Certainty Factor certainty factor CF (is the certainty factor in the hypothesis H due to

(IHBG) Competitive NOFA Training Rating Factor 3: Soundness of Approach 1 Rating Factor 3

Predicting condition specific transcription factors for target gene. Kaur Alasoo 19.09.2012

Introduction ECN 102: Analysis of Economic Data Winter, 2011 J. Parman (UC-Davis) Analysis of

dedate then regulatory network gere neuroscience lo Primary decomposition model wiring diagram

The Stuxnet Worm Babak Yadegari and Paul Mueller CSc 566: Computer Security April 25, 2012

Pathways to Discovering Supernova Neutrinos Thomas D. P . Edwards , Sebastian Baum, Bradley J.

Poverty in Canada: Unidimensional and Multidimensional Measures Presented by: Lori J Curtis, PhD

Overview Preterm Birth The Persistent Dilemma of Preterm Delivery Prevalence Etiology Current

Bleeding in Dialysis Patients Dr. Birnbaumer has no financial disclosures Diane M.

Applying IPFIX to Network Measurement and Management presented