Deep Residual Learning for Portfolio Optimization: With Attention - PowerPoint PPT Presentation

Deep Residual Learning for Portfolio Optimization: With Attention and Switching Modules Jeff Wang, Ph.D. Prepared for NYU FRE Seminar. March 7th, 2019 1 / 48

Overview ◮ Study model driven portfolio management strategies ◮ Construct long/short portfolio from dataset of approx. 2000 individual stocks. ◮ Standard momentum and reversal predictors/features from Jagadeesh and Titman (1993), and Takeuchi and Lee (2013). ◮ Probability of next month’s normalized return higher/lower than median value. ◮ Attention Enhanced Residual Network ◮ Optimize the magnitude of non-linearity in the model. ◮ Strike a balance between linear and complex non-linear models. ◮ Proposed network can control over-fitting. ◮ Evaluate portfolio performance against linear model and complex non-linear ANN. ◮ Deep Residual Switching Network ◮ Switching module automatically sense changes in stock market conditions. ◮ Proposed network switch between market anomalies of momentum and reversal. ◮ Examine dynamic behavior of switching module as market conditions change. ◮ Evaluate portfolio performance against Attention Enhanced ResNet. 2 / 48

Part One: Attention Enhanced Residual Network Figure 1: Fully connected hidden layer representation of multi-layer feedforward network. 3 / 48

� � � � Given input vector X , let n ∈ 1 , 2 , ..., N , i , j ∈ 1 , 2 , 3 , ..., D , and f (0) ( X ) = X . ◮ Pre-activation at hidden layer n , j W ( n ) i , j · f ( n − 1) ( X ) j + b ( n ) z ( n ) ( X ) i = � i ◮ Equivalently in Matrix Form, z ( n ) ( X ) = W ( n ) · f ( n − 1) ( X ) + b ( n ) ◮ Activation at hidden layer n , f ( n ) ( X ) = σ ( z ( n ) ( X )) = σ ( W ( n ) · f ( n − 1) ( X ) + b ( n ) ) ◮ Output layer n = N + 1, F ( X ) = f ( N +1) ( X ) = Φ( z ( N +1) ( X )) � ⊺ � exp ( z ( N +1) exp ( z ( N +1) ) ) ◮ Φ( z ( N +1) ( X )) = 1 ) , ..., c c exp ( z ( N +1) c exp ( z ( N +1) � � ) c c W ( n ) i , j , b ( n ) ◮ F ( X ) c = p ( y = c | X ; Θ), Θ = � � i 4 / 48

Universal Approximators Multilayer Network with ReLu Activation Function ◮ “Multilayer feedforward network can approximate any continuous function arbitrarily well if and only if the network’s countinuous activation function is not polynomial.” ◮ ReLu: Unbounded activation function in the form σ ( x ) = max (0 , x ). Definition loc ( R n ) is dense in C ( R n ) if for every function A set F of functions in L ∞ g ∈ C ( R n ) and for every compact set K ⊂ R n , there exists a sequence of functions f j ∈ F such that f →∞ || g − f j || L ∞ ( K ) = 0 . lim Theorem (Leshno et al., 1993) Let σ ∈ M, where M denotes the set of functions which are in L ∞ loc (Ω) . � σ ( w · x + b ) : w ∈ R n , b ∈ R � Σ n = span Then Σ n is dense in C ( R n ) if and only if σ is not an algebraic polynomial (a.e.). 5 / 48

ANN and Over-fitting Deep learning applied to financial data. ◮ Artificial Neural Network (ANN) can approximate non-linear continuous functions arbitrarily well. ◮ Financial markets offer non-linear relationships. ◮ Financial datasets are large, and ANN thrives with big datasets. When the ANN goes deeper. ◮ Hidden layers mixes information from input vectors. ◮ Information from input data get saturated. ◮ Hidden units fit noises in financial data. May reduce over-fitting with weight regularization and dropout. ◮ Quite difficult to control, especially for very deep networks. 6 / 48

Over-fitting and Generalization Power ◮ Generalization error decomposes into bias and variance. ◮ Variance: does model vary for another training dataset. ◮ Bias: closeness of average model to the true model F ∗ . Figure 2: Bias Variance Trade-Off. 7 / 48

Residual Learning: Referenced Mapping ◮ Network architecture that references mapping. ◮ Unreferenced Mapping of ANN: ◮ Y = F ( X , Θ) ◮ Underlying mapping fit by a few stacked layers. ◮ Referenced Residual Mapping (He et al., 2016): ◮ R ( X , Θ) = F ( X , Θ) − X ◮ Y = R ( X , Θ) + X 8 / 48

Residual Block Figure 3: Fully connected hidden layer representation of multi-layer feedforward network. 9 / 48

◮ Let n ∈ � � � � , f (0) ( X ) = X 1 , 2 , ..., N , i , j ∈ 1 , 2 , 3 , ..., D ◮ z ( n ) ( X ) = W ( n ) · f ( n − 1) ( X ) + b ( n ) ◮ f ( n ) ( X ) = σ ( z ( n ) ( X )) ◮ z ( n +1) ( X ) = W ( n +1) · f ( n ) ( X ) + b ( n +1) ◮ z ( n +1) ( X ) + f ( n − 1) ( X ) ◮ f ( n +1) ( X ) = σ ( z ( n +1) ( X ) + f ( n − 1) ( X )) ◮ f ( n +1) ( X ) = σ ( W ( n +1) · f ( n ) ( X ) + b ( n +1) + f ( n − 1) ( X )) In the deeper layers of residual learning system, with regularization weight decay, W ( n +1) − → 0 and b ( n +1) − → 0, and with ReLU activation function σ , we have, ◮ f ( n +1) ( X ) − → σ ( f ( n − 1) ( X )) ◮ f ( n +1) ( X ) − → f ( n − 1) ( X ) 10 / 48

Residual Learning: Referenced Mapping ◮ Residual Block ◮ Identity function is easy for residual blocks to learn. ◮ Improves performance with each additional residual block. ◮ If it cannot improve performance, simply transform via identity function. ◮ Preserves structure of input features. ◮ Concept behind residual learning is cross-fertilizing and hopeful for algorithmic portfolio management. ◮ He et al., 2016. Deep residual learning for image recognition. 11 / 48

12 / 48

Attention Module ◮ Attention Module ◮ Naturally extend to residual block to guide feature learning. ◮ Estimate soft weights learned from inputs of residual block. ◮ Enhances feature representations at selected focal points. ◮ Attention enhanced features improve predictive properties of the proposed network. ◮ Residual Mapping: R(X,Θ) = F ( X , Θ) − X ◮ Y = R(X,Θ) + X ◮ ◮ Attention Enhanced Residual Mapping: Y = ( R(X,Θ) + X ) · M ( X , Θ) ◮ Y = ( R(X,Θ) + W s · X ) · M ( X , Θ) ◮ 13 / 48

Attention Enhanced Residual Block Figure 5: Representation of Attention Enhanced Residual Block, “+” denotes element-wise addition, σ denotes leaky-relu activation function, and “X” denotes element-wise product. The short circuit occurs before σ activation, and attention mask is applied after σ activation. 14 / 48

Attention Enhanced Residual Block ◮ z a , ( n ) ( X ) = W a , ( n ) · f ( n − 1) ( X ) + b a , ( n +1) ◮ f a , ( n ) ( X ) = σ ( z a , ( n ) ( X )) ◮ z a , ( n +1) ( X ) = W a , ( n +1) · f a , ( n ) ( X ) + b a , ( n +1) ◮ f a , ( n +1) ( X ) = Φ( z a , ( n +1) ( X )) where, � ⊺ � exp ( z a , ( n +1) exp ( z a , ( n +1) ) ) ◮ Φ( z a , ( n +1) ( X )) = ) , ..., c 1 c exp ( z a , ( n +1) c exp ( z a , ( n +1) � � ) c c ◮ f ( n +1) ( X ) = [ σ ( z ( n +1) ( X ) + f ( n − 1) ( X ))] · [Φ( z a , ( n +1) ( X ))] 15 / 48

16 / 48 Figure 6:

Objective Function ◮ Objective function minimizes the error between the estimated conditional probability and the correct target label is formulated as the following cross-entropy loss with weight regularization: m y ( m ) · log F ( x ( m ) ; Θ) + (1 − y ( m ) ) · log (1 − F ( x ( m ) ; Θ)) + λ � − 1 n || Θ || 2 ◮ argmin � m F Θ W ( n ) i , j , b ( n ) ◮ Θ = � � ; || · || F is Frobenius Norm. i ◮ Cross-entropy loss speeds up convergence when trained with gradient descent algorithm. ◮ Cross-entropy loss function also has the nice property that imposes a heavy penalty if p ( y = 1 | X ; Θ) = 0 when the true target label is y=1, and vice versa. 17 / 48

Optimization Algorithm ◮ Adaptive Moment (ADAM) algo combines Momentum and RMS prop. ◮ The ADAM algorithm have been shown to work well across a wide range of deep learning architectures. ◮ Cost contours: ADAM damps out oscillations in gradients that prevents the use of large learning rate. ◮ Momentum: speed ups training in horizontal direction. ◮ RMS Prop: Slow down learning in vertical direction. ◮ ADAM is appropriate for noisy financial data. ◮ Kingma and Ba., 2015. ADAM: A Method For Stochastic Optimization. 18 / 48

ADAM 19 / 48

Experiment Setting ◮ Model Input ◮ 33 features in total. 20 normalized past daily returns, 12 normalized monthly returns for month t − 2 through t − 13, and an indicator variable for the month of January. ◮ Target Output ◮ Label individual stocks with normalized monthly return above the median as 1, and below the median as 0. ◮ Strategy ◮ Over the broad universe of US equities (approx. 2000 tickers), estimate the probability of each stock’s next month’s normalized return being higher or lower than median. ◮ Rank estimated probabilities for all stocks in the trading universe (or by industry groups), then construct long/short portfolio of stocks with estimated probability in the top/bottom decile. ◮ Long signal: p i > p ∗ , p ∗ : threshold for the top decile. ◮ Short signal: p i < p ∗∗ , p ∗∗ : threshold for the bottom decile. 20 / 48

Table 1: Trading Universe Categorized by GICS Industry Group as of January 3, 2017. 21 / 48

Deep Residual Learning for Portfolio Optimization: With Attention - PowerPoint PPT Presentation

Deep Residual Learning for Portfolio Optimization: With Attention and Switching Modules Jeff Wang, Ph.D. Prepared for NYU FRE Seminar. March 7th, 2019 1 / 48 Overview Study model driven portfolio management strategies Construct

Pipeline Strategies and conversations behind securing a Residual Bequest Agenda 1. Why Residual?

An Overview of Deep Residual Learning Semih Yagcioglu 01.03.2016 Deep Residual Learning

Clarifying Residual Flow s for Surface Water Takes August 2017 Clarifying Residual Flow s

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AMMI Introduction to Deep Learning 6.5. Residual networks Fran cois Fleuret

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

SPOT Farm East (Elveden) 2016 Residual Herbicide Demonstration Report Background The urea

Residual Flows for Invertible Generative Modeling Ricky T. Q. Chen, Jens Behrmann, David

Residual Networks (ResNet) Residual Networks (ResNet) In [1]: import d2l from mxnet import gluon,

Lecture 3 Residual Analysis + Generalized Linear Models Colin Rundel 1/23/2017 1 Residual

Residual modular Galois representations and their images Samuele Anni University of Warwick

SESSION 8: VALUING RESIDUAL CLAIMS (EQUITY) Valuing Equity Equity represents a residual

Lecture 9: Residual Analysis Instructor: Prof. Shuai Huang Industrial and Systems Engineering

Lecture 3 Residual Analysis + Generalized Linear Models Colin Rundel 1/23/2018 1 Residual

CSC321 Lecture 17: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 17: ResNets

Residual Analysis Inferences about a regression model are valid only under assumptions about the

Linear Regression Part 2: Residuals and Errors INFO-1301, Quantitative Reasoning 1 University of

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Introduction to General and Generalized Linear Models Generalized Linear Models - part II Henrik

Iterative Techniques in Matrix Algebra Relaxation Techniques for Solving Linear Systems Numerical

The vanishing gradient problem revisited: Highway and residual connections CS 6956: Deep

Augmenting Paths Math 482, Lecture 25 Misha Lavrov April 3, 2020 The greedy algorithm