Predicting Risk from Financial Reports with Regression Shimon - - PowerPoint PPT Presentation

predicting risk from financial reports with regression
SMART_READER_LITE
LIVE PREVIEW

Predicting Risk from Financial Reports with Regression Shimon - - PowerPoint PPT Presentation

Predicting Risk from Financial Reports with Regression Shimon Kogan, University of Texas at Austin Dimitry Levin, Carnegie Mellon University Bryan R. Routledge, Carnegie Mellon University Jacob S. Sagi, Vanderbilt University Noah A. Smith,


slide-1
SLIDE 1

Predicting Risk from Financial Reports with Regression

Shimon Kogan, University of Texas at Austin Dimitry Levin, Carnegie Mellon University Bryan R. Routledge, Carnegie Mellon University Jacob S. Sagi, Vanderbilt University Noah A. Smith, Carnegie Mellon University

slide-2
SLIDE 2

Talk In A Nutshell

financial risk = f(financial report)

volatility

  • f returns

Form 10-K, Item 7 SV regression

slide-3
SLIDE 3

What This Talk Isn’t and Is

New statistical models for NLP ... Exciting text domains like political blogs ... Advances in applications like translation and summarization ...

slide-4
SLIDE 4

What This Talk Isn’t and Is

New statistical models for NLP ... Exciting text domains like political blogs ... Advances in applications like translation and summarization ...

Shay Cohen, 10:40 am yesterday Tae Yano, 10:40 am tomorrow Ashish Venugopal, right now André Martins, 11 am Thursday

slide-5
SLIDE 5

What This Talk Isn’t and Is

New statistical models for NLP ... Exciting text domains like political blogs ... Advances in applications like translation and summarization ...

slide-6
SLIDE 6

What This Talk Isn’t and Is

New statistical models for NLP ...

Bag of terms representation and SVR model.

Exciting text domains like political blogs ...

Boring (to read) text domain of financial reports.

Advances in applications like translation and summarization ...

Under-explored application: forecasting.

slide-7
SLIDE 7

See Also ...

  • Lavrenko et al. (2000), Koppel and

Shtrimberg (2004), and others: prices

  • Blei and McAuliffe (2007): popularity
  • Lerman et al. (2008): prediction markets
slide-8
SLIDE 8

Outline

  • Mini-lesson in finance
  • A new text-driven forecasting task
  • Regression models trained on text
  • Experimental results and analysis
  • Outlook
slide-9
SLIDE 9

Finance

Allocation of wealth (e.g., money) across time and risk (states of nature).

slide-10
SLIDE 10

Finance

From an NLP perspective: crucial information about your investments that’s buried in documents you’d rather not read.

slide-11
SLIDE 11

financial risk = f(financial report)

slide-12
SLIDE 12

financial risk = f(financial report)

volatility

  • f returns
slide-13
SLIDE 13
  • Return on day t:
  • Sample standard deviation from day t - τ

to day t:

  • This is called measured volatility.

What is Risk?

v[t−τ,t] =

  • τ
  • i=0

(rt−i − ¯ r)2

  • τ

rt = closingpricet + dividendst closingpricet−1 − 1

slide-14
SLIDE 14

Why Not Predict Returns, Get Rich, Retire Early?

  • Hard: predicting a stock’s performance.
  • To predict returns, we would need to

find new information.

  • Our reports probably don’t contain new

information (10-Ks do not precede big price changes).

slide-15
SLIDE 15

Will This Talk Make Anyone Rich?

  • Some people think you can exploit

accurate volatility predictions.

  • I’m not really qualified to give financial

advice.

  • Consulting to portfolio/wealth managers

is a huge industry.

slide-16
SLIDE 16

So Then Why Do Finance Researchers Care?

  • Models of economics and finance treat

information simplistically.

  • No notion of extracting information

from large amounts of raw data.

  • These reports are produced at huge
  • expense. Are they worth it?
slide-17
SLIDE 17

Important Property of Volatility

  • Autoregressive conditional

heteroscedacity: volatility tends to be stable (over horizons like ours).

  • v[t - τ, t] is a strong predictor of v[t, t + τ]
  • This is our strong baseline.
slide-18
SLIDE 18

financial risk = f(financial report)

volatility

  • f returns

Form 10-K, Item 7

slide-19
SLIDE 19

Form 10-K, Item 7

General Motors Corp. March 5, 2009

Item 7. Management’s Discussion and Analysis of Financial Condition and Results of Operations Overview We are primarily engaged in the worldwide production and marketing of cars and trucks. We

  • perate in two businesses, consisting of our automotive operations, which we also refer to as

Automotive, GM Automotive or GMA, that includes our four automotive segments consisting of GMNA, GME, GMLAAM and GMAP, and our financing and insurance operations (FIO). Our finance and insurance operations are primarily conducted through GMAC, a wholly-owned subsidiary through November 2006. On November 30, 2006, we sold a 51% controlling

  • wnership interest in GMAC to a consortium of investors. After the sale, we have accounted for
  • ur 49% ownership interest in GMAC under the equity method. GMAC provides a broad range of

financial services, including consumer vehicle financing, automotive dealership and other commercial financing, residential mortgage services, automobile service contracts, personal automobile insurance coverage and selected commercial insurance coverage. Automotive Industry In 2008, the global automotive industry has been severely affected by the deepening global credit crisis, volatile oil prices and the recession in North America and Western Europe, decreases in the employment rate and lack of consumer confidence. The industry continued to show growth in Eastern Europe, the LAAM region and in Asia Pacific, although the growth in these areas moderated from previous levels and is beginning to show the effects of the credit market crisis which began in the United States and has since spread to Western Europe and the rest of the

  • world. Global industry vehicle sales to retail and fleet customers were 67.1 million units in 2008,

representing a 5.1% decrease compared to 2007. We expect industry sales to be approximately 57.5 million units in 2009.

slide-20
SLIDE 20

Our Corpus

  • Edgar database at http://www.sec.gov
  • 26,806 examples of Item 7, 1996-2006
  • 247.7 million words in total
  • http://www.ark.cs.cmu.edu/10K
slide-21
SLIDE 21

“Annotation”

  • For each report at time t, we gathered
  • “Historical” volatility: v[t - 1y, t]
  • “Future” volatility: v[t, t + 1y]
  • Source: Center for Research in Security

Prices U.S. Stocks Databases

slide-22
SLIDE 22

Methodology

  • Input: Item 7 and/or historical volatility
  • Output: predicted future volatility
  • Test on (input, output) pairs from year Y
  • Train on (input, output) from years < Y
  • Evaluation: MSE of (log) volatility
slide-23
SLIDE 23

financial risk = f(financial report)

volatility

  • f returns

Form 10-K, Item 7 SV regression

slide-24
SLIDE 24

Support-Vector Regression

(Drucker et al., 1997)

  • Predicted future volatility is a function of

a document (Item 7), d, and a weight vector w:

  • The training criterion:

ˆ v = f(d; w) min

w∈Rd

1 2w2 + C N

N

  • i=1

max

  • 0,
  • vi − f(di; w)
  • − ǫ
  • regularize

prediction within ε of correct

slide-25
SLIDE 25

Representation

  • Vector-space model (tf, tfidf, etc.)
  • So far, unigrams and bigrams
  • Linear kernel (for interpretability)

f(d; w) = h(d)⊤w =

N

  • i=1

N

  • i=1

αiK(d, di) =

N

  • i=1

N

  • i=1

αih(d)⊤h(di)

w =

N

  • i=1

αih(di)

slide-26
SLIDE 26

Representation

  • Vector-space model (tf, tfidf, etc.)
  • So far, unigrams and bigrams
  • Linear kernel (for interpretability)

w =

N

  • i=1

αih(di)

dual

f(d; w) = h(d)⊤w =

N

  • i=1

αiK(d, di) =

N

  • i=1

αih(d)⊤h(di)

slide-27
SLIDE 27

Experiment

  • Test on year Y.
  • Train on (Y - 5, Y - 4, Y - 3, Y - 2, Y - 1).
  • Six such splits.
  • Compare history-only baseline, text-only

SVR, combined SVR.

slide-28
SLIDE 28

MSE of Log-Volatility

0.120 0.143 0.165 0.188 0.210

2001 2002 2003 2004 2005 2006 Micro-ave.

History Text Text + History

Using “log(1+freq.)” representation on all unigrams and bigrams. See paper. * * * * *

lower is better

slide-29
SLIDE 29

Dominant Weights (2000-4)

loss 0.025 net income -0.021 net loss 0.017 rate -0.017 year # 0.016 properties -0.014 expenses 0.015 dividends -0.013 going concern 0.014 lower interest -0.012 a going 0.013 critical accounting -0.012 administrative 0.013 insurance -0.011 personnel 0.013 distributions -0.011

high volatility words low volatility words

slide-30
SLIDE 30

MSE of Log-Volatility

0.120 0.143 0.165 0.188 0.210

2001 2002 2003 2004 2005 2006 Micro-ave.

History Text Text + History

Using “log(1+freq.)” representation on all unigrams and bigrams. See paper. * * * * *

lower is better

slide-31
SLIDE 31

Changes Over Time

3,250 6,500 9,750 13,000 ‘96 ‘97 ‘98 ‘99 ‘00 ‘01 ‘02 ‘03 ‘04 ‘05 ‘06 average length of Item 7

slide-32
SLIDE 32

2002

  • Enron and other accounting scandals
  • Sarbanes-Oxley Act of 2002
  • Longer reports
  • Are the reports more informative after

2002? Because of Sarbanes-Oxley?

slide-33
SLIDE 33

Changes In w

50 54 58 62 ‘97-’01 ‘98-’02 ‘99-’03 ‘00-’04 ‘01-’05 change from previous weights

Measured in L1 distance; based on unigram model with “log(1 + freq.)” representation.

slide-34
SLIDE 34

Language Over Time

  • 0.015
  • 0.010
  • 0.005

0.005 96-00 97-01 98-02 99-03 00-04 01-05 w accounting policies estimates 2 4 6 8

  • ave. term frequency
slide-35
SLIDE 35

0.2 0.4 0.6 0.8

  • ave. term frequency
  • 0.010
  • 0.005

0.005 96-00 97-01 98-02 99-03 00-04 01-05 w

Language Over Time

reit

(“Real Estate Investment Trust”)

mortgages

slide-36
SLIDE 36

Language Over Time

  • 0.010
  • 0.005

0.005 0.010 96-00 97-01 98-02 99-03 00-04 01-05 w higher margin lower margin 0.05 0.10 0.15 0.20

  • ave. term frequency
slide-37
SLIDE 37

Delisting

  • Rare (4%) event: delisting due to

dissolution after bankruptcy, merger, violation of rules.

  • bulletin, creditors, dip, otc, court

25 50 75 100 01 02 03 04 05 06

precision at 10 precision at 100

slide-38
SLIDE 38

Conclusions

  • Text-driven forecasting of volatility, by

regression.

  • Works nearly as well as strong history

predictor.

  • Often works better in combination.
  • Suggestion of effects of legislation on a

real-world text-generating process.

slide-39
SLIDE 39

Future Work

  • Measuring the effect of Sarbanes-Oxley
  • Other predictions
  • Other text representations
  • Other datasets
slide-40
SLIDE 40

Future Work (Text-Driven Forecasting)

  • Application for NLP: techniques that use

text to make real-world predictions.

  • Many potential domains (finance,

politics, government, sales, ...)

  • There’s lots of room for improvement!