CS7646 - Machine Learning for Trading

ML4T Mini Course 1 : Manipulating Financial Data in Python

01-06 Statistical analysis of time series

  • Bollinger bands: Add 2 std above and below rolling mean. When you see excursion 2 stds away from the mean: if excursion > upper band, buy. if lower, sell.
  • Daily returns: Important for financial analysis. How much did the price go up/down in a day.
    • Formula: daily_return[time] = (price[time] / price[t-1]) - 1

01-06 Histograms and scatter plots

Histograms:

  • Histograms: distribution of numerical data
  • Gaussian distribution: normal distribution, bell-curve
  • Kurtosis: Tail of distribution. How different tail occurrence is compared to gaussian distribution.
    • "Fat tails": Positive kurtosis. More occurrences in the tail compared to gaussian distribution.
    • "Skinny tails": Negative kurtosis. Less occurrences in the tail.

Comparing Histograms (in trading domain)

  • Flatter bell curve = lower return, higher volatility

Adding lines to plots

plt.axvline(value, color='w', linestyle='dashed', linewidth=2)

Scatter plots

  • Scatter plots allow drawing line via linear regression, ie. a slope.
  • Beta: steepness of scatter plot slope. Shows how reactive the stock is to the market. Steeper line = higher beta.
    • Value of 1 = When market goes up 1%, stock also goes up 1%
    • Value of 2 = When market goes up 1%, stock goes up 2%
  • Alpha: Where the line intercepts the vertical axis. Positive alpha means the stock is doing better than the S&P 500 every day.
  • Slope != correlation. Correlation is about how slope fits tightly to the line. -1 to 1.

Coding Scatter Plots

# Compute daily returns
daily_returns = compute_daily_returns(df)

# Scatter plot
daily_returns.plot(kind='scatter', x='SPY', y='GLD')

# Draw line, for polynomial of 1 it is y = mx + b
# Returns 1) polynomial coefficient (m) and 2) the intercept (b)
# 'm' is beta (slope), 'b' is alpha
beta_XOM, alpha_XOM = np.polyfit(daily_returns['SPY'], daily_returns['XOM'], 1)

# Plot the line
# For every value of x (SPY in this case), find value of y using mx+b 
plt.plot(
  daily_returns['SPY'], 
  beta_XOM * daily_returns['SPY'] + alpha_XOM, 
  '-', 
  color = 'r'
)

Why kurtosis matters: 2008 crash. Investment banks built bonds based on mortgages. Assumed distribution of returns for these mortgages had gaussian distribution, thought these bonds had low probability of default. 2 mistakes:

  1. Assumed that return of each of these mortgages were independent.
  2. Return would be normally distributed.

01-07 Sharpe ratio and other portfolio statistics

  1. Start with prices. Columns are symbols, rows are stock values at each date.
  2. Normalize first day to 1.0 in normed
  3. Multiply normed with allocations, gives relative value of each symbols over time.
  4. Multiply starting value (start_val) with initial investment, which will give actual value for each symbol over time.
  5. Sum across rows to get the cumulative value of the portfolio for each date..

Key stats of portfolios (using daily returns)

  1. Cumulative return: How much the value of portfolio increased from the beginning to the end.
  2. Avg daily return: Average of daily returns
  3. Std daily return: Standard deviation of daily returns
  4. Sharpe ratio: From wiki, "measures the performance of an investment compared to a risk-free asset, after adjusting for its risk".

Sharpe ratio

Definition: Metric that adjusts return for risk.

  • Lower risk is better
  • Higher return is better
  • Consider risk free rate of return, ie. what happens if you put asset in a bank.

Formula

"Ex ante" formulation (looking forward):

$$\text{Sharpe Ratio}=\frac{E[R_{p}-R_{f}]}{std[R_{p}]}$$
  • E: expected value
  • Rp: portfolio return
  • Rf: risk-free return
  • Denominator: std dev of portfolio return (volatility)

Dividing by volatility, hence higher the volatility the ratio goes down. With higher portfolio return, the ratio goes up. With higher risk free rate of return, the ratio goes down.

Code:

sharpe_ratio = np.mean(daily_rets - daily_rf) / np.std(daily_rets)

What is risk free rate (daily_rf)?

  1. London Inter-bank Offered Rate (LIBOR)
  2. 3mo treasury bill
  3. 0% (in current economy) - traditional shortcut

Coding risk-free rate adjusted by sampling rate:

# start_value: Starting value to calculate risk free rate.
# rf_rate: Risk free ra te
# sample_rate: Sample rate to adjust Sharpe ratio.  Daily is 252.
adj_rf_rate = (start_value - rf_rate) ** (1 / sample_rate) - 1

Adjustment factor

  • Sharpe ratio (SR) can vary widely depending on sample rate (annual, weekly, daily, etc).
    • SR was originally calculated annually, so you need to adjust it if calculating SR at a greater sampling rate.
    • Adjustment ratio is called "K"
    • Formula: SR_annualized = K * SR, where k = sqrt(samples_per_year)
  • Daily, weekly, monthly adjustment rates:
    • If sampling at daily rate: daily_k = np.sqrt(252)
    • If sampling at weekly rate: weekly_k = np.sqrt(52)
    • IF sampling at monthly rate: monthly_k = np.sqrt(12)

You'll need to add this in to the final formula:

daily_k = sqrt(252)
sharp_ratio_daily = daily_k * np.mean(daily_rets - daily_rf) / np.std(daily_rets)

Quiz: What is Sharpe ratio

Given:

  • 60 days of data
  • average daily return = 10 bps ("bips", basis points) = 0.001
  • daily risk free rate = 2 bps = 0.0002
  • std dev of daily returns = 10 bps = 0.001

Answer: 12.7

daily_k = np.sqrt(252)
daily_k * (np.mean(0.001 - 0.0002) / 0.001)

01-08 Optimizers: Building a parameterized model

What's an optimizer?

Optimizers: algorithm that can:

  • Minimum values of functions
  • Build parameterized models based on data
  • Refine allocations to stocks in portfolios

How to use:

  1. Define a function to minimize (e.g. f(x) = x**2 + 0.5)
    • Minimizer will call this function many times until it finds the minimum value for x.
  2. Provide an initial guess
    • Random value, optimizer repeatedly calls functions, narrows in on solution.
  3. Call the optimizer

Minimizer in Python

def f(X):
    """Given a scalar X, return some value (a real number)."""
    Y = (X - 1.5)**2 + 0.5
    print ("X = {}, Y = {}".format(X, Y)) # for tracing
    return Y

def test_run():
    Xguess = 2.0
    min_result = spo.minimize(f, Xguess, method='SLSQP', options={'disp': True})
    print ("Minima found at:")
    print ("X = {}, Y = {}".format(min_result.x, min_result.fun))

    # Plot function values, mark minima
    Xplot = np.linspace(0.5, 2.5, 21)
    Yplot = f(Xplot)
    plt.plot(Xplot, Yplot)
    plt.plot(min_result.x, min_result.fun, 'ro')
    plt.title("Minima of an objective function")
    plt.show()

How to defeat a minimizer

1, 2, and 4 are hard functions to solve because:

  • 1 has flat points. Hard for optimizer to do gradient descent.
  • 2 has multiple local minimas.
  • 4 has a discontinuity.

Convex problems: Easiest class of problems for minimizer to solve

Convex function: Choosing two points, convex if line drawn is above a graph.

Building a parameterized model

Given f(x) = C0 * x + C1, task is to find the slope (C0) and intercept (C1) that best fits the data.

  • Minimizer will vary the two coefficients to minimize something. What equation to use, ie. what is a good error metric?

Answer:

Key is to remove negative values:

  1. Take absolute value of distance of each data point from the line
  2. Take the square of distance of each data point from the line

Fitting a line given data points (code)

"""Minimize an objective function using SciPy."""

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.optimize as spo


def error(line, data):  # error function
    """Compute error between given line model and observed data.

    Parameters
    ----------
    line: tuple/list/array (C0, C1) where C0 is slope and C1 is Y-intercept
    data: 2D array where each row is a point (x, y)

    Returns error as a single real value.
    """
    # Metric: Sum of squared Y-axis differences
    err = np.sum((data[:, 1] - (line[0] * data[:, 0] + line[1])) ** 2)
    return err


def fit_line(data, error_func):
    """Fit a line to given data, using a supplied error function.

    Parameters
    ----------
    data: 2D array where each row is a point (X0, Y)
    error_func: function that computes the error between a line and observed data

    Returns line that minimizes the error function.
    """
    # Generate initial guess for line model
    l = np.float32([0, np.mean(data[:, 1])])  # slope = 0, intercept = mean(y values)

    # Plot initial guess (optional)
    x_ends = np.float32([-5, 5])
    plt.plot(x_ends, l[0] * x_ends + l[1], "m--", linewidth=2.0, label="Initial guess")

    # Call optimizer to minimize error function
    result = spo.minimize(error_func, l, args=(data,), method="SLSQP", options={"disp": True})
    return result.x


def test_run():
    # Define original line
    l_orig = np.float32([4, 2])
    print("Original line: C0 = {}, C1 = {}".format(l_orig[0], l_orig[1]))
    Xorig = np.linspace(0, 10, 21)
    Yorig = l_orig[0] * Xorig + l_orig[1]
    plt.plot(Xorig, Yorig, "b--", linewidth=2.0, label="Original line")

    # Generate noisy data points
    noise_sigma = 3.0
    noise = np.random.normal(0, noise_sigma, Yorig.shape)
    data = np.asarray([Xorig, Yorig + noise]).T
    plt.plot(data[:, 0], data[:, 1], "go", label="Data points")

    # Try to fit a line to this data
    l_fit = fit_line(data, error)
    print("Fitted line: C0 = {}, C1 = {}".format(l_fit[0], l_fit[1]))
    plt.plot(
        data[:, 0], l_fit[0] * data[:, 0] + l_fit[1], "r--", linewidth=2.0, label="Fitted Line"
    )

    # Add a legend and show plot
    plt.legend(loc="upper right")
    plt.show()


if __name__ == "__main__":
    test_run()

Result

01-09 Optimizers: How to optimize a portfolio

How to optimize: Framing the problem

  1. Provide a function to minimize f(X), where X is array of allocations, and the function result is the Sharpe Ratio.
  2. Provide an initial guess for X
  3. Call the optimizer
    • Must set ranges and constraints.
    • Range limit what X should return (ie. between 0 and 1). Ranges limit search area significantly, making it easier to optimize.
    • Constrains are properties of X that must be True.

ML4T Mini Course 2.1 : Computational Investing

02-01 So you want to be a hedge fund manager?

Type of Funds

Three broad classes

  • ETF
    • Buy/sell like stocks
    • Baskets of stocks
    • Transparent, liquid
    • Tied to an index (S&P 500, etc)
  • Mutual fund
    • Buy/sell like stocks
    • Quarterly disclosure
    • Less transparent
  • Hedge fund
    • Buy/sell by agreement
    • No disclosure
    • Not transparent (Never disclose what they're holding)

Terms:

  1. Liquid:
    • Performs just like stocks (buy/sell), sold in single stocks. Easy to trade. - Lots of volume traded.
  2. Capitalization ("Cap"):
    • How much company is worth acccording to number of shares that outstanding, multiplied by price of the stock.
    • number of shares * price

Incentives for fund managers

Assets Under Management (AUM): How much money is being managed by the fund.

How they are compensated:

  • ETFs: Compensated by expense ratio (0.01% - 1.00%)
  • Mutual Funds: Compensated by expense ratio (0.5% - 3.00%)
  • Hedge Funds: Compensated by "two and twenty" rule, ie. 2% of AUM + 20% of profits
    • 2% AUM, typically between initial AUM and final AUM (after profits/loss)

Expense ratio are incentivied by AUM accumulation, while "two and twenty" are incentivised by profits and risk taking

How hedge funds attract investors

Types of investors:

  1. Individuals
    • Hedge funds typically limited to ~100 investors. Incentive to pull in investors with large funds.
  2. Institutions
    • Retirement funds, university foundations, etc.
  3. Funds of funds
    • Group individuals or institutions.

Why invest in a fund:

  1. Track record: Profitable for 5 years
  2. Simulation + story: Make a case that your strategy is stronger than others.
  3. Good portfolio fit: If investor is looking for small cap growth stocks

Hedge fund goals and metrics

Goals:

  1. Beat a benchmark (in both profit and loss)
    • Benchmark you choose depends on your expertise.
  2. Absolute return: Provide positive returns, no matter what.
    • Make long/shorts (positive bets + negative bets)
    • Not as profitable as "beat a benchmark", but more resilient to market downturns.

Metrics:

  1. Cumulative return: How much money is made over time.
    • (val[-1] / val[0]) - 1
  2. Volatility: How rapidly prices go up/down
    • daily_rets.std()
  3. Risk / Reward
    • Sharpe Ratio, sqrt(trading_days) * mean(daily_rets - risk_free_rate) / daily_rets.std()

Computing inside a hedge fund

  1. Compares target portfolio to live portfolio
  2. To move live portfolio towards target portfolio, orders are placed to the market.
    • Makes orders incrementally so as not to increase/reduce price.

How target portfolios are made

target

  1. N-day forecast predicts (with ML) what stock prices are going to be at some point in the future. Informs portfolio optimizer which balances the target portfolio.
  2. Other considerations:
    • historical price data
    • current portfolio

How N-day forecast is made

  1. Forecasting algorithm (ML model): Decides what future prices look like.
  2. Model fed by information feed (proprietary info) + historical data.

02-02 Market Mechanics

What is in an order?

When you make a stock order, the order is sent to a stock broker which they then execute the order.

How orders enters the market and back

Components of an order:

  1. Buy or sell
  2. Stock symbol
  3. Number of shares
  4. Limit or market
    • Market: Accept the price of the market
    • Limit: Threshold set to buy or sell.
  5. Price (if Limit): The threshold price to buy or sell.

Examples:

  • BUY,IBM,100,LIMT,99.95
  • SELL,GOOG,150,MARKET

The order book

Each exchange keeps an order book for every stock that is bought or sold.

  • ASK: Request to sell
  • BID: Request to buy

  • The more sell orders, stock price goes down. If more buy orders, stock prices go up.

How orders affect the order book

  • Executed prices may change, depending on available orders at each price point.

How orders get to the exchange

  1. Broker receives your online order
  2. Broke checks each stock exchange for that order, orders from the best price.
  3. Order enters the exchange, gets executed and price comes back to broker, then to user.
    • Fees are incurred as orders enter exchanges.
  4. If broker has both buyer and seller, the broker can execute the order without going to the exchange. However, price must match price in exchange.
    • No fees are incurred.
  5. Dark pool: Intermediary between brokers and exchange.
    • Pays brokers to see orders before they enter the market.
    • If they see an advantageous trade, the dark pool might make the trade and the order never enteres the exchange.
    • Can span across multiple brokers.
    • No fees are incurred.
  • Most (80%) of orders never make it to the exchange.

orders 3

How hedge funds exploit market mechanics

Colocated servers: Hedge funds have servers located on the premises of the stock exchange. This means they have faster access to the active order book.

  • On-premise server may have 0.3 micro-seconds access to orders, whereas a normal user may have a 12ms access.

Order book exploit

  1. HF observes order book.
  2. HF buys stock.
  3. User clicks "buy"
  4. Meanwhile price goes up
  5. HF sells you the stock.
    • Only held stock for few milliseconds, but earns profit from it.

hf1

Geographic arbitrage exploit

  1. Fact: Different stock exchanges may have slightly different prices of a stock.
  2. A hedge funds' colocated servers at both exchange are connected at high speed and observes these phenomena.
  3. If differences occur, HF buys stock from the lower priced market and sells it to the other market with higer price, making a profit.
    • Typically exchanges occur very fast and only a fraction of a cent's worth of profit is made. hf1

Additional order types

All of these order types are facilitated by the broker:

  • "Stop loss": When stock drops to certain price, sell.
  • "Stop gain": When stock increases to certain price, sell.
  • "Trailing stop": Combo of stop loss and statistically determined price value.
  • "Selling short": Take a negative position on a stock.

Short selling

Brokers take care of shorting.

short

  • If IBM selling at $100 and you short 100 shares and exit at $90, you make a profit of $1000 ($10 difference * 100 shares).

What can go wrong:

  • If price goes up.
    • If IBM selling at $100 and you short 100 sahres and exit at $110, you lose $1000 (-$10 difference * 100 shares)
  • If no one wants to buy your shorted shares

02-03 What is a company worth?

Q: WHat is a company worth that makes $1/year?

A: Between $10-25 depending on interest rates. Value of $1 reduce over time, and sum of the individual $1's generated every year would converge to a finite value.

Rank of best value of $1

  1. $1 right now: Immediate spending power.
  2. $1 government bond over year: More reliable than one person.
  3. A random person's promise of a $1 in a year: Most unreliable.

Any promise of a given amount in now paid in the future will devalue over time.

How to calculate company value (ATTENTION):

  • Intrinsic value: Based on the value of the company as estimated by future dividends. "How much dividends can am going to get in the future".
  • Book value: Assets that the company owns.
  • Market cap: Value of stock on the market.

Many market strategy look for deviation between these three values:

  • If intrinsic value (ie dividends) is decreasing but market cap is high, trader may short the stock, and vice versa.
  • If stock price approaches book value, it's most likely to bottom out there.

Value of future dollar - Interest Rates

present_value = future_value / ((1 + interest_rate) ** time)
  • Interest rate becomes bigger as i (time) becomes larger.

Some calculations:

  • IR of 1% on a dollar: $1 / (1 + 0.01) = $0.99
    • Pay 99 cents now to get $1 back in a year
  • IR of 5% on a dollar: $1 / (1. 0.05) = $0.95
    • Pay 95 cents now to get $1 back in a year

Company Value 1: Intrinsic Value

Interest Rates and Dividends

  • Interest rates reflect how risky a company is. The more risk you think the company is, the interest rate needs to be a little bit higher, ie. you need to expect that the company is going to pay you more in the future or that it's more likely to pay back you investment in the future.
  • Discount Rate: Higher if you trust the company less, lower if company is reliable. This is the amount of interest rate you expect to get paid back per time period (usually year). Typically this is paid back in the form of dividends.
  • If you want to calculate a company's value based on intrinsic value, compute what is the value of all the future dividends that its going to pay me.

How to calculate intrinsic value:

  • FV: Future value, dividend paid at a regular interval
  • DR: Discount rate, amount of interest rate you expect to be paid back (e.g. 5%).

Intrinsic Value Equation: Intrinsic Value = Future Value / Discount Rate

intrinsic value

Company Value 2: Book Value

Book Value: "Total assets (not including intangible assets) minus liabilities" book val

Company Value 3: Market Capitalization

Market Cap Equation: Market Cap = Number of Shares * Price

Why information affects stock price

  • How it affects intrinsic value: Company specific news, potential for reduced future dividends.
  • How it affects book value: Sector specific news, affects asset value
  • How it affects: Market-wide news, affects stock prices.

Quiz: Would you buy this stock?

  • Q: Calc book value - 10 airplanes at $10M each, brand name at $10M, $20M loan liability
    • A: $80M
  • Q: Calc intrinsic value - $1M dividends/year with 5% discount rate
    • A: $20M
  • Q: Calc market cap - $1M shares outstanding, $75 stock price.
    • A: $75M

Q: Would you buy this company for market cap ($75M)?

A: Yes, because its book value is currently $80M which is more than $75M. The company is undervalued.

  • IRL, book value rarely goes below market cap, because if it does it is vulnerable to a predatory takeover by someone who buys up all the stocks.

02-04 The Capital Assets Pricing Model (CAPM)

Portfolio

Definition: A portfolio is a weighted set of assets.

  • Each asset has a weight w with a portion i
  • Sum of all assets * weight = 1.0
  • Portfolio equation: 1.0 = sum(abs(Wi))
    • abs because you can short stocks.

Calculating returns of a portfolio:

t = time
W = weight of each asset
R = return of each asset

portfolio_return = sum(W * R * t)

The Market Portfolio

  • "Market": Refers to market index, e.g. S&P500 (US), FTA (UK), TOPIX (JP)
  • "Cap Weighted": Individual weight of each stock in the portfolio is set according to that stock's market cap.
    • Equation: Wi = mktcap(i) / sum(mktcaps(I))
  • "Sectors": Industries like energy / tech / manufacturing / finance, etc.
    • Pos/neg news can affect one sector but not others.

The CAPM Equation

Memorize and understand this for exam

Equation - A particular stock's return is dependent on the sum of:

  1. beta of a specific stock times the return on the market of that day
  2. alpha of a specific stock on that day.
stock_return = (beta(i) * market_return * time) + (alpha(i) * time)

Tenets:

  1. Significant portion of returns of a stock is due to the market, and the extento which the market affects the stock is called beta.
    • Each stock has different beta amounts.
  2. The remaining amount of the return of a stock is called alpha, or the residual. CAPM's expected value of alpha is zero.

What is beta and alpha?

capm

Beta and alpha is calculate by plotting a stock's daily return against a market (e.g. S&P500) return for the same day.

Beta: Slope of the line for a particular stock's scatterplot fit to a line. Higher slope equals higher beta.

  • Higher beta means it is more swayed by market change (good for upswing). Smaller beta means it is less swayed by market (good for downturn).

Alpha: y-intercept of the stock. CAPM expects this to be zero, but not always the case.

CAPM vs Active Management

  • "Passive": Buy index and hold.
  • "Active": Pick individual stocks. Compare to index, managers may put higher weights (overweight) on certain stock and lower weight (underweight) on certain stocks.

Both passive and active manager are affected by beta. Where they differ is how they treat alpha.

  • CAPM says it is random and can't predict, and expected value is zero.
  • Active manager believe they can predict alpha. Beter than flipping a coin.

CAPM for Portfolios

Equation:

W = weights for each stock
i = specific stock
portfolio_return = sum(W(i) * (beta(i) * market_return * time) + (alpha(i) * time))

CAPM simplifies this because it only cares about beta, assumes alpha is zero

portfolio_return = beta(portfolio) * market_return * time

Implications of CAPM

  1. Expected value of alpha = 0
  2. Only way to beat the market is to choose beta.
    • Choose high beta in upward markets
    • Choose low beta in down markets
    • But, efficient markets hypothesis (EMH) says you can't predict the market.
  3. CAPM's final word: You can't beat the market.

Bonus: Arbitrage Pricing Theory (APT) (not used in exam)

Observed by Stephen Ross, 1976.

Idea: CAPM only looks at one market, what about considering multiple sectors using multiple betas?

  • By breaking out the beta by different sectors, you can get a better forecast.

From the 60 Minutes Archive: Rigged (Will be in exam)

Is the stock market rigged video]

  • High frequency traders are forecasting your trades, buying it first and selling it back to you for fractions of a cent more.

02-05 How hedge funds use the CAPM

Two stock scenario

two stock

Given two stock scenario what are the returns for each?

# Formula: return = beta * market_return + alpha

# Stock A
beta_a = 1.0
market_return = 0.1
alpha_a = 0.01
position_a = 50
return_a = beta_a * market_return + alpha_a # 0.11
return_a_dollar = return_a * position_a # 5.5

# Stock B
beta_b = 2.0
alpha_b = -0.01
position_b = -50
return_b = beta_b * market_return + alpha_b # 0.19
return_b_dollar = return_b * position_b # -9.5

Takeaway: Even with perfect alpha and beta we can still lose money.

Two stock CAPM math

2stock math

In code:

weight_a = 0.5
alpha_a = 0.01
weight_b = -0.5
alpha_b = -0.01
beta_a = 1.0
beta_b = 2.0

return = (weight_a * beta_a + weight_b * beta_b) * market_return + weight_a * alpha_a + weight_b * alpha_b

return = -0.5 * market_return + 0.01

Takeaway:

  • We have information for alpha and weights, but we don't know what the market return will be. Can we eliminate market return altogether?
  • In order to eliminate market return from equation, choose weights for each stock's beta so the overall portfolio beta is zero.

How to remove market risk

beta alloc

In code:

# Goal: Make beta product equal to zero
beta_product = 0
beta_a = 1.0
beta_b = 2.0

beta_product = weight_a * beta_a + weight_b * beta_b

# 1. Multiply beta_b by negative because it's a short position
weight_a = (beta_a * beta_b * -1) * weight_b
weight_a = -2 * weight_b

# 2. Sum of both weights equal to 1
abs(weight_a) + abs(weight_b) = 1

# 3. Replace weight_a with point 1
abs(-2 * weight_b) + abs(weight_b) = 1
3*abs(weight_b) = 1
abs(weight_b) = 1/3
weight_b = -1/3 # because weight_b is a short

# 4. Use calc from 1 using derived weight_b
weight_a = -2 * -1/3 # 2/3

CAPM for hedge funds summary

Assuming:

  • Information exists for forecast alpha and beta of a stock.

CAPM enables:

  • Minimize market risk (beta) to zero
  • We can minimize by finding appropriate weights on each individual stock.

02-06 Techincal vs Fundamental Analysis

Characteristics of technical analysis

What is is:

  • Only looks at price and volume (not fundamentals like earnings, dividends, etc)
  • Use above data to compute statistics (indicators) on a time series.
  • Indicators are heuristics that suggests a buy/sell opportunity.

Why it may work:

  • There is information in price (reflects sentiments of buyer/sellers, momentum, etc)
  • Heuristics work.
Technical Fundamental
Moving average Price-to-earnings ratio
% change in volume Intrinsic value

When is technical analysis effective?

Rules of thumb:

  • Individual indicators are weakly predictive
    • The more ppl that trade on a given indicator, the less predictive that indicator becomes.
  • Combining indicators adds value
  • Look for contrasts (choose stocks that contrasts each other, or contrasts the market).
  • Works better over shorter time periods

tech versus fundamental

  1. Technical analysis is more valuable with shorter time spans.
    • Has high value in high frequency trading (Decision speed)
    • Better done by computers.
  2. Fundamental analysis is more valuable with longer time spans.
    • Has high value when needing to make a trading decision that will last a long time (Decision complexity).
    • Better done by humans.

Indicators

Most common:

  1. momentum
  2. simple moving average (SMA)
  3. bollinger bands

Indicator 1: Momentum

Definition: Given over a time-period, how much did the price change.

- Prices goes up = positive momentum
- Prices goes down = negative momentum
- Steepness of slope indicates momentum strength.

momentum

For ML, need to convert this into a quantitative data.

In pseudocode:

def momentum(price, day, n):
  """Calculate momentum of stock, typically ranges between -.5 and +.5
  price: price of stock
  day: day of trade
  n: number of days back
  """
  price_on_day = price * day 
  price_n_days_back = price * (day - n)
  return price_on_day / price_n_days_back - 1

Indicator 2: Simple Moving Average (SMA)

Definition: Takes an average of a stock's price over an n-day window and rolls that over a time period. Returns a smooth price chart and lags in movement.

Trading signals:

  • Traders use it to detect buy/sell when current price crosses through the simple moving average.
    • If prices rises upward through SMA, buy.
    • If prices dives downward through SMA, sell.
    • Strong momentum + crossing SMA can be a strong indicator.
  • Traders use it as a proxy for value. When there is large deviation, you expect to see it correct itself.
    • If current price is much higher than SMA, sell.
    • If current price is much lower than SMA, buy.

In pseudocode:

def simple_moving_average(price, day, n):
  """Calculate diff between SMA and current price. Returns ratio of current 
  price's deviation from SMA. Ranges between -.5 to +.5.
  price: price of stock
  day: day of trade
  n: number of days back
  """
  price_on_day = price[day]
  price_n_days_back = price[day - n:day]
  n_window_mean = price_n_days_back.mean()

  return price_on_day / n_window_mean - 1

sma

Indicator 3: Bollinger Bands

BBs adjusts SMA trigger based on the volatility of a time window.

  • For periods with small volatility, use small window of SMA.
  • For periods with large volatility, use large window of SMA.
  • Calculated by setting 2 bands that are 2 standard deviations above and below the SMA.
  • Stock prices will typically be between bollinger bands.

Trading signals:

  • Prices goes from outside upper band to inside = SELL
  • Prices goes from outside lower band to inside = BUY

In pseudocode:

def bollinger_band(price, simple_moving_avg, day):
  """Calculate trading signal movement. Values greater than 1/-1 indicate 
  current price is higher/lower than bollinger bands.
  """
  price_on_day = price[day]
  sma_on_day = simple_moving_avg[day]
  two_std = 2 * std(price_on_day)
  return (price_on_day - sma_on_day) / two_std

Normalization

Problem statement: Each indicator has different degrees of values. Values for all indicators need to be normalized between a range (e.g. -1 to 1) in order for a ML algorithm to weight each indicator fairly. Otherwise, indicators with larger values will overwhelm others.

How to do it:

normed = (value - mean) / values.std()

END OF EXAM 1 LECTURES


ML4T Mini Course 2.2 : Computational Investing

02-07 Dealing with Data

How data is aggregated

  • Tick: successful transactions
    • Ticks are aggregated usually by minute/hour blocks.

Given a minute/hour block of ticks, we have the following stats:

  • Open: First transaction within time period
  • High: Highest transaction price
  • Low: Lowest transaction price
  • Close: Last transaction within time period
  • Volume: Amount of shares traded within time period

Stock Splits

  • Stock splits result in reduced price per unit.
    • Is each stock is divided into two, value of each new unit drops to half
  • Stocks are split to reduce share price.
    • Ppl like to buy stocks at 100 share blocks. Same for optioins
    • If price is too high, it because prohibitively expensive to buy 100 shares.
  • Problem: Learners may incorrectly interpret this as shorting opportunities.
    • Solution: Use adjusted close as training data. When stock splits happen, you divide historical data prior to that by split amount (if split by 2, divide past data by 2)
def adj_stock_split(split_ratio: int, historical_stock_prices: np.array -> np.array):
  return historical_stock_prices / split_ratio

split

Dividends

Dividends causes drops proportionate to the amount of dividends paid on the day dividends are paid out.

  • When dividends are announced, price typically reaches stock price + divident amount (e.g. $1 per share).
  • Once the dividend is paid, the stock price drops by the dividend amount.

div

Adjusting for dividends:

Adjust all of prior prices down by the proportion of the dividend.

def adj_dividend(dividend_cash: int, historical_stock_prices: np.array) -> np.array:
  return historical_stock_prices - (dividend_cash / historical_stock_prices)
  • If $100 stock gives $1 dividend, adjust down all prices prior to dividend by 1% ($1 dividend / $100 stock price) all the way back in history.
  • FYI: The most recent day should have same adjusted and actual closing price.
  • FYI: Adjusted closing prices are dependent on when you collected the adj closing price data.

Quiz: Stock What is stock price before divident, and the price the day the dividend is paid?

  • Stock price before dividend: $101
  • Stock price the day the divident is paid: $100

Survivor bias

Key point: There is built-in bias in using existing stock data as of today. Some stocks may have failed and disappeared, but we didn't account for those.

Solution: Use surivor-bias free data (typically not free).

02-08 Efficient Markets Hypothesis

EMH assumptions

  1. There are a large number of investors interacting in the market.
    • Because these investors are operating simultaneously, price will adjust quickly to any new information.
  2. New information arrives randomly
    • Randtom intervals/cadence, investors pay attention to this and thus prices adjust quickly.
  3. Prices reflect all available information.

Where information come from

  • Price/Volume: Basis of technical analysis
  • Fundamental: Quarterly reports, basis of fundamental analysis
  • Exogenous: Info about the world that affects the company (COVID vs airline stock)
  • Company insiders: Private information within company, least accessible.

3 Forms of the EMH

  • Weak: Future prices can't be predicted by analyzing historical prices
    • Prohibits technical analysis.
    • Allows fundamental analysis and insider info, however.
  • Semi-strong: Prices adjust rapidly to new public information.
    • Prohibits technical and fundamental analysis.
    • Allows insider information to profit.
  • Strong: Prices reflect all information public and private
    • Impossible to make money other than holding an index fund.

emh

Is the EMH Correct?

  • If it is, we can't beat the market.
  • Strong version of EMH is the least true, especially due to insider trading.

02-09 The Fundamental Law of active portfolio management

Grinold's Fundamental Law

Deriving performance by skill and breadth:

  • Skill: Can pick stocks well.
  • Breadth: Opportunities to act on above skill.
    • For example, how many stocks you're investing in.

Formula: Performance = Skill * sqrt(Breadth)

Information Ratio: The degree in which a portfolio manager is exceeding market performance.

  • Similar to Sharpe Ratio ("Sharpe ratio of excess returns")
  • Formula: Information Ratio = Information Coefficient * sqrt(Breadth)

Coin Flipping Casino - Which bet is better?

casino

Central Question: Is it better to bet all on one coin flip, or bet on multiple coin flips at once?

Q: WHich bet is better?

  1. Bet 1: 1000 tokens on one table, 0 on 999 other tables
  2. Bet 2: 1 tokens on each of 1000 tables
  3. Both are equivalent.

Answer: 1 token on each of 1000 tables

  • Expected return for each of these bets are the same
  • There is less risk (ie bigger Sharpe ratio) when distributing bets over multiple events, rather than all on one
  • Bet 2 has much lower risk for the same expected return

Coin Flip Casino - Calculating Reward, ie Expected Return

  • Single Bet Scenario: 0.51 * 1000 + 0.49 * -1000 = $20
  • Multi-Bet Scenario: 1000 * (0.51*1 + 0.49 * -1) = $20
  • IMPORTANT: Both bets' expected returns are the same
    • So why is multi-bet better? Risk mitigation.

Coin Flip Casino - Calculating Risk

Chance of losing everything:

  • Single bet scenario: 0.49
  • Multi-Bet scenario: 0.49 * 0.49 * 0.49 ... 0.49 = 0.49^1000
    • Multiply chance of losing by how many bets you enter.
    • Results in a very small number. Much less risk than with one coin toss.

Standard deviation of individual bets:

  • Single bet scenario: 31.62
    • Outcome is either 1000/-1000 for one table and zero for all others:
    • stdev(1000, 0, 0, ... 0) = 31.62
    • stdev(-1000, 0, 0, ... 0) = 31.62
  • Multi-Bet scenario: 1.0
    • Outcome is either +1 or -1, hence stdev is 1.

Reward/Risk Ratio (Sharpe ratio)

  • Formula: Expected Return / Risk (stdev)
  • Single bet case: $20 / $31.62 = 0.63
  • Multi-bet case: $20 / $1 = 20

Takeaway: Multi-bet strategy has a much higher risk/reward ratio than single-bet strategy

How this relates to fundamental law

  • As we spread out our bets (increase Breadth), we are increasing our Sharpe ratio
  • The Sharpe ratio improves by the square root of that number of bets
SR_single = 0.63
SR_multi = 20
num_bets = 1000

# This looks a lot like the Information Coefficient
# Performance = Skill * sqrt(Breadth)
SR_multi = SR_single * sqrt(num_bets)

Think of it like:

  • SR_single is equal to how good you are at predicting future return of the stock
  • num_bets is equal to how much you spread your bets, ie. breadth.

Lessons Learned (from casino experiment)

  1. Higher alpha generates a higher Sharpe ratio
  2. More execution opportunities provides a higher Sharpe ratio
  3. Sharpe ratio grows as the square root of breadth increase

IR, IC and Breadth

real world

  • Information Ratio (IR)

    • alpha (ie skill): numerator is the reward, denominator is risk ir
  • Information Coefficient (IC): Correlation of forecasts to returns

  • Breadth: Number of trading opportunities per year.

Fundamental Law

fl

  • Performance manager can focus on improving skill (IC), or finding new commodities (breadth)
  • Breadth only increase at square root, hence increase in returns taper off.
  • Very hard to improve skill, hence most managers look to increase breadth.

Simon vs Buffet

sb

  • tip: Set up Information Ratio equation for both, then derive x on Simon's side

02-10 Portfolio optimization and the efficient frontier

What is risk?

Risk: Volatility, or standard deviation of historical daily returns

Visualizing return vs risk

  • X/Y graph, scatter plot of multiple stocks
  • Build portfolio by combining multiple assets, and weighing each asset by a particular weight that represents its allocation within the portfolio.
  • When combined together, we end up with portfolio with properties of each of the stocks.
  • Risk is somewhere in the middle

Building a portfolio

riskreturnportfolio

Can we do better? Yes, by accounting for covariance

  • Classic method is to weight allocation to move return/risk ratio within the selected stock's risk/return area.
  • Covariance (researched by Harry Markowitz) allows for same selected stock to have an even higher return for less risk.

Why covariance matters

covariance

  • Chart shows stocks ABC,DEF,GHI all getting returns of 10%
  • Chart shows stock ABC and DEF moving in tandem, while GHI moves in an inverse direction.

What's the best portfolio we can build with these stocks?

Answer: Blend anti-correlated assets together

  • ABC/DEF has positive covariance of 0.9
  • ABC/GHI has negative covariance of -0.9
  • Blending ABC/DEF has no advantage since they have same volatility
  • Blending stocks (25% ABC and DEF, 50% GHI) will net same return but lower volatility

  • By combining anti-correlated assets allow you to have much lower risk portfolio.

  • Ideally, we look for anti-correlation in the short-term and positive correlation in the long term.

Mean Variance Optimization (MVO)

Definition: Taking a potential set of assets and figuring out how they should be blended together by looking at their covariance among other things.

Inputs

  1. Expected return - profit or loss that an investor anticipates on an investment that has known historical rates of return (RoR)
  2. Volatility - historical stdev of price
  3. Covariance - matrix of correlation of daily returns with other stocks
  4. Target return - Most important, can set between max return possible among assets to min return possible among assets.

Output

  • Array of asset weights for portfolio that achieves target return but minimizes risk

mvo

The Efficient Frontier

  • For any particular return level, there is an optimal portfolio.
  • A dot can be placed for the optimized portfolio for a given return level.
  • When all dots for each return level is plotted, this creates a line. This is what is called the efficient frontier.

Meaning of EF: Anything inside the frontier is suboptimal

  • It shows that there are no portfolio north of the line, and any portfolio within the frontier is suboptimal (ie. assuming more risk than necessary for lower return that what is possible).
  • More theoretical, but ppl like to plot their portfolio to see where it is in relation to the assets they are using.

Characteristics of EF:

  1. As you reduce return, sometimes this curve comes back, meaning you start to gain risk and less return (we don't want that)
  2. Anything above where the phenomenon in bullet 1 occurs is the "efficient frontier"
  3. Drawing a tangent from the origin to the frontier is the max Sharpe ratio for all assets.

ef

Time Series Data Video

link to video

Can't train on all trading dates, it would be cheating. There are ways to get around this.

rollforward

Rollforward cross validation

  • Backtest to validate model: Roll back time train for n-days, forecast price at n+x days, decide what position to take and take action next day.
  • With stocks, only rollfoward is valid.

Ways backtesting can go wrong

  • In sample backtesting: Backtesting over the same data you used to train.
  • Survivor bias: From wiki: "The logical error of concentrating on the people or things that made it past some selection process and overlooking those that did not, typically because of their lack of visibility. This can lead to some false conclusions in several different ways. It is a form of selection bias."
    • Example: Just using companies that exist in S&P 500 today, rather than X-years ago. Performance metrics will change once you add in those companies that did not survive.
    • How to avoid: Use historic index membership. Use these indices as your universe for testing.
  • Ignoring market impact: The act of trading affects price. Historical data doesn't include your trades, and therefore isn't an accurate representation of the price you would get.
    • This is doubly important if you're trading a large number of shares

Intro to exchange-traded options

Options: Contract which gives the buyer the right, but not the obligation, to buy or sell the underlying stock* at a specific price** on or before*** the expiration date.

  • There also options other than stocks (currency, futures)
  • strike price: set price at which a derivative contract can be bought or sold when it is exercised.
  • Specifically for US-style stock options (in Europe, you can't exercise your option prior to the expiration date)

Example options board, AAPL stock

options

  1. Calls: Expiration date
  2. Root: Underlying security the option is tied to
  3. Strike: Price you're guaranteed to be able to buy the stock, if you choose to exercise the option.
  4. Last: Last price the option traded at that day before the market closed.
  5. Net: Change from previous time
  6. Bid/Ask Size: Number of open orders at the moment.
  7. Open interest: How many total positions of this option is open and in the market right now.
  • Options are always 100 shares, no more no less.
  • Profit = Strike * Share - Last
  • Options: Bet that price will go up to strike price by call date.
  • (IMPORTANT): Options buys the right, but you're not obligated to exercise the option

Pros/cons of options vs buying stock

Good situation: stockvsoptions

Pros:

  • Leverage. Can make a fair profit with much less money invested through stock purchase. Can also cap downside to premium paid.
  • Can't lose more than the premium paid up front.

Cons:

  • Premium is lost money.
  • Options have expiration dates.
  • You don't own the stock.

"Moneyness" of options: Why do some options cost more?

Intrinsic value - The difference between the option strike price and the underlying spot price for an "in-the-money" option (option that currently has a non-negative intrinsic value).

  • "Out-of-money" options (strike price above current stock price) has an intrinsic value of zero.

Time value of an option: The excess premium cost of an option beyond its intrinsic value, attribute to time-to-expiration.

  • The more time there is between now and expiration date, the higher likelihood the stock will reach the strike price before it expires and thus higher time value.

Time decay: The rate at which an option is currently losing its time value. Also called theta.

  • The rate increases as option nears its expiration date.
  • Those actively trading options typically want to close their open option positions 2 weeks before the expiration date to not incur losses from time decay.
  • Black-Sholes model: Equation used to determine what price an option contract should be at any given moment.

Options profit-loss curve

BUY CALL

buycall

  • When buying a call, you're paying some premium up-front. You need the stock price to rise above the strike price (ie. premium that you paid) to break even.
  • After stock reaches strike price, if it continues to go up you continue to make profit (no ceiling).
  • Max loss you will incur with BUY CALL is the premium you paid up front.

BUY PUT

put

  • Buy and pay premium like BUY CALL, but you're buying the option to sell a stock. How: short the stock and buy it immediately (sell for more, buy for less...?).
  • If stock price is still above strike price, you lose the premium.
  • For every dollar the stock price is below the spot price at expiration, you make a profit. Ceiling is the stock price going to $0.

Writing

Definition: Selling someone the option until the expiration date, they can force you to buy or sell them the stock at the strike price, whether or not you want to and no matter what the price of the underlying stock is set.

  • 90% of stock options that are bought are never exercised (!)

WRITE CALL

writecall

Definition: Selling someone the right to buy an underlying stock from you at a strike price. You immediately pocket the premium, but until expiration date the buyer can buy your stock at the strike price.

  • You want the spot price to stay below strike price. As long as spot price is below strike price, you pocket the premium.

  • Best case: Buyer doesn't exercise. You keep premium + your stock.

  • Worse case: Buyer exercises option. You're forced to make a unprofitable trade. Risky. Maximum loss is unlimited based on how high the stock price goes. Typically, you'll make a COVERED CALL to prevent this from happening.

WRITE PUT

writeput

WRITE PUT: Give someone else the option to sell us the stock at the strike price, should they choose to do so.

  • Safer than WRITE CALL because there is floor how much money you can lose, based on spot price reaching $0 at expiration.
  • You want the spot price to stay above strike price. As long as spot price is above strike price, you pocket the premium.

More complex options strategies

COVERED CALL

From investopedia

...financial transaction in which the investor selling call options owns an equivalent amount of the underlying security. To execute this, an investor who holds a long position in an asset then writes (sells) call options on that same asset to generate an income stream. The investor's long position in the asset is the cover because it means the seller can deliver the shares if the buyer of the call option chooses to exercise.

3 basic possibilities after you make a covered call:

  1. Stock ends above strike: "called away" sell at strike price, lose profit above the strike price.
    • profit: STRIKE - (purchase price) + premium
    • loss: don't have stock anymore
  2. Stock ends up, but below strike: Perfect, option not exercised.
    • profit: current price - purchase price + premium
    • loss: none, you still have the stock
  3. Stock goes down: Option not exercised, still have stock.
    • loss: same as #2, but premium partially offsets loss.
    • Premium offset protects you during downturns because you're still making moeny from premium even if stock price goes down.

MARRIED PUT (not as important, problably not in exam)

From investopedia

...an investor, holding a long position in a stock, purchases an at-the-money put option on the same stock to protect against depreciation in the stock's price. The benefit is that the investor can lose a small but limited amount of money on the stock in the worst scenario, yet still participates in any gains from price appreciation. The downside is that the put option costs a premium and it is usually significant.

Another pro is that you can do this instead of selling the stock to avoid incurring capital gains tax.

Butterfly Spread

From investopedia

The term butterfly spread refers to an options strategy that combines bull and bear spreads with a fixed risk and capped profit. These spreads are intended as a market-neutral strategy and pay off the most if the underlying asset does not move prior to option expiration. They involve either four calls, four puts, or a combination of puts and calls with three strike prices.

  • Wider the buttefly, the more premium you'll pay.
  • Can offset spreads either lower/higher, called "broken butterfly spread"

Example:

  • AAPL is at 111
  • Buy options at 105 and 115, write 2 options at 110 (all CALLs)
  • Premium: -7.16 + (2 * 2.73) - 0.53 = -2.23
  • Cost to enter buttefly: $223 (2.23 * 100)

Loss-profit curve (P/L upside down)

butterfly

Interview with Tammer Kamel (Quandl)

  • Quandl provides clean financial data to hedge funds. Provides historical data to train a model.
  • What time is data available: Updated after the close.
  • Infrastructure: Cloud (Amazon), NoSQL database. API on top. Website connected to Ruby on Rails
  • Why NoSQL: Huge database, relational DBS too slow.
  • Challenge: Maintaining accuracy of 12 million data sets.
  • Timestamps: Provide real/revised to prevent lookahead bias.
  • Flaws in data and mitigation: Remove distance between end-user and data publishers. Vendors offering competing products.

  • Jump diffusion: Model to capture sixth sigma event. If you mix normal distributions (with varying deviations) and draw from them at different probabilities.

  • Strategies: yield-per-arbitrage, model swap curve with a two or three factor model.
  • A lot less inter-stock correlation in the 90s
  • Why evaporation of alpha: Number of participants executing similar strategies. 90s had a more volatile market. More day-traders.
  • Any alpha left? Don't look in same places. Find new information sources. Count trucks coming out of factory in China.
  • Combo of different data sources adds value.
  • Bollinger bands on its own doesn't work. Need to mix indicators.

3 rules of thumb:

  1. Use something that makes theoretical sense, don't use neural nets.
  2. Empirically test your strategy
  3. Beward of complexity

ML4T Mini-Course 3.1: ML Algorithms for Trading

03-01 How ML is used at a hedge fund

The ML Problem

  • What does ML do: Ingests historical data through a ML algorithm to generate the model.
  • At runtime, observation(s) X is pushed through the model and the model returns an inference y
    • X: Price momentum, Bollinger value, current price
    • y: Future price, future returns

Supervised Regression Learning

  • "supervised": provide ML algorithm X (observations) and Y (correct value)
  • "regression": numerical prediction
  • "learning": training with data

Types of supervised regression learning algorithms:

  1. Linear regression (parametric): Finds parameters for a model
    • Takes data, manipulate it to come up with parameters, then throw the data away.
  2. K-nearest neighbor (KNN) (instance-based):
    • Keeps historical data (X,y pairs) and during inference time it consults the data.
  3. Decision trees:
    • Stores a tree structure, during inference pushes query through the tree according to factors of the data.
    • Each node represents a question (usually "is x greater than or less than this other value").
    • Final node (leaf) that query ends at is the regression value that is returned.
  4. Decision forests:
    • Lots of decision tres taken together. Query each one to get an overall result.

How to do supervised learning with stock data

Steps:

  1. Get historical data. Data may have multiple dimensions, ie features (Bollinger, PE ratio, price, etc)
  2. Roll back time to first data point and look at X values (ie. features) there.
  3. Look to the future (e.g. 5 days ahead) to get y value.
    • Also called "forecast date"
  4. Pair X from step 2 with y in step 3 and save to data.
  5. Move forward one day, get X and y pairing and save it to data..
  6. Repeat until you get to last y data. (Leave some leftover X data not used)
  7. Use X y pairs to build the model.

Example at a fintech company

Steps:

  1. Select X (features)
    • PE ratio, Bollinger value
  2. Select y (target)
    • Change in price, future price
  3. Consider breadth and depth of data.
    • breadth: how far back in time?
    • depth: "stock universe" how many symbols will you use to train the system.
  4. Train
  5. Predict

Confidence and Back Test Score

  • Confidence: Std dev among k-nearest neighbor values. Greater the variance, the less confidence we are of the prediction.
  • Back test score: Calculate how accurate model predictions were for the past X number of months. The more accurate, the higher the ranking.

More on Backtesting

General idea: Get a limited amount of data for algorithm to learn and to make prpredictions into the future.

How to backtest:

  1. Slice a period of historical data for model to train on.
  2. Train model, and let it make a forecast.
  3. On the basis of forecast, place orders (short and long)
  4. Put the orders into a trading simulator and see how portfolio performs.
  5. Train over a new period of historical data, make a new forecast and put into trading simulator.
  6. Rinse and repeat.

Backtest reporting

Example report from QuantDesk platform:

  • Compares historical value of our forecasted portfolio and compares it to a benchmark like S&P500.
  • Returns table to stats, like absolute return, std dev, Sharpe.

Problems with regression

  1. Forecasts are noisy and uncertain.
  2. Challenging to estimate confidence.
    • SOP is to use std dev, but that alone is not that strong of an indicator.
  3. Not clear on how long you should hold a position and how you should allocate to that position.

Solution: Use Policy learning reinforcement learning (taught later in class)

Project 3 - Scoping the problem

Goal: Use several ML algorithms on same set of training + test data and report findings.

Steps:

  1. Get historical data (X and y) between 2009 and 2010 to train models
  2. Use data from 2010 to 2012 as test data X for ML model to forecast y.
  3. Generate an order.txt to push through market simulator.
  4. Record how the strategy performs and measure its Sharpe ratio.
  5. Repeat for other ML algorithms.

03-02 Regression

Example: Parametric regression of rain & change in barometric pressure. Goal: Create a model that predicts amount of rain (mm) based on change in barometric pressure (mm)

Parametric Regression

  • The most basic approach is linear regression, y=mx+b
  • You can add more parameters to turn it into a polynomial regression, y=(m2*x**2)+(mx+b)
  • All of these are parameteric models, where the parameters represent the model

K-nearest Neighbor (KNN)

  • Given a value on the x-axis (e.g. -5mm barometric pressure) get K nearest data points and calculate the mean of their y-values.

Kernel regression

  • Similar to KNN, but additionally weights each data point by their distance to the query line (ie. x value).

Parametric vs Non-parametric

Parametric and non-parametric differ in how they treat the data. Parametric models throws away data and only uses the parameters. Instance-based models keep the data and calculates prediction at inference time.

Best use case for each:

  1. Parametric works well with data with a well-defined trajectory. A smooth curve is easier to calculate with a mathematical equation.
    • Data with a smooth curve is biased, meaning we have an initial guess of what he form of the equation is.
  2. Non-parametric works well with data that is hard to model mathematically.
    • Data with a non-linear structure is unbiased, where we don't have a clue about the underlying equation. Better for complex patterns.

Training vs Inference Speed

  1. Parametric is slow to train, fast to infer. Inference is fast because it doesn't keep the data.
  2. Non-parametric is fast to train, but potentially slow to infer (depending on data size). Very large datasets won't work well.
  3. Non-parametric can be easily updated (just add more data), but parametric model usually requires a full re-training.

Training and Testing

Out of sample testing: Procedure of separating testing data from training data.

  • Split results in 4 datasets: X_train, y_train, X_test and y_test
  • Training is done with X_train and y_train, testing is done with the other two.
  • Accuracy is measured by whether the trained model's prediction, given X_test, is equivalent to y_test.

  • For stocks, you want to train on old data, and test on new data.

test

Project 3: API

For LinReg:

learner = LinRegLearner()
learner.train(Xtrain, Ytrain)
Y = learner.query(Xtest) ## Compare Y to Ytest


class LinRegLearner:
  def __init__():
    pass

  def train(X, Y):
    """Take X and Y and fit a line, returning params m and b"""
    self.m, self.b = favorite_linreg(X, Y) # Can use Scipy or Numpy linreg algo

  def query(X):
    """Predict Y given X"""
    Y = self.m * x + self.b
    return Y

For KNN:

learner = KNNLearner(K=3)
learner.train(Xtrain, Ytrain)
Y = learner.query(Xtest) ## Compare Y to Ytest

# Use same structure for KNN as LinReg

03-03 Assessing a learning algorithm

Closer look at KNNs

knnsol

  • Good: Doesn't overfit the data.
  • Bad: Beginning and end lay flat due to no updates of nearest neighbors.

Effect of changing K values (KNNs)

knnq

  • the smaller the K, the more overfitting occurs
  • the larger the K, the more underfitting occurs

Effect of changing D values (Parametric)

dq

  • the larger the number of params, the more overfitting occurs
  • the smaller the number of params, the more underfitting occurs
  • Parametric models, unlike KNN, has the ability to extrapolate the direction of the line before/after the 0th/nth data point.

RMS Error

RMS error calculates the average error, ie distance between Y_test (ground truth) and Y_predict (hypothesis).

  • The squaring of the values emphasize larger distance errors.
  • Make sure to test on the out of sample data set (not the train set).
  • Out of sample error is usually higher than in sample error

rms

Cross Validation

Cross validation is when multiple training and testing is done on the same data set, but each time a different set of data is put aside as the test data.

  • This is often used to overcome data sparsity, ie. not enough data to do meaningful 60/40 split.

Roll-forward cross validation is a variant of cross validation, where the training set is always before the test set. This is important for financial data, because we don't want to peek into the future.

  • Roll-forward essentially functions like a window function, where the train/test split is incrementally moving forward in time.

Metric 2: Correlation

  • Correlation quantifies how close two data sets are related.
  • Use numpy's implementation np.corrcoef() to calculate correlation between Ytest and Ypredict.
  • Returns value between -1 and 1. -1 = Negative relation, 1 = Positive relation, 0 = No relation
  • Correlation is NOT the slope of the line. Correlation is showing how well aligned one dataset's points are with another.
  • It is actually an oval vs circle comparison. Tight oval = high correlation (either pos or neg), big circle means little correlation.
  • RMS error and correlation have an inverse relation (one goes up, other goes down)

Overfitting

Overfitting in paramterized models

of

Overfitting = in sample error increase + out of sample error increase

Overfitting in non-parametrized models

As K decreases, the more overfitting occurs. Thus the diagram of in sample vs out of sample error looks like the overfit parameterized model but flipped 180 degrees.

Other considerations

Cost benefits of param vs non-param models:

03-04 Ensemble Learners - Bagging and Boosting

Ensemble Learners is not a new algorithm, but combines several different algorithms/models.

Why use them:

  • Lower error
  • Less overfitting. Why? Each algorithms has an innate bias (e.g. linear regression is biased towards linear trajectory), and using multiple negates each other's biases.

ensemble

How to combine model outputs:

  1. Classification: Have each Y values vote on which is the best.
  2. Regression: Use the mean of all of them.

How to build an ensemble

Example: Combinin LinReg and KNN together.

  1. Train several parameterized polynomials of differing degress (1, 2, 3, etc).
  2. Train several KNN models using different subsets of data.

Bootstrap aggregating - "Bagging"

How to do bagging with regression models:

  1. Create m "bags", each containing n' number of training data. Data is chosen randomly with replacement. Total n in each should be less than ~60% of N.
  2. Use each "bags" to train a model, resulting in m number of models.
  3. Test each model with the same Xtest data.
  4. Get the mean value of all Y values

Bagging example with KNN

knn

Boosting: Ada Boost

How to ada(ptive) boost:

  1. Create an initial "bag" (~60% of data) and train a model.
  2. Use all of the training data to test the bagged model.
  3. Assign weights to each training data based on much error it caused.
  4. Collect training data for second bag, giving preference to those that resulted in high error in the first bag.
  5. Continue until you reach m number of bags.
  • AdaBoost is more likely to overfit than bagging, since it is choosing specific data points that were modeled poorly.

Using Bagging and Boosting in Project 3

  • Make sure to use the same API as single learners. To caller, it shouldn't need to know if it is doing bagging/boosting within the class.

Decision Trees

Components of a decision tree

  • Factors: Features (e.g. size, cost, etc) Represented as X1, X2... XN
  • Labels: Y derived from the factors. Can be both classification or regression.
  • Decision Nodes: Contain a factor and a "split value", ie. whether the input data meets a condition or not. Binary decisions (e.g. Xi <= SplitVal)
    • Root node: Parent node
    • Leaves: Terminal node
  • Outgoing Edges: Paths (left, right) pointing to the next node.

Each row looks like: X0, X1, X2 ... XN, Y

Decision Tree: Data Structure

Wine tasting data set: DT graphical view

dt

Wine tasting data set: DT tabular view

Data structure of decision tree as an numpy array

dttab

Headers:

  • node: ID for Factor
  • Factor: Features or Leaf label
  • SplitVal: Value that determines returning a Left or Right value.
    • If node is a Leaf, contains the Y value.
  • Left & Right: Where in the matrix the tree beings.
    • If node is a Leaf both are NaN Notes:
  • Root node is always the first row.

Decision Tree: Algorithm (JR Quinlan)

dtalgo

Recursive algorithm. Make sure to terminate to prevent infinite loops.

Steps:

  1. Set termination condition first. Important for recursion.
    • If only one row of data or all data labels are the same, return label.
  2. Determine best feature i to split on
  3. Choose particular value to set SplitVal.
    • Standard method: data[:, i].median()
  4. Gather data to build Left and Right trees recursively
    lefttree = build_tree(data[data[:, i] <= SplitVal])
    righttree = build_tree(data[data[:, i] > SplitVal])
  5. Compose the root node.
    • root = [i, SplitVal, 1, lefttree.shape[0] + 1]
    • 1 is specifying relative index of left tree (next row). lefttree.shape[0] + 1 is specifying right tree index (after left tree ends).
  6. Append root, lefttree and righttree together as a new numpy array.
    • append in numpy makes a new ndarray.

How to determine "best" feature in a decision tree

  • Goal: Divide and conquer
  • Group data into most similar groups. Choose factor that splits the data set most efficiently.

Approaches:

  • Information gain: Entropy (most common)
  • Information gain: Correlation (use for project)
    • Look at each factor in the data and see how well correlated it is with labels.
    • Factor that is most strongly correlated is going to help us make the split most effectively.
  • Information gain: Gini index

Steps:

  1. Calculate correlation of each factor with Y values
  2. Factor with the highest correlation will be the parent node.
    • SplitVal will be the median value of the factor values (e.g. all data points of factor X11).
    • Left will be the index position right after the parent node
    • Right will be the first node index of the higher sorted median values
  3. Recurse step 2, splitting the group in half over and over
    • Once you get to the last 2 nodes, you'll likely get multiple factors with a value of 1. If so just deterministically use the first factor.
  4. Once you get to the leaf nodes, set SplitVal to the left leaf value and the subsequent index as the right leaf value.
  5. Pop back up the recursion of the left tree, and do the right tree.
    • Right values are relative indices. Why? Easier to build the tree.

Random Tree Algorithm (A Cutler)

rtalgo

  • Random tree algorithm, unlike decision tree, doesn't try to determine the best factor to split on but chooses it randomly.
  • Rather than calculating the median like DT, RT gets 2 random rows of the factor column and calculates its mean.
  • The major effect of doing this is that it makes this algorithm much faster than a DT.

Random Forest = Multiple RTs

IMPORTANT: This learner is best used in a "Random Forest" set up, ie. "bagging" multiple RT learners. RT doesn't work well on its own, its strength is realized only when using them in an ensemble.

Several ways to make a random forest:

  1. Each tree is computed with random numbers
  2. Data used to create the tree is randomly sampled (with replacement)

Strengths & Weaknesses of Decision Trees

Cons:

  • More computationally expensive to train (RandmForest > RandomTree > DecisionTree > LinearRegression > KNN)
  • Middle of the road in query cost. LinearRegression is fastest, KNN is slowest. DT typically use binary decisions so it's fast to calculate, but not as fast as LinRegs that only use several parameters and a bias.

Pros:

  • You don't need to normalize your data. For learners like KNN, you need to normalize your feature values else the ones with the greater values/range will overwhelm the others.

END OF EXAM 1


ML4T Mini-Course 3.2: ML Algorithms for Trading

Reinforcement Learning

Definition: Reinforcement learners create policies that provide specific direction on which action to take.

The RL Problem

rlproblem

  • E: Environment
  • S: State of the environment.
  • Pi: Policy, aids robot to figure out what action to take. (Can be simple look-up table)
  • R: Reward received after each action the robot takes.
  • Q: Algorithm the tries to find the pi that will maximize its reward over time.

How it works (in robotics)

  1. Robot is placed in an environment.
  2. Robot observes the environment and some data of the environment comes in. This is saved as state s
  3. Robot has a policy, ie. a strategy that an agent uses in pursuit of goals. Robot takes in state s, processes it with its policy and takes an action a.
  4. Action a affects the environment. The environment then transitions to a new state t. T is a transition function that takes in its previous state + action and moves to a new state.
  5. The new state is fed back to the robot.

Policy is created by rewards

  • Everytime a robot is in a state and takes an action, a particular reward is given. The robot keeps track of rewards it receives.
  • Robot's objective is to take actions that maximize its reward.
  • Robot has an algorithm that takes all information gained over time to figure out what the optimal policy should be.
  • Policy could just be a lookup table.

Trading as an RL problem

tradingrl

In terms of trading:

  • Environment: Stock market
  • Actions: Buy/sell/hold
  • state: Factors about stocks that we might observe and know about.
  • reward: Return we get for making the proper trades.

Markov Decision Problems

What markov decisioni problems (MDP) is comprised of:

  • Set of states s
  • Set of actions a
  • Transition function T[s,a,s']
    • T is 3D matrix: Record in each of its cells the probability that if we are in state s and we take action a, we will end up in state s'.
    • Given T[s,a,s'], the sum of all the next states we might end up in has to sum to one. s' is a distribution that sums to one.
  • Reward function R[s,a]: Returns reward given s and a.

Objective: Find policy Pi(s) that will maximize reward.

  • Best policy is Pi\*(s) (with asterisk)

Algorithms used to find this optimal policy:

  1. policy iteration
  2. value iteration
  • In this class, we can't use these directly because we don't know the transition function T[s,a,s'] or the rewards function R[s,a] beforehand. We need to interact with the environment to create them.

Unknown Transitions and Rewards

Most of the time we don't have either transition or rewards function. Hence, we have to interact with the real world, observe what happens, and work with that data to try to build a policy.

Experience tuple:

  • Each interaction with the environment creates an experience tuple, <s, a, s', r>
  • Each iteration creates a new tuple, where s' is the new s.
  • Do this over and over. We can use these experience tuples to find policy Pi

How to use experience tuple to create a policy:

  1. Model-based RL: Look at this data over time and build models for transition function T[s,a,s'] and reward function R[s,a]. Fills in model by looking statistically at these transitions.
    • For T build a matrix that counts each instance of [s,a,s'] found in each experience tuple.
    • For R build a matrix that counts each instance [s,a,r] found in each experience tuple.
    • Finally, use value/policy iteration to solve the problem.
  2. Model-free RL: Type we're going to use is Q-learning
    • Develops policy just directly by looking at the data.

What to optimize?

Example: Robot in a maze.

Optimization depends on the "horizon", ie how many moves the robot should take.

  • With infinite horizon, we're trying to maximize sum of all rewards over infinity time (i=infinity).
  • With finite horizon, we're trying to maximize sum of rewards over n time in the future (i=n).

Discounted reward: Devalues rewards gained in the future, similar to interest rate concept. Method used in Q-learning.

  • Multiplies reward with gamma rate. Gamma increase as time i increases (i=infinity * gamma ^ i-1 * r_i).
  • Gamma is between 0 and 1. Closer to zero, value immediate reward. Closer to 1, value future reward (1 is same as inifinite horizon).
  • If gamma is .95, each step in the future is worth 5% less than the immediate reward.

optimize

Quiz: Which gets $1M?

rlquiz

RL Summary

rlsummary

Q-Learning

What is Q?

Definition: Q is the value of taking action a in state s

Sum of 2 components:

  1. immediate reward, plus
  2. discounted reward

How to use (given a Q-table exists):

Pi(s) = argmax(Q[s,a], a)
  • Policy is iterated until we converge to the optimal policy.

Procedure - Big Picture

q

Update Rule

Two parts to update rule:

  1. What is the old value that we used to have
  2. What is our improved estimate
    • alpha is the learning rate
    • gamma is the discount rate. Lower value of gamma means that we value immediate rewards. Higher gamma means future rewards are worth as much as immediate reward.

update

Two finer points

  • Success depends on exploration. Accomplished by adding randomness.
    • Step 1: Flip a coin whether or not to take random action.
    • Step 2: Flip a coin to decide what action to take.
  • Successively reduce randomness over each iteration.
    • Forces system to explore early on.

Trading Problem: State

Good states to know

states

Creating the state

  • State is an integer
  • How to do it:
    • Discretize each factor: Convert real number into an integer
    • Combine each state into overall discretized state disc

Discretizing

Discretizing is converting a real number into an integer across a limited scale (ie. convert numbers from 0-25 to 0-9)

How to do it:

steps = how many groups you want (for 0-9, steps = 10)
stepsize = size(data) / steps
data.sort()
for i in range(0, steps)
    threshold[i] = data[(i+1) * stepsize]

Dyna

Dyna: Big Picture

  • Algorithm developed by Rich Sutton in order to speed up Q-learning
  • Q-learning is model free. Does not rely on T (transition matrix) nor R (rewards matrix)
  • T: Probability that if we are in state s and we take action a, will end up in s prime
  • R: Expected reward if we are in state s and take action a
  • "Hallucinating experiences"
    • Add logic that enables us to learn models of T and R, then hallucinate an experience
    • Finally update queue table
    • Repeat this many times (100-200), then resume interaction with the real world
  • For each experience with the real world, we have 100-200 updates of our model using Dyna.

How to hallucinate an experience

  1. Randomly select an s
  2. Randomly select an a
  3. Infer our new state s prime by looking at T
  4. Infer r (immediate reward) by looking at the R table.

dyna

Learning T

  • Remember: T[s, a, s'] represents probability that if we are in state s, take a, we will end up in s'
  • To learn model of T, we observe how these transitions occur. Ie. have an experience with real world, get [s, a, s'] and count how many times did it happen. This is called T-count or Tc
  • How:
    • Initialize all T-count values to be a small number (avoid Divide-by-zero)
    • Start Q-learning, each time we interact with real world we observe [s, a, s'].
    • Increment that location in our T-count matrix. learnt

How to evaluate T?

(Need to review this again)

evalt

Learning R

  • R[s,a] is expected reward for s,a
  • r is the immediate reward
  • Want to update the model every time we have a real experience

Formula:

alpha = 0.2
R'[s,a] = (1 - alpha) * R[s,a] + alpha * r
  • alpha * r is immediate reward, or our new best estimate of what the value should be.
  • This is added to what the value was before times 1 minus alpha.
  • We weigh our old value more than the newer value in order to converge more slowly.

Appendix

From Reinforcement Learning, Richard Sutton (2018) dynaq

From RL Course by David Silver

Why Discount?

  1. Mathematically convenient to discount rewards
  2. Avoids infinite returns in cyclic Markov processes
  3. Uncertainty about the future may not be fully represented
  4. If reward if financial, immediate rewards may earn more interest than delayed rewards.
  5. Anima behavior shows preference for immediate rewards.
  6. It is sometimes possible to use undiscounted Markov reward processes (ie. gamma = 1), IF all sequences terminate.

Dyna Architecture

dynaarch

Q-learner Trader Project Overview

Mapping RL to the market strategy problem

  1. States
    • Holding
    • Indicators
  2. Actions
    • Long
    • Nothing
    • Short
  3. Reward
    • Return

Discretizing states

Each indicator values need to be discretized. The concatenation of each discretized state integer is concatenated together to create the state.

Policy pseudocode

For training

while not converged
  X = calculate_indicators()
  query_set_state(X)
  for each day:
    reward = calc_reward()
    action = query(X, reward)
    # Take a LONG/SHORT/CASH position
    simulate_trade(action)
    add_action_to_dataframe()
  # How to tell if converged?
  # If policy (Q-value) is not changing
  check_if_converged()

For testing, do the same as above but just query state and implement the action:

X = calculate_indicators()
query_set_state(X)
for each day:
  action = query_set_state(X)
  simulate_trade(action)
  add_action_to_dataframe()
  X = new_state

Biggest trap: Overfitting to data

In [ ]: