daily_return[time] = (price[time] / price[t-1]) - 1
Adding lines to plots
plt.axvline(value, color='w', linestyle='dashed', linewidth=2)
# Compute daily returns
daily_returns = compute_daily_returns(df)
# Scatter plot
daily_returns.plot(kind='scatter', x='SPY', y='GLD')
# Draw line, for polynomial of 1 it is y = mx + b
# Returns 1) polynomial coefficient (m) and 2) the intercept (b)
# 'm' is beta (slope), 'b' is alpha
beta_XOM, alpha_XOM = np.polyfit(daily_returns['SPY'], daily_returns['XOM'], 1)
# Plot the line
# For every value of x (SPY in this case), find value of y using mx+b
plt.plot(
daily_returns['SPY'],
beta_XOM * daily_returns['SPY'] + alpha_XOM,
'-',
color = 'r'
)
Why kurtosis matters: 2008 crash. Investment banks built bonds based on mortgages. Assumed distribution of returns for these mortgages had gaussian distribution, thought these bonds had low probability of default. 2 mistakes:
prices
. Columns are symbols, rows are stock values at each date.normed
normed
with allocations, gives relative value of each symbols over time.start_val
) with initial investment, which will give actual value for each symbol over time.Definition: Metric that adjusts return for risk.
"Ex ante" formulation (looking forward):
$$\text{Sharpe Ratio}=\frac{E[R_{p}-R_{f}]}{std[R_{p}]}$$Dividing by volatility, hence higher the volatility the ratio goes down. With higher portfolio return, the ratio goes up. With higher risk free rate of return, the ratio goes down.
Code:
sharpe_ratio = np.mean(daily_rets - daily_rf) / np.std(daily_rets)
What is risk free rate (daily_rf)?
Coding risk-free rate adjusted by sampling rate:
# start_value: Starting value to calculate risk free rate.
# rf_rate: Risk free ra te
# sample_rate: Sample rate to adjust Sharpe ratio. Daily is 252.
adj_rf_rate = (start_value - rf_rate) ** (1 / sample_rate) - 1
SR_annualized = K * SR
, where k = sqrt(samples_per_year)
daily_k = np.sqrt(252)
weekly_k = np.sqrt(52)
monthly_k = np.sqrt(12)
You'll need to add this in to the final formula:
daily_k = sqrt(252)
sharp_ratio_daily = daily_k * np.mean(daily_rets - daily_rf) / np.std(daily_rets)
Given:
Answer: 12.7
daily_k = np.sqrt(252)
daily_k * (np.mean(0.001 - 0.0002) / 0.001)
Optimizers: algorithm that can:
How to use:
f(x) = x**2 + 0.5
)x
.def f(X):
"""Given a scalar X, return some value (a real number)."""
Y = (X - 1.5)**2 + 0.5
print ("X = {}, Y = {}".format(X, Y)) # for tracing
return Y
def test_run():
Xguess = 2.0
min_result = spo.minimize(f, Xguess, method='SLSQP', options={'disp': True})
print ("Minima found at:")
print ("X = {}, Y = {}".format(min_result.x, min_result.fun))
# Plot function values, mark minima
Xplot = np.linspace(0.5, 2.5, 21)
Yplot = f(Xplot)
plt.plot(Xplot, Yplot)
plt.plot(min_result.x, min_result.fun, 'ro')
plt.title("Minima of an objective function")
plt.show()
1, 2, and 4 are hard functions to solve because:
Convex function: Choosing two points, convex if line drawn is above a graph.
Given f(x) = C0 * x + C1
, task is to find the slope (C0
) and intercept (C1
) that best fits the data.
Key is to remove negative values:
"""Minimize an objective function using SciPy."""
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.optimize as spo
def error(line, data): # error function
"""Compute error between given line model and observed data.
Parameters
----------
line: tuple/list/array (C0, C1) where C0 is slope and C1 is Y-intercept
data: 2D array where each row is a point (x, y)
Returns error as a single real value.
"""
# Metric: Sum of squared Y-axis differences
err = np.sum((data[:, 1] - (line[0] * data[:, 0] + line[1])) ** 2)
return err
def fit_line(data, error_func):
"""Fit a line to given data, using a supplied error function.
Parameters
----------
data: 2D array where each row is a point (X0, Y)
error_func: function that computes the error between a line and observed data
Returns line that minimizes the error function.
"""
# Generate initial guess for line model
l = np.float32([0, np.mean(data[:, 1])]) # slope = 0, intercept = mean(y values)
# Plot initial guess (optional)
x_ends = np.float32([-5, 5])
plt.plot(x_ends, l[0] * x_ends + l[1], "m--", linewidth=2.0, label="Initial guess")
# Call optimizer to minimize error function
result = spo.minimize(error_func, l, args=(data,), method="SLSQP", options={"disp": True})
return result.x
def test_run():
# Define original line
l_orig = np.float32([4, 2])
print("Original line: C0 = {}, C1 = {}".format(l_orig[0], l_orig[1]))
Xorig = np.linspace(0, 10, 21)
Yorig = l_orig[0] * Xorig + l_orig[1]
plt.plot(Xorig, Yorig, "b--", linewidth=2.0, label="Original line")
# Generate noisy data points
noise_sigma = 3.0
noise = np.random.normal(0, noise_sigma, Yorig.shape)
data = np.asarray([Xorig, Yorig + noise]).T
plt.plot(data[:, 0], data[:, 1], "go", label="Data points")
# Try to fit a line to this data
l_fit = fit_line(data, error)
print("Fitted line: C0 = {}, C1 = {}".format(l_fit[0], l_fit[1]))
plt.plot(
data[:, 0], l_fit[0] * data[:, 0] + l_fit[1], "r--", linewidth=2.0, label="Fitted Line"
)
# Add a legend and show plot
plt.legend(loc="upper right")
plt.show()
if __name__ == "__main__":
test_run()
f(X)
, where X
is array of allocations, and the function result is the Sharpe Ratio.X
X
should return (ie. between 0 and 1). Ranges limit search area significantly, making it easier to optimize.X
that must be True
.Three broad classes
Terms:
Assets Under Management (AUM): How much money is being managed by the fund.
How they are compensated:
Expense ratio are incentivied by AUM accumulation, while "two and twenty" are incentivised by profits and risk taking
Types of investors:
Why invest in a fund:
Goals:
Metrics:
(val[-1] / val[0]) - 1
daily_rets.std()
sqrt(trading_days) * mean(daily_rets - risk_free_rate) / daily_rets.std()
N-day forecast
predicts (with ML) what stock prices are going to be at some point in the future. Informs portfolio optimizer
which balances the target portfolio.historical price data
N-day forecast
is made¶information feed
(proprietary info) + historical data.When you make a stock order, the order is sent to a stock broker which they then execute the order.
Components of an order:
Examples:
BUY,IBM,100,LIMT,99.95
SELL,GOOG,150,MARKET
Each exchange keeps an order book for every stock that is bought or sold.
ASK
: Request to sellBID
: Request to buy
The more sell orders, stock price goes down. If more buy orders, stock prices go up.
Colocated servers: Hedge funds have servers located on the premises of the stock exchange. This means they have faster access to the active order book.
All of these order types are facilitated by the broker:
Brokers take care of shorting.
What can go wrong:
A: Between $10-25 depending on interest rates. Value of $1 reduce over time, and sum of the individual $1's generated every year would converge to a finite value.
Any promise of a given amount in now paid in the future will devalue over time.
Many market strategy look for deviation between these three values:
present_value = future_value / ((1 + interest_rate) ** time)
i
(time) becomes larger.Some calculations:
$1 / (1 + 0.01) = $0.99
$1 / (1. 0.05) = $0.95
How to calculate intrinsic value:
FV
: Future value, dividend paid at a regular intervalDR
: Discount rate, amount of interest rate you expect to be paid back (e.g. 5%).Intrinsic Value Equation:
Intrinsic Value = Future Value / Discount Rate
Book Value: "Total assets (not including intangible assets) minus liabilities"
Market Cap Equation: Market Cap = Number of Shares * Price
Q: Would you buy this company for market cap ($75M)?
A: Yes, because its book value is currently $80M which is more than $75M. The company is undervalued.
Definition: A portfolio is a weighted set of assets.
w
with a portion i
1.0 = sum(abs(Wi))
abs
because you can short stocks.Calculating returns of a portfolio:
t = time
W = weight of each asset
R = return of each asset
portfolio_return = sum(W * R * t)
Wi = mktcap(i) / sum(mktcaps(I))
Memorize and understand this for exam
Equation - A particular stock's return is dependent on the sum of:
stock_return = (beta(i) * market_return * time) + (alpha(i) * time)
Tenets:
Beta and alpha is calculate by plotting a stock's daily return against a market (e.g. S&P500) return for the same day.
Beta: Slope of the line for a particular stock's scatterplot fit to a line. Higher slope equals higher beta.
Alpha: y-intercept of the stock. CAPM expects this to be zero, but not always the case.
Both passive and active manager are affected by beta. Where they differ is how they treat alpha.
Equation:
W = weights for each stock
i = specific stock
portfolio_return = sum(W(i) * (beta(i) * market_return * time) + (alpha(i) * time))
CAPM simplifies this because it only cares about beta, assumes alpha is zero
portfolio_return = beta(portfolio) * market_return * time
Observed by Stephen Ross, 1976.
Idea: CAPM only looks at one market, what about considering multiple sectors using multiple betas?
Is the stock market rigged video]
Given two stock scenario what are the returns for each?
# Formula: return = beta * market_return + alpha
# Stock A
beta_a = 1.0
market_return = 0.1
alpha_a = 0.01
position_a = 50
return_a = beta_a * market_return + alpha_a # 0.11
return_a_dollar = return_a * position_a # 5.5
# Stock B
beta_b = 2.0
alpha_b = -0.01
position_b = -50
return_b = beta_b * market_return + alpha_b # 0.19
return_b_dollar = return_b * position_b # -9.5
Takeaway: Even with perfect alpha and beta we can still lose money.
In code:
weight_a = 0.5
alpha_a = 0.01
weight_b = -0.5
alpha_b = -0.01
beta_a = 1.0
beta_b = 2.0
return = (weight_a * beta_a + weight_b * beta_b) * market_return + weight_a * alpha_a + weight_b * alpha_b
return = -0.5 * market_return + 0.01
Takeaway:
In code:
# Goal: Make beta product equal to zero
beta_product = 0
beta_a = 1.0
beta_b = 2.0
beta_product = weight_a * beta_a + weight_b * beta_b
# 1. Multiply beta_b by negative because it's a short position
weight_a = (beta_a * beta_b * -1) * weight_b
weight_a = -2 * weight_b
# 2. Sum of both weights equal to 1
abs(weight_a) + abs(weight_b) = 1
# 3. Replace weight_a with point 1
abs(-2 * weight_b) + abs(weight_b) = 1
3*abs(weight_b) = 1
abs(weight_b) = 1/3
weight_b = -1/3 # because weight_b is a short
# 4. Use calc from 1 using derived weight_b
weight_a = -2 * -1/3 # 2/3
Assuming:
CAPM enables:
What is is:
Why it may work:
Technical | Fundamental |
---|---|
Moving average | Price-to-earnings ratio |
% change in volume | Intrinsic value |
Rules of thumb:
Most common:
Definition: Given over a time-period, how much did the price change.
- Prices goes up = positive momentum
- Prices goes down = negative momentum
- Steepness of slope indicates momentum strength.
For ML, need to convert this into a quantitative data.
In pseudocode:
def momentum(price, day, n):
"""Calculate momentum of stock, typically ranges between -.5 and +.5
price: price of stock
day: day of trade
n: number of days back
"""
price_on_day = price * day
price_n_days_back = price * (day - n)
return price_on_day / price_n_days_back - 1
Definition: Takes an average of a stock's price over an n-day window and rolls that over a time period. Returns a smooth price chart and lags in movement.
Trading signals:
In pseudocode:
def simple_moving_average(price, day, n):
"""Calculate diff between SMA and current price. Returns ratio of current
price's deviation from SMA. Ranges between -.5 to +.5.
price: price of stock
day: day of trade
n: number of days back
"""
price_on_day = price[day]
price_n_days_back = price[day - n:day]
n_window_mean = price_n_days_back.mean()
return price_on_day / n_window_mean - 1
BBs adjusts SMA trigger based on the volatility of a time window.
Trading signals:
In pseudocode:
def bollinger_band(price, simple_moving_avg, day):
"""Calculate trading signal movement. Values greater than 1/-1 indicate
current price is higher/lower than bollinger bands.
"""
price_on_day = price[day]
sma_on_day = simple_moving_avg[day]
two_std = 2 * std(price_on_day)
return (price_on_day - sma_on_day) / two_std
Problem statement: Each indicator has different degrees of values. Values for all indicators need to be normalized between a range (e.g. -1 to 1) in order for a ML algorithm to weight each indicator fairly. Otherwise, indicators with larger values will overwhelm others.
How to do it:
normed = (value - mean) / values.std()
Given a minute/hour block of ticks, we have the following stats:
def adj_stock_split(split_ratio: int, historical_stock_prices: np.array -> np.array):
return historical_stock_prices / split_ratio
Dividends causes drops proportionate to the amount of dividends paid on the day dividends are paid out.
Adjust all of prior prices down by the proportion of the dividend.
def adj_dividend(dividend_cash: int, historical_stock_prices: np.array) -> np.array:
return historical_stock_prices - (dividend_cash / historical_stock_prices)
Quiz: Stock What is stock price before divident, and the price the day the dividend is paid?
Key point: There is built-in bias in using existing stock data as of today. Some stocks may have failed and disappeared, but we didn't account for those.
Solution: Use surivor-bias free data (typically not free).
Deriving performance by skill and breadth:
Formula: Performance = Skill * sqrt(Breadth)
Information Ratio: The degree in which a portfolio manager is exceeding market performance.
Information Ratio = Information Coefficient * sqrt(Breadth)
Central Question: Is it better to bet all on one coin flip, or bet on multiple coin flips at once?
Q: WHich bet is better?
Answer: 1 token on each of 1000 tables
0.51 * 1000 + 0.49 * -1000 = $20
1000 * (0.51*1 + 0.49 * -1) = $20
0.49
0.49 * 0.49 * 0.49 ... 0.49 = 0.49^1000
31.62
stdev(1000, 0, 0, ... 0) = 31.62
stdev(-1000, 0, 0, ... 0) = 31.62
1.0
$20 / $31.62 = 0.63
$20 / $1 = 20
Takeaway: Multi-bet strategy has a much higher risk/reward ratio than single-bet strategy
SR_single = 0.63
SR_multi = 20
num_bets = 1000
# This looks a lot like the Information Coefficient
# Performance = Skill * sqrt(Breadth)
SR_multi = SR_single * sqrt(num_bets)
Think of it like:
SR_single
is equal to how good you are at predicting future return of the stocknum_bets
is equal to how much you spread your bets, ie. breadth.alpha
generates a higher Sharpe ratioInformation Ratio (IR)
alpha
(ie skill): numerator is the reward, denominator is risk
Information Coefficient (IC): Correlation of forecasts to returns
x
on Simon's sideRisk: Volatility, or standard deviation of historical daily returns
Answer: Blend anti-correlated assets together
Blending stocks (25% ABC and DEF, 50% GHI) will net same return but lower volatility
By combining anti-correlated assets allow you to have much lower risk portfolio.
Definition: Taking a potential set of assets and figuring out how they should be blended together by looking at their covariance among other things.
Meaning of EF: Anything inside the frontier is suboptimal
Characteristics of EF:
Can't train on all trading dates, it would be cheating. There are ways to get around this.
Options: Contract which gives the buyer the right, but not the obligation, to buy or sell the underlying stock* at a specific price** on or before*** the expiration date.
Good situation:
Pros:
Cons:
Intrinsic value - The difference between the option strike price and the underlying spot price for an "in-the-money" option (option that currently has a non-negative intrinsic value).
Time value of an option: The excess premium cost of an option beyond its intrinsic value, attribute to time-to-expiration.
Time decay: The rate at which an option is currently losing its time value. Also called theta.
Definition: Selling someone the option until the expiration date, they can force you to buy or sell them the stock at the strike price, whether or not you want to and no matter what the price of the underlying stock is set.
Definition: Selling someone the right to buy an underlying stock from you at a strike price. You immediately pocket the premium, but until expiration date the buyer can buy your stock at the strike price.
You want the spot price to stay below strike price. As long as spot price is below strike price, you pocket the premium.
Best case: Buyer doesn't exercise. You keep premium + your stock.
WRITE PUT: Give someone else the option to sell us the stock at the strike price, should they choose to do so.
From investopedia
...financial transaction in which the investor selling call options owns an equivalent amount of the underlying security. To execute this, an investor who holds a long position in an asset then writes (sells) call options on that same asset to generate an income stream. The investor's long position in the asset is the cover because it means the seller can deliver the shares if the buyer of the call option chooses to exercise.
3 basic possibilities after you make a covered call:
From investopedia
...an investor, holding a long position in a stock, purchases an at-the-money put option on the same stock to protect against depreciation in the stock's price. The benefit is that the investor can lose a small but limited amount of money on the stock in the worst scenario, yet still participates in any gains from price appreciation. The downside is that the put option costs a premium and it is usually significant.
Another pro is that you can do this instead of selling the stock to avoid incurring capital gains tax.
From investopedia
The term butterfly spread refers to an options strategy that combines bull and bear spreads with a fixed risk and capped profit. These spreads are intended as a market-neutral strategy and pay off the most if the underlying asset does not move prior to option expiration. They involve either four calls, four puts, or a combination of puts and calls with three strike prices.
Example:
Loss-profit curve (P/L upside down)
Flaws in data and mitigation: Remove distance between end-user and data publishers. Vendors offering competing products.
Jump diffusion: Model to capture sixth sigma event. If you mix normal distributions (with varying deviations) and draw from them at different probabilities.
3 rules of thumb:
X
is pushed through the model and the model returns an inference y
X
: Price momentum, Bollinger value, current pricey
: Future price, future returnsX
(observations) and Y
(correct value)Types of supervised regression learning algorithms:
x
greater than or less than this other value"). X
values (ie. features) there.y
value. X
from step 2 with y
in step 3 and save to data.X
and y
pairing and save it to data..y
data. (Leave some leftover X
data not used)X
y
pairs to build the model.Steps:
X
(features)y
(target)Confidence and Back Test Score
General idea: Get a limited amount of data for algorithm to learn and to make prpredictions into the future.
How to backtest:
Example report from QuantDesk platform:
Solution: Use Policy learning reinforcement learning (taught later in class)
Goal: Use several ML algorithms on same set of training + test data and report findings.
Steps:
X
and y
) between 2009 and 2010 to train modelsX
for ML model to forecast y
.order.txt
to push through market simulator.Example: Parametric regression of rain & change in barometric pressure. Goal: Create a model that predicts amount of rain (mm) based on change in barometric pressure (mm)
y=mx+b
y=(m2*x**2)+(mx+b)
K
nearest data points and calculate the mean of their y-values.Parametric and non-parametric differ in how they treat the data. Parametric models throws away data and only uses the parameters. Instance-based models keep the data and calculates prediction at inference time.
Out of sample testing: Procedure of separating testing data from training data.
X_train
, y_train
, X_test
and y_test
X_train
and y_train
, testing is done with the other two.Accuracy is measured by whether the trained model's prediction, given X_test
, is equivalent to y_test
.
For stocks, you want to train on old data, and test on new data.
For LinReg:
learner = LinRegLearner()
learner.train(Xtrain, Ytrain)
Y = learner.query(Xtest) ## Compare Y to Ytest
class LinRegLearner:
def __init__():
pass
def train(X, Y):
"""Take X and Y and fit a line, returning params m and b"""
self.m, self.b = favorite_linreg(X, Y) # Can use Scipy or Numpy linreg algo
def query(X):
"""Predict Y given X"""
Y = self.m * x + self.b
return Y
For KNN:
learner = KNNLearner(K=3)
learner.train(Xtrain, Ytrain)
Y = learner.query(Xtest) ## Compare Y to Ytest
# Use same structure for KNN as LinReg
K
, the more overfitting occursK
, the more underfitting occursRMS error calculates the average error, ie distance between Y_test
(ground truth) and Y_predict
(hypothesis).
Cross validation is when multiple training and testing is done on the same data set, but each time a different set of data is put aside as the test data.
Roll-forward cross validation is a variant of cross validation, where the training set is always before the test set. This is important for financial data, because we don't want to peek into the future.
np.corrcoef()
to calculate correlation between Ytest
and Ypredict
.-1
and 1
. -1
= Negative relation, 1
= Positive relation, 0
= No relationOverfitting = in sample error increase + out of sample error increase
As K decreases, the more overfitting occurs. Thus the diagram of in sample vs out of sample error looks like the overfit parameterized model but flipped 180 degrees.
Cost benefits of param vs non-param models:
Ensemble Learners is not a new algorithm, but combines several different algorithms/models.
Why use them:
How to combine model outputs:
Y
values vote on which is the best.Example: Combinin LinReg and KNN together.
How to do bagging with regression models:
m
"bags", each containing n'
number of training data. Data is chosen randomly with replacement. Total n
in each should be less than ~60% of N
.m
number of models. Xtest
data.Y
valuesHow to ada(ptive) boost:
m
number of bags.X1
, X2
... XN
Y
derived from the factors. Can be both classification or regression.Xi <= SplitVal
)Each row looks like: X0, X1, X2 ... XN, Y
Data structure of decision tree as an numpy array
Headers:
node
: ID for Factor
Factor
: Features or Leaf
labelSplitVal
: Value that determines returning a Left
or Right
value.Leaf
, contains the Y
value.Left
& Right
: Where in the matrix the tree beings.Leaf
both are NaN
Notes:Root
node is always the first row.Recursive algorithm. Make sure to terminate to prevent infinite loops.
Steps:
i
to split onSplitVal
.data[:, i].median()
Left
and Right
trees recursively
lefttree = build_tree(data[data[:, i] <= SplitVal])
righttree = build_tree(data[data[:, i] > SplitVal])
root = [i, SplitVal, 1, lefttree.shape[0] + 1]
1
is specifying relative index of left tree (next row). lefttree.shape[0] + 1
is specifying right tree index (after left tree ends).root
, lefttree
and righttree
together as a new numpy array.append
in numpy makes a new ndarray.Approaches:
Steps:
Y
valuesX11
).Left
will be the index position right after the parent node Right
will be the first node index of the higher sorted median values 1
. If so just deterministically use the first factor.SplitVal
to the left leaf value and the subsequent index as the right leaf value.Right
values are relative indices. Why? Easier to build the tree.IMPORTANT: This learner is best used in a "Random Forest" set up, ie. "bagging" multiple RT learners. RT doesn't work well on its own, its strength is realized only when using them in an ensemble.
Several ways to make a random forest:
Cons:
Pros:
END OF EXAM 1
Definition: Reinforcement learners create policies that provide specific direction on which action to take.
E
: EnvironmentS
: State of the environment.Pi
: Policy, aids robot to figure out what action to take. (Can be simple look-up table)R
: Reward received after each action the robot takes.Q
: Algorithm the tries to find the pi that will maximize its reward over time.s
s
, processes it with its policy and takes an action a
. a
affects the environment. The environment then transitions to a new state t
. T
is a transition function that takes in its previous state + action and moves to a new state.In terms of trading:
What markov decisioni problems (MDP) is comprised of:
s
a
T[s,a,s']
s
and we take action a
, we will end up in state s'
.T[s,a,s']
, the sum of all the next states we might end up in has to sum to one. s'
is a distribution that sums to one.R[s,a]
: Returns reward given s
and a
.Objective: Find policy Pi(s)
that will maximize reward.
Algorithms used to find this optimal policy:
T[s,a,s']
or the rewards function R[s,a]
beforehand. We need to interact with the environment to create them. Most of the time we don't have either transition or rewards function. Hence, we have to interact with the real world, observe what happens, and work with that data to try to build a policy.
Experience tuple:
<s, a, s', r>
s'
is the new s
.Pi
How to use experience tuple to create a policy:
T[s,a,s']
and reward function R[s,a]
. Fills in model by looking statistically at these transitions.T
build a matrix that counts each instance of [s,a,s']
found in each experience tuple.R
build a matrix that counts each instance [s,a,r]
found in each experience tuple.Example: Robot in a maze.
Optimization depends on the "horizon", ie how many moves the robot should take.
i=infinity
). n
time in the future (i=n
).Discounted reward: Devalues rewards gained in the future, similar to interest rate concept. Method used in Q-learning.
gamma
rate. Gamma increase as time i
increases (i=infinity * gamma ^ i-1 * r_i
).Quiz: Which gets $1M?
Definition: Q is the value of taking action a
in state s
Sum of 2 components:
How to use (given a Q-table exists):
Pi(s) = argmax(Q[s,a], a)
Two parts to update rule:
alpha
is the learning rategamma
is the discount rate. Lower value of gamma means that we value immediate rewards. Higher gamma means future rewards are worth as much as immediate reward.Good states to know
Discretizing is converting a real number into an integer across a limited scale (ie. convert numbers from 0-25 to 0-9)
How to do it:
steps = how many groups you want (for 0-9, steps = 10)
stepsize = size(data) / steps
data.sort()
for i in range(0, steps)
threshold[i] = data[(i+1) * stepsize]
T
(transition matrix) nor R
(rewards matrix)T
: Probability that if we are in state s
and we take action a
, will end up in s prime
R
: Expected reward if we are in state s
and take action a
T
and R
, then hallucinate an experienceHow to hallucinate an experience
s
a
s prime
by looking at T
r
(immediate reward) by looking at the R
table. T
¶s
, take a
, we will end up in s'
s
, a
, s'
] and count how many times did it happen. This is called T-count or Tcs
, a
, s'
].T
?¶(Need to review this again)
R
¶R[s,a]
is expected reward for s,a
r
is the immediate rewardFormula:
alpha = 0.2
R'[s,a] = (1 - alpha) * R[s,a] + alpha * r
alpha * r
is immediate reward, or our new best estimate of what the value should be.From Reinforcement Learning, Richard Sutton (2018)
From RL Course by David Silver
Why Discount?
Dyna Architecture
Each indicator values need to be discretized. The concatenation of each discretized state integer is concatenated together to create the state.
For training
while not converged
X = calculate_indicators()
query_set_state(X)
for each day:
reward = calc_reward()
action = query(X, reward)
# Take a LONG/SHORT/CASH position
simulate_trade(action)
add_action_to_dataframe()
# How to tell if converged?
# If policy (Q-value) is not changing
check_if_converged()
For testing, do the same as above but just query state and implement the action:
X = calculate_indicators()
query_set_state(X)
for each day:
action = query_set_state(X)
simulate_trade(action)
add_action_to_dataframe()
X = new_state
Biggest trap: Overfitting to data