Takeyaways:
Lagrange vs Leibniz Notation
Solution: $\frac{\Delta y}{\Delta x}$
Deriving the slope at (1, 1):
Formula:
Deriving the slope at (0.5, 0.2):
Formula:
Calculating the slope between (1,1) and (1.5, 2/3):
Deriving the slope at the limit of (1,1):
Formula:
Pattern for power functions:
Takeaways:
Using an example at (1,1):
Using an example (2, 4):
Convergence of $e$:
Intuition using bank interest as an example:
Proving derivative using secants:
Non-differentiable functions: Functions where you can't calculate a derivative at every point
Sclar rule:
Intuition using $y=2x^2$:
This happens for any function, not just quadratics
Given a function:
$$ f = g + h $$its derivative is:
$$ f' = g' + h' $$Intuition (using example of child running inside a moving boat):
Given a function:
$$ f = gh $$Its derivative is:
$$ f' = g'h + gh' $$Intuition (using example of square space of a house):
Given a function $h(t)$ and you want apply another function $g(t)$ to it:
$$ g(h(t)) $$To get the derivative of this composite of functions with respect to $t$, you use:
Above is Leibnitz notation, with LaGrange notation it is:
$$ \frac{d}{dt} g(h(t)) = g'(h(t)) \cdot h'(t) $$It is called the chain rule because you can keep chaining functions and take its derivative using the same idea:
$$ \frac{d}{dt} f(g(h(t))) = \frac{df}{dg} \cdot \frac{dg}{dh} \cdot \frac{dh}{dt} $$in LaGrange notation:
$$ \frac{d}{dt} f(g(h(t))) = f'(g(h(t)) \cdot g'(h(t)) \cdot h'(t) $$Intuition of chain rule using temperature change wrt height and time:
In plot form:
Q: What is the derivative of $f(x) = e^{2x}$?
Answer: $2e^{2x}$
Why do we care about derivatives? Because it allows us to find the optima of a function (minima/maxima)
But, multiple minima can exist:
Goal: Find where to put the home wrt to distance $x$ (from origin) to connect power grid at the lowest cost.
The function is quadratic, so optimization finds its derivative, which is:
$$ x = \frac{a+b}{2} $$With three power lines, the cost functions becomes the sum of the cost to connect to each powerline:
Minimizing the cost is similar to the two power line problem:
$$ x = \frac{a+b+c}{3} $$Generalizing the power line problem, we come to the square loss:
"Let's play a game. I'm going to toss a coin 10 times and we're going to look at the results. If the results are seven heads followed by three tails, then you win a lot of money. If they're not, then you don't win any money."
- The catch: You can use a biased coin
- Q: How to pick the best biased coin?
- A: Heads: 0.7, Tail: 0.3
Q: How to find the best biased coin for this game, optimize the probability of winning?
The hard way:
Easier way using the log of $g(p)$:
Why the logarithm?
Definition: Given a function with 2 variables, the slope of the function is representent by a plane rather than a line.
Task is, given $f(x,y) = x^2 + y^2$, find partial derivatives of $f$ with respect to $x$ and $y$
Another example: Given $f(x,y) = 3x^2y^3$, find partial derivatives of $f$ with respect to $x$.
Now we can find the partial derivative of $f$ with respect to $y$:
$$ \frac{\partial f}{\partial y} = 3(x^2)(3y^2) = 9x^2y^2 $$The gradient of $f(x,y)$ at (2,3) is:
Sauna example, but in two dimensions:
Steps: Given $f(x,y)$,
Problem: There are three positions of the power lines. You want to find the optimal place for a fiber line connection that goes in a straight line in such a way that you reduce the total cost of connecting to the three power lines.
Goal: Minimize sum of squares cost
How to calculate the function you want to minimize:
Now we can minimize this by calculating the partial derivative with respect to $m$ and $b$
$$ \frac{\partial E}{\partial m} = 28m + 12b - 42 $$$$ \frac{\partial E}{\partial b} = 6b + 12m - 20 $$from sympy import *
x, y, z = symbols('x y z')
print("Q1")
display(diff(x**2*y+3*x**2, x))
print("\nQ2")
print("Partial derivative of x")
display(diff(x*y**2 + 2*x + 3*y, x))
print("Partial derivative of y")
display(diff(x*y**2 + 2*x + 3*y, y))
print("\nQ3")
f = x**2 + 2*y**2 + 8*y
df_dx = diff(f, x)
df_dy = diff(f, y)
print("Partial derivative of x")
display(df_dx)
print("Partial derivative of y")
display(df_dy)
print("Now find where both would equal to zero, which is (0, -2). Then plug and chug")
_x, _y = (0, -2)
print("Answer:", _x**2 + 2*_y**2 + 8*_y)
print("\nQ4")
print("Partial derivative of x")
display(diff(x**2 + 2*x*y*z + z**2, x))
print("Partial derivative of y")
display(diff(x**2 + 2*x*y*z + z**2, y))
print("Partial derivative of z")
display(diff(x**2 + 2*x*y*z + z**2, z))
from sympy.solvers import solve
f = diff(E**x - log(x), x)
display(f)
display(solve(x**2-1,x))
Idea: new point = old point - slope
But, moving by the amount of the slope can result in a big step. This can lead to overshooting the step size. Instead, we can apply a "learning rate" to the slope to decrease the step size.
Learning rate: learning rate ensures that the steps we are performing are small enough so the algorithm can converge to the minimum.
You can apply this learning rate to the slope and subtract it from the current position. This is gradient descent:
$$ x_1 = x_0 - \alpha f'(x_0) $$Implementation:
In action using $f(x) = e^x - log(x)$:
Significance: "Notice something interesting, which is that in this algorithm, you never need it to solve $e^x - \frac{1}{x} = 0$. You never need to solve for the derivative zero. You only need to know the derivative and then apply it in the algorithm when you're taking the updating step."
Takeaway: There is no rule to get the best learning learning rate.
Drawbacks of gradient descent:
Using direction of greatest descent:
Example using the sauna problem, starting at x=0.5, y=0.6:
Steps:
Implementation:
Takeaway: Start at different places, and choose the best local minima.
Linear regression with more points:
TV ad budget vs number of sales problem:
Takeaways:
Steps:
Harder Qs:
Prediction Function:
For regression, we use the Mean Squared Error:
$$ L(y, \hat{y}) = \frac{1}{2}(y - \hat{y})^2 $$Goal: Find $w_n$, $b$ that give $\hat{y}$ with the least error.
To find optimal values for $w_n$, $b$, you use gradient descent:
Use the chain rule to find the partial derivatives. Example uses two features, $w_1$ and $w_2$:
Now we need to solve the partial derivatives of each wrt the original functions:
(Note to self - I don't understand why $\frac{1}{2}(y-\hat{y}^2$'s derivative is $-(y-\hat{y})$)
Then we can plug them back into the chain rule:
Fina
lly, we plug in the derivatives into the update rule:
Activation function:
Sigmoid function:
$$ \sigma (z) = \frac{1}{1+e^-z} $$Takeaway:
Sigmoid function can be represented as:
$$ \sigma(z)=(1+e^{-z})^{-1} = \frac{1}{1+e^{-z}} $$and the Sigmoid function's derivative is:
$$ \frac{d}{dz} \sigma(z) = \sigma(z) (1-\sigma(z)) $$Getting there is complicated:
With classification, log loss is the loss function to calculate the error.
Loss function for log loss uses natural logarithm:
$$ L(y, \hat{y}) = -y \ln(\hat{y}) - (1-y) \ln(1-\hat{y}) $$Takeaway:
Classification goal: Find optimal values for $w_n$, $b$ to minimize the log loss
Once we know the log loss after forward propagation, we can take a gradient descent step to find the bets weights and bias.
Focusing on one feature $w_1$, the idea is to reduce the loss by adjusting $w_1$'s weight. This requires applying the chain rule to get from the derivative of the loss function to the derivative of $w_1$.
Basically we are trying to figure out:
The nitty gritty:
Deriving $\hat{y}$ wrt the loss ($\frac{\partial L}{\partial \hat{y}}$):
Deriving each feature/bias wrt $\hat{y}$:
Now multiply partial derivatives together for each:
Notice that all of the sigmoid derivatives ($\hat{y}(1-\hat{y})$) cancel out:
And we end up with simple derivatives:
Update step for weights and bias:
Repeat this until log loss is optimally minimized
A 2,2,1 neural network example:
Partial derivatives tell us exactly in what direction to move each one of the weights and biases in order to reduce the loss:
The chain rule wrt to weight $w_{11}$:
Calculate each derivative and update step:
The chain rule wrt to bias $b_1$:
Calculate each derivative and update step:
Summary of updating first layer of a 2-2-1 network:
Chain rule for second layer:
Summary of updating second layer of a 2-2-1 network:
Using the following example (3 layer NN):
Just like before, we get derivatives for each weight/bias using the chain rule:
Significance: Newton's method is an alternate method to gradient descent.
Formula to iterate on to find the next step (generalized to $x_k$:
$$ x_{k+1} = x_k - \frac{f(x_k)}{f'(x_k)} $$Problem: Note that Newton's method only finds the zero of a function, not the minimum. How to use for optimization?
Answer:
Summary of difference between Newton's method and using it for optimization:
Using the example:
Let's find derivatives wrt $g(x)$:
Now we can plug in x ($x_0=0.05$) and the first/second derivatives:
$$ \begin{align} x_1 &= x_0 - \frac{g'(x_0)}{(g'(x_0))'} \\ &= 0.05 - \frac{e^{0.05}-\frac{1}{0.05}}{e^{0.05}+\frac{1}{0.05^2}} \\ &=0.097 \end{align} $$...and iterate $k$ times until $x_k$ is optimially minimized to zero.
Notation in Leibniz and Lagrange:
Using a car's distance, velocity and acceleration:
Key points:
Hessian: Matrix of second derivatives
Example function: $f(x,y) = 2x^2+3y^2-xy$:
What do these second derivatives mean?
Another example, to get the Hessian matrix of $f(x,y)=x^2+y^2$:
Hessian matrix: Matrix of the second derivatives
Notation:
Applied to the $f(x,y) = 2x^2+3y^2-xy$ problem:
General formula
Takeaway:
Formula:
$$ \begin{bmatrix}x_{k+1} \\ y_{k+1}\end{bmatrix}=\begin{bmatrix}x_k \\ y_k\end{bmatrix}-H^{-1}(x_k,y_k)\nabla(x_k,y_k) $$Newton's method applied to a two variable case:
Goal: Find minimimum of a concave, 2 variable function: $$ f(x,y) = x^4 + 0.8y^4 + 4x^2 + 2y^2 - xy - 0.2x^2y $$
Steps:
Pros:
Cons: