Derivatives

Motivation

  • Problem: Measuring velocity of car at given time
  • Solution: Estimate average velocity given 2 points (secant), average = rise / run
    • Secant: A line that intersects a curve at a minimum of two distinct points

Derivatives and Tangents

  • Solution to above problem: Move the distance between rise and run so close that you can't tell the distance between the two.
    • Derivate: Slope of the tangent line at a specific point in a function, $dx/dt$
    • Instantaneous rate of change: Slope of the tangent line

Slopes, maxima and minima

Zero Slope

Maxima / Minima

Takeyaways:

  • You can find the maxima / minima of a function by finding the zero slope (ie. tangent line is horizontal)

Derivatives and Notation

Lagrange vs Leibniz Notation

Some Common Derivatives

Constant (horizontal line)

Line (diagonal straight line)

Solution: $\frac{\Delta y}{\Delta x}$

Quadratics

Deriving the slope at (1, 1):

Formula:

Higher-degree polynomials

Cubed

Deriving the slope at (0.5, 0.2):

Formula:

Other Power Functions

Calculating the slope between (1,1) and (1.5, 2/3):

Deriving the slope at the limit of (1,1):

Formula:

Summary

Pattern for power functions:

  1. Exponent becomes the derivative as a multiplication factor
  2. Subtract one from the exponent and you get the new exponent

Common Derivatives

Inverse Function

What's an inverse?

  • Undoes the original function $f(x)$ by applying $g(x)$.
  • For $x$ becoming $x^2$, $g(x)$ applies a square root to $x^2$
  • Notation on the right (notice the $-1$ on $f$

Derivative of the Inverse

Takeaways:

  1. $f(x)$ and $g(x)$ mirror each other
  2. Their slopes, $\frac{\Delta f}{\Delta x}$ and $\frac{\Delta g}{\Delta y}$ essentially mirror each other
  3. Therefore, the derivative of $g$ ($g'(y)$) is the inverse of the derivative of $f$ ($f'(x)$): $$ g'(y) = \frac{1}{f'(x)} $$

Using an example at (1,1):

Using an example (2, 4):

Derivative of Trigonometric Functions

Sine

Cosine

Meaning of the Exponential $(e)$

$$ e = 2.71828182 $$

Convergence of $e$:

  • Another way to define $e$ is $(1 + \frac{1}{n})^n$
  • As $n$ reaches infinity, it becomes $e$
  • Derivative of $e$ is the same ($e^x$)

Intuition using bank interest as an example:

Derivative of $e^x$ is $e^x$

Proving derivative using secants:

Derivative of $log(x)$

Properties of logarithm

  • Logarithm of $x$ ($log(x)$) is the inverse function of $e^x$
  • The derivative of $f(x) = e^x$ as (2, 7.89) is $e^2$
  • The derivative of its inverse (7.89, 2), ie $log(y)$ is $\frac{1}{e^2}$
  • The derivative of $log(y)$ is $\frac{1}{y}$

Existence of the Derivative

Non-differentiable functions: Functions where you can't calculate a derivative at every point

  • For a function to be differentiable, the derivative has to exist for every point in the interval
  • Non-differentiable functions have a point where any tangent could work.
  • Examples: absolute ($|x|$), piecewise functions, functions with a vertical tangent ($f(x)=x^\frac{1}{3}$)

Properties of Derivative

Multiplication by a Scalar

Sclar rule:

  • $c$ = constant $$ f' = cg' $$

Intuition using $y=2x^2$:

  • "If the function multiplies by 2, all these slopes get multiplied by 2. Now, if you take the limit, if you let the point on the right move towards the point in the left, you still have that all the slopes multiply by 2."
  • "If we take a function and multiply it by 2, the derivative gets multiplied by 2. This happens for every value."

This happens for any function, not just quadratics

The Sum Rule

Given a function:

$$ f = g + h $$

its derivative is:

$$ f' = g' + h' $$

Intuition (using example of child running inside a moving boat):

  • Velocity of the moving boat and running child sum together:
  • The slope of the two, then, is going to be the sum of the two slopes:

The Product Rule

Given a function:

$$ f = gh $$

Its derivative is:

$$ f' = g'h + gh' $$
  • Mnemonic: "Hammer" both $g$ and $h$ separately (ie. get its derivative) and sum them together to get the derivative of $f$

Intuition (using example of square space of a house):

The Chain Rule

Given a function $h(t)$ and you want apply another function $g(t)$ to it:

$$ g(h(t)) $$

To get the derivative of this composite of functions with respect to $t$, you use:

  1. $\frac{dh}{dt}$, the derivative of $h$ with respect to $t$
  2. $\frac{dg}{dh}$, the derivative of $g$ with respect to $h$
  3. The derivative of the composition is simply the product of these two.
$$ \frac{d}{dt} g(h(t)) = \frac{dg}{dh} \cdot \frac{dh}{dt} $$

Above is Leibnitz notation, with LaGrange notation it is:

$$ \frac{d}{dt} g(h(t)) = g'(h(t)) \cdot h'(t) $$

It is called the chain rule because you can keep chaining functions and take its derivative using the same idea:

$$ \frac{d}{dt} f(g(h(t))) = \frac{df}{dg} \cdot \frac{dg}{dh} \cdot \frac{dh}{dt} $$

in LaGrange notation:

$$ \frac{d}{dt} f(g(h(t))) = f'(g(h(t)) \cdot g'(h(t)) \cdot h'(t) $$

Intuition of chain rule using temperature change wrt height and time:

  • Given temp change wrt height and height change wrt time, you can figure out temperature change wrt time:

In plot form:

Chain Rule Quiz

Q: What is the derivative of $f(x) = e^{2x}$?

Answer: $2e^{2x}$

  • Derivative of $2x$ is 2, and $f(x) = g(2x)$ where $g(x) = e^x$

Quiz 1

Optimization

Why do we care about derivatives? Because it allows us to find the optima of a function (minima/maxima)

But, multiple minima can exist:

Two Power Line Problem

Goal: Find where to put the home wrt to distance $x$ (from origin) to connect power grid at the lowest cost.

  • Formalized: Find distance $x$ to minimize the cost function
  • Total cost of connecting to both power lines is: $$ (x-a)^2 + (x-b)^2 $$

The function is quadratic, so optimization finds its derivative, which is:

$$ x = \frac{a+b}{2} $$

Three Power Line Problem

With three power lines, the cost functions becomes the sum of the cost to connect to each powerline:

  • This is the total area of the three squares, ie cost.
$$ (x-a)^2 + (x-b)^2 + (x-c)^2 $$

Minimizing the cost is similar to the two power line problem:

$$ x = \frac{a+b+c}{3} $$

The Square Loss

Generalizing the power line problem, we come to the square loss:

The Log-loss

Example: Coin Toss

"Let's play a game. I'm going to toss a coin 10 times and we're going to look at the results. If the results are seven heads followed by three tails, then you win a lot of money. If they're not, then you don't win any money."

  • The catch: You can use a biased coin
  • Q: How to pick the best biased coin?
  • A: Heads: 0.7, Tail: 0.3

Q: How to find the best biased coin for this game, optimize the probability of winning?

  • Chances of winning (given heads is $p$ and tails is $1-p$): $$ p^7(1-p)^3=g(p) $$
  • Goal is the maximize $g(p)$. Here's how:

The hard way:

Easier way using the log of $g(p)$:

  • As a matter of fact, what we really want is the negative of g of p. That is called the log loss and it's a very useful loss function in machine learning.
  • The reason that we actually take negative g of p instead of g of p is that logarithm of p is actually a negative number when p is in between zero and one.
  • We want negative g of p to be a positive number, and instead of maximizing it, like we did here, we minimize negative g of p.

Relationship with ML

Why the logarithm?

  1. Derivative of products is hard, derivative of sums is easy
  2. Product of lots of tiny things is tiny
    • Imagine if you have a probability that is the product of 1,000 probabilities. That's the product of 1,000 numbers, between 0 and 1. That number could be really small and maybe a computer is not able to handle a number so small.
    • On the other hand, if it's the logarithm, the logarithm of a very small number is a very large negative number and computers can handle those numbers.

Quiz 2

Gradients and Gradient Descent

Introduction to Tangent Planes

Definition: Given a function with 2 variables, the slope of the function is representent by a plane rather than a line.

  • This is found by finding the tangent of both variables, and the plane contains both tangent lines.

Partial Derivatives

Task is, given $f(x,y) = x^2 + y^2$, find partial derivatives of $f$ with respect to $x$ and $y$

  1. Treat one variable as a constant, and it becomes a one variable function. We can take a derivative of the existing variable. Note that the derivative of a constant is zero: $$ \frac{\partial f}{\partial x} = 2x + 0 $$
  2. Apply the same to the other variable (x): $$ \frac{\partial f}{\partial y} = 0 + 2y $$

Another example: Given $f(x,y) = 3x^2y^3$, find partial derivatives of $f$ with respect to $x$.

  1. Treat y as constant, ie block it off and ignore
  2. Treat 3 as a constant coefficient, ie. leave alone
  3. Derive $x^2$, which is $2x$
$$ \frac{\partial f}{\partial x} = 3(2x) y^3 = 6xy^3 $$

Now we can find the partial derivative of $f$ with respect to $y$:

$$ \frac{\partial f}{\partial y} = 3(x^2)(3y^2) = 9x^2y^2 $$

Gradients

  • Top right is a vector containing partial derivatives with respect to $x$ and $y$.
  • Bottom right, nabla f ($\nabla$f) is the collection of all the partial derivatives with respect to all the variables in the function.

The gradient of $f(x,y)$ at (2,3) is:

Gradients and maxima/minima

  • The gradient is pretty useful to minimize functions of two or more variables in the same way as the derivative was useful to minimize functions of one variable.
  • The way to find the minima is to solve a derivative where it is equal to zero
  • Basically, to find the minimums and maximums, all you have to do is set all the partial derivatives to 0 and solve that system of equations.
  • You can apply this to many more variables, ie a function of 12 variables needs 12 slopes to be 0.

Optimization with gradients: An example

Sauna example, but in two dimensions:

Steps: Given $f(x,y)$,

  1. Find partial derivatives with respect to x and y
  2. For each, set the equation equal to zero
  3. For each variable, find what constant will result in zero (see green arrows)
  4. Choose candidate points and check if local maxima/minima

Optimization using gradients - Analytical method

Problem: There are three positions of the power lines. You want to find the optimal place for a fiber line connection that goes in a straight line in such a way that you reduce the total cost of connecting to the three power lines.

  • You can connect to each power line by connecting a wire to touch the fiber line, but the wire needs to be parallel to the y-axis.
  • Cost for connection is the square of the length of the wires
  • Function is $y=mx+b$

Goal: Minimize sum of squares cost

How to calculate the function you want to minimize:

  1. Find the sum of squares cost for each point
    • $(1,2) = (m+b-2)^2$
    • $(2,5) = (2m+b-2)^2$
    • $(3,3) = (3m+b-3)^2$
  2. Total cost is their sum: $$ (m+b-2)^2 + (2m+b-2)^2 + (3m+b-3)^2 $$
  3. Expand, and join similar terms. This is now the cost function: $$ E(m,b) = 14m^2 + 3b^2 + 38 + 12mb - 42m - 20b $$

Now we can minimize this by calculating the partial derivative with respect to $m$ and $b$

$$ \frac{\partial E}{\partial m} = 28m + 12b - 42 $$$$ \frac{\partial E}{\partial b} = 6b + 12m - 20 $$
  1. Set both to zero and find zero point of a corresponding variable
    1. Multiply partial b equation by 2, then subtract it from partial m equation, resulting in $4m-2=0$
    2. $4m-2=0$ is equivalent to $m=2/4$ or $0.5$
    3. Now plug and chug $m=0.5$ into the second equation, $6b + 12(0.5) -20 = 0$, resulting in $b=\frac{7}{3}$

  1. The minimized cost is determined by $m=\frac{1}{2}$ and $b=\frac{7}{3}$ (i.e. $y=\frac{1}{2}x+\frac{7}{3}$), which is 4.167

Quiz 3: Partial Derivatives and Gradient

Code for Quiz 3 answers

In [29]:
from sympy import *

x, y, z = symbols('x y z')

print("Q1")
display(diff(x**2*y+3*x**2, x))

print("\nQ2")
print("Partial derivative of x")
display(diff(x*y**2 + 2*x + 3*y, x))
print("Partial derivative of y")
display(diff(x*y**2 + 2*x + 3*y, y))


print("\nQ3")
f = x**2 + 2*y**2 + 8*y
df_dx = diff(f, x)
df_dy = diff(f, y)
print("Partial derivative of x")
display(df_dx)
print("Partial derivative of y")
display(df_dy)
print("Now find where both would equal to zero, which is (0, -2). Then plug and chug")
_x, _y = (0, -2)
print("Answer:", _x**2 + 2*_y**2 + 8*_y)

print("\nQ4")
print("Partial derivative of x")
display(diff(x**2 + 2*x*y*z + z**2, x))
print("Partial derivative of y")
display(diff(x**2 + 2*x*y*z + z**2, y))
print("Partial derivative of z")
display(diff(x**2 + 2*x*y*z + z**2, z))
Q1
$\displaystyle 2 x y + 6 x$
Q2
Partial derivative of x
$\displaystyle y^{2} + 2$
Partial derivative of y
$\displaystyle 2 x y + 3$
Q3
Partial derivative of x
$\displaystyle 2 x$
Partial derivative of y
$\displaystyle 4 y + 8$
Now find where both would equal to zero, which is (0, -2). Then plug and chug
Answer: -8

Q4
Partial derivative of x
$\displaystyle 2 x + 2 y z$
Partial derivative of y
$\displaystyle 2 x z$
Partial derivative of z
$\displaystyle 2 x y + 2 z$

Optimization using Gradient Descent in One Variable

Hard to optimize functions

Some functions are hard to find the zero of a derivative. Instead we can use gradient descent to inch towards the optima.

Derivative in sympy:

In [42]:
from sympy.solvers import solve

f = diff(E**x - log(x), x)

display(f)

display(solve(x**2-1,x))
$\displaystyle e^{x} - \frac{1}{x}$
[-1, 1]

Using derivative to decide direction

Idea: new point = old point - slope

But, moving by the amount of the slope can result in a big step. This can lead to overshooting the step size. Instead, we can apply a "learning rate" to the slope to decrease the step size.

Learning rate: learning rate ensures that the steps we are performing are small enough so the algorithm can converge to the minimum.

You can apply this learning rate to the slope and subtract it from the current position. This is gradient descent:

$$ x_1 = x_0 - \alpha f'(x_0) $$

Gradient Descent

Implementation:

In action using $f(x) = e^x - log(x)$:

Significance: "Notice something interesting, which is that in this algorithm, you never need it to solve $e^x - \frac{1}{x} = 0$. You never need to solve for the derivative zero. You only need to know the derivative and then apply it in the algorithm when you're taking the updating step."

What's a good learning rate?

Takeaway: There is no rule to get the best learning learning rate.

Drawbacks of gradient descent:

  • local minima

Optimization using Gradient Descent in Two Variables

Using direction of greatest descent:

  • position = $(x_0, y_0) - \alpha \nabla f$

Example using the sauna problem, starting at x=0.5, y=0.6:

Steps:

  1. Find the partial derivative wrt both x and y and put them together in a vector (middle of first slide)
  2. Plug x=0.5 and y=0.6 into this vector to get the gradient (bottom of first slide)
  3. Move in the direction of the gradient times the learning rate (ie. 0.05): $x_0-0.05\nabla f(0.5, 0.6)$
  4. Repeat until you reach the minima

Implementation:

Gradient descent with local minima

Takeaway: Start at different places, and choose the best local minima.

Optimization using Gradient Desccent - Least squares with multiple observations

Linear Regression: Gradient Descent

Linear regression with more points:

TV ad budget vs number of sales problem:

  • Goal: predict sales wrt TV budget
  • Tool: Linear regression
  • Data: Multiple observations of budget & sales

Takeaways:

  • height is cost of error (right)
  • adjust $m$ and $b$ to minimize the cost function

Steps:

  1. Choose random point and calculate its error (ie. height)
  2. Get loss of all the points $(x_n, y_n)$ and calculate its average (actually $\frac{1}{2n}$ for math convenience).
  3. Apply gradient descent
    • Given $m_0$ and $b_0$, subtract the loss: $\alpha \nabla L_1 (m_0, b_0$

Quiz 4: Partial Derivatives and Gradient Descent

Harder Qs:

Optimization in Neural Networks and Newton's Method

Regression with a Perceptron

Perceptron

Prediction Function:

  • $w$: weights
  • $x$: features
  • $b$: bias
$$ \hat{y} = \sum^N_{i=1} w_n x_n + b $$

Loss Function

For regression, we use the Mean Squared Error:

$$ L(y, \hat{y}) = \frac{1}{2}(y - \hat{y})^2 $$

Goal: Find $w_n$, $b$ that give $\hat{y}$ with the least error.

Gradient Descent

To find optimal values for $w_n$, $b$, you use gradient descent:

  1. Set initial values for $w_n$
  2. Find their partial derivatives

Chain Rule

Use the chain rule to find the partial derivatives. Example uses two features, $w_1$ and $w_2$:

Now we need to solve the partial derivatives of each wrt the original functions:

(Note to self - I don't understand why $\frac{1}{2}(y-\hat{y}^2$'s derivative is $-(y-\hat{y})$)

Then we can plug them back into the chain rule:

Fina

lly, we plug in the derivatives into the update rule:

Classification with a Perceptron

Activation function:

  • Takes continuous number and compresses them between 0 and 1.

Sigmoid function:

$$ \sigma (z) = \frac{1}{1+e^-z} $$

Derivative of a Sigmoid Function

Takeaway:

Sigmoid function can be represented as:

$$ \sigma(z)=(1+e^{-z})^{-1} = \frac{1}{1+e^{-z}} $$

and the Sigmoid function's derivative is:

$$ \frac{d}{dz} \sigma(z) = \sigma(z) (1-\sigma(z)) $$

Getting there is complicated:

Gradient Descent with a Perceptron

With classification, log loss is the loss function to calculate the error.

Loss function for log loss uses natural logarithm:

$$ L(y, \hat{y}) = -y \ln(\hat{y}) - (1-y) \ln(1-\hat{y}) $$

Takeaway:

  • $L(y,\hat{y})$ is large if $y$ and $\hat{y}$ are far away from each other and small if they are close.
  • If y is $0$
    • log loss is small when y-hat $\mapsto$ 0
    • log loss is large when y-hat $\mapsto$ 1
  • If y is $1$
    • log loss is small when y-hat $\mapsto$ 1
    • log loss is large when y-hat $\mapsto$ 0

Classification goal: Find optimal values for $w_n$, $b$ to minimize the log loss

Calculating the Derivatives

Once we know the log loss after forward propagation, we can take a gradient descent step to find the bets weights and bias.

Focusing on one feature $w_1$, the idea is to reduce the loss by adjusting $w_1$'s weight. This requires applying the chain rule to get from the derivative of the loss function to the derivative of $w_1$.

Basically we are trying to figure out:

  1. How much $\hat{y}$ influences the loss ($\frac{\partial L}{\partial \hat{y}}$), and
  2. How much $w_1$ influences $\hat{y}$ ($\frac{\partial \hat{y}}{w_1}$). Same with the bias.

The nitty gritty:

Deriving $\hat{y}$ wrt the loss ($\frac{\partial L}{\partial \hat{y}}$):

Deriving each feature/bias wrt $\hat{y}$:

Now multiply partial derivatives together for each:

Notice that all of the sigmoid derivatives ($\hat{y}(1-\hat{y})$) cancel out:

And we end up with simple derivatives:

Gradient descent update steps

Update step for weights and bias:

Repeat this until log loss is optimally minimized

Classification with a Neural Network

A 2,2,1 neural network example:

Partial Derivatives

Partial derivatives tell us exactly in what direction to move each one of the weights and biases in order to reduce the loss:

  1. $\frac{\partial L}{\partial w_{ij}}$ = how initial weights affects the loss
  2. $\frac{\partial L}{\partial b_i}$ - how initial biases affect the loss
  3. $\frac{\partial L}{\partial w_i}$ - how each weight in second layer affect the loss
  4. $\frac{\partial L}{\partial b}$ - how the bias in the second layer affects the loss

Neural Network - Minimzing Log Loss

The chain rule wrt to weight $w_{11}$:

Calculate each derivative and update step:

The chain rule wrt to bias $b_1$:

Calculate each derivative and update step:

Summary of updating first layer of a 2-2-1 network:

Chain rule for second layer:

Summary of updating second layer of a 2-2-1 network:

Gradient Descent and Backpropagation

Using the following example (3 layer NN):

Just like before, we get derivatives for each weight/bias using the chain rule:

Newton's Method

Significance: Newton's method is an alternate method to gradient descent.

  • Main idea: Take tangent of a random point (ie. $x_0$) in the function, and subsequently get the tangent of where the 1st tangent's line cross the 0 axis (ie. $x_1$).

Formula to iterate on to find the next step (generalized to $x_k$:

$$ x_{k+1} = x_k - \frac{f(x_k)}{f'(x_k)} $$

Newton's Method for Optimization

Problem: Note that Newton's method only finds the zero of a function, not the minimum. How to use for optimization?

Answer:

  1. Remember that when you want to minimize the function $g(x)$, you actually have to find the zeros of $g'(x)$
  2. If you can find the zeros of a function, you can choose its argmin as the minimum.
  3. In other words, if I let $f(x)$ to be the derivative of $g(x)$, then by finding its zeros you are minimizing $g'(x)$
  4. And the derivative of $f'(x)$ is the derivative of $g(x)$ and the derivative of that (ie. $g'(x)'$. This is the second derivative.

Summary of difference between Newton's method and using it for optimization:

Example of Newton's Method for Optimization

Using the example:

  • Function: $g(x) = e^x - \log(x)$
  • $x_0 = 0.05$
  • Remember that Newton's method is: $$ x_{k+1} = x_k - \frac{f(x_k)}{f'(x_k)} $$

Let's find derivatives wrt $g(x)$:

  1. First derivative of $g(x)$ - Let us consider this $f(x)$ $$ g'(x) = e^x-\frac{1}{x} $$
  2. Second derivative of $g(x)$ - Let us consider this $f'(x)$ $$ (g'(x))'=e^x + \frac{1}{x^2} $$

Now we can plug in x ($x_0=0.05$) and the first/second derivatives:

$$ \begin{align} x_1 &= x_0 - \frac{g'(x_0)}{(g'(x_0))'} \\ &= 0.05 - \frac{e^{0.05}-\frac{1}{0.05}}{e^{0.05}+\frac{1}{0.05^2}} \\ &=0.097 \end{align} $$

...and iterate $k$ times until $x_k$ is optimially minimized to zero.

The Second Derivative

Notation in Leibniz and Lagrange:

Significance

  • Second derivative gives us the curvature of a function
  • Curvature: A measure of the amount by which a curve deviates from being a straight line.
  • We can detect whether curvature has minima/maxima + direction:
    • If $f''(x) > 0$, it has a (local) minima with concave up
    • If $f''(x) < 0$, it has a (local) maxima with concave down
    • If $f''(x) = 0$, it is inconclusive
  • First vs second derivative characteristics
    1. $f'(0) > 0$ = increasing function
    2. $f'(0) < 0$ = decreasing function
    3. $f''(0) > 0$ = concave up (smile)
    4. $f''(0) < 0$ = concave down (frown)

Car Example

Using a car's distance, velocity and acceleration:

  • Velocity is 1st derivative
  • Acceleration is 2nd derivative

Key points:

  • $v(t)$ is derivative of $d(t)$
  • $a(t)$ is second deriative of $d(t)$
  • If $v(t)$ increases linearly, $a(t)$ is a constant
  • If $v(t)$ is a constant, $a(t)$ is zero

The Hessian

Hessian: Matrix of second derivatives

  • When we want to optimize $f$ with many variables, we use multivariable Newton's method which uses the Hessian.

Comparison between $f$ with 1 vs 2 variables

Example using multivariable function

Example function: $f(x,y) = 2x^2+3y^2-xy$:

  • Each output on the right of image is the second derivative wrt one variable (e.g. $x$).
  • Each represent the rate of change of a rate of change

What do these second derivatives mean?

  1. "The first two are the rate of change with respect to x and then with respect to x, and also with respect to y and with respect to y. These work exactly like in one variable, because pretty much if you have f as a function of only x, you can take f prime prime of x, and the same thing with y."
  2. "The last two are the change in the slope along one coordinate axis with respect to tiny changes along the orthogonal coordinate axis."
    • These two are the same in most cases.

Another example, to get the Hessian matrix of $f(x,y)=x^2+y^2$:

  1. Get the first derivative wrt x and y: $$ f'_x(x,y) = 2x\\ f'_y(x,y) = 2y $$
  2. Get the second derivative wrt x and y: $$ \nabla f'_x(x,y) = \begin{bmatrix}2\\0\end{bmatrix}\\ \nabla f'_y(x,y) = \begin{bmatrix}0\\2\end{bmatrix} $$
  3. Pile the transposed vectors $$ \begin{bmatrix}2&0\\0&2\end{bmatrix} $$

Hessian Matrix

Hessian matrix: Matrix of the second derivatives

Notation:

Applied to the $f(x,y) = 2x^2+3y^2-xy$ problem:

General formula

Hessians and Concavity

Takeaway:

  1. Given a hessiam matrix at its zero point (eg. $H(0,0)$), if eigenvalues are all positive, then:
    1. $f$ is concave up
    2. (0,0) is the minimum
    3. Matrix is "positive definite"
  2. Same as above, if eigenvalues are all negative, then:
    1. $f$ is concave down
    2. (0,0) is the maximum
    3. Matrix is "negative definite"
  3. Same as above, if eigenvalues are a mix of positive/negatives, then:
    1. We can't conclude anything!
    2. Matrix is neight positive/negative definite
    3. Called "saddle point" for 2 variable case

Hessian Summary

Newton's Method for Two Variables

Formula:

$$ \begin{bmatrix}x_{k+1} \\ y_{k+1}\end{bmatrix}=\begin{bmatrix}x_k \\ y_k\end{bmatrix}-H^{-1}(x_k,y_k)\nabla(x_k,y_k) $$

Newton's method applied to a two variable case:

  1. We already learned solving it with one variable: $$ x_{k+1} = x_k - \frac{f(x_k)}{f'(x_k)} $$
  2. With two variables, denominator becomes an inverse Hessian, and the numerator becomes the gradient.
    • Note: Order matter here! Can't matrix multiple 2x1 against 2x2, must be 2x2 by 2x1

Example

Goal: Find minimimum of a concave, 2 variable function: $$ f(x,y) = x^4 + 0.8y^4 + 4x^2 + 2y^2 - xy - 0.2x^2y $$

Steps:

  1. Find the first and second derivatives wrt x and y
  2. First derivative is the gradient, second is the Hessian
  3. Start at random point and find its gradient and Hessian
  4. Solve using Newton's method for 2 variables
  5. Iterate until reaching close to zero of a function (reached in 8 steps): $$\begin{bmatrix}4.15 10^{-17} \\ -2.05 10^{-17}\end{bmatrix}$$

Pros vs Cons of Newton's Method vs Gradient Descent

Pros:

  • Less parameters (no learning rate)
  • Converges faster

Cons:

  • Need to calculate second derivative. Can be expensive if there are a lot of parameters (ie variables/features)

Quiz

In [ ]: