Optimization Algorithm
Uses:
Given $x=\{1, \ldots, 100\}$, optimize the following functions
ANSWER:
"""Solve Quiz via Generate & Test"""
from math import sin
import matplotlib.pyplot as plt
def fn_1(x: int):
return (x % 6) ** 2 % 7 - sin(x)
def fn_2(x: int):
return -(x**4) + 1000 * x**3 - 20 * x**2 + 4 * x - 6
# Generate and Test to get max value of f(x)
fn_1_x = list(range(1, 100))
fn_2_x = list(range(-500, 1000)) # Subtitute R with large range
result_fn_1 = tuple(fn_1(i) for i in fn_1_x)
result_fn_2 = tuple(fn_2(i) for i in fn_2_x)
fn_1_max = max(result_fn_1)
fn_2_max = max(result_fn_2)
fn_1_max_x = result_fn_1.index(fn_1_max) + 1
fn_2_max_x = result_fn_2.index(fn_2_max) - 500 # Offset index by -500
# Print maximum x and its value
print("Q1 - index:", fn_1_max_x, "value:", fn_1_max)
print("Q2 - index:", fn_2_max_x, "value:", fn_2_max)
# Plot it
fig, ax = plt.subplots(1, 2, figsize=(10, 3))
ax[0].plot(fn_1_x, result_fn_1)
ax[0].set_title("Function 1 and Optimal x")
ax[0].axvline(x=fn_1_max_x, color="r")
ax[1].plot(fn_2_x, result_fn_2, color="red")
ax[1].set_title("Function 2 and Optimal x")
ax[1].axvline(x=fn_2_max_x, color="b");
Takeaway: We can use methods below, but each have caveats:
These assumptions don't hold if:
The solution, is to use Random Optimization
Algorithm (steepest ascent hill climbing):
Takeaways: Well behaved problem. This problem has only 1 optima.
Once local optimum reached, try again starting from a randomly chosen x.
If the range of input space that reaches global optimum is narrow (ie green zone in image), then there is high chance that we will only random restart at point where it reaches the local optima (ie yellow zone)
This is a complicated answer, watch video for full info: https://www.youtube.com/watch?v=Ts9vLVxyqdY
Takeaway: The success of random hillclimbing depends a lot on the size of the attraction base (area that points toward global optima). If it is narrow, it won't do as much better than trying every point in $X$.
Idea: "Don't alaways improve (exploit) - sometimes you need to search (explore) with the hope of finding something even better".
From CMU notes:
The algorithm is basically hill-climbing except instead of picking the best move, it picks a random move. If the selected move improves the solution, then it is always accepted. Otherwise, the algorithm makes the move anyway with some probability less than 1. The probability decreases exponentially with the “badness” of the move, which is the amount deltaE by which the solution is worsened.
Simulated annealing is typically used in discrete, but very large, configuration spaces, such as the set of possible orders of cities in the Traveling Salesman problem and in VLSI routing.
Key Ideas:
Reference:
Terminology:
Takeaways:
Takeaways:
Reference: De Bonet, Isbell, Viola, "MIMIC: Finding Optima by Estimating Probability Densties", 1997
How? Directly model a probability distribution and succesfully refine this model over time. This then conveys the structure the search space and its best points.
Challenges:
* Example of joint probability distribution (ref):
NOTE: I got lost on the math. Revisit this lecture to refresh: https://www.youtube.com/watch?v=R-Mf9-tKC5o.
Takeaway: The best dependency tree we can find is the one that minimizes all of the entropy for each of the features, given the product of its parents.
Motivation: MIMIC doesn't care about the probability distribution, the user must pick the best one.
Q: Given 3 problems and 3 probability distributions you might use, see which distribution works best for each problem. A:
Takeaway: Each probability distributions represent a structure, ie each $x$'s relationship to each other.
The overall structure of this section of the course is to look at UL algorithms that are each designed to solve different problems.
Basic Clustering Problem:
Q: Given $n$ points (objects) and $k$ clusters, what is its running time?
Reference: Notes from ISYE6501
TODO: I don't understand what is meant by "selecting K Gaussians" here, it seems related to the Gaussian Process but I don't understand what that is either.
Motivation: If one $x$ is exactly between 2 clusters, sometimes included in partition A, then partition B. Can we put ambiguous examples into both clusters?
Assume the data was generated by the following method:
Task: Find a hypothesis of means $h = \{ \mu_1 \ldots \mu_k\}$ that maximize the probability of the data (Maximum Likelihood).
From George's Notes:
$$ x = \{x, z_1, z_2 \ldots z_i\} $$Single Max Likelihood Gaussian: Suppose k = 1 for the simplest possible case. What Gaussian maximizes the likelihood of some collection of points? Conveniently (and intuitively), it’s simply the Gaussian with μ set to the mean of the points!
Extending to Many Ks: With k possible sources for each point, we’ll introduce hidden variables for each point that represent which cluster they came from. They’re hidden because, obviously, we don’t know them: if we knew that information we wouldn’t be trying to cluster them. Now each point x is actually coupled with the probabilities of coming from the clusters:
Idea: Move back and forth calculating 2 probabilistic distributions: Expectation, and Maximization, ie. adjusting the cluster probabilities zi and the chosen mean μ
Terminology:
From George's Notes:
Richness: For any given set of data assigned to clusters $C$, there is some distance matrix $D$ that causes your clustering algo $P_D$ to produce that clustering.
Scale-invariance: Scaling distances doesn't change the clustering (x100, different units, etc).
Consistency: Shrinking distance of points within a cluster AND increasing distance between clusters doesn't change the clustering.
Answer:
Impossibility Theorem by Kleinberg:
No clustering schema can achieve all three of: Richness, scale invariance and consistency.
Types:
Motivations:
Quiz: Given a set of $N$ features and a function $f$ that returns a score, how hard is the problem of choosing the best subset of $N$ ($m$)?
Definition: Search combinations of features using a heuristic, pass it to learner (L)
Pros/Cons: It is fast, but you don't get feedback from the learner. It is ignoring the learning problem.
Technique:
Definition: Search combinations of features and for each combo, get feedback from the learner.
Pros/Cons: It is much slower ($2^N$), but since it gets information from the learner over each iteration, it accounts for model bias and learning.
Technique:
Q: What's the smallest set of features sufficient to net zero training error, given a DT and NN?
Relevance is a way to measure how useful a feature is to learn a problem.
Usefulness: A feature’s usefulness is measured by its effect on a particular predictor rather than on the Bayes’ optimal classifier
What is it?: Feature transformation is the pre-processing of features into a smaller, more compact feature set that still retains as much relevant and useful information as the original.
Motivation: Example - Information Retrieval
Problem: Given a database and a search term, we want to return the most relevant set of documents. What features do we use? We can use words, but there are many words which can lead to the curse of dimensionality. Words can also be ambiguous (e.g. "book" as a noun vs verb).
Solution: We can use feature transformation to group relevant words together, which ends up creating a more compact feature set.
Reference:
Principal Components Analysis is a technique in which we can project features into a ranked set of new "component" features, which are determined by the amount of variance the component explains.
The first PC of a feature set represents the direction of maximum variance. Subsequent PCs are orthogonal to each other, and importantly explain the next direction of maximal variance.
Eigenvectors are a set of vectors used to transform a source matrix $X$ into a transformed matrix $\lambda$, which are called eigenvalues. One can move from $X$ to $\lambda$ by multiplying either matrix with the eigenvector. Given matrix $X$, each principal component is the linear transformation of X by a ranked set of eigenvectors (eg. $Xv_1 \ldots Xv_n$ are principal components)
Independent Components Analysis is slightly different from PCA, in that instead of finding vectors that maximizes variance (and correlation), it aims to maximize independence. This means that if feature set $X$ is transformed into $Y$ via ICA, each of the features $Y_i$ share minimal/zero mutual information between one another while still retaining all of the original information.
From Georges Notes:
From a high level, given some features {x1,x2,...}, ICA aims to find new features {y1,y2,...} such that $y_i \bot y_j$ and $I(yi,yj) = 0$ (that is, the pairs of vectors are mutually independent and share minimal mutual information). It also tries to maximize the mutual information between the $x_i$s and $y_j$s.
Takeaways:
Motivation: In general, every input/output vector (in ML context) can be thought of as a probability density function*. IT is a mathematical framework that lets us compare these density functions to ask questions like (from George's notes):
- Which of our $x_i$'s will give us the most information about the output?
- Are these input feature vectors similar? This is called mutual information.
- Does this feature have any information? This is called entropy.
* Mathematical function that describes the probability of an event occurring
Entropy is the measure of uncertainty associated with a random variable.
Entropy in general is expressed as follows, where:
Mutual Information gives us information about the relationship between variables (ie. given rain, P of thunder). We can measure this in a few ways:
Joint Entropy measures randomness given a set of variables: * ($h$ represents entropy)
$$ h(X,Y) = -\sum_{x \in X, y \in Y} P(x,y) \log_2 P(x,y) $$Conditional Entropy measures the randomness of one variable given another variable:
$$ h(Y | x) = -\sum_{y \in Y} P(x,y) \log_2 P(Y|x) $$Mutual Information (MI) measures the mutual dependence of random variables. It measures the amount of information both X and Y share. High MI means X and Y are highly dependent, low MI means they are independent.
$$ I(x,Y) = h(Y) - h(Y|x) $$(TODO: This was tricky, revisit entropy calculation)
Answer the probability of the following, where A and B are 2 independent coins with P(0.5)
-2*(0.25*math.log(0.25,2))
(python)-4*(0.25*math.log(0.25,2))
(python)-4*(0.25*math.log(0.5)
(python)Answer the probability of the following, where A and B are 2 dependent coins with P(0.5)
0.5/0.5
-2*(0.25*math.log(0.25,2))
(python)-2*(0.5*math.log(0.5,2))
(python)-2*(0.5*math.log(1,2)
(python)KL Divergence measures the distance between two distributions. Mutual information is special case of KL divergence.
$$ D_{KL} (p || q) = \int P(x) log_2 \frac{P(p)x}{P(q)x} dx $$Significance: In SL our goal is to model data $X$ to a distribution which is unknown $p(x)$. We can sample this data to get a known distribution $q(x)$. We can use KL divergence as a substitute for the least square formula for fitting.