Per-word cross-entropy: Cross entropy average over all the words in a sequence.
Perplexity: Geometric mean of the inverse probability of a sequence of words.
Intuition: Perplexity of a discrete uniform distribution of $k$ events is $k$.
Teacher forcing: During training, the model is also fed with the true target sequence, not its own generated output. This means that the model receives the correct tokens from the target sequence as inputs at each time step, rather than its own generated tokens from the previous time step.
Key Idea: Masked language models (MLM) is a pretraining task, a task not specifically for the final task, but can help us achieve getting better initial parameters for modeling the final task.
Key Idea: Pre-training on multiple languages allow the model to perform downstream tasks in multiple language.
Definition: Knowledge distillation is a technique in machine learning where a smaller model (student) learns to mimic the predictions of a larger, more complex model (teacher) by transferring its knowledge. The goal is to compress the teacher model's knowledge into a smaller model, allowing for efficient deployment and reduced computational requirements while maintaining or improving performance.
How:
Losses:
Distributional semantics: A word's meaning is given by the words that frequently appear close-by
"A Neural Probabilistic Language Model" Bengio, 2003
"A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning" Collobert & Weston, 2008 & "Natural Language Processing (Almost from Scratch)" Collobert et al., 2011
"Efficient Estimation of Word Representations in Vector Space" Mikolov et al., 2013 (Word2vec paper)
Significance: Very efficient to compute, because word vectors are the only parameters learned in the model.
Significance: A word and its context is a positive training sample. A random word in that sample context gives a negative training sample.
Significance: Use words to predict their context words.
Defining $P ( w_{t+j} | w_t) \theta$
GloVe: Global Vectors - Generates word embeddings by capturing global statistical patterns of word co-occurrences in large text corpora.
fastText: sub-word embeddings - Word representations generated by considering smaller textual units like character n-grams, enabling the model to capture morphological and semantic similarities for words, even for those not seen during training. Available for more than 200 languages.
(Elmo, BERT covered in later lectures)
Common ways to use embeddings:
2 categories: Intrinsic and extrinsic.
Intrinsic - evaluated on a specific / intermedia subtasks.
Extrinsic - Evaluation on real task (ie text classification)
The classic "man:woman, king:?" task.
Examples of graph data:
Idea: Given a matrix of rows (people) and items, predict if someone will like an unknown item.
Definitions:
Why graph embeddings:
Example:
(from Bordes et al, 2013 - https://papers.nips.cc/paper_files/paper/2013/file/1cecc7a77928ca8133fa24680a88d2f9-Paper.pdf)
High-level overview:
Optimized negative sampling
TagSpace task: Given input text (e.g. restaurant review), predict hashtag.
PageSpace task: Given user-page pairs, recommend relevant pages.
More info on StarSpace (Github link)
Example for PageSpace:
Example for VideoSpace:
Idea: Embed everything into embedding space.
Power of Universal Behavioral Features
Applications of world2vec:
Summary of world2vec:
Technical details:
Handling large-scale computations:
Idea: Partition the graph using matrix blocking. Nodes divided uniformly into N shards.
ChatGPT:
Matrix blocking involves dividing the adjacency matrix (or other relevant matrices) into smaller blocks and processing these blocks separately. By breaking down the large matrix into smaller, manageable chunks, parallel processing and memory optimization techniques can be applied, making computations more efficient and scalable for large-scale graphs in GNNs.
Ideas:
Nickel and Kiela, 2017, NIPS
ChatGPT summary of paper:
The paper "Poincaré Embeddings for Learning Hierarchical Representations" by Nickel and Kiela introduces a method for learning hierarchical representations of data using hyperbolic space, specifically the Poincaré ball model. Traditional embedding methods struggle with capturing hierarchical relationships effectively. The authors propose Poincaré Embeddings, a technique that maps data points into hyperbolic space, preserving hierarchical structures better than Euclidean embeddings. This approach utilizes the Poincaré ball model's unique properties to represent data hierarchies more accurately, making it particularly useful for tasks involving structured data, such as taxonomies and ontologies. The paper demonstrates that Poincaré Embeddings outperform existing methods in capturing hierarchical relationships and offer a valuable tool for various applications requiring the understanding of hierarchical structures within data.
Why makes hyperbolic special?
Hyperbolic space's inherent curvature properties allow it to naturally model hierarchical structures, making it a better choice for representing complex, nested relationships in various applications
In hierarchical structures, such as trees or graphs, entities often have varying degrees of similarity or relatedness. Hyperbolic space can capture these hierarchical relationships by assigning more space (distance) to represent the relationships between entities that are farther apart in the hierarchy.
Bolukbasi et al, "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings", NeurIPS 2016
Key Ideas:
Relation to Attention:
Softmax attention over vectors
In MLP:
In Softmax Attention:
Key ideas:
Using softmax:
Notes:
Key Idea: Reasoning over a set of inputs. Attentions placed on different tokens across different inputs to perform inductive reasoning.
Including Geometric Information
Key Idea: If input has an underlying geometry (ie. structure) we can include that as an additional encoding.
Key Ideas: 3 things that make transformers special:
Key Idea: Self-attention expands on controller state of softmax attention by including it for every input.
Transformers diagram:
Key Idea: Multi-head attention splits controller states into $N$ chunks ("heads"), operates self-attention separately, then recombines them with a FC layer.
Multi-head attention diagram:
Multi-head attention: Math diagram
(This was skipped over in lecture)
Transformers can operate on any sets of vectors/graphs, they were first introduced for handling text.
Some special considerations for text:
Key Ideas:
Key Ideas:
First step of decoding
N-steps of decoding
Key Idea: Key and value matrices / vectors drives the self-attention mechanism.
Inputs:
Computation:
Attention V1 with query vectors as input:
Q: Difference between hard vs soft attention?
Key Idea: Don't let vectors "look ahead" in the sequence.
Key Idea: Use $H$ independent attention heads in parallel. Process multiple chunks of attention.
Key Idea: Stack transformer blocks to get contextualized understanding of the input.
Key Idea: Remove decoder structure. Just encodes input into a rich, contextualized representation.
GLUE benchmark as of 2023/10/22
Q: How is it different from Word2Vec?
Key Idea: Jointly process images and language data using attention + transformers to respond to multimodal tasks.
Key Idea: Given each decoding step, we get $n$-candidates based on the "beam width".
Formula: Maximize the probability of sequence of tokens:
$$ argmax_y \prod\limits_{t=1}^{T_y} \text{ } P(y_t | x, y_1, \ldots y_{t-1}) $$Which can also be represented as probability of sequence $y$ given $x$:
$$ P(y_1 \ldots y_{T_y} | x, y_1, \ldots y_{t-1}) = P(y_1|x) P(y_2|x,y_1) \ldots P(y_{Ty}|x,y_1 \ldots y_{T_{y-1}}) $$From GT lecture - beam search over a full sentence
Key Idea: Taking the argmax of a long sequence can lead to numerical underflow. Use the log makes it prevents this.
New objective function - Sum of logs of probabilities $$ argmax_y \prod\limits_{t=1}^{T_y} logP(y_t | x, y_1, \ldots y_{t-1}) $$
Key Idea: Longer texts will have more probabilities which makes its sum smaller than shorter ones. This biases the objective function to favor shorter sequences because they have fewer probabilities.
Solution: Normalize the length of the sequence, reduces penalty of longer sequences.
Beam search takes the top-K sentences and evalutes them all using the objective function.
What's expensive:
Strategies:
Key Idea: Reduce less likely tokens given a sequence and project that subset.
Solutions:
Key Idea: Algorithmically create your vocabulary rather than handling every single unique token.
Key Idea: Dropping layers during training and inference
More details in Fan et al, "Reducing Transformer Depth on Demand with Structured Dropout"
Key Idea: Speed up computation by operating in lower precision domain.
Example: fbgemm
Key Idea: Predict all of the tokens at once.
Key Idea: Listen-Attend-Spell system needs the entire stream in order to decode, can't be used in a streaming context.
Key Idea: Collapse multiple audio frames (ie. sequence) into letters instead of phonemes (like hybrid). Inherently creates alignment.
Key Idea: Handle continuous output of the same letter by using a "blank" symbol.
Issue: CTC and LAS makes big conditional independence assumption. Each frame doens't depend on transcript produced so far.
Key Idea: Objective is to maximize sum of probabilities of alignment given audio features.
Alignment in a lattice:
Key Idea: Probability of alignment (path in lattice) is the product of all probabilities of edges.
Subset of hypotheses at $t=1$ during decoding
Key Idea: Any hypothesis which is extended from a given hypothesis would have lower probability than the hypothesis from which it was extended.
Key Idea: Get a candidate final transcript by making local optimal choice at each distinct audio embedding $t$ and text embedding $u$.
Loop through, get different softmax probabilities:
Step 1:
Step 4:
Key Idea: Obtain n candidate hypotheses at time frame $t+1$ before exiting to decode at $t$. Expand search space than greedy decoding, but not infinite number of candidates. $n$ is a hyperparameter.
Important: The probabilities of each time step is multiplied together to get the total probability of a sequence.
Key Idea: Use utterance specific context words along with audio during RNN-T training or inference to improve recognition of rare words.
Implicit boosting: Change training to make model aware of context words
Key Idea: Create biasing module using trie structure of all context words, and find if unfinished word in history matches any of them then boost.
Explicit boosting: Boosting hyps in decoding which has contextual words in them (during greedy/beam decoding).
Key Idea: Build a personalized LM using WFST, to boost hypothesis of a context word in the personalized LM. No need to change training.
Key Idea: Project context words into a new embedding space (context embeddings), then use attention to attend between this embedding space and output from RNN-T.
References and Links: