CS7643 Deep Learning - Module 3 (Structured Neural Representations)

Lesson 11: Introduction to Structured Representations

Lesson 12: Language Models

Evaluating LM Performance

Per-word cross-entropy: Cross entropy average over all the words in a sequence.

  • Reference distribution (P*) is the empirical distribution of the words in the sequence.

Perplexity: Geometric mean of the inverse probability of a sequence of words.

  • Formula can be expanded to log of per-word cross-entropy (see image).

Intuition: Perplexity of a discrete uniform distribution of $k$ events is $k$.

  • Flip a fair coin, 1/2 chance of heads. Perplexity is 2.
  • Flip a fair die, 1/6 chance of 1. Perplexity is 6.
  • The lower the perplexity, the better the LM is.

Teacher forcing: During training, the model is also fed with the true target sequence, not its own generated output. This means that the model receives the correct tokens from the target sequence as inputs at each time step, rather than its own generated tokens from the previous time step.

Masked Language Models

Key Idea: Masked language models (MLM) is a pretraining task, a task not specifically for the final task, but can help us achieve getting better initial parameters for modeling the final task.

Cross-lingual Masked Language Modeling

Key Idea: Pre-training on multiple languages allow the model to perform downstream tasks in multiple language.

Knowledge distillation (KD)

Definition: Knowledge distillation is a technique in machine learning where a smaller model (student) learns to mimic the predictions of a larger, more complex model (teacher) by transferring its knowledge. The goal is to compress the teacher model's knowledge into a smaller model, allowing for efficient deployment and reduced computational requirements while maintaining or improving performance.

How:

  • Use of soft-labels gives more signal than just hard-labels from target.

KD Loss

Losses:

  1. Teacher-student loss: Cross-entropy loss (or KL divergence) between prediction scores between teacher and student
  2. Student loss: Cross-entropy loss between student prediction and ground truth.
  3. Combined loss: Both are added together to get the combined loss.

In [ ]:
 

Lesson 13: Embeddings

Word Embeddings

Distributional semantics: A word's meaning is given by the words that frequently appear close-by

  • When a word $w$ appears in a text, its context is the set of words that appear nearby (within a fixed-size window)
  • Use many contexts of $w$ to build up a representation of $w$.

Neural Word Embeddings

"A Neural Probabilistic Language Model" Bengio, 2003

  • Associated each word in the vocabulary a distributed word feature vector.
  • Uses a NN model to predict the next word.

"A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning" Collobert & Weston, 2008 & "Natural Language Processing (Almost from Scratch)" Collobert et al., 2011

  • Used CNNs to create word embeddings
  • Word vectors and other params of NN is chained on different tasks including LM, entity recognition, POS tagging, sharing the same word vectors. Showed that it is useful for all tasks.
  • Looked a nearest neighbors of vectors, found similar syntactic + semantic properties of words.

"Efficient Estimation of Word Representations in Vector Space" Mikolov et al., 2013 (Word2vec paper)

Significance: Very efficient to compute, because word vectors are the only parameters learned in the model.

Collobert & Weston vectors

Significance: A word and its context is a positive training sample. A random word in that sample context gives a negative training sample.

  • NN built on top of word vectors scores the difference between two examples. Penalizes score of negative example.
  • Negative words are drawn at random from vocabulary.
  • Penalizes using margin loss (like SVM) - maximize margin between positive and negative example.

Word2vec: Skip-gram model

Significance: Use words to predict their context words.

  • Context = fixed window of size $2m$

  • $w$ is the central word, vector uses context window of 2.
  • Uses its vector representation to predict what would be in the context of $w$.
  • Shift the context window $w+1$ and do the same thing.

Skip-gram Objective Function

  • Objective function: Average negative log-likelihood.
  • Theta $\theta$ are parameters to be optimized (word vectors) - simpler than earlier models, fast to compute.

Defining $P ( w_{t+j} | w_t) \theta$

  • 2 sets of vectors for each word:
    1. $u_w$ when $w$ is center word
    2. $v_o$ when $o$ is context word
  • When inner product is bigger, the $w$ is more likely to appear with context word $o$ and vice versa.
  • Uses softmax - probability sums up to 1.
  • Do SGD on the softmax function
  • Problem: Large vocab size - computationally expensive. Solution:
    1. Hierarchical softmax
    2. Negative sampling

Other word embeddings

GloVe: Global Vectors - Generates word embeddings by capturing global statistical patterns of word co-occurrences in large text corpora.

fastText: sub-word embeddings - Word representations generated by considering smaller textual units like character n-grams, enabling the model to capture morphological and semantic similarities for words, even for those not seen during training. Available for more than 200 languages.

(Elmo, BERT covered in later lectures)

Common ways to use embeddings:

  • Use it as features in downstream ML models
  • Initialization of NN models for NLP tasks (RNN / LSTM)

How to evaluate word embeddings

2 categories: Intrinsic and extrinsic.

Intrinsic - evaluated on a specific / intermedia subtasks.

  • Fast to compute
  • Not clear if really helpful unless correlation to real task is established

Extrinsic - Evaluation on real task (ie text classification)

  • Takes longer to compute
  • Unclear if subsystem is the problem or its interaction
  • If replacing exactly one subsystem with another improves accuracy - then it is better.

Intrinsic eval - Word Analogy Task

The classic "man:woman, king:?" task.

  • Evaluate word vectors by how well their cosine distance after addition captures intuitive semantic and syntactic analogy questions.

Graph Embeddings

Examples of graph data:

Graph embedding & Matrix completion

Idea: Given a matrix of rows (people) and items, predict if someone will like an unknown item.

Definitions:

  • Embedding: A learned map from entities to vectors of numbers that encodes similarity.
  • Word embedding: Learn features such that the features of a word are similar to those surrounding it.
  • Graph embedding: Learn features such that connected nodes are more similar than unconnected nodes
  • These embeddings are learning by optiming SGD.

Why graph embeddings:

  1. Can be used as task-agnostic entity representations.
  2. Features are useful on downstream tasks without much data
  3. Nearest neighbors are semantically meaningful.

Multi-relation Graph

  • Different colors represent different relations.
  • Minimize margin loss similar to Collobert & Weston word embeddings (positive vs negative samples).
  • Score for edge - similarity between source embedding $\theta_s$ and transformed version of the destination embedding $\theta_r + \theta_d$ (former is relation vector, latter is destination vector).
  • Negative examples created by taking real edge and replacing src / destination with a random node.
  • Loss can be optimized using SGD

Example:

  • We want vector $B$ and transformed vector of $C$ to be close, while $B$'s distance with $A$ and $E$ to be large.
  • Do the same thing for $B$ to $C$ and negative edges $A$ to $C$ and $D$ to $C$.

Type of relations

  1. Identity: Edge are the same
  2. Translator: Vector to represent edge type, add it to node vector
  3. Affine: Affine transformation between matrix $A$ and translator vector delta for each edge type, and do matrix multiplication between $A$ and node vector, plus this translator.
    • Allows more complicated transformations than the translator model.
    • More parameters to learn
  4. Diagonal: Vector $b$ for edge type, multiply each dimension of $b$ with corresponding dimension of $x$.

Embedding a Knowledge Base

(from Bordes et al, 2013 - https://papers.nips.cc/paper_files/paper/2013/file/1cecc7a77928ca8133fa24680a88d2f9-Paper.pdf)

High-level overview:

  1. Subgraph from Freebase (developed by Google, like WikiData), with entities (e.g. Clooney) and relations between them
  2. We can enter a graph embedding on the KG, then for questions (e.g. "Who did Clooney marry in 1987"), you also have embeddings of the Q through the word embeddings table. You can then detect entities in the question and find corresponding entity in the Freebase subgraph.
  3. Relevant entity for the Q is found through entity embeddings.

WikiData

  • Huge graph: 55 million entities, 3 billion edges, 5000 different relations.
  • Visualization: Top 500 most popular entities via t-SNE. Meaning clusters (countries, journals, jobs).

Optimized negative sampling

  • In order to chain embeddings fast on a big graph like wikidata, we need to optimize.
  • Problem: Due to large scale of data, most of training time dominated by computing scores for "fake edges" during negative sampling.
  • Solution 1: Corrupt a sub-batch of 100 edges with the same set of random nodes.
    • Pro: Reduces random sampling cost (100 edges = reduce RAM by 100x)
    • If similarity metric is dot product, a batch of negatives can be computed via matmul.
  • Solution 2 (a bit unrelated): Distributed training of graph embeddings
    • Pro: Enables us to train giant graph embeddings.
  • More info on PyTorch BigGraph (Github link)

Applications, world2vec

TagSpace / PageSpace

TagSpace task: Given input text (e.g. restaurant review), predict hashtag.

  • Implementation: Like word embeddings - In addition to input text, have edges between words and hashtags. Calculate likelihood of hashtags appearing in same post and learn embeddings for both of them. Use embeddings to label posts and do hashtag clustering.

PageSpace task: Given user-page pairs, recommend relevant pages.

  • Implementation: Like predicting missing edges in graph. Embed pages to form a space of page embeddings. Use embeddings to cluster pages or recommend pages to users.
  • Pro: Don't need to embed every user. Addresses cold-start problem.
    • Cold-start: New user without much information, harder to train good quality embeddings.

More info on StarSpace (Github link)

Example for PageSpace:

  • Nearest neighbor for NYT gives similar newspaper companies.
  • Faiss: A library for efficient similarity search and clustering of dense vectors.

Example for VideoSpace:

  • Use multiple embeddings for classification / recommendation.

World2Vec

Idea: Embed everything into embedding space.

  • Useful for understanding what posts users are interested in, which topic does a post belong to.
  • Can be used for tasks with little labeled data, e.g. given 10 videos in French, find more videos in French.

Power of Universal Behavioral Features

Applications of world2vec:

  • Users: Cluster "bad actors"
  • Groups: Identify "for sale" groups
  • Pages: Recommendation, page category prediction, spam / hate page
  • Domains: Domain type prediction, identify mis-information.
  • Note: Graph embeddings don't know what they're looking for, so still need to add a model on top of the features. But can train with any kind of ML algo on top of these features (NN, RF) and can combine it with other features.

Summary of world2vec:

  1. Use unsupervised graph embeddings + convert them to node embeddings.
  2. Combine with other features (content, counters, etc)
  3. Use it as extra feature for downstream tasks (ranking, class, clustering, etc).

Technical details:

  1. Preprocessing: Preparing the graph (id, src, dest, relation type)
  2. Pytorch-BigGraph workflow: More preprocess (sharding), training, output is database with IDs and their embeddings.
  3. Downstream tasks: Nearest neighbor retrieval, visualization, use as features, etc.

Handling large-scale computations:

Idea: Partition the graph using matrix blocking. Nodes divided uniformly into N shards.

  • Pro: Handle one chunk at a time - only 2 partitions of the model held in memory, others swapped to disk. Saves memory usage.
  • Pro: Buckets with disjoint partitions can be trained simultaneously without synchronizing model params.
  • Because shared parameters are relatively small compared to size of embedding tables, training scales linearly by the number of machines used (40 machines, 40x speed up).

ChatGPT:

Matrix blocking involves dividing the adjacency matrix (or other relevant matrices) into smaller blocks and processing these blocks separately. By breaking down the large matrix into smaller, manageable chunks, parallel processing and memory optimization techniques can be applied, making computations more efficient and scalable for large-scale graphs in GNNs.

Embeddings: Additional Topics

Ideas:

  • Moving beyond euclidean distance

Hyperbolic Embeddings

Nickel and Kiela, 2017, NIPS

ChatGPT summary of paper:

The paper "Poincaré Embeddings for Learning Hierarchical Representations" by Nickel and Kiela introduces a method for learning hierarchical representations of data using hyperbolic space, specifically the Poincaré ball model. Traditional embedding methods struggle with capturing hierarchical relationships effectively. The authors propose Poincaré Embeddings, a technique that maps data points into hyperbolic space, preserving hierarchical structures better than Euclidean embeddings. This approach utilizes the Poincaré ball model's unique properties to represent data hierarchies more accurately, making it particularly useful for tasks involving structured data, such as taxonomies and ontologies. The paper demonstrates that Poincaré Embeddings outperform existing methods in capturing hierarchical relationships and offer a valuable tool for various applications requiring the understanding of hierarchical structures within data.

Why makes hyperbolic special?

Hyperbolic space's inherent curvature properties allow it to naturally model hierarchical structures, making it a better choice for representing complex, nested relationships in various applications

In hierarchical structures, such as trees or graphs, entities often have varying degrees of similarity or relatedness. Hyperbolic space can capture these hierarchical relationships by assigning more space (distance) to represent the relationships between entities that are farther apart in the hierarchy.

Bias in Word Embeddings

Bolukbasi et al, "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings", NeurIPS 2016

  • Paper
  • Key idea: Dataset has bias, and will amplify existing discriminatory and unfair behaviors.
  • Paper shows biased analogies.
  • Debias technique:
    1. Identify gender subspace with gendered words.
    2. Project words onto this subspace.
    3. Subtract those projections from the original word.
    4. Problem - Not that effective, bias pervades the word embedding space and isn't just a local property of a few words.
In [ ]:
 

Lesson 14: Neural Attention Models

Softmax and Preview of Attention

Key Ideas:

  • There are lots of ways to implement attention
  • This one uses softmax function
  • Softmax is permutation invariant, ie order doesn't matter.
  • If logits are doubled, we would put more mass on the maximum output.

Relation to Attention:

  • Given a set of vector $\left\{ u_1 \ldots u_N \right\}$ and a query vector $q$
    • Take inner product between each vector and $q$
  • We can select the most similar vector to $q$ via $p = \text{softmax}(Uq)$
  • This gives a method of differentiably selecting a vector from a set.

Softmax attention over vectors

Softmax Attention and MLP Outputs

In MLP:

  • In traditional MLP, the $q$ is the last hidden state. $\left\{ u_1 \ldots u_N \right\}$ is the embeddings of class labels.
  • Samples from the distribution correspond to labels (ie. outputs)

In Softmax Attention:

  • $q$ is an internal hidden state. $\left\{ u_1 \ldots u_N \right\}$ is the embeddings of an input (e.g. previous layer).
  • Distribution correspond to a summary of $\left\{ u_1 \ldots u_N \right\}$

Attention

Key ideas:

  • Attention is the distribution of the input that depends on the computational state and the inputs themselves.
  • Attention can be either hard to soft:
    1. Hard: Samples are drawn from the distribution over the input (ie. explicit)
    2. Soft: Distribution is used directly as a weighted average
  • Attention is typically implemented using softmax.
  • "Saccade" - eye movement directing attention.
  • Attention in NLP - alignment in machine translation.

Current approach to attention

Using softmax:

  1. Input to the layer a set of vectors $u_N$
  2. Inner product each with a controller $q$, which represents the layer's state.
  3. Take softmax of that set of numbers which gives $a_N$
  4. Output is sum of inputs with the weights given by the softmax.

Notes:

  • Attention with "keys" and "values": Input representation split in to k-v pairs.
    • Keys: Used to find softmax weights.
    • Values: Used to compute the output.
  • For Attention layers, representational power grows with the size of the input.
    • Why? With softmax, the hidden state grows with the input.

Multi-layer Soft Attention

Key Idea: Reasoning over a set of inputs. Attentions placed on different tokens across different inputs to perform inductive reasoning.

Including Geometric Information

Key Idea: If input has an underlying geometry (ie. structure) we can include that as an additional encoding.

  • Ex: For sequential data, use position encoding:
    1. For each input $m_j$ add to it a vector $l(j)$
    2. $l(j)$ can be fixed durining training, or learned.
    3. e.g. sinusoid of varying frequency in each coordinate.

Transformers

Key Ideas: 3 things that make transformers special:

  1. Multi-query hidden-state propagation ("self-attention")
  2. Multi-head attention
  3. Network stabiility techniques: Residual connections, LayerNorm

Self-attention

Key Idea: Self-attention expands on controller state of softmax attention by including it for every input.

  • The size of the controller state increases with the size of the input, similar to the hidden states

Transformers diagram:

  • $U^j$ are individual controller states. Each are updated with softmax attention.

Multi-head Attention

Key Idea: Multi-head attention splits controller states into $N$ chunks ("heads"), operates self-attention separately, then recombines them with a FC layer.

Multi-head attention diagram:

  1. Left nodes: Input controller states
  2. $G_i$: "Projections", take full state and map it down to smaller sub-states.
  3. $\text{Attention(G_i)}$: On each $G_i$, we run attention to get attention weights.
  4. Do weighted sums with the attention computed on each chunk: $\sum_{i=1}^{N} G_i$
  5. Take a FC network to recombine all the chunks.

Multi-head attention: Math diagram

  1. Single-head attention weights (bottom-left): Take inner product of each input with its controller state and take softmax
  2. Multi-head attention weights (bottom-right): Project controller state to $L$ different chunks and compute $L$ different set of weights.
  3. Single-head layer output (top-left): Sum weights against inputs
  4. Multi-head layer output (top-right): Sum each head independently, then apply FC network to recombine.

Network Stability - Residual connections, LayerNorm

(This was skipped over in lecture)

Transformers for Text

Transformers can operate on any sets of vectors/graphs, they were first introduced for handling text.

Some special considerations for text:

  1. Position encoding depending on the location of a token in the text
  2. For LMs: "causal attention" - Masking out weights that don't go from left to right.
  3. Training code outputs a prediction at each token and takes their gradients simultaneously.
    • Important, because you can parallelize transformer training. Big advantage over RNNs.
      • Note that this has hard-dependency on causal attention.
    • For Masked LMs (BERT) you don't need causal attention.

Guest lecture: Attention, Transformers, BERT and ViLBERT

Sequence-to-Sequence with RNNs

Key Ideas:

  • $c$ represents global context. Remains constant throughout sequence.
  • $s_i$ represents hidden states at each time step.
  • Problem: Input sequence bottlenecked through fixed-size vector

Sequence-to-Sequence with RNNs and Attention

Key Ideas:

  • Each hidden input ($h_i$) is combined with initial decoder state $s_0$ to produce similarity scores ($e_{ij}$) ("alignment score" in slide). These scores are unnormalized at this point.
    • Each score is a scalar
    • Function used here is a MLP, but practically science has learned that cosine similarity is good enough.
  • We then take a softmax to get probability distribution ($a_ij$). These are "attention weights".
  • We compute context vector $c_1$ as linear combination of attention weights ($a_i$) with hidden states ($h_i$).
  • $c_1$ is used to create first output token $y_1$ plus new decoder state $s_1$.
  • Important: We don't supervise its attention, but rather let it learn this via backpropagation.
$$ \begin{align} \text{Alignment scores} &= e_{t,i} = f_{att}(s_{t-1} , h_i) \\ \text{Attention weights (softmax)} &= \sum_i a_{t,i} = 1 \\ \text{Context vector} &= c_t = \sum_i a_{t,i} h_i \\ \text{New decoder state} &= s_t = g_U (y_{t-1}, s_{t-1}, c_t) \end{align} $$

First step of decoding

N-steps of decoding

  • Attention shifts from relevant part of input to generate first token to those relevant for second token.

Attention Weights Visualized

Self-Attention - Generalizing Attention layers

Key Idea: Key and value matrices / vectors drives the self-attention mechanism.

  • Query vector asks, "which parts of the input sequence should I attend to?"
  • Key + value vectors construct a set of attention scores indicating how much focus each token in the input sequence should receive.
    • Key vectors capture properties of input tokens
    • Value vectors store data used to construct output sequence.
  • When computing attention scores, the query vector is compared with key vectors. Output determines how much focus each token should place on corresponding value vectors during the weighted sum operation.

Inputs:

  • State vector ($s_i$) and hidden vectors ($h_i$) renamed to query vector ($Q$) and input vectors ($X$).
    • $Q$: $N_Q \cdot D_Q$ (Shape: $N_Q \cdot D_Q$)
    • $X$: $N_X \cdot D_Q$ (Shape: $N_X \cdot D_X$)
  • NEW: Key / Value matrices
    1. Key matrix: $W_k$ (Shape: $D_X \cdot D_Q$)
      • Keys compare inputs to queries.
    2. Value matrix: $W_v$ (Shape: $D_X \cdot D_V$)
      • Values represent knowledge returned back to the decoder.
    • These matrices are learned during training.
  • NEW: Query matrix: $W_Q$ (Shape: $D_X \cdot D_Q$)
  • Key + Value + Query matrices are linear projections of the input into different spaces

Computation:

  • Similarity changed from $F_{att}$ to scaled dot product. dot product divided by square root of dimension of query vector (see slide)
  • Attention weights are still the same softmax
  • Output vectors incorporate attention weights ($A$) as well as input vector ($X$)
  • NEW: Key / value vectors
    1. Key vectors: $K = XW_k$ (Shape: $N_x \cdot D_Q$)
    2. Value vectors: $V = XW_V$ (Shape: $N_x \cdot D_V$)
  • NEW: Query vectors: $Q = XW_Q$
$$\begin{align} \text{Similarities} &= QX^T \\ &= \frac{Q_i \cdot X_j}{\sqrt{D_Q}} \\ \text{Attention weights} &= \text{softmax}(E) \\ \text{Output vectors} &= Y = AX \\ &= Y_i = \sum_i A_{i,j}X_j \\ \text{Key vectors} &= K = XW_K \\ \text{Value vectors} &= V = XW_V \\ \text{Query vectors} &= Q = XW_Q \end{align} $$

Simple Attention

Attention V1 with query vectors as input:

Self-attention

  • Key Idea: Get rid of Q vectors as input.
  • Only takes input vectors X, outputs Y.
  • Linear projection from X → Q is learned
    • Replace Q inputs with Q matrices that are now learned.
  • The rest is similar to V1
  • One query per input vector
  • Self-attention layer is Permutation Equivariant
    • If input permuted, parameters will be permuted but have same values.
    • Significance: Self-attention doesn’t “know” the order of the vectors it is processing! Need to encode this.
    • Encoding (E) can be a lookup table or fixed function.

Q: Difference between hard vs soft attention?

  • Hard-attention selects argmax of softmax probability distribution, set that value to one. Doesn't pass the entire distribution.
  • Hard because it becomes not differentiable (bunch of zeros).
  • Soft-attention uses the whole probability distribution to create output.

Masked Self-Attention Layer

Key Idea: Don't let vectors "look ahead" in the sequence.

  • Used for language modeling (predicting next token)

Multi-head Self-Attention Layer

Key Idea: Use $H$ independent attention heads in parallel. Process multiple chunks of attention.

  • Run them all in parallel, and then concatenate them downstream.

Summary: 3 ways of processing sequences

Transformers

Key Idea: Stack transformer blocks to get contextualized understanding of the input.

  • Encoder-Decoder structure is pretty much the same as RNNs we've seen before.

BERT

Key Idea: Remove decoder structure. Just encodes input into a rich, contextualized representation.

  • Stack a series of Transformer encoder blocks.
  • Trained in an unsupervised mannger via "Masked Language Modeling" and "Next Sentence Prediction".
  • Take massive amounts of sentences and learn contextualized understanding of language.

GLUE benchmark as of 2023/10/22

  • BERT variants have superceded human baseline.

Q: How is it different from Word2Vec?

  • BERT uses attention mechanisms.

ViLBERT

Key Idea: Jointly process images and language data using attention + transformers to respond to multimodal tasks.

Lesson 15: Neural Machine Translation

Key Idea: Given each decoding step, we get $n$-candidates based on the "beam width".

Formula: Maximize the probability of sequence of tokens:

$$ argmax_y \prod\limits_{t=1}^{T_y} \text{ } P(y_t | x, y_1, \ldots y_{t-1}) $$

Which can also be represented as probability of sequence $y$ given $x$:

$$ P(y_1 \ldots y_{T_y} | x, y_1, \ldots y_{t-1}) = P(y_1|x) P(y_2|x,y_1) \ldots P(y_{Ty}|x,y_1 \ldots y_{T_{y-1}}) $$

From GT lecture - beam search over a full sentence

Using the log

Key Idea: Taking the argmax of a long sequence can lead to numerical underflow. Use the log makes it prevents this.

New objective function - Sum of logs of probabilities $$ argmax_y \prod\limits_{t=1}^{T_y} logP(y_t | x, y_1, \ldots y_{t-1}) $$

Length Normalization

Key Idea: Longer texts will have more probabilities which makes its sum smaller than shorter ones. This biases the objective function to favor shorter sequences because they have fewer probabilities.

Solution: Normalize the length of the sequence, reduces penalty of longer sequences.

  • Usually has an $\alpha$ param to control strength of normalization.
$$ argmax_y \frac{1}{T_y^\alpha} \prod\limits_{t=1}^{T_y} logP(y_t | x, y_1, \ldots y_{t-1}) $$

Beam search takes the top-K sentences and evalutes them all using the objective function.

Inference Efficiency

What's expensive:

  1. Step-by-step computation (auto-regressive inference)
  2. Output projection: $\theta$(vocab output length beam size)
  3. Deeper models with large parameters

Strategies:

  1. Smaller vocabularies
  2. More efficient computation
  3. Reduce depth / increase parallelism

Vocabulary Reduction

Key Idea: Reduce less likely tokens given a sequence and project that subset.

Solutions:

  1. IBM alignment - Use statistical techniques to model the probability of one word being translated into another (alignment) using Expectation Maximization.
  2. Lexical probability - probability of a particular word or phrase in the target language given a specific word or phrase in the source language

Byte Pair Encoding

Key Idea: Algorithmically create your vocabulary rather than handling every single unique token.

Layer Drop

Key Idea: Dropping layers during training and inference

More details in Fan et al, "Reducing Transformer Depth on Demand with Structured Dropout"

Hardware Consideration

  1. Batched vs on-demand: average efficiency can be increased by translating multiple inputs at once.
  2. Parallel computation will benefit more from GPUs.
    • Wider models benefit more from this than deeper models.
    • Autoregressive seq-to-seq generation doesn't benefit as much.

Quantization

Key Idea: Speed up computation by operating in lower precision domain.

  • Training quantization is also possible (e.g. FP16 on Nvidia GPUs)

Example: fbgemm

Non-Autoregressive Machine Translation

Key Idea: Predict all of the tokens at once.

  • Challenge: Harder to rely on prior tokens of sequence.

Lesson 16: Automated Speech Recognition

Modularized (Hybrid) ASR

Key Idea: Uses acoustic model to generate sound units, which the decoder maps to pronunciation model (phonemes -> words) plus language model (words -> sentence).

Acoustic Model

ASR and Machine Translation

Key Idea: Listen-Attend-Spell system needs the entire stream in order to decode, can't be used in a streaming context.

ASR as a Seq-to-Seq Problem

Key Idea: Collapse multiple audio frames (ie. sequence) into letters instead of phonemes (like hybrid). Inherently creates alignment.

  • At each audio frame, neural net generates probability of generating a an output symbol, ie. a letter.
  • LSTM, bi-LSTM and Transformers are commonly used audio encoders.

Connectionist Temporal Classification (CTC)

Key Idea: Handle continuous output of the same letter by using a "blank" symbol.

Conditional Independence Assumption

Issue: CTC and LAS makes big conditional independence assumption. Each frame doens't depend on transcript produced so far.

RNN-T

Components

Training Objective

Key Idea: Objective is to maximize sum of probabilities of alignment given audio features.

  • We do not know alignment ahead of time. It is part of probability.

Alignment during training

Alignment in a lattice:

  • $t$ = audio encoder step, $u$ = text predictor step
  • Move over time step when predictor emits a "blank" symbol.
  • Multiple different alignment paths can occur.

RNN-T Lattice and Training

Key Idea: Probability of alignment (path in lattice) is the product of all probabilities of edges.

  • Goal: Gradient descent to increase probabilities of valid alignments.

Decoding

Subset of hypotheses at $t=1$ during decoding

  • Challenge is that generating hypotheses can continue forever.
Extend Hypothesis

Key Idea: Any hypothesis which is extended from a given hypothesis would have lower probability than the hypothesis from which it was extended.

Greedy Decoding

Key Idea: Get a candidate final transcript by making local optimal choice at each distinct audio embedding $t$ and text embedding $u$.

Loop through, get different softmax probabilities:

Step 1:

Step 4:

Beam Search Decoding

Key Idea: Obtain n candidate hypotheses at time frame $t+1$ before exiting to decode at $t$. Expand search space than greedy decoding, but not infinite number of candidates. $n$ is a hyperparameter.

Important: The probabilities of each time step is multiplied together to get the total probability of a sequence.

RNN-T Personalization

Key Idea: Use utterance specific context words along with audio during RNN-T training or inference to improve recognition of rare words.

Biasing Module

Implicit boosting: Change training to make model aware of context words

Key Idea: Create biasing module using trie structure of all context words, and find if unfinished word in history matches any of them then boost.

Shallow Fusion

Explicit boosting: Boosting hyps in decoding which has contextual words in them (during greedy/beam decoding).

Key Idea: Build a personalized LM using WFST, to boost hypothesis of a context word in the personalized LM. No need to change training.

Attention Module

Key Idea: Project context words into a new embedding space (context embeddings), then use attention to attend between this embedding space and output from RNN-T.

  • Passes weighted sum of contextual embeddings into the joiner layer, rather than predicted text.
  • Can use other modalities like audio or video.
  • Evaluation results show loss in precision but higher recall.

References and Links:

In [ ]: