COS 484

Natural Language Processing

Written live in classes with Prof. Karthik Narasimhan.

Last updated: 02/05.

Contents


Lecture 1

Introduction & Language Models

Natural language processing or NLP is a rapidly-evolving field, that is becoming increasingly central to our daily lives, as chatbots become more widely integrated.

NLP is an interdisciplinary field, building computer programs to analyze, understand and generate human language - either spoken or written.

Why is language difficult to understand? It can be ambiguous; there are dialects and accents; context and dependencies matter; there could be humour, sarcasm or irony etc.

The roadmap for this course is in four parts as below:

  • $N$-gram language models; text classification; word embeddings; neural ents for language.
  • Sequence models; recurrent neural nets; attention; transformers.
  • Large language models (pre-training, adaptation, post-training); language agents.
  • Systems for LLMs; evaluations and data; reasoning.

Language models

A language model is a probabilisitc model of a sequence of words.

The joint probability distribution of words $w_1,w_2,\dots,w_n$ is $P(w_1,w_2,\dots,w_n)$, i.e. how likely is a given phrase, sentence, paragraph or even document.

To make computations easier, we have the chain rule for probabilties, by conditioning on previous words.

\[P(w_1,\dots,w_n) = P(w_1)P(w_2\mid w_1)P(w_3\mid w_2,w_1)\times\dots\times P(w_n\mid w_1,\dots,w_{n-1})\]

Language models are everywhere - a common example being in auto-complete on your phone, or auto-completion of query suggestions in search engine.

For estimating probabilities, we can use the simple MLE of

\[P(w_3 \mid w_1,w_2) = \frac{\text{count}(w_1 w_2 w_3)}{\text{count}(w_1w_2)}.\]

It is common to introduce a Markov assumption, which means using only the recent past to predict the next word, in order to simplify computations.

This reduces the number of estimated parameters in exchange for modelling capacity.

For instance, the $k$-th order Markov assumption considers only the last $k$ words (or less) when conditioning.

\[\therefore \hspace{5pt} P(w_i\mid w_1,w_2,\dots,w_{i-1}) \approx P(w_i \mid w_{i-k},\dots,w_{i-1})\] \[\implies P(w_1,\dots,w_n)\approx \prod_{i}P(w_i\mid w_{i-k},\dots,w_{i-1})\]

This above is also called a $k$-gram model. In general a unigram model is $P(w_1,\dots,w_n)=\prod_{i=1}^nP(w_i)$; a bigram model is $P(w_1,\dots,w_n)=\prod_{i=1}^nP(w_i\mid w_{i-1})$, and so on.

In order to generate from a language model, we draw words from the underlying $k$-gram probability distribution.

For example, in a trigram model, $P(w_1,\dots,w_n)=\prod_{i=1}^nP(w_i\mid w_{i-2}, w_{i-1})$:

  • Generate the first word $w_1\sim P(w)$.
  • Generate the second word $w_2\sim P(w\mid w_1)$.
  • Generate the third word $w_3\sim P(w\mid w_1,w_2)$.
  • Generate the fourth word $w_4\sim P(w\mid w_2,w_3)$ etc.

Instead of simply sampling from the distribution, there are other generation methods. One is greedy generation, which chooses the most likely next word, so in a vocabulary $V$, given words $w_1,w_2$, $w_3=\arg\max_{w\in V}P(w\mid w_1,w_2)$.

There is also top-$k$ or top-$p$ sampling, which chooses from a word from the top $k$ most likely words, or a word within a probability density of $p$.

Evaluating a language model

In order to evaluate a language model, we introduce an evaluation metric known as perplexity. It is a measure of how well a LM predicts the next word.

For a test corpus with words $w_1,\dots,w_n$, the perplexity is $\text{ppl} = P(w_1,\dots,w_n)^{-1/n}$. This can equivalently be expressed using logs.

\[\begin{align*} \text{ppl} = 2^x,\text{ where }x&=-\frac{1}{n}\log_2P(w_1,\dots,w_n) \\ &= -\frac{1}{n}\sum_{i=1}^n\underbrace{\log_2 P(w_i\mid w_1,\dots,w_{i-1})}_{\text{cross-entropy}} \end{align*}\]

If our $k$-gram model (with vocabulary $V$) has uniform proability, i.e. $P(w\mid w_{i-k},\dots,w_{i-1})=\frac{1}{\lvert V\rvert}$ $\forall w\in V$, then its perplexity is $\text{ppl} = 2^{-\frac{1}{n}n\log(1/\lvert V\rvert)} = \lvert V\rvert$.


Lecture 2

Smoothing

Smoothing is a technique, which handles sparsity by making sure all probabilities are non-zero in our model.

When we have sparse statistics, i.e. some words have proability of zero of following a certain sequence, we can ‘steal’ probability mass from all the other words with non-zero probability, and allocate it here to generalize better.

The simplest form of smoothing is Laplace smoothing AKA add-alpha smoothing, with quite literally just adds $\alpha$ to all counts and then renormalizes.

Before smoothing, the MLE for bigrams is (where $C$ is the usual ‘$\text{count}$’ function),

\[P(w_i\mid w_{i-1}) = \frac{C(w_{i-1},w_i)}{C(w_{i-1})}.\]

After smoothing, this becomes,

\[P(w_i\mid w_{i-1}) = \frac{C(w_{i-1},w_i)+\alpha}{C(w_{i-1})+\alpha\lvert V\rvert}.\]

Text Classification

e.g. If we have an email, be able to classify whether it is spam or not.

Formally, the inputs to the classification model is a document $d$, and a set of $m$ classes $C$. Our model outputs the predicted class $c\in C$ for document $d$.

For example, sentiment classification of positive/negative for a document.

While we can have rule-based text classification (e.g. if email address ends in specific URL, output spam), supervised learning is what is more commonly and recently used - let the machine figure out the best patterns using data.

The inputs to the supervised learning process are:

  • Set of $m$ classes $C$.
  • Set of $n$ ‘labeled’ documents: ${(d_1,c_1), (d_2,c_2),\dots,(d_n,c_n)}$, $d_i\in\mathcal{D}$, $c_i\in C$.

The output is a trained classifier $F:\mathcal{D}\to C$.

Types of supervised classifiers include naive Bayes, logistic regression, support vector machines, neural networks; but each one differs on what is our function $F$ and how the model ‘learns’ that function $F$.

Naive Bayes classifier

Simple classification model makes use of Bayes rule. For socument $d$, class $c$,

\[P(c\mid d) = \frac{P(c)P(d\mid c)}{P(d)}.\]

MAP is ‘maximum a posteriori’ estimate, i.e. the most likely class.

\[\begin{align*} c_{MAP} &= \arg\max_{c\in C} P(c\mid d) = \arg\max_{c\in C} \frac{P(d\mid c)P(c)}{P(d)} \\ &= \arg\max_{c\in C} P(d\mid c)P(c) \end{align*}\]

This is which class is most likely to occur in general, multiplied by probability of generating document $d$ from that class.

Next question is - how do we calculate/represent $P(d\mid c)$? If $d=w_1,w_2,\dots,w_K$.

One option would be to represent the entire sequence of words $P(w_1,\dots,w_K\mid c)$, but this has too many sequences!

So, alternatively we use a bag of words:

\[P(w_1,w_2,\dots,w_K\mid c) = P(w_1\mid c)P(w_2\mid c)\dots P(w_K\mid c),\]

where we assume the position of each word doesn’t matter, and the probability of each word is conditionally independent of the other words given class $c$.

Thus, to predict with naive Bayes, we get, \(c_{MAP} = \arg\max_{c\in C} P(c)\prod_{i=1}^K P(w_i\mid c).\)

(Sometimes, you can $\log$ the expression to be maximized to change the product to a sum.)

To estimate the probabilities, given a set of $n$ ‘labeled’ documents, ${(d_1,c_1),\dots,(d_n,c_n)}$, we form the maximum likelihood estimates:

\[\hat{P}(c_j) = \frac{\text{Count}(c_j)}{n},\quad \hat{P}(w_i\mid c_j) = \frac{\text{Count}(w_i, c_j)}{\sum_{w\in V}\text{Count}(w,c_j)}\]

However, this leads to a data sparsity problem. What if \(\text{count('fantastic', positive)}=0 \implies P(\text{'fantastic', positive})=0,\) so the $\arg\max$ term for $c$ becomes $0$. The solution to this is smoothing! Hence replace $\hat{P}(w_i\mid c_j)$ with,

\[\hat{P}(w_i\mid c_j) = \frac{\text{Count}(w_i, c_j) + \alpha}{\sum_{w\in V}\text{Count}(w,c_j) + \alpha\lvert V\rvert}.\]

Thus, for prediction, given document $d=(w_1,\dots,w_K)$,

\[c_{MAP} = \arg\max_{c} \hat{P}(c)\prod_{i=1}^K\hat{P}(w_i\mid c).\]

The pros of naive Bayes is it’s fast, low storage requirements, and works well with very small amounts of training data. However, cons are that the independence assumption is too strong, and it doesn’t work well when the classes are highly imbalanced.

Binary naive Bayes: for tasks like sentiment, word occurence may be more important than frequency, e.g. the occurence of word ‘fantastic’ tells us a lot; the fact that it occurs $5$ times may not tell us much. Hence, the solution is to clip the word count at $1$ in every document. N/B: counts can still be 2,3 etc., since binarization is within-doc!

Logistic regression

Popular alternative to naive Bayes in text classification.

Naive Bayes is a generative model ($\arg\max_{c\in C} P(d\mid c)P(c)$), whereas logistic regression is a discriminative model ($\arg\max_{c\in C} P(c\mid d)$).

Discriminative classifiers just try to distinguish (or discriminate) class $A$ from class $B$, it doesn’t care about modelling each class thoroughly, e.g. distinguishing dogs vs. cats might classify - dogs have collars, cats do not (and we ignore everything else).

The overall process for discriminative classifiers is after inputting a set of labelled document ${(d_i,y_i)}_{i=1}^n$, where $y_i = 0,1$ (binary) or $y_i=1,\dots,m$ (multinomial).

  1. Convert $d_i$ into a feature representation $x_i$ (e.g. a bag of words).
  2. Classification function to compute $\hat{y}$ using $P(\hat{y}\mid x)$, e.g. sigmoid or softmax.
  3. Loss function for learning, like cross-entropy.
  4. Optimization algorithm, like stochastic gradient descent.

In the train phase, we learn the parameters of the model to minimize loss function on the training set, and in test phase, we apply parameters to predict class given a new input $x$ (the feature representation of testing document $d$).


Lecture 3

Word Embeddings

We have previously seen that we use ‘counts’ as a way to represent words, i.e. the $C$ function, as seen in $n$-gram models, and naive Bayes.

However, this is kind of supotimal, since we aren’t actually capturing the meaning of the word in this way, since each word, up till now, has just been a string in the vocabulary list.

Instead of representing them as indices, we can represent the meaning of words, by writing them as vectors. We can generalize this to similar, but unseen words.

We can uncover more about the meaning of a word, by looking at words around it. Thus, we can represent a wrod’s context using vectors!

Distributional hypothesis: If $A$ and $B$ have almost identical environments, we say that they are synonyms.

One solution is to use word-word co-occurence matrix, in which each $ij$ entry counts how often word $i$ appears next to word $j$ (e.g. $4$ words to the left/right). Note that this is a sparse matrix, since most entries are $0$.

In this way each word can be represented by the corresponding row vector. In order to measure similarity between the two vectors, we use the cosine metric, $\cos (\mathbf{u},\mathbf{v}) = (\mathbf{u}\cdot\mathbf{v})/(\lvert\lvert \mathbf{u}\rvert\rvert\cdot\lvert\lvert\mathbf{v}\rvert\rvert )$.

However, a raw frequency count is a bad representation! This is because overly frequent words like ‘the’ or ‘it’ will appear next to word, which are not very informative about the context (i.e. we don’t want the non-informative words to count towards the meaning).

The solution to this is to use a weighted function instead of raw counts. We use pointwise mutual information (PMI),

\[PMI(x,y) = \log_2\frac{P(x,y)}{P(x)P(y)}.\]

This is essentially asking - so event $x$ and $y$ co-occur more or less than if they were independent. The denominator downweights or penalizes any overly frequent words.

The vectors in the word-word occurence matrix are long (due to vocabulary size) and sparse (mostly $0$’s). As an alternative, we want to represent words as short ($50$-$300$ dimensional) and dense vectors; in fact, this is the basis for modern NLP systems.

In order to get dense vectors, we perform a singular value decomposition of the count matric. This way, we can approximate the full matrix by only keeping the top $k$ (e.g. $100$) singular values.

word2vec

While a SVD decomposition of the count matrix is a count-based methods, prediction-based methods can also be used such as word2vec (AKA word embeddings).

Vectors are created by training a classifier to predict whether a particular word $c$ (‘pie’) is likely to appear in the context of a word $w$ (‘cherry’).

The basic property of word2vec is: similar words have similar vectors.

The goal is to learn vectors from text for representing words. The input is a large text corpus, a vocabulary $V$, a vector dimension $d$ (e.g. $300$).

The output is $f:V\to\mathbb{R}^d$.

Skip-gram: Assume we have a large corpus $w_1,w_2,\dots,w_T\in V$. the key idea is to use each word to predict other words in its context, and a context is defined as a fixed window of size $2m$ (e.g. $m=2$). We train a classifier to do this.

Once we have training data (e.g. in our corpus of text), our goal is to maximize the probability of the words occuring next to each other, as in our corpus.

The objective function is, for each position $t=1,\dots,T$, we predict context words within a context size $m$, given a center word $w_t$.

\[\mathcal{L}(\theta) = \prod_{t=1}^{T} \prod_{\substack{-m \le j \le m, j \ne 0}} P\!\left(w_{t+j} \mid w_t; \theta\right)\]

This is equivalent to minimizing the (average) negative log-likelihood.

\[J(\theta) = -\frac{1}{T} \log \mathcal{L}(\theta) = -\frac{1}{T} \sum_{t=1}^{T} \sum_{\substack{-m \le j \le m, j \ne 0}} \log P\!\left(w_{t+j} \mid w_t; \theta\right)\]

Essentially, find the $\theta$’s which minimize this negative log likelihood.

We define $P(w_{t+j}\mid w_t;\theta)$, using the inner product $\mathbf{u}_a\cdot\mathbf{v}_b$ to measure how likely word $a$ appears with context word $b$.

\[P\!\left(w_{t+j} \mid w_t\right) = \frac{\exp\!\left(\mathbf{u}_{w_t} \cdot \mathbf{v}_{w_{t+j}}\right)} {\sum_{k \in V} \exp\!\left(\mathbf{u}_{w_t} \cdot \mathbf{v}_k\right)}\]

Note that in this formulation, we don’t care about the classification task itself, and instead the parameters (vectors) that optimize the training objective end up being very good word representations.

A natural question is: why do we need two vectors for each word? Since one word is not likely to appear in its own context window e.g $P(\mathrm{dog} \mid\mathrm{dog})$. So if we use one set of vectors only, it essentially needs to minimize

\[\mathbf{u}_{\mathrm{dog}}\cdot\mathbf{u}_{\mathrm{do}}.\]

To train such a model, we need to cmpute the vector gradient $\nabla_\theta J(\theta)$, where $J$ is defined as above, and then use stochastic gradient descent.

These word embeddings have some nice properties, e.g. $v_{\text{man}} - v_{\text{woman}} \approx v_{\text{king}} - v_{\text{queen}}$. They capture the directions of semantics which make sense.


Lecture 4

Neural Networks for NLP

Last time we saw a very simple, elegant technique to give nice embeddings of words into vectors. For instance, they can capture the semantic of a word much better than just by using its index.

The neural nets we are likely to encounter in NLP are feed-forward NNs, recurrent NNs, and transformers (maybe convolutional NNs to but less so, since these are more used for images).

History: In 2011, Collobert and Weston paper ‘NLP (almost) from Scratch’ showed how feedforward NNs can replace ‘feature engineering’ (i.e. which words matter, interpreting them etc.). In 2014, deep learning starts working for parsing language, e.g. Kim and Kalchbrenner 2014 sentence slasification showed ConvNets work for NLP; Chen and Manning 2014 ependency parsing showed feedforward neural networks work well for NLP.

Earlier we did feature engineering, learning the ‘counts’ and ‘values’ from words in a doc, but we now want to move to deep learning, where we let the model learn the good features and representations, as opposed to us ‘teaching’ the features and putting it in the model.

Feed-forward neural networks

The units are connected with no cycles. The outputs from units in each layers are passed to units in the next higher layer. No outputs are passed back to lower layers.

We have an input layer, some hidden layer(s), and output layer. No backward parsing in this network.

Fully-connected layers, i.e. all the units from one layer are fully connected to every unit of the next layer.

Each hidden layer takes in each input, dot products it with the weights (plus a bias term), and applies an activation function. Examples of activation functions are sigmoid, $f(z) = \frac{1}{1+e^{-z}}$; ReLU $f(z)=\max(0,z)$; tanh $f(z)=\frac{e^{2z}-1}{e^{2z}+1}$ . Note activation functions are non-linear, if they were linear then everything would just collapse into a linear model, and we wouldn’t need multiple layers.

In matrix notation, if $\mathbf{x}\in\mathbb{R}^d$ is the input, hidden layer 1 is $\mathbf{h}_1=f(\mathbf{W}^{(1)}\mathbf{x}+\mathbf{b}^{(1)})$. A similar thing can be done for hidden layer 2 then, $\mathbf{h}_2=f(\mathbf{W}^{(2)}\mathbf{h}_1+\mathbf{b}^{(2)})$, and if it’s only a 2 layer network then the output layer is $\mathbf{y}=\mathbf{W}^{(o)}\mathbf{h}_2\mathbf{W}^{(o)}$, where $\mathbf{W}^{(o)}$ is the weights going to the output layer.

Using feedforward NNs for multi-class classification, we can apply the softmax function to the output $\mathbf{y}=\mathbf{W}^{(o)}\mathbf{h}_2\mathbf{W}^{(o)}$.

\[\text{softmax}(\mathbf{y})_k = \frac{\exp(y_k)}{\sum_{j=1}^C\exp(y_j)}\]

The training loss from this is,

\[\min_{\mathbf{W}^{(1)}, \mathbf{W}^{(2)}, \mathbf{W}^{(o)}} -\sum_{(\mathbf{x},y)\in D}\log\mathbf{\hat{y}}_y\]

Take every pair in your document $D$, and compare it to the actual labelling given by $y$. This can be written as the negative log-likelihood, as above. Training this (i.e. optimiszing the above minimum) can be done with stochastic gradient descent!

Neural networks are actually difficult to optimize; SGD can only converge to local minimum, so initializations and optimizers matter a lot.

To optimize the neural network, we use back-propagation. First we do forward propagation from the input to output layer of the network, and compute the loss $L=-\log\mathbf{\hat{y}}_y$. And then, we go backwards using backpropagation, we compute the gradient of the loss with respect to the edges/weights of the network, and we want to move our weights $W^{(i)}$’s in a way such that to minimize our loss. These days, this can be done easily in PyTorch.

We compare image inputs vs. text inputs: images are fixed-size input with continuous values (e.g. $64\times64$), whereas text is a variable-length input, with discrete words, and hence we need to convert these words into vectors before inputting them into our model - word embeddings!

Neural networks for text classification

Input is $w_1,\dots,w_K\in V$, and the output is $y\in C$ (e.g. $C=\lbrace\text{positive, negative, neutral}\rbrace$).

Solution number one: is that you can construct a feature vector $\mathbf{x}$ (each $x_i$ is a hand-designed feature) from the input and simply feed the vector to a neural network, instead of using a logistic regression classifier.

How can we feed a variable-length input to a neural network classifier? Solution number two: is that we take all the word embeddings of these words and aggregate them into a vector through some pooling function.

\[\mathbf{x}_{\text{mean}} = \frac{1}{K}\sum_{i=1}^K e(w_i),\]

where $e(w_i)$ is the word embedding of $w_i$, and pooling examples could be the sum, mean or max etc. N/B: each input has a different $K$.

In this solution, we are not dealing with features here - essentially, we treat the word embeddings as the feature, without engineering anything myself (as in solution one).

In practice, this pooling solution with word embeddings is used.

Word embeddings $\mathbf{E}\in\mathbb{R}^{\lvert V\rvert\times d}$ can be treated as parameters too, optimizing using SGD. This is because we could have vectors being like $\mathbf{v}(\text{‘good’})\approx\mathbf{v}(\text{‘bad’})$, which isn’t good.

Feedforward neural language models

With the $n$-gram model, our goal was to model sequence of words into probability distributions. But, we saw the limitations of the $n$-gram model is that it can’t handle long histories.

If we use a $4$-gram, $5$-gram language model, it will become too sparse to estimate the probabilities.

\[P(w \mid \text{students opened their}) = \frac{\text{count(students opened their } w\text{)}}{\text{count(students opened their)}}\]

However, a lot of contexts are similar and simply counting them won’t generalize. And, the number of probabilities that we need to estimate grow exponentially with window size!

The key idea to estimate these probabilities better was: can we just use a neural network to fit the probabilistic distribution of a language.

Feedforward neural language models approximate the probability based on the previous $m$ (e.g. 5) words - $m$ is a hyper-parameter here.

\[P(x_1,x_2,\dots,x_n)\approx \prod_{i=1}^n P(x_i\mid x_{i-m+1},\dots x_{i-1})\]

We just use a neural network for classification here. This becomes a $\lvert V\rvert$-way classification problem; instead of $C$ classes we have $V$ words.

For example, if $P(\text{mat} \mid \text{the cat sat on the}) = \hspace{2pt}?$, $d$ is the word embedding size, $h$ is the hidden layer size.

We first concatenate the vectors in order to create a bigger vector (instead of taking the mean etc.).

The input layer ($m=5$) is

\[\mathbf{x} = [e(\text{the});\ e(\text{cat});\ e(\text{sat});\ e(\text{on});\ e(\text{the})] \in \mathbb{R}^{md}.\]

The hidden layer is $\mathbf{h}=\text{tanh}(\mathbf{W}\mathbf{x}+\mathbf{b})\in\mathbb{R}^h$.

The output layer is $\mathbf{z} = \mathbf{U}\mathbf{h}\in\mathbb{R}^{\lvert V\rvert}$ (here $\mathbf{U}$ is another weight matrix).

\[\therefore\hspace{5pt} P(w = i \mid \text{the cat sat on the}) = \operatorname{softmax}_i(z) = \frac{e^{z_i}}{\sum_k e^{z_k}}.\]

The dimensions of the weights $\mathbf{W}$ and $\mathbf{U}$ here are, $\mathbf{W}\in\mathbb{R}^{h\times md}$, $\mathbf{U}\in\mathbb{R}^{\lvert V\rvert\times h}$.

How do we train this feedforward neural language model? We use a lot of raw text to create training examples and run gradient-descent optimization. The key idea here is we just need raw text to train on - we don’t need to take any labels with it, and the nice part here is that so much raw text is available online freely.

The limitations of this are that $W$ scales linearly with the context size $m$.The size of the context is fixed, and since we are concatenating computationally this can become a problem (so we can’t really increase our context too much).

Better solutions include recurrent NNs and transformers.