MicDZ's Blog

CS224N Notes - Lecture 2

CS224N Notes - Lecture 2



Bag of Words Model

The model makes the same prediction at each position.

Optimization: Gradient Descent

Idea: from current value of θ\theta, calculate the gradient of J(θ)J(\theta), and move in the opposite direction of the gradient. Repeat until convergence.


where α\alpha is the learning rate.

Stochastic Gradient Descent

Problem: J(θ)J(\theta) is the sum of the loss over all training examples. θJ(θ)\nabla_{\theta}J(\theta) is very expensive to compute.

Solution: Stochastic Gradient Descent. At each step, sample a random training example and update θ\theta based on the loss of that example.

The gradient of the loss of a single example is a noisy estimate of the true gradient. But it’s much cheaper to compute. And it’s often good enough.

Running SGD with Word Vectors

vcJt(θ)=[0vlikeulike0]R2dV\nabla_{v_c}J_t(\theta) = \begin{bmatrix} 0\\ \vdots\\ \nabla_{v_{\text{like}}}\\ \vdots\\ \nabla_{u_{\text{like}}}\\ 0\\ \vdots\\ \end{bmatrix}\in \mathbb{R}^{2dV}

Word2vec: more details

Why we use two vectors to describe one word?

A word may occur more than once in a sentence. If we use the same vector for each occurrence, the math will be sophisticated.

Two model variants:

  • Skip-gram: predict context words (outside words) given center word
  • Continuous Bag of Words (CBOW): predict center word from (bag of) context words

Skip-gram with Negative Sampling

Additional efficiency: hierarchical softmax and negative sampling.

  • Hierarchical softmax
  • Negative sampling: train binary logistic regressions for a true pair (center word, context word) and a number of noise pairs.

Overall objective function:

Jneg-sample(u0,v0,θ)=logσ(uoTvc)i=1kEjP(w)[logσ(ujTvc)]J_{\text{neg-sample}}(u_0,v_0,\theta) = -\log\sigma(u_o^Tv_c) - \sum_{i=1}^k\mathbb E_{j\sim P(w)}[\log\sigma(-u_j^Tv_c)]

where σ(x)=11+ex\sigma(x)=\frac{1}{1+e^{-x}} is the sigmoid function. EjP(w)\mathbb E_{j\sim P(w)} means the expectation over the noise distribution P(w)P(w). Sample with P(w)=U(w)3/4/ZP(w)=U(w)^{3/4}/Z, where U(w)U(w) is the unigram distribution raised to the 3/4 power, and ZZ is the normalization constant. Taking 3/4 power has the effect of dampening the differences between the common and rare words.

Trick: the sigmoid function is symmetric, so minimize the ujTvc-u_j^Tv_c is equivalent to maximize σ(ujTvc)\sigma(u_j^Tv_c).

Use a Co-occurrence Matrix

Consider a large corpus of text. For each word ww, count how often it appears in some window of words around it. This is the co-occurrence matrix.

It contains two kinds of information:

  • Window length: how many words on each side of ww to consider
  • Symmetric: the co-occurrence matrix is symmetric because the co-occurrence of uu to vv is the same as the co-occurrence of vv to uu.

Problem: The matrix is very large and very sparse. Thus, the results tend to be noisy and less robust.

Solution: Low-dimensional vectors.
Idea: store “most” of the important information in a fixed, small number of dimensions. (Usually 25-1000)

We could use SVD (Single Value Dimensionality) to reduce the dimensionality of the matrix.


where UU and VV are orthogonal matrices, and SS is a diagonal matrix. The columns of UU are the left singular vectors, and the columns of VV are the right singular vectors. The diagonal of SS contains the singular values. (Refer to linear algebra)

We could delete some small singular values and their corresponding columns in UU and VV to reduce the dimensionality of the matrix.

To deal with the most frequent words, we could use a model called COALS.

Towards GloVe: Count Based vs. Direct Prediction

Count Based:


  • Fast training
  • Efficient use of statistics


  • Primarily used to capture word similarity
  • Disproprtionate importance given to large counts.

Direct Prediction:


  • Generate improved task performance
  • Can capture complex patterns beyond word similarity
  • Inefficient usage of statistics
    Only look at one center word at a time, and only generating a few negative samples for each center word.
  • Related to the size of the corpus

Encoding Meaning in Vector Differences

Cucial insight: ratios of co-occurrence can encode meaning components.


x=solidx=\text{solid} x=gasx=\text{gas} x=waterx=\text{water} x=randomx=\text{random}
P(xice)P(x|\text{ice}) large small large small
P(xsteam)P(x|\text{steam}) small large large small
P(xice)P(xsteam)\frac{P(x|\text{ice})}{P(x|\text{steam})} large small 1\sim 1 1\sim 1

We use a log-bilinear model to capture the ratios of co-occurrence probabilities:

wiwj=logP(ij)w_i\cdot w_j = \log P(i|j)

with vector differences:

wx(wawb)=logP(xa)P(xb)w_x\cdot(w_a-w_b) = \log\frac{P(x|a)}{P(x|b)}

GloVe: Global Vectors for Word Representation

GloVe is a model for distributed word representation. It is a log-bilinear model with a weighted least squares objective.

wiwj=logPijw_i\cdot w_j=\log P_{ij}

J(θ)=i,j=1Vf(Xij)(wiTwj~+bi+bj~logXij)2J(\theta) = \sum_{i,j=1}^V f(X_{ij})(w_i^T\tilde{w_j}+b_i+\tilde{b_j}-\log X_{ij})^2

f function is used to dampen the influence of very frequent words. It is usually f(x)=(x/xmax)αf(x)=(x/x_{\text{max}})^\alpha.


  • Fast training
  • Scalable to large corpora
  • Good performance even with small corpus and small vectors

How to Evaluate Word Vectors?

Intrinsic evaluation:

  • Evaluate the word vectors directly on a specific/intermeiate task.
  • Fast to compute
  • Helps to understand the model
  • Not clear if really helpful for the final task, unless the correlation to the final task is established.

Extrinsic evaluation:

  • Evaluate the word vectors on a real task
  • Can take a long time to compute accuracy
  • Unclear if the subsystem is the problem or its interaction or other subsystems.
  • If replacing exactly one subsystem with another improves the performance, then we find the subsystem is the problem.

Intrinsic Example: Analogy Task

We could use a big collection of word analogy task. For example, given a:b, find c:d. More pecifically, man is to woman like king is to what? The answer is queen.

The math language is:

d=argmaxwVocabcos(w,wawb+wc)d=\arg\max_{w\in Vocab}\cos(w, w_a-w_b+w_c)

where cos(x,y)=xyxy\cos(x,y)=\frac{x\cdot y}{\|x\|\|y\|} is the cosine similarity.

One trick is to discarding the input words from the search.

Analogy Evalutaion and Hyperparameters

More data helps and Wikipedia is a good source.

300 dimensions is a good choice for the dimensionality of the word vectors.

Intrinsic Example: Human Judgements

Example dataset: WordSim353.

Evaluate word vector distances and their correlation with human judgement.

Word Senses and Word Sense Ambiguity

Most words have lots of meanings. For example, “bank” could mean a financial institution, or the side of a river.

Idea: Cluster word windows around words, retrain with each word assigned to multiple different clusters.

However, this method is complex. First of all, it has to learn words senses and then learning word vectors in terms of the word senses. Cutting the meaning of a word into defferent sensors making the differences overlap and it’s often not clear which ones to use.

Solution: Linear Algebraic Structure of Word Senses, with Applications to Polysemy

Different senses of a word reside in a linear superposition (weighted sum) in standard word embeddings like word2vec.

vbank=i=1kαivbankiv_{\text{bank}} = \sum_{i=1}^k \alpha_i v_{\text{bank}_i}

where αi=fif\alpha_i=\frac{f_i}{\sum f}, for frequency fif_i of sense ii.

Surprising result: beacaus of ideas from sparse coding, we can easily separate out the senses (providing they are relatively common).

, — 2024年2月27日