The model makes the same prediction at each position.

Idea: from current value of $\theta$, calculate the gradient of $J(\theta)$, and move in the opposite direction of the gradient. Repeat until convergence.

$\theta^{\text{new}}=\theta^{\text{old}}-\alpha\nabla_{\theta}J(\theta)$

where $\alpha$ is the learning rate.

Problem: $J(\theta)$ is the sum of the loss over all training examples. $\nabla_{\theta}J(\theta)$ is very expensive to compute.

Solution: Stochastic Gradient Descent. At each step, sample a random training example and update $\theta$ based on the loss of that example.

The gradient of the loss of a single example is a noisy estimate of the true gradient. But it’s much cheaper to compute. And it’s often good enough.

$\nabla_{v_c}J_t(\theta) = \begin{bmatrix} 0\\ \vdots\\ \nabla_{v_{\text{like}}}\\ \vdots\\ \nabla_{u_{\text{like}}}\\ 0\\ \vdots\\ \end{bmatrix}\in \mathbb{R}^{2dV}$

Why we use two vectors to describe one word?

A word may occur more than once in a sentence. If we use the same vector for each occurrence, the math will be sophisticated.

Two model variants:

- Skip-gram: predict context words (outside words) given center word
- Continuous Bag of Words (CBOW): predict center word from (bag of) context words

Additional efficiency: hierarchical softmax and negative sampling.

- Hierarchical softmax
- Negative sampling: train binary logistic regressions for a true pair (center word, context word) and a number of noise pairs.

Overall objective function:

$J_{\text{neg-sample}}(u_0,v_0,\theta) = -\log\sigma(u_o^Tv_c) - \sum_{i=1}^k\mathbb E_{j\sim P(w)}[\log\sigma(-u_j^Tv_c)]$

where $\sigma(x)=\frac{1}{1+e^{-x}}$ is the sigmoid function. $\mathbb E_{j\sim P(w)}$ means the expectation over the noise distribution $P(w)$. Sample with $P(w)=U(w)^{3/4}/Z$, where $U(w)$ is the unigram distribution raised to the 3/4 power, and $Z$ is the normalization constant. Taking 3/4 power has the effect of dampening the differences between the common and rare words.

Trick: the sigmoid function is symmetric, so minimize the $-u_j^Tv_c$ is equivalent to maximize $\sigma(u_j^Tv_c)$.

Consider a large corpus of text. For each word $w$, count how often it appears in some window of words around it. This is the co-occurrence matrix.

It contains two kinds of information:

- Window length: how many words on each side of $w$ to consider
- Symmetric: the co-occurrence matrix is symmetric because the co-occurrence of $u$ to $v$ is the same as the co-occurrence of $v$ to $u$.

Problem: The matrix is very large and very sparse. Thus, the results tend to be noisy and less robust.

Solution: Low-dimensional vectors.

Idea: store “most” of the important information in a fixed, small number of dimensions. (Usually 25-1000)

We could use SVD (Single Value Dimensionality) to reduce the dimensionality of the matrix.

$X=USV^T$

where $U$ and $V$ are orthogonal matrices, and $S$ is a diagonal matrix. The columns of $U$ are the left singular vectors, and the columns of $V$ are the right singular vectors. The diagonal of $S$ contains the singular values. *(Refer to linear algebra)*

We could delete some small singular values and their corresponding columns in $U$ and $V$ to reduce the dimensionality of the matrix.

To deal with the most frequent words, we could use a model called COALS.

Pros:

- Fast training
- Efficient use of statistics

Cons:

- Primarily used to capture word similarity
- Disproprtionate importance given to large counts.

Pros:

- Generate improved task performance
- Can capture complex patterns beyond word similarity

Cons: - Inefficient usage of statistics

Only look at one center word at a time, and only generating a few negative samples for each center word. - Related to the size of the corpus

Cucial insight: ratios of co-occurrence can encode meaning components.

Example:

$x=\text{solid}$ | $x=\text{gas}$ | $x=\text{water}$ | $x=\text{random}$ | |
---|---|---|---|---|

$P(x|\text{ice})$ | large | small | large | small |

$P(x|\text{steam})$ | small | large | large | small |

$\frac{P(x|\text{ice})}{P(x|\text{steam})}$ | large | small | $\sim 1$ | $\sim 1$ |

We use a log-bilinear model to capture the ratios of co-occurrence probabilities:

$w_i\cdot w_j = \log P(i|j)$

with vector differences:

$w_x\cdot(w_a-w_b) = \log\frac{P(x|a)}{P(x|b)}$

GloVe is a model for distributed word representation. It is a log-bilinear model with a weighted least squares objective.

$w_i\cdot w_j=\log P_{ij}$

$J(\theta) = \sum_{i,j=1}^V f(X_{ij})(w_i^T\tilde{w_j}+b_i+\tilde{b_j}-\log X_{ij})^2$

f function is used to dampen the influence of very frequent words. It is usually $f(x)=(x/x_{\text{max}})^\alpha$.

Pros:

- Fast training
- Scalable to large corpora
- Good performance even with small corpus and small vectors

**Intrinsic evaluation:**

- Evaluate the word vectors directly on a specific/intermeiate task.
- Fast to compute
- Helps to understand the model
- Not clear if really helpful for the final task, unless the correlation to the final task is established.

**Extrinsic evaluation:**

- Evaluate the word vectors on a real task
- Can take a long time to compute accuracy
- Unclear if the subsystem is the problem or its interaction or other subsystems.
- If replacing exactly one subsystem with another improves the performance, then we find the subsystem is the problem.

We could use a big collection of word analogy task. For example, given a:b, find c:d. More pecifically, man is to woman like king is to what? The answer is queen.

The math language is:

$d=\arg\max_{w\in Vocab}\cos(w, w_a-w_b+w_c)$

where $\cos(x,y)=\frac{x\cdot y}{\|x\|\|y\|}$ is the cosine similarity.

One trick is to discarding the input words from the search.

More data helps and Wikipedia is a good source.

300 dimensions is a good choice for the dimensionality of the word vectors.

Example dataset: WordSim353.

Evaluate word vector distances and their correlation with human judgement.

Most words have lots of meanings. For example, “bank” could mean a financial institution, or the side of a river.

Idea: Cluster word windows around words, retrain with each word assigned to multiple different clusters.

However, this method is complex. First of all, it has to learn words senses and then learning word vectors in terms of the word senses. Cutting the meaning of a word into defferent sensors making the differences overlap and it’s often not clear which ones to use.

Different senses of a word reside in a linear superposition (weighted sum) in standard word embeddings like word2vec.

$v_{\text{bank}} = \sum_{i=1}^k \alpha_i v_{\text{bank}_i}$

where $\alpha_i=\frac{f_i}{\sum f}$, for frequency $f_i$ of sense $i$.

Surprising result: beacaus of ideas from sparse coding, we can easily separate out the senses (providing they are relatively common).

本博客所有文章除特别声明外，均采用 CC BY-NC-SA 4.0 许可协议，转载请注明出处。 rss订阅