*Language is glorious chaos.*

How do we represent the meaning of a word?

In traditional nlp, we regard words as discrete symbols (represented by a one-hot vectors $[0\ 0\ 0\ 1\ \ldots]$)

It’s obvious that vectors ar orthogonal, thus ther is no natural notion of similarities for one-hot vectors.

The main idea is: A word’s meaning is given by the words that “frequently” appear close-by.

“*You shall know a word by the company it keeps*.” (John Rupert Firth) It essentially means that the meaning of a word can be understood better by examining the context in which it is used, the words it is commonly associated with, and the situations in which it is employed. In other words, the meaning of a word is influenced by the words that surround it and the way it is used in various contexts. This idea emphasizes the importance of considering language holistically rather than in isolation.

This is one of the most successful ideas of modern statistical NLP.

We build a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts.

$\text{banking}=\begin{bmatrix}0.286\\0.792\\-0.532\\ \ldots\end{bmatrix}$

Word vectors are also called word embeddings or (neural) word representations.

Idea:

- We have a large corpus (“body”) of text
- Every word in a fixed vocabulary is represented by a vector
- Go through each position $t$ in the text, which has a center word $c$ and context (“outside”) words o
- Use the similarity of the word vectors for c and o to calculate the probability of o given c
- Keep adjusting the word vector to maximize this probability

Data likelihood:

$L(\theta) = \prod_{t=1}^T \prod_{-m\leq j\leq m, j\neq 0} P(w_{t+j}|w_t;\theta)$

$\theta$ is the parameters of the model. $w_i$ is the center word and $w_{i+j}$ is the context word. $m$ is the size of the context window.

Calculating the product of probabilities is hard, so we take the log and the negative to make it easier to calculate.

$J(\theta) = -\frac{1}{T}\sum_{t=1}^T\sum_{-m\leq j\leq m, j\neq 0}\log P(w_{t+j}|w_t;\theta)$

In order to calculate the $P(w_{t+j}|w_t;\theta)$, we use two vectors for each word $w$:

- $v_w$ when $w$ is a center word
- $u_w$ when $w$ is a context word

Using a softmax function, we can calculate the probability of a context word given a center word:

$P(o|c) = \frac{\exp(u_o^T v_c)}{\sum_{w\in Vocab}\exp(u_w^T v_c)}$

where $Vocab$ is the vocabulary of the corpus.

A softmax function $\mathbb{R}^n\rightarrow\mathbb(0,1)^n$

$\mathrm{softmax}(x_i) = \frac{\exp(x_i)}{\sum_{j=1}^n\exp(x_j)}=p_i$

The softmax function maps arbitrary values $x_i$ to a probability distribution $p_i$.

**max**because amplifies the largest input and suppresses the smaller ones**soft**because still assigns some probability to even the smallest input

It is worth noting that the softmax function does not return a single value, but a distribution of probabilities over the entire vocabulary.

Since every word has two vectors, we can concatenate them to form a single vector. This is called a “word vector” or “word embedding”.

With d-dimensional word vectors, the model has $2\times |V|\times d$ parameters.

$\theta = (v_{w_1}, u_{w_1}, v_{w_2}, u_{w_2}, \ldots)\in\mathbb{R}^{2\times |V|\times d}$

$\begin{aligned} \frac{\partial}{\partial v_c}\log P(o|c) &= \frac{\partial}{\partial v_c}\log\exp(u_o^T v_c)-\frac{\partial}{\partial v_c}\log\sum_{w\in Vocab}\exp(u_w^T v_c) \\ &= u_o - \frac{\sum_{w\in Vocab}\exp(u_w^T v_c)u_w}{\sum_{w\in Vocab}\exp(u_w^T v_c)} \\ &= u_o - \sum_{w\in Vocab}P(w|c)u_w \\ &= \text{observed} - \text{expected} \end{aligned}$

The result perfectly makes sense. The gradient of the loss function is the difference between the observed and expected values.

It’s important to be cognizant of the biases (gender, race, sexual orientation etc.) implicit in our word embeddings. Bias can be dangerous because it can reinforce stereotypes through applications that employ these models.

1 | pprint.pprint(wv_from_bin.most_similar( |

Running the above code gives the following results:

1 | [('reputation', 0.5250176787376404), |

The results show that the word vectors are biased. One explanation of how bias gets into the word vectors is that the word vectors are based on the prediction of the probability of a word given its context. Therefore, the bias in the context of the word gets into the word vectors. A real-world example that demonstrates this source of bias is that the word “doctor” is more associated with the word “he” than the word “she” in the word vectors. This is because the word “doctor” is more associated with the context of “he” than the context of “she” in the word vectors.

One way to debias the word vectors is to remove the bias in the context of the word vectors. This can be done by preset a group a potentially biased words and then neutralize and equalize the word vectors. Neutralizing the word vectors means removing the bias in the context of the word vectors. Equalizing the word vectors means making the word vectors of the group of words equal in the context of the word vectors. A real-world example that demonstrates this method is that the word “doctor” is equally associated with the word “he” and the word “she” in the debiased word vectors. This is because the bias in the context of the word is removed from the debiased word vectors.

本博客所有文章除特别声明外，均采用 CC BY-NC-SA 4.0 许可协议，转载请注明出处。 rss订阅