• Onehot vs 分布式表示
• CBOW和skip gram
• Huffman改进
• 负采样

## 为什么要分布式表示

word2vec是一种词的embedding方法，在介绍之前，首先了解一下啥叫embedding呢？

Embedding在数学上表示一个mapping, f: X -> Y， 也就是一个function，其中该函数是injective（就是我们所说的单射函数，每个Y只有唯一的X对应，反之亦然）和structure-preserving (结构保存，比如在X所属的空间上X1 < X2,那么映射后在Y所属空间上同理 Y1 < Y2)。那么对于word embedding，就是将单词word映射到另外一个空间，其中这个映射具有injective和structure-preserving的特点。

word embedding，就是找到一个映射或者函数，生成在一个新的空间上的表达，该表达就是word representation。

## Skip-Gram

### 训练

$\mathcal{L} = -\sum_{-m \leq j \leq m,\ j \neq 0} \text{log}\, P(w^{(i+j)} \mid w^{(i)})\tag{3-5}$

3-5的式子其实也可以看成是交叉熵损失，标签y为one-hot的结果，则 $L({\bf W}) = - \sum_{-m \leq j \leq m,\ j \neq 0} \sum_{a=1}^V y_a^{(i+j)} \log P(w^{(i+j)} \mid w^{(i)})\tag{3-9}$ 但每个$$\bf y^{(i+j)}$$中只有一个为1，其余的为0，因此最后就和3-5等价了。

## CBOW 模型

CBOW(continuous bag-of-word)即连续词袋模型，即用一个中心词前后距离d内的背景词来预测该中心词出现的概率

CBOW模型可以用如下的一层神经网络表示：

### 训练

4-3取对数得： $\log P(w_j|w_1,\dots,w_C) = {\bf {\bar v}^T u_j } - \log{\sum_{k = 1}^{V}\exp({\bf{\bar v}^T u_k}})\tag{4-3}$ 4-3对中心词向量$$\bf u_j$$求导得： \begin{aligned} \frac{\partial \text{log}\, P}{\partial \bf{u}_j} &= {\bf {\bar v}} - \frac{\exp({\bf {\bar v}^T u_j}){\bf {\bar v}}}{\sum_{a=1}^{V} \exp({\bf {\bar v}^T u_a})}\\ &= {\bf {\bar v}} - P(w_j|w_1,\dots,w_C) {\bf {\bar v}} \end{aligned}\tag{4-4} 对于任意的背景词向量$$\bf v_i$$，4-3对其求导有： \begin{aligned} \frac{\partial \text{log}\, P}{\partial \bf{v}_i} &= \frac{1}{C} {\bf u_j} - \frac{\sum_{b=1}^{V}\exp({\bf {\bar v}^T u_b})\frac{1}{C} {\bf u_b} }{\sum_{a=1}^{V} \exp({\bf {\bar v}^T u_a})}\\ &= \frac{1}{C} \left({\bf u_j} - \sum_{b=1}^VP(w_b|w_1,\dots,w_C) {\bf u_b} \right) \end{aligned}\tag{4-5}

## 加速计算

### 负采样 negative sampling

• 正例：即中心词$$w_c$$和窗口范围内每个背景词$$w_i$$分别组成的pair就是正例。（和之前的一样）
• 负例：用相同的中心词$$w_c$$，然后在词库中随机抽取k个词，中心词和这k个词分别组成pair，得到的k个pair都标记为负例。这里的k一般取5~20。这里有个小细节，比如我的中心词是fox，窗口大小为2，但是我从词库中抽取的词是brown，在该句子中其实是在fox前面的，但是仍然算作负例

• 如果用均等的概率选择的话，其实对于英文的文本是没有代表性的。
• 如果用出现的频率来取词，像"the"这些stop words的采样得到的概率会很大，但是("fox", "the") 并不能告诉我们很多的信息，因为the这个单词在很多上下文都出现过。

### 欠采样 subsample

• 负采样的方式通过考虑同时含有正例样本和负例样本的相互独立事件来构造损失函数,其训练中每一步的梯度计算开销与采样的负例个数线性相关.
• 层次Softmax使用了Huffman树,并根据根节点到叶子结点的路径来构造损失函数,其训练中每一步的梯度计算开销与词典大小对数相关.

## CBOW 和 Skip-gram对比

In CBOW the vectors from the context words are averaged before predicting the center word. In skip-gram there is no averaging of embedding vectors. It seems like the model can learn better representations for the rare words when their vectors are not averaged with the other context words in the process of making the predictions.

As we know, CBOW is learning to predict the word by the context. Or maximize the probability of the target word by looking at the context. And this happens to be a problem for rare words. For example, given the context yesterday was really [...] day CBOW model will tell you that most probably the word is beautiful or nice. Words like delightful will get much less attention of the model, because it is designed to predict the most probable word. Rare words will be smoothed over a lot of examples with more frequent words.

On the other hand, the skip-gram is designed to predict the context. Given the word delightful it must understand it and tell us, that there is huge probability, the context is yesterday was really [...] day, or some other relevant context. With skip-gram the word delightful will not try to compete with word beautiful but instead, delightful+context pairs will be treated as new observations. Because of this, skip-gram will need more data so it will learn to understand even rare words.

from https://stats.stackexchange.com/a/261440

