2020-05-01

大规模无监督预训练语言模型与应用上

Subword Modeling

以单词作为模型的基本单位有一些问题：

单词量有限，我们一般会把单词量固定在50k-300k，然后没有见过的单词只能用UNK表示
zipf distribution: given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.: the rank-frequency distribution is an inverse relation.
模型参数量太大，100K * 300 = 30M个参数，仅仅是embedding层
对于很多语言，例如英语来说，很多时候单词是由几个subword拼接而成的
对于中文来说，很多常用的模型会采用分词后得到的词语作为模型的基本单元，同样存在上述问题

可能的解决方案：

使用subword information，例如字母作为语言的基本单元 Char-CNN
用wordpiece

解决方案：character level modeling

使用字母作为模型的基本输入单元

Ling et. al, Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation

用BiLSTM把单词中的每个字母encode到一起

Yoon Kim et. al, Character-Aware Neural Language Models

根据以上模型示意图思考以下问题：

character emebdding的的维度是多少？4
有几个character 4-gram的filter？filter-size=4? 红色的 5个filter
max-over-time pooling: 3-gram 4维， 2-gram 3维 4-gram 55维
为什么不同的filter (kernel size)长度会导致不同长度的feature map? seq_length - kernel_size + 1

fastText

与word2vec类似，但是每个单词是它的character n-gram embeddings + word emebdding

解决方案：使用subword作为模型的基本单元

Botha & Blunsom (2014): Composional Morphology for Word Representations and Language Modelling

subword embedding

Byte Pair Encoding (需要知道什么是BPE)

Neural Machine Translation of Rare Words with Subword Units

关于什么是BPE可以参考下面的文章

https://www.cnblogs.com/huangyc/p/10223075.html

https://leimao.github.io/blog/Byte-Pair-Encoding/

首先定义所有可能的基本字符（abcde…）
然后开始循环数出最经常出现的pairs，加入到我们的候选字符（基本组成单元）中去

a, b, c, d, …, z, A, B, …., Z.. !, @, ?, st, est, lo, low,

控制单词表的大小

我只要确定iteration的次数 30000个iteartion，30000+原始字母表当中的字母数个单词

happiest

h a p p i est

LSTM

emb(h), emb(a), emb(p), emb(p), emb(i), emb(est)

happ, iest

emb(happ), emb(iest)

https://www.aclweb.org/anthology/P16-1162.pdf

中文词向量

Meng et. al, Is Word Segmentation Necessary for Deep Learning of Chinese Representations?

简单来说，这篇文章的作者生成通过他们的实验发现Chinese Word Segmentation对于语言模型、文本分类，翻译和文本关系分类并没有什么帮助，直接使用单个字作为模型的输入可以达到更好的效果。

We benchmark neural word-based models which rely on word segmentation against neural char-based models which do not involve word segmentation in four end-to-end NLP benchmark tasks: language modeling, machine translation, sentence matching/paraphrase and text classification. Through direct comparisons between these two types of models, we find that charbased models consistently outperform wordbased models.

word-based models are more vulnerable to data sparsity and the presence of out-of-vocabulary (OOV) words, and thus more prone to overfitting

Jiwei Li

https://nlp.stanford.edu/~bdlijiwei/

中文分词工具

建议同学们可以在自己的项目中尝试以下工具

北大中文分词工具
https://github.com/lancopku/pkuseg-python
机器之心报道 https://www.jiqizhixin.com/articles/2019-01-09-12
清华分词工具 https://github.com/thunlp/THULAC-Python
结巴 https://github.com/fxsjy/jieba

预训练句子/文档向量

既然有词向量，那么我们是否可以更进一步，把句子甚至一整个文档也编码成一个向量呢？

在之前的课程中我们已经涉及到了一些句子级别的任务，例如文本分类，常常就是把一句或者若干句文本分类成一定的类别。此类模型的一般实现方式是首先把文本编码成某种文本表示方式，例如averaged word embeddings，或者双向LSTM头尾拼接，或者CNN模型等等。

文本分类

文本通过某种方式变成一个向量
WORDAVG
LSTM
CNN
最后是一个linear layer 300维句子向量 –》 2 情感分类

猫图片/狗图片

图片 –> ResNet –> 2048维向量 –> (2, 2048) –> 2维向量 binary cross entropy loss

ResNet 预训练模型

文本 –> TextResNet –> 2048维向量

apply to any downstream tasks

TextResNet：LSTM模型

不同的任务（例如不同的文本分类：情感分类，话题分类）虽然最终的输出不同，但是往往拥有着相似甚至完全一样的编码层。如果我们能够预训练一个非常好的编码层，那么后续模型的负担就可以在一定程度上得到降低。这样的思想很多是来自图像处理的相关工作。例如人们在各类图像任务中发现，如果使用在ImageNet上预训练过的深层CNN网络（例如ResNet），只把最终的输出层替换成自己需要的样子，往往可以取得非常好的效果，且可以在少量数据的情况下训练出优质的模型。

在句子/文本向量预训练的领域涌现出了一系列的工作，下面我们选取一些有代表性的工作供大家学习参考。