语言模型

语言模型

学习目标

  • 学习语言模型,以及如何训练一个语言模型
  • 学习torchtext的基本使用方法
    • 构建 vocabulary
    • word to inde 和 index to word
  • 学习torch.nn的一些基本模型
    • Linear
    • RNN
    • LSTM
    • GRU
  • RNN的训练技巧
    • Gradient Clipping
  • 如何保存和读取模型

我们会使用 torchtext 来创建vocabulary, 然后把数据读成batch的格式。请大家自行阅读README来学习torchtext。

先了解下torchtext库:torchtext介绍和使用教程

In [1]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import torchtext
from torchtext.vocab import Vectors
import torch
import numpy as np
import random

USE_CUDA = torch.cuda.is_available()

# 为了保证实验结果可以复现,我们经常会把各种random seed固定在某一个值
random.seed(53113)
np.random.seed(53113)
torch.manual_seed(53113)
if USE_CUDA:
torch.cuda.manual_seed(53113)

BATCH_SIZE = 32 #一个batch多少个句子
EMBEDDING_SIZE = 650 #每个单词多少维
MAX_VOCAB_SIZE = 50000 #单词总数
  • 我们会继续使用上次的text8作为我们的训练,验证和测试数据
  • torchtext提供了LanguageModelingDataset这个class来帮助我们处理语言模型数据集
  • BPTTIterator可以连续地得到连贯的句子

In [2]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
TEXT = torchtext.data.Field(lower=True) 
# .Field这个对象包含了我们打算如何预处理文本数据的信息,这里定义单词全部小写

train, val, test = \
torchtext.datasets.LanguageModelingDataset.splits(
path=".",
train="text8.train.txt",
validation="text8.dev.txt",
test="text8.test.txt",
text_field=TEXT)
# torchtext提供了LanguageModelingDataset这个class来帮助我们处理语言模型数据集

TEXT.build_vocab(train, max_size=MAX_VOCAB_SIZE)
# build_vocab可以根据我们提供的训练数据集来创建最高频单词的单词表,max_size帮助我们限定单词总量。
print("vocabulary size: {}".format(len(TEXT.vocab)))
1
vocabulary size: 50002

In [4]:

1
test

Out[4]:

1
<torchtext.data.example.Example at 0x121738b00>

In [ ]:

1
2


In [9]:

1
2
3
4
print(TEXT.vocab.itos[0:50]) 
# 这里越靠前越常见,增加了两个特殊的token,<unk>表示未知的单词,<pad>表示padding。
print("------"*10)
print(list(TEXT.vocab.stoi.items())[0:50])
1
2
3
['<unk>', '<pad>', 'the', 'of', 'and', 'one', 'in', 'a', 'to', 'zero', 'nine', 'two', 'is', 'as', 'eight', 'for', 's', 'five', 'three', 'was', 'by', 'that', 'four', 'six', 'seven', 'with', 'on', 'are', 'it', 'from', 'or', 'his', 'an', 'be', 'this', 'he', 'at', 'which', 'not', 'also', 'have', 'were', 'has', 'but', 'other', 'their', 'its', 'first', 'they', 'had']
------------------------------------------------------------
[('<unk>', 0), ('<pad>', 1), ('the', 2), ('of', 3), ('and', 4), ('one', 5), ('in', 6), ('a', 7), ('to', 8), ('zero', 9), ('nine', 10), ('two', 11), ('is', 12), ('as', 13), ('eight', 14), ('for', 15), ('s', 16), ('five', 17), ('three', 18), ('was', 19), ('by', 20), ('that', 21), ('four', 22), ('six', 23), ('seven', 24), ('with', 25), ('on', 26), ('are', 27), ('it', 28), ('from', 29), ('or', 30), ('his', 31), ('an', 32), ('be', 33), ('this', 34), ('he', 35), ('at', 36), ('which', 37), ('not', 38), ('also', 39), ('have', 40), ('were', 41), ('has', 42), ('but', 43), ('other', 44), ('their', 45), ('its', 46), ('first', 47), ('they', 48), ('had', 49)]

In [10]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
VOCAB_SIZE = len(TEXT.vocab) # 50002
train_iter, val_iter, test_iter = \
torchtext.data.BPTTIterator.splits(
(train, val, test),
batch_size=BATCH_SIZE,
device=-1,
bptt_len=50, # 反向传播往回传的长度,这里我暂时理解为一个样本有多少个单词传入模型
repeat=False,
shuffle=True)
# BPTTIterator可以连续地得到连贯的句子,BPTT的全称是back propagation through time。
'''
Iterator:标准迭代器

BucketIerator:相比于标准迭代器,会将类似长度的样本当做一批来处理,
因为在文本处理中经常会需要将每一批样本长度补齐为当前批中最长序列的长度,
因此当样本长度差别较大时,使用BucketIerator可以带来填充效率的提高。
除此之外,我们还可以在Field中通过fix_length参数来对样本进行截断补齐操作。

BPTTIterator: 基于BPTT(基于时间的反向传播算法)的迭代器,一般用于语言模型中。
'''
1
2
3
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.

Out[10]:

1
'\nIterator:标准迭代器\n\nBucketIerator:相比于标准迭代器,会将类似长度的样本当做一批来处理,\n因为在文本处理中经常会需要将每一批样本长度补齐为当前批中最长序列的长度,\n因此当样本长度差别较大时,使用BucketIerator可以带来填充效率的提高。\n除此之外,我们还可以在Field中通过fix_length参数来对样本进行截断补齐操作。\n\nBPTTIterator: 基于BPTT(基于时间的反向传播算法)的迭代器,一般用于语言模型中。\n'

In [11]:

1
2
3
print(next(iter(train_iter))) # 一个batch训练集维度
print(next(iter(val_iter))) # 一个batch验证集维度
print(next(iter(test_iter))) # 一个batch测试集维度
1
2
3
4
5
6
7
8
9
10
11
[torchtext.data.batch.Batch of size 32]
[.text]:[torch.LongTensor of size 50x32]
[.target]:[torch.LongTensor of size 50x32]

[torchtext.data.batch.Batch of size 32]
[.text]:[torch.LongTensor of size 50x32]
[.target]:[torch.LongTensor of size 50x32]

[torchtext.data.batch.Batch of size 32]
[.text]:[torch.LongTensor of size 50x32]
[.target]:[torch.LongTensor of size 50x32]

模型的输入是一串文字,模型的输出也是一串文字,他们之间相差一个位置,因为语言模型的目标是根据之前的单词预测下一个单词。

In [12]:

1
2
3
4
it = iter(train_iter)
batch = next(it)
print(" ".join([TEXT.vocab.itos[i] for i in batch.text[:,1].data])) # 打印一个输入的句子
print(" ".join([TEXT.vocab.itos[i] for i in batch.target[:,1].data])) # 打印一个输出的句子
1
2
combine in pairs and then group into trios of pairs which are the smallest visible units of matter this parallels with the structure of modern atomic theory in which pairs or triplets of supposedly fundamental quarks combine to create most typical forms of matter they had also suggested the possibility
in pairs and then group into trios of pairs which are the smallest visible units of matter this parallels with the structure of modern atomic theory in which pairs or triplets of supposedly fundamental quarks combine to create most typical forms of matter they had also suggested the possibility of

In [ ]:

1
2


In [13]:

1
2
3
4
5
for j in range(5): # 这种取法是在一个固定的batch里取数据,发现一个batch里的数据是连不起来的。
print(j)
print(" ".join([TEXT.vocab.itos[i] for i in batch.text[:,j].data]))
print(j)
print(" ".join([TEXT.vocab.itos[i] for i in batch.target[:,j].data]))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0
anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans <unk> of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the
0
originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans <unk> of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization
1
combine in pairs and then group into trios of pairs which are the smallest visible units of matter this parallels with the structure of modern atomic theory in which pairs or triplets of supposedly fundamental quarks combine to create most typical forms of matter they had also suggested the possibility
1
in pairs and then group into trios of pairs which are the smallest visible units of matter this parallels with the structure of modern atomic theory in which pairs or triplets of supposedly fundamental quarks combine to create most typical forms of matter they had also suggested the possibility of
2
culture few living ainu settlements exist many authentic ainu villages advertised in hokkaido are simply tourist attractions language the ainu language is significantly different from japanese in its syntax phonology morphology and vocabulary although there have been attempts to show that they are related the vast majority of modern scholars
2
few living ainu settlements exist many authentic ainu villages advertised in hokkaido are simply tourist attractions language the ainu language is significantly different from japanese in its syntax phonology morphology and vocabulary although there have been attempts to show that they are related the vast majority of modern scholars reject
3
zero the apple iie card an expansion card for the lc line of macintosh computers was released essentially a miniaturized apple iie computer on a card utilizing the mega ii chip from the apple iigs it allowed the macintosh to run eight bit apple iie software through hardware emulation although
3
the apple iie card an expansion card for the lc line of macintosh computers was released essentially a miniaturized apple iie computer on a card utilizing the mega ii chip from the apple iigs it allowed the macintosh to run eight bit apple iie software through hardware emulation although video
4
in papers have been written arguing that the anthropic principle would explain the physical constants such as the fine structure constant the number of dimensions in the universe and the cosmological constant the three primary versions of the principle as stated by john d barrow and frank j <unk> one
4
papers have been written arguing that the anthropic principle would explain the physical constants such as the fine structure constant the number of dimensions in the universe and the cosmological constant the three primary versions of the principle as stated by john d barrow and frank j <unk> one nine

In [14]:

1
2
3
4
5
6
for i in range(5): # 这种取法是在每个batch里取某一个相同位置数据,发现不同batch间相同位置的数据是可以连起来的。这里有点小疑问。
batch = next(it)
print(i)
print(" ".join([TEXT.vocab.itos[i] for i in batch.text[:,2].data]))
print(i)
print(" ".join([TEXT.vocab.itos[i] for i in batch.target[:,2].data]))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0
reject that the relationship goes beyond contact i e mutual borrowing of words between japanese and ainu in fact no attempt to show a relationship with ainu to any other language has gained wide acceptance and ainu is currently considered to be a language isolate culture traditional ainu culture is
0
that the relationship goes beyond contact i e mutual borrowing of words between japanese and ainu in fact no attempt to show a relationship with ainu to any other language has gained wide acceptance and ainu is currently considered to be a language isolate culture traditional ainu culture is quite
1
quite different from japanese culture never shaving after a certain age the men had full beards and <unk> men and women alike cut their hair level with the shoulders at the sides of the head but trimmed it <unk> behind the women tattooed their mouths arms <unk> and sometimes their
1
different from japanese culture never shaving after a certain age the men had full beards and <unk> men and women alike cut their hair level with the shoulders at the sides of the head but trimmed it <unk> behind the women tattooed their mouths arms <unk> and sometimes their <unk>
2
<unk> starting at the onset of puberty the soot deposited on a pot hung over a fire of birch bark was used for colour their traditional dress is a robe spun from the bark of the elm tree it has long sleeves reaches nearly to the feet is folded round
2
starting at the onset of puberty the soot deposited on a pot hung over a fire of birch bark was used for colour their traditional dress is a robe spun from the bark of the elm tree it has long sleeves reaches nearly to the feet is folded round the
3
the body and is tied with a girdle of the same material women also wear an <unk> of japanese cloth in winter the skins of animals were worn with <unk> of <unk> and boots made from the skin of dogs or salmon both sexes are fond of earrings which are
3
body and is tied with a girdle of the same material women also wear an <unk> of japanese cloth in winter the skins of animals were worn with <unk> of <unk> and boots made from the skin of dogs or salmon both sexes are fond of earrings which are said
4
said to have been made of grapevine in former times as also are bead necklaces called <unk> which the women prized highly their traditional cuisine consists of the flesh of bear fox wolf badger ox or horse as well as fish fowl millet vegetables herbs and roots they never ate
4
to have been made of grapevine in former times as also are bead necklaces called <unk> which the women prized highly their traditional cuisine consists of the flesh of bear fox wolf badger ox or horse as well as fish fowl millet vegetables herbs and roots they never ate raw

定义模型

  • 继承nn.Module
  • 初始化函数
  • forward函数
  • 其余可以根据模型需要定义相关的函数

In [15]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import torch
import torch.nn as nn


class RNNModel(nn.Module):
""" 一个简单的循环神经网络"""

def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5):
# rnn_type;有两个层供选择'LSTM', 'GRU'
# ntoken:VOCAB_SIZE=50002
# ninp:EMBEDDING_SIZE = 650,输入层维度
# nhid:EMBEDDING_SIZE = 1000,隐藏层维度,这里是我自己设置的,用于区分ninp层。
# nlayers:纵向有多少层神经网络

''' 该模型包含以下几层:
- 词嵌入层
- 一个循环神经网络层(RNN, LSTM, GRU)
- 一个线性层,从hidden state到输出单词表
- 一个dropout层,用来做regularization
'''
super(RNNModel, self).__init__()
self.drop = nn.Dropout(dropout)
self.encoder = nn.Embedding(ntoken, ninp)
# 定义输入的Embedding层,用来把每个单词转化为词向量

if rnn_type in ['LSTM', 'GRU']: # 下面代码以LSTM举例

self.rnn = getattr(nn, rnn_type)(ninp, nhid, nlayers, dropout=dropout)
# getattr(nn, rnn_type) 相当于 nn.rnn_type
# nlayers代表纵向有多少层。还有个参数是bidirectional: 是否是双向LSTM,默认false
else:
try:
nonlinearity = {'RNN_TANH': 'tanh', 'RNN_RELU': 'relu'}[rnn_type]
except KeyError:
raise ValueError( """An invalid option for `--model` was supplied,
options are ['LSTM', 'GRU', 'RNN_TANH' or 'RNN_RELU']""")
self.rnn = nn.RNN(ninp, nhid, nlayers, nonlinearity=nonlinearity, dropout=dropout)
self.decoder = nn.Linear(nhid, ntoken)
# 最后线性全连接隐藏层的维度(1000,50002)


self.init_weights()

self.rnn_type = rnn_type
self.nhid = nhid
self.nlayers = nlayers

def init_weights(self):
initrange = 0.1
self.encoder.weight.data.uniform_(-initrange, initrange)
self.decoder.bias.data.zero_()
self.decoder.weight.data.uniform_(-initrange, initrange)

def forward(self, input, hidden):

''' Forward pass:
- word embedding
- 输入循环神经网络
- 一个线性层从hidden state转化为输出单词表
'''

# input.shape = seg_length * batch = torch.Size([50, 32])
# 如果觉得想变成32*50格式,可以在LSTM里定义batch_first = True
# hidden = (nlayers * 32 * hidden_size, nlayers * 32 * hidden_size)
# hidden是个元组,输入有两个参数,一个是刚开始的隐藏层h的维度,一个是刚开始的用于记忆的c的维度,
# 这两个层的维度一样,并且需要先初始化,hidden_size的维度和上面nhid的维度一样 =1000,我理解这两个是同一个东西。
emb = self.drop(self.encoder(input)) #
# emb.shape=torch.Size([50, 32, 650]) # 输入数据的维度
# 这里进行了运算(50,50002,650)*(50, 32,50002)
output, hidden = self.rnn(emb, hidden)
# output.shape = 50 * 32 * hidden_size # 最终输出数据的维度,
# hidden是个元组,输出有两个参数,一个是最后的隐藏层h的维度,一个是最后的用于记忆的c的维度,这两个层维度相同
# hidden = (h层维度:nlayers * 32 * hidden_size, c层维度:nlayers * 32 * hidden_size)


output = self.drop(output)
decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
# output最后的输出层一定要是二维的,只是为了能进行全连接层的运算,所以把前两个维度拼到一起,(50*32,hidden_size)
# decoded.shape=(50*32,hidden_size)*(hidden_size,50002)=torch.Size([1600, 50002])

return decoded.view(output.size(0), output.size(1), decoded.size(1)), hidden
# 我们要知道每一个位置预测的是哪个单词,所以最终输出要恢复维度 = (50,32,50002)
# hidden = (h层维度:2 * 32 * 1000, c层维度:2 * 32 * 1000)

def init_hidden(self, bsz, requires_grad=True):
# 这步我们初始化下隐藏层参数
weight = next(self.parameters())
# weight = torch.Size([50002, 650])是所有参数的第一个参数
# 所有参数self.parameters(),是个生成器,LSTM所有参数维度种类如下:
# print(list(iter(self.parameters())))
# torch.Size([50002, 650])
# torch.Size([4000, 650])
# torch.Size([4000, 1000])
# torch.Size([4000]) # 偏置项
# torch.Size([4000])
# torch.Size([4000, 1000])
# torch.Size([4000, 1000])
# torch.Size([4000])
# torch.Size([4000])
# torch.Size([50002, 1000])
# torch.Size([50002])
if self.rnn_type == 'LSTM':
return (weight.new_zeros((self.nlayers, bsz, self.nhid), requires_grad=requires_grad),
weight.new_zeros((self.nlayers, bsz, self.nhid), requires_grad=requires_grad))
# return = (2 * 32 * 1000, 2 * 32 * 1000)
# 这里不明白为什么需要weight.new_zeros,我估计是想整个计算图能链接起来
# 这里特别注意hidden的输入不是model的参数,不参与更新,就跟输入数据x一样

else:
return weight.new_zeros((self.nlayers, bsz, self.nhid), requires_grad=requires_grad)
# GRU神经网络把h层和c层合并了,所以这里只有一层。

初始化一个模型

In [16]:

1
2
3
4
nhid = 1000 # 我自己设置的维度,用于区分embeding_size=650
model = RNNModel("LSTM", VOCAB_SIZE, EMBEDDING_SIZE, nhid, 2, dropout=0.5)
if USE_CUDA:
model = model.cuda()

In [17]:

1
model

Out[17]:

1
2
3
4
5
6
RNNModel(
(drop): Dropout(p=0.5)
(encoder): Embedding(50002, 650)
(rnn): LSTM(650, 1000, num_layers=2, dropout=0.5)
(decoder): Linear(in_features=1000, out_features=50002, bias=True)
)

In [23]:

1
list(model.parameters())[0].shape

Out[23]:

1
torch.Size([50002, 650])
  • 我们首先定义评估模型的代码。
  • 模型的评估和模型的训练逻辑基本相同,唯一的区别是我们只需要forward pass,不需要backward pass

In [68]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# 先从下面训练模式看起,在看evaluate
def evaluate(model, data):
model.eval() # 预测模式
total_loss = 0.
it = iter(data)
total_count = 0.
with torch.no_grad():
hidden = model.init_hidden(BATCH_SIZE, requires_grad=False)
# 这里不管是训练模式还是预测模式,h层的输入都是初始化为0,hidden的输入不是model的参数
# 这里model里的model.parameters()已经是训练过的参数。
for i, batch in enumerate(it):
data, target = batch.text, batch.target
# # 取出验证集的输入的数据和输出的数据,相当于特征和标签
if USE_CUDA:
data, target = data.cuda(), target.cuda()
hidden = repackage_hidden(hidden) # 截断计算图
with torch.no_grad(): # 验证阶段不需要更新梯度
output, hidden = model(data, hidden)
#调用model的forward方法进行一次前向传播,得到return输出值
loss = loss_fn(output.view(-1, VOCAB_SIZE), target.view(-1))
# 计算交叉熵损失

total_count += np.multiply(*data.size())
# 上面计算交叉熵的损失是平均过的,这里需要计算下总的损失
# total_count先计算验证集样本的单词总数,一个样本有50个单词,一个batch32个样本
# np.multiply(*data.size()) =50*32=1600
total_loss += loss.item()*np.multiply(*data.size())
# 每次batch平均后的损失乘以每次batch的样本的总的单词数 = 一次batch总的损失

loss = total_loss / total_count # 整个验证集总的损失除以总的单词数
model.train() # 训练模式
return loss

In [9]:

1
2
3
4
5
import torch
import numpy as np
a = torch.ones((5,3))
print(a.size())
np.multiply(*a.size())
1
torch.Size([5, 3])

Out[9]:

1
15

我们需要定义下面的一个function,帮助我们把一个hidden state和计算图之前的历史分离。

In [69]:

1
2
3
4
5
6
7
8
9
# Remove this part
def repackage_hidden(h):
"""Wraps hidden states in new Tensors, to detach them from their history."""
if isinstance(h, torch.Tensor):
# 这个是GRU的截断,因为只有一个隐藏层
# 判断h是不是torch.Tensor
return h.detach() # 截断计算图,h是全的计算图的开始,只是保留了h的值
else: # 这个是LSTM的截断,有两个隐藏层,格式是元组
return tuple(repackage_hidden(v) for v in h)

定义loss function和optimizer

In [70]:

1
2
3
4
5
loss_fn = nn.CrossEntropyLoss() # 交叉熵损失
learning_rate = 0.001
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, 0.5)
# 每调用一次这个函数,lenrning_rate就降一半,0.5就是一半的意思

训练模型:

  • 模型一般需要训练若干个epoch
  • 每个epoch我们都把所有的数据分成若干个batch
  • 把每个batch的输入和输出都包装成cuda tensor
  • forward pass,通过输入的句子预测每个单词的下一个单词
  • 用模型的预测和正确的下一个单词计算cross entropy loss
  • 清空模型当前gradient
  • backward pass
  • gradient clipping,防止梯度爆炸
  • 更新模型参数
  • 每隔一定的iteration输出模型在当前iteration的loss,以及在验证集上做模型的评估

In [13]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import copy
GRAD_CLIP = 1.
NUM_EPOCHS = 2

val_losses = []
for epoch in range(NUM_EPOCHS):
model.train() # 训练模式
it = iter(train_iter)
# iter,生成迭代器,这里train_iter也是迭代器,不用iter也可以
hidden = model.init_hidden(BATCH_SIZE)
# 得到hidden初始化后的维度
for i, batch in enumerate(it):
data, target = batch.text, batch.target
# 取出训练集的输入的数据和输出的数据,相当于特征和标签
if USE_CUDA:
data, target = data.cuda(), target.cuda()
hidden = repackage_hidden(hidden)
# 语言模型每个batch的隐藏层的输出值是要继续作为下一个batch的隐藏层的输入的
# 因为batch数量很多,如果一直往后传,会造成整个计算图很庞大,反向传播会内存崩溃。
# 所有每次一个batch的计算图迭代完成后,需要把计算图截断,只保留隐藏层的输出值。
# 不过只有语言模型才这么干,其他比如翻译模型不需要这么做。
# repackage_hidden自定义函数用来截断计算图的。
model.zero_grad() # 梯度归零,不然每次迭代梯度会累加
output, hidden = model(data, hidden)
# output = (50,32,50002)
loss = loss_fn(output.view(-1, VOCAB_SIZE), target.view(-1))
# output.view(-1, VOCAB_SIZE) = (1600,50002)
# target.view(-1) =(1600),关于pytorch中交叉熵的计算公式请看下面链接。
# https://blog.csdn.net/geter_CS/article/details/84857220
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), GRAD_CLIP)
# 防止梯度爆炸,设定阈值,当梯度大于阈值时,更新的梯度为阈值
optimizer.step()
if i % 1000 == 0:
print("epoch", epoch, "iter", i, "loss", loss.item())

if i % 10000 == 0:
val_loss = evaluate(model, val_iter)

if len(val_losses) == 0 or val_loss < min(val_losses):
# 如果比之前的loss要小,就保存模型
print("best model, val loss: ", val_loss)
torch.save(model.state_dict(), "lm-best.th")
else: # 否则loss没有降下来,需要优化
scheduler.step() # 自动调整学习率
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# 学习率调整后需要更新optimizer,下次训练就用更新后的
val_losses.append(val_loss) # 保存每10000次迭代后的验证集损失损失
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
epoch 0 iter 0 loss 10.821578979492188
best model, val loss: 10.782116411285918
epoch 0 iter 1000 loss 6.5122528076171875
epoch 0 iter 2000 loss 6.3599748611450195
epoch 0 iter 3000 loss 6.13856315612793
epoch 0 iter 4000 loss 5.473214626312256
epoch 0 iter 5000 loss 5.901871204376221
epoch 0 iter 6000 loss 5.85321569442749
epoch 0 iter 7000 loss 5.636535167694092
epoch 0 iter 8000 loss 5.7489800453186035
epoch 0 iter 9000 loss 5.464158058166504
epoch 0 iter 10000 loss 5.554863452911377
best model, val loss: 5.264891533569864
epoch 0 iter 11000 loss 5.703625202178955
epoch 0 iter 12000 loss 5.6448974609375
epoch 0 iter 13000 loss 5.372857570648193
epoch 0 iter 14000 loss 5.2639479637146
epoch 1 iter 0 loss 5.696778297424316
best model, val loss: 5.124550380139679
epoch 1 iter 1000 loss 5.534722805023193
epoch 1 iter 2000 loss 5.599489212036133
epoch 1 iter 3000 loss 5.459986686706543
epoch 1 iter 4000 loss 4.927192211151123
epoch 1 iter 5000 loss 5.435710906982422
epoch 1 iter 6000 loss 5.4059576988220215
epoch 1 iter 7000 loss 5.308575630187988
epoch 1 iter 8000 loss 5.405811786651611
epoch 1 iter 9000 loss 5.1389055252075195
epoch 1 iter 10000 loss 5.226413726806641
best model, val loss: 4.946829228873176
epoch 1 iter 11000 loss 5.379891395568848
epoch 1 iter 12000 loss 5.360724925994873
epoch 1 iter 13000 loss 5.176026344299316
epoch 1 iter 14000 loss 5.110936641693115

In [ ]:

1
2
3
4
5
6
# 加载保存好的模型参数
best_model = RNNModel("LSTM", VOCAB_SIZE, EMBEDDING_SIZE, nhid, 2, dropout=0.5)
if USE_CUDA:
best_model = best_model.cuda()
best_model.load_state_dict(torch.load("lm-best.th"))
# 把模型参数load到best_model里

使用最好的模型在valid数据上计算perplexity

In [15]:

1
2
3
4
val_loss = evaluate(best_model, val_iter)
print("perplexity: ", np.exp(val_loss))
# 这里不清楚语言模型的评估指标perplexity = np.exp(val_loss)
# 清楚的朋友欢迎交流下
1
perplexity:  140.72803934425724

使用最好的模型在测试数据上计算perplexity

In [16]:

1
2
test_loss = evaluate(best_model, test_iter)
print("perplexity: ", np.exp(test_loss))
1
perplexity:  178.54742013696125

使用训练好的模型生成一些句子。

In [18]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
hidden = best_model.init_hidden(1) # batch_size = 1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
input = torch.randint(VOCAB_SIZE, (1, 1), dtype=torch.long).to(device)
# (1,1)表示输出格式是1行1列的2维tensor,VOCAB_SIZE表示随机取的值小于VOCAB_SIZE=50002
# 我们input相当于取的是一个单词
words = []
for i in range(100):
output, hidden = best_model(input, hidden)
# output.shape = 1 * 1 * 50002
# hidden = (2 * 1 * 1000, 2 * 1 * 1000)
word_weights = output.squeeze().exp().cpu()
# .exp()的两个作用:一是把概率更大的变得更大,二是把负数经过e后变成正数,下面.multinomial参数需要正数
word_idx = torch.multinomial(word_weights, 1)[0]
# 按照word_weights里面的概率随机的取值,概率大的取到的机会大。
# torch.multinomial看这个博客理解:https://blog.csdn.net/monchin/article/details/79787621
# 这里如果选择概率最大的,会每次生成重复的句子。
input.fill_(word_idx) # 预测的单词index是word_idx,然后把word_idx作为下一个循环预测的input输入
word = TEXT.vocab.itos[word_idx] # 根据word_idx取出对应的单词
words.append(word)
print(" ".join(words))
1
s influence clinton decision de gaulle is himself sappho s iv one family banquet was made published by paul <unk> and by a persuaded to prevent arcane of animate poverty based at copernicus bachelor in search services and in a cruise corps references eds the robin series july four one nine zero eight summer gutenberg one nine six four births one nine two eight deaths timeline of this method by the fourth amendment the german ioc known for his <unk> from <unk> one eight nine eight one seven eight nine management was established in one nine seven zero they had

In [42]:

1
torch.randint(50002, (1, 1))

Out[42]:

1
tensor([[11293]])

In [ ]:

1
 
文章目录
  1. 1. 语言模型
    1. 1.0.1. 定义模型
    2. 1.0.2. 使用最好的模型在valid数据上计算perplexity
    3. 1.0.3. 使用最好的模型在测试数据上计算perplexity
|