seq2seq

Seq2Seq, Attention

在这份notebook当中,我们会(尽可能)复现Luong的attention模型

由于我们的数据集非常小,只有一万多个句子的训练数据,所以训练出来的模型效果并不好。如果大家想训练一个好一点的模型,可以参考下面的资料。

更多阅读

课件

论文

PyTorch代码

更多关于Machine Translation

  • Beam Search
  • Pointer network 文本摘要
  • Copy Mechanism 文本摘要
  • Converage Loss
  • ConvSeq2Seq
  • Transformer
  • Tensor2Tensor

TODO

  • 建议同学尝试对中文进行分词

NER

In [137]:

1
2
3
4
5
6
7
8
9
10
11
12
import os
import sys
import math
from collections import Counter #计数器
import numpy as np
import random

import torch
import torch.nn as nn
import torch.nn.functional as F

import nltk

读入中英文数据

  • 英文我们使用nltk的word tokenizer来分词,并且使用小写字母
  • 中文我们直接使用单个汉字作为基本单元

In [138]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def load_data(in_file):
cn = []
en = []
num_examples = 0
with open(in_file, 'r') as f:
for line in f:
#print(line) #Anyone can do that. 任何人都可以做到。
line = line.strip().split("\t") #分词后用逗号隔开
#print(line) #['Anyone can do that.', '任何人都可以做到。']
en.append(["BOS"] + nltk.word_tokenize(line[0].lower()) + ["EOS"])
#BOS:beginning of sequence EOS:end of
# split chinese sentence into characters
cn.append(["BOS"] + [c for c in line[1]] + ["EOS"])
#中文一个一个字分词,可以尝试用分词器分词
return en, cn

train_file = "nmt/en-cn/train.txt"
dev_file = "nmt/en-cn/dev.txt"
train_en, train_cn = load_data(train_file)
dev_en, dev_cn = load_data(dev_file)

In [0]:

1
print(train_en[:10])
1
[['BOS', 'anyone', 'can', 'do', 'that', '.', 'EOS'], ['BOS', 'how', 'about', 'another', 'piece', 'of', 'cake', '?', 'EOS'], ['BOS', 'she', 'married', 'him', '.', 'EOS'], ['BOS', 'i', 'do', "n't", 'like', 'learning', 'irregular', 'verbs', '.', 'EOS'], ['BOS', 'it', "'s", 'a', 'whole', 'new', 'ball', 'game', 'for', 'me', '.', 'EOS'], ['BOS', 'he', "'s", 'sleeping', 'like', 'a', 'baby', '.', 'EOS'], ['BOS', 'he', 'can', 'play', 'both', 'tennis', 'and', 'baseball', '.', 'EOS'], ['BOS', 'we', 'should', 'cancel', 'the', 'hike', '.', 'EOS'], ['BOS', 'he', 'is', 'good', 'at', 'dealing', 'with', 'children', '.', 'EOS'], ['BOS', 'she', 'will', 'do', 'her', 'best', 'to', 'be', 'here', 'on', 'time', '.', 'EOS']]

In [0]:

1
print(train_cn[:10])
1
[['BOS', '任', '何', '人', '都', '可', '以', '做', '到', '。', 'EOS'], ['BOS', '要', '不', '要', '再', '來', '一', '塊', '蛋', '糕', '?', 'EOS'], ['BOS', '她', '嫁', '给', '了', '他', '。', 'EOS'], ['BOS', '我', '不', '喜', '欢', '学', '习', '不', '规', '则', '动', '词', '。', 'EOS'], ['BOS', '這', '對', '我', '來', '說', '是', '個', '全', '新', '的', '球', '類', '遊', '戲', '。', 'EOS'], ['BOS', '他', '正', '睡', '着', ',', '像', '个', '婴', '儿', '一', '样', '。', 'EOS'], ['BOS', '他', '既', '会', '打', '网', '球', ',', '又', '会', '打', '棒', '球', '。', 'EOS'], ['BOS', '我', '們', '應', '該', '取', '消', '這', '次', '遠', '足', '。', 'EOS'], ['BOS', '他', '擅', '長', '應', '付', '小', '孩', '子', '。', 'EOS'], ['BOS', '她', '会', '尽', '量', '按', '时', '赶', '来', '的', '。', 'EOS']]

In [0]:

1
2


构建单词表

In [139]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
UNK_IDX = 0
PAD_IDX = 1
def build_dict(sentences, max_words=50000):
word_count = Counter()
for sentence in sentences:
for s in sentence:
word_count[s] += 1 #word_count这里应该是个字典
ls = word_count.most_common(max_words)
#按每个单词数量排序前50000个,这个数字自己定的,不重复单词数没有50000
print(len(ls)) #train_en:5491
total_words = len(ls) + 2
#加的2是留给"unk"和"pad"
#ls = [('BOS', 14533), ('EOS', 14533), ('.', 12521), ('i', 4045), .......
word_dict = {w[0]: index+2 for index, w in enumerate(ls)}
#加的2是留给"unk"和"pad",转换成字典格式。
word_dict["UNK"] = UNK_IDX
word_dict["PAD"] = PAD_IDX
return word_dict, total_words

en_dict, en_total_words = build_dict(train_en)
cn_dict, cn_total_words = build_dict(train_cn)
inv_en_dict = {v: k for k, v in en_dict.items()}
#en_dict.items()把字典转换成可迭代对象,取出键值,并调换键值的位置。
inv_cn_dict = {v: k for k, v in cn_dict.items()}
1
2
5491
3193

In [1]:

1
2
# print(en_dict)
# print(en_total_words)

In [3]:

1
2
print(cn_dict)
print(cn_total_words)

In [4]:

1
print(inv_en_dict)

In [5]:

1
print(inv_cn_dict)

把单词全部转变成数字

In [140]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def encode(en_sentences, cn_sentences, en_dict, cn_dict, sort_by_len=True):
'''
Encode the sequences.
'''
length = len(en_sentences)
#en_sentences=[['BOS', 'anyone', 'can', 'do', 'that', '.', 'EOS'],....

out_en_sentences = [[en_dict.get(w, 0) for w in sent] for sent in en_sentences]
#out_en_sentences=[[2, 328, 43, 14, 28, 4, 3], ....
#.get(w, 0),返回w对应的值,没有就为0.因题库比较小,这里所有的单词向量都有非零索引。


out_cn_sentences = [[cn_dict.get(w, 0) for w in sent] for sent in cn_sentences]

# sort sentences by english lengths
def len_argsort(seq):
return sorted(range(len(seq)), key=lambda x: len(seq[x]))
#sorted()排序,key参数可以自定义规则,按seq[x]的长度排序,seq[0]为第一句话长度

# 把中文和英文按照同样的顺序排序
if sort_by_len:
sorted_index = len_argsort(out_en_sentences)
#print(sorted_index)
#sorted_index=[63, 1544, 1917, 2650, 3998, 6240, 6294, 6703, ....
#前面的索引都是最短句子的索引

out_en_sentences = [out_en_sentences[i] for i in sorted_index]
#print(out_en_sentences)
#out_en_sentences=[[2, 475, 4, 3], [2, 1318, 126, 3], [2, 1707, 126, 3], ......

out_cn_sentences = [out_cn_sentences[i] for i in sorted_index]

return out_en_sentences, out_cn_sentences

train_en, train_cn = encode(train_en, train_cn, en_dict, cn_dict)
dev_en, dev_cn = encode(dev_en, dev_cn, en_dict, cn_dict)

In [6]:

1
2
3
k=10000
print(" ".join([inv_cn_dict[i] for i in train_cn[k]])) #通过inv字典获取单词
print(" ".join([inv_en_dict[i] for i in train_en[k]]))
1
2
BOS 他 来 这 里 的 目 的 是 什 么 ? EOS
BOS for what purpose did he come here ? EOS

把全部句子分成batch

In [0]:

1
2
print(np.arange(0, 100, 15))
print(np.arange(0, 15))
1
2
[ 0 15 30 45 60 75 90]
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14]

In [141]:

1
2
3
4
5
6
7
8
9
def get_minibatches(n, minibatch_size, shuffle=True):
idx_list = np.arange(0, n, minibatch_size) # [0, 1, ..., n-1]
if shuffle:
np.random.shuffle(idx_list) #打乱数据
minibatches = []
for idx in idx_list:
minibatches.append(np.arange(idx, min(idx + minibatch_size, n)))
#所有batch放在一个大列表里
return minibatches

In [10]:

1
get_minibatches(100,15) #随机打乱的

Out[10]:

1
2
3
4
5
6
7
[array([75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89]),
array([45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59]),
array([30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44]),
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]),
array([15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]),
array([60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74]),
array([90, 91, 92, 93, 94, 95, 96, 97, 98, 99])]

In [142]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
def prepare_data(seqs):
#seqs=[[2, 12, 167, 23, 114, 5, 27, 1755, 4, 3], ........
lengths = [len(seq) for seq in seqs]#每个batch里语句的长度统计出来
n_samples = len(seqs) #一个batch有多少语句
max_len = np.max(lengths) #取出最长的的语句长度,后面用这个做padding基准
x = np.zeros((n_samples, max_len)).astype('int32')
#先初始化全零矩阵,后面依次赋值
#print(x.shape) #64*最大句子长度

x_lengths = np.array(lengths).astype("int32")
#print(x_lengths)
#这里看下面的输入语句发现英文句子长度都一样,中文句子长短不一。
#说明英文句子是特征,中文句子是标签。


for idx, seq in enumerate(seqs):
#取出一个batch的每条语句和对应的索引
x[idx, :lengths[idx]] = seq
#每条语句按行赋值给x,x会有一些零值没有被赋值。

return x, x_lengths #x_mask

def gen_examples(en_sentences, cn_sentences, batch_size):
minibatches = get_minibatches(len(en_sentences), batch_size)
all_ex = []
for minibatch in minibatches:
mb_en_sentences = [en_sentences[t] for t in minibatch]
#按打乱的batch序号分数据,打乱只是batch打乱,一个batach里面的语句还是顺序的。
#print(mb_en_sentences)

mb_cn_sentences = [cn_sentences[t] for t in minibatch]
mb_x, mb_x_len = prepare_data(mb_en_sentences)
#返回的维度为:mb_x=(64 * 最大句子长度),mb_x_len=最大句子长度
mb_y, mb_y_len = prepare_data(mb_cn_sentences)

all_ex.append((mb_x, mb_x_len, mb_y, mb_y_len))
#这里把所有batch数据集合到一起。
#依次为英文句子,英文长度,中文句子翻译,中文句子长度,这四个放在一个列表中
#一个列表为一个batch的数据,所有batch组成一个大列表数据


return all_ex

batch_size = 64
train_data = gen_examples(train_en, train_cn, batch_size)
random.shuffle(train_data)
dev_data = gen_examples(dev_en, dev_cn, batch_size)

In [28]:

1
train_data[0]

Out[28]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
(array([[   2,   12,  707,   23,    7,  295,    4,    3],
[ 2, 12, 120, 1207, 517, 604, 4, 3],
[ 2, 8, 90, 433, 64, 1470, 126, 3],
[ 2, 12, 144, 46, 9, 94, 4, 3],
[ 2, 25, 10, 9, 535, 639, 4, 3],
[ 2, 25, 10, 64, 377, 2512, 4, 3],
[ 2, 12, 43, 309, 9, 96, 4, 3],
[ 2, 43, 328, 1475, 25, 469, 11, 3],
[ 2, 82, 1043, 34, 1991, 2514, 4, 3],
[ 2, 5, 54, 7, 181, 1694, 4, 3],
[ 2, 30, 51, 472, 6, 294, 11, 3],
[ 2, 5, 241, 16, 65, 551, 4, 3],
[ 2, 14, 8, 36, 2516, 680, 11, 3],
[ 2, 8, 30, 9, 66, 333, 4, 3],
[ 2, 12, 10, 34, 40, 777, 4, 3],
[ 2, 29, 54, 9, 138, 1633, 4, 3],
[ 2, 43, 8, 309, 9, 96, 11, 3],
[ 2, 47, 12, 39, 59, 190, 11, 3],
[ 2, 29, 85, 14, 150, 221, 4, 3],
[ 2, 12, 70, 37, 36, 242, 4, 3],
[ 2, 5, 239, 64, 2521, 1696, 4, 3],
[ 2, 5, 14, 13, 36, 314, 4, 3],
[ 2, 5, 234, 7, 45, 44, 4, 3],
[ 2, 5, 76, 226, 17, 621, 4, 3],
[ 2, 29, 180, 9, 269, 266, 4, 3],
[ 2, 85, 5, 22, 6, 708, 11, 3],
[ 2, 6, 788, 48, 37, 889, 4, 3],
[ 2, 8, 63, 124, 45, 95, 4, 3],
[ 2, 921, 10, 21, 640, 350, 4, 3],
[ 2, 52, 10, 6, 296, 44, 11, 3],
[ 2, 681, 10, 190, 24, 146, 11, 3],
[ 2, 19, 1480, 838, 7, 596, 4, 3],
[ 2, 29, 90, 472, 2036, 132, 4, 3],
[ 2, 8, 90, 9, 66, 645, 4, 3],
[ 2, 5, 192, 257, 7, 684, 4, 3],
[ 2, 5, 68, 36, 384, 1686, 4, 3],
[ 2, 12, 10, 120, 38, 23, 4, 3],
[ 2, 18, 47, 965, 106, 112, 4, 3],
[ 2, 8, 30, 37, 9, 250, 4, 3],
[ 2, 31, 20, 129, 20, 900, 11, 3],
[ 2, 29, 519, 118, 2044, 1313, 4, 3],
[ 2, 29, 22, 6, 294, 229, 4, 3],
[ 2, 25, 189, 1056, 335, 151, 4, 3],
[ 2, 8, 67, 89, 57, 887, 4, 3],
[ 2, 41, 8, 72, 59, 362, 11, 3],
[ 2, 51, 923, 2534, 26, 364, 4, 3],
[ 2, 22, 8, 1209, 914, 834, 11, 3],
[ 2, 19, 48, 9, 1127, 847, 4, 3],
[ 2, 25, 224, 70, 13, 425, 4, 3],
[ 2, 19, 949, 62, 1112, 657, 4, 3],
[ 2, 87, 10, 6, 751, 443, 11, 3],
[ 2, 19, 144, 99, 9, 539, 4, 3],
[ 2, 19, 599, 242, 117, 103, 4, 3],
[ 2, 14, 8, 22, 9, 386, 11, 3],
[ 2, 16, 20, 60, 7, 45, 4, 3],
[ 2, 25, 145, 133, 10, 1974, 4, 3],
[ 2, 25, 10, 426, 17, 343, 4, 3],
[ 2, 5, 22, 239, 6, 461, 4, 3],
[ 2, 14, 13, 8, 162, 242, 11, 3],
[ 2, 8, 67, 13, 159, 59, 4, 3],
[ 2, 140, 3452, 1220, 33, 601, 4, 3],
[ 2, 5, 79, 1937, 35, 232, 4, 3],
[ 2, 18, 1612, 35, 779, 926, 4, 3],
[ 2, 12, 197, 599, 6, 632, 4, 3]], dtype=int32),
array([8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8],
dtype=int32),
array([[ 2, 9, 793, ..., 0, 0, 0],
[ 2, 9, 504, ..., 0, 0, 0],
[ 2, 8, 114, ..., 0, 0, 0],
...,
[ 2, 5, 154, ..., 0, 0, 0],
[ 2, 214, 171, ..., 838, 4, 3],
[ 2, 9, 74, ..., 0, 0, 0]], dtype=int32),
array([10, 12, 9, 10, 8, 10, 7, 13, 17, 8, 11, 10, 11, 9, 9, 12, 8,
12, 10, 9, 14, 9, 9, 6, 9, 10, 9, 10, 13, 11, 14, 13, 14, 8,
8, 10, 10, 9, 8, 7, 14, 12, 13, 13, 13, 12, 13, 8, 11, 11, 10,
12, 10, 9, 6, 10, 8, 11, 9, 11, 10, 12, 21, 9], dtype=int32))

没有Attention的版本

下面是一个更简单的没有Attention的encoder decoder模型

In [143]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
class PlainEncoder(nn.Module):
def __init__(self, vocab_size, hidden_size, dropout=0.2):
#以英文为例,vocab_size=5493, hidden_size=100, dropout=0.2
super(PlainEncoder, self).__init__()
self.embed = nn.Embedding(vocab_size, hidden_size)
#这里的hidden_size为embedding_dim:一个单词的维度
#torch.nn.Embedding(num_embeddings, embedding_dim, .....)
#这里的hidden_size = 100

self.rnn = nn.GRU(hidden_size, hidden_size, batch_first=True)
#第一个参数为input_size :输入特征数量
#第二个参数为hidden_size :隐藏层特征数量

self.dropout = nn.Dropout(dropout)

def forward(self, x, lengths):
#x是输入的batch的所有单词,lengths:batch里每个句子的长度
#因为需要把最后一个hidden state取出来,需要知道长度,因为句子长度不一样
##print(x.shape,lengths),x.sahpe = torch.Size([64, 10])
# lengths= =tensor([10, 10, 10, ..... 10, 10, 10])

sorted_len, sorted_idx = lengths.sort(0, descending=True)
#按照长度排序,descending=True长的在前。
#返回两个参数,句子长度和未排序前的索引
# sorted_idx=tensor([41, 40, 46, 45,...... 19, 18, 63])
# sorted_len=tensor([10, 10, 10, ..... 10, 10, 10])

x_sorted = x[sorted_idx.long()] #句子用新的idx,按长度排好序了

embedded = self.dropout(self.embed(x_sorted))
#print(embedded.shape)=torch.Size([64, 10, 100])
#tensor([[[-0.6312, -0.9863, -0.3123, ..., -0.7384, 0.9230, -0.4311],....

packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, sorted_len.long().cpu().data.numpy(), batch_first=True)
#这个函数就是用来处理不同长度的句子的,https: // www.cnblogs.com / sbj123456789 / p / 9834018. html

packed_out, hid = self.rnn(packed_embedded)
#hid.shape = torch.Size([1, 64, 100])

out, _ = nn.utils.rnn.pad_packed_sequence(packed_out, batch_first=True)
#out.shape = torch.Size([64, 10, 100]),

_, original_idx = sorted_idx.sort(0, descending=False)
out = out[original_idx.long()].contiguous()
hid = hid[:, original_idx.long()].contiguous()
#out.shape = torch.Size([64, 10, 100])
#hid.shape = torch.Size([1, 64, 100])

return out, hid[[-1]] #有时候num_layers层数多,需要取出最后一层

In [124]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
class PlainDecoder(nn.Module):
def __init__(self, vocab_size, hidden_size, dropout=0.2):
super(PlainDecoder, self).__init__()
self.embed = nn.Embedding(vocab_size, hidden_size)
self.rnn = nn.GRU(hidden_size, hidden_size, batch_first=True)
self.out = nn.Linear(hidden_size, vocab_size)
self.dropout = nn.Dropout(dropout)

def forward(self, y, y_lengths, hid):
#print(y.shape)=torch.Size([64, 12])
#print(hid.shape)=torch.Size([1, 64, 100])
#中文的y和y_lengths
sorted_len, sorted_idx = y_lengths.sort(0, descending=True)
y_sorted = y[sorted_idx.long()]
hid = hid[:, sorted_idx.long()] #隐藏层也要排序

y_sorted = self.dropout(self.embed(y_sorted))
# batch_size, output_length, embed_size

packed_seq = nn.utils.rnn.pack_padded_sequence(y_sorted, sorted_len.long().cpu().data.numpy(), batch_first=True)
out, hid = self.rnn(packed_seq, hid) #加上隐藏层
#print(hid.shape)=torch.Size([1, 64, 100])
unpacked, _ = nn.utils.rnn.pad_packed_sequence(out, batch_first=True)
_, original_idx = sorted_idx.sort(0, descending=False)
output_seq = unpacked[original_idx.long()].contiguous()
#print(output_seq.shape)=torch.Size([64, 12, 100])
hid = hid[:, original_idx.long()].contiguous()
#print(hid.shape)=torch.Size([1, 64, 100])
output = F.log_softmax(self.out(output_seq), -1)
#print(output.shape)=torch.Size([64, 12, 3195])

return output, hid

In [144]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class PlainSeq2Seq(nn.Module):
def __init__(self, encoder, decoder):
#encoder是上面PlainEncoder的实例
#decoder是上面PlainDecoder的实例
super(PlainSeq2Seq, self).__init__()
self.encoder = encoder
self.decoder = decoder

#把两个模型串起来
def forward(self, x, x_lengths, y, y_lengths):
encoder_out, hid = self.encoder(x, x_lengths)
#self.encoder(x, x_lengths)调用PlainEncoder里面forward的方法
#返回forward的out和hid

output, hid = self.decoder(y=y,y_lengths=y_lengths,hid=hid)
#self.dencoder()调用PlainDecoder里面forward的方法

return output, None

def translate(self, x, x_lengths, y, max_length=10):
#x是一个句子,用数值表示
#y是句子的长度
#y是“bos”的数值索引=2

encoder_out, hid = self.encoder(x, x_lengths)
preds = []
batch_size = x.shape[0]
attns = []
for i in range(max_length):
output, hid = self.decoder(y=y,
y_lengths=torch.ones(batch_size).long().to(y.device),
hid=hid)

#刚开始循环bos作为模型的首个输入单词,后续更新y,下个预测单词的输入是上个输出单词
y = output.max(2)[1].view(batch_size, 1)
preds.append(y)

return torch.cat(preds, 1), None

In [145]:

1
2
3
4
5
6
7
8
9
10
11
12
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dropout = 0.2
hidden_size = 100

#传入中文和英文参数
encoder = PlainEncoder(vocab_size=en_total_words,
hidden_size=hidden_size,
dropout=dropout)
decoder = PlainDecoder(vocab_size=cn_total_words,
hidden_size=hidden_size,
dropout=dropout)
model = PlainSeq2Seq(encoder, decoder)

In [146]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# masked cross entropy loss
class LanguageModelCriterion(nn.Module):
def __init__(self):
super(LanguageModelCriterion, self).__init__()

def forward(self, input, target, mask):
#target=tensor([[5,108,8,4,3,0,0,0,0,0,0,0],....
# mask=tensor([[1,1 ,1,1,1,0,0,0,0,0,0,0],.....
#print(input.shape,target.shape,mask.shape)
#torch.Size([64, 12, 3195]) torch.Size([64, 12]) torch.Size([64, 12])

# input: (batch_size * seq_len) * vocab_size
input = input.contiguous().view(-1, input.size(2))

# target: batch_size * 1=768*1
target = target.contiguous().view(-1, 1)
mask = mask.contiguous().view(-1, 1)
#print(-input.gather(1, target))
output = -input.gather(1, target) * mask
#这里算得就是交叉熵损失,前面已经算了F.log_softmax
#.gather的作用https://blog.csdn.net/edogawachia/article/details/80515038
#output.shape=torch.Size([768, 1])
#mask作用是把padding为0的地方重置为零,因为input.gather时,为0的地方不是零了

output = torch.sum(output) / torch.sum(mask)
#均值损失

return output

In [147]:

1
2
3
model = model.to(device)
loss_fn = LanguageModelCriterion().to(device)
optimizer = torch.optim.Adam(model.parameters())

pythonIn [151]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
def train(model, data, num_epochs=2):
for epoch in range(num_epochs):
model.train()
total_num_words = total_loss = 0.
for it, (mb_x, mb_x_len, mb_y, mb_y_len) in enumerate(data):
#(英文batch,英文长度,中文batch,中文长度)

mb_x = torch.from_numpy(mb_x).to(device).long()
mb_x_len = torch.from_numpy(mb_x_len).to(device).long()

#前n-1个单词作为输入,后n-1个单词作为输出,因为输入的前一个单词要预测后一个单词
mb_input = torch.from_numpy(mb_y[:, :-1]).to(device).long()
mb_output = torch.from_numpy(mb_y[:, 1:]).to(device).long()
#
mb_y_len = torch.from_numpy(mb_y_len-1).to(device).long()
#输入输出的长度都减一。

mb_y_len[mb_y_len<=0] = 1

mb_pred, attn = model(mb_x, mb_x_len, mb_input, mb_y_len)
#返回的是类PlainSeq2Seq里forward函数的两个返回值

mb_out_mask = torch.arange(mb_y_len.max().item(), device=device)[None, :] < mb_y_len[:, None]
#mb_out_mask=tensor([[1, 1, 1, ..., 0, 0, 0],[1, 1, 1, ..., 0, 0, 0],
#mb_out_mask.shape= (64*19),这句代码咱不懂,这个mask就是padding的位置设置为0,其他设置为1
#mb_out_mask就是LanguageModelCriterion的传入参数mask。

mb_out_mask = mb_out_mask.float()

loss = loss_fn(mb_pred, mb_output, mb_out_mask)

num_words = torch.sum(mb_y_len).item()
#一个batch里多少个单词

total_loss += loss.item() * num_words
#总损失,loss计算的是均值损失,每个单词都是都有损失,所以乘以单词数

total_num_words += num_words
#总单词数

# 更新模型
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 5.)
#为了防止梯度过大,设置梯度的阈值

optimizer.step()

if it % 100 == 0:
print("Epoch", epoch, "iteration", it, "loss", loss.item())


print("Epoch", epoch, "Training loss", total_loss/total_num_words)
if epoch % 5 == 0:
evaluate(model, dev_data) #评估模型
train(model, train_data, num_epochs=2)
1
2
3
4
5
6
7
8
9
Epoch 0 iteration 0 loss 4.277793884277344
Epoch 0 iteration 100 loss 3.5520756244659424
Epoch 0 iteration 200 loss 3.483494997024536
Epoch 0 Training loss 3.6435126089915557
Evaluation loss 3.698509503997669
Epoch 1 iteration 0 loss 4.158623218536377
Epoch 1 iteration 100 loss 3.412541389465332
Epoch 1 iteration 200 loss 3.3976175785064697
Epoch 1 Training loss 3.5087569079050698

In [135]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def evaluate(model, data):
model.eval()
total_num_words = total_loss = 0.
with torch.no_grad():#不需要更新模型,不需要梯度
for it, (mb_x, mb_x_len, mb_y, mb_y_len) in enumerate(data):
mb_x = torch.from_numpy(mb_x).to(device).long()
mb_x_len = torch.from_numpy(mb_x_len).to(device).long()
mb_input = torch.from_numpy(mb_y[:, :-1]).to(device).long()
mb_output = torch.from_numpy(mb_y[:, 1:]).to(device).long()
mb_y_len = torch.from_numpy(mb_y_len-1).to(device).long()
mb_y_len[mb_y_len<=0] = 1

mb_pred, attn = model(mb_x, mb_x_len, mb_input, mb_y_len)

mb_out_mask = torch.arange(mb_y_len.max().item(), device=device)[None, :] < mb_y_len[:, None]
mb_out_mask = mb_out_mask.float()

loss = loss_fn(mb_pred, mb_output, mb_out_mask)

num_words = torch.sum(mb_y_len).item()
total_loss += loss.item() * num_words
total_num_words += num_words
print("Evaluation loss", total_loss/total_num_words)

In [ ]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#翻译个句子看看结果咋样
def translate_dev(i):
#随便取出句子
en_sent = " ".join([inv_en_dict[w] for w in dev_en[i]])
print(en_sent)
cn_sent = " ".join([inv_cn_dict[w] for w in dev_cn[i]])
print("".join(cn_sent))

mb_x = torch.from_numpy(np.array(dev_en[i]).reshape(1, -1)).long().to(device)
#把句子升维,并转换成tensor

mb_x_len = torch.from_numpy(np.array([len(dev_en[i])])).long().to(device)
#取出句子长度,并转换成tensor

bos = torch.Tensor([[cn_dict["BOS"]]]).long().to(device)
#bos=tensor([[2]])

translation, attn = model.translate(mb_x, mb_x_len, bos)
#这里传入bos作为首个单词的输入
#translation=tensor([[ 8, 6, 11, 25, 22, 57, 10, 5, 6, 4]])

translation = [inv_cn_dict[i] for i in translation.data.cpu().numpy().reshape(-1)]
trans = []
for word in translation:
if word != "EOS": # 把数值变成单词形式
trans.append(word) #
else:
break
print("".join(trans))

for i in range(100,120):
translate_dev(i)
print()

数据全部处理完成,现在我们开始构建seq2seq模型

Encoder

  • Encoder模型的任务是把输入文字传入embedding层和GRU层,转换成一些hidden states作为后续的context vectors

下面的注释我先把原理捋清楚吧

In [0]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class Encoder(nn.Module):
def __init__(self, vocab_size, embed_size, enc_hidden_size, dec_hidden_size, dropout=0.2):
super(Encoder, self).__init__()
self.embed = nn.Embedding(vocab_size, embed_size)

self.rnn = nn.GRU(embed_size, enc_hidden_size, batch_first=True, bidirectional=True)
self.dropout = nn.Dropout(dropout)
self.fc = nn.Linear(enc_hidden_size * 2, dec_hidden_size)

def forward(self, x, lengths):
sorted_len, sorted_idx = lengths.sort(0, descending=True)
x_sorted = x[sorted_idx.long()]
embedded = self.dropout(self.embed(x_sorted))

packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, sorted_len.long().cpu().data.numpy(), batch_first=True)
packed_out, hid = self.rnn(packed_embedded)
out, _ = nn.utils.rnn.pad_packed_sequence(packed_out, batch_first=True)
_, original_idx = sorted_idx.sort(0, descending=False)
out = out[original_idx.long()].contiguous()
hid = hid[:, original_idx.long()].contiguous()

hid = torch.cat([hid[-2], hid[-1]], dim=1)
hid = torch.tanh(self.fc(hid)).unsqueeze(0)

return out, hid

Luong Attention

  • 根据context vectors和当前的输出hidden states,计算输出

In [0]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
class Attention(nn.Module):
def __init__(self, enc_hidden_size, dec_hidden_size):
super(Attention, self).__init__()

self.enc_hidden_size = enc_hidden_size
self.dec_hidden_size = dec_hidden_size

self.linear_in = nn.Linear(enc_hidden_size*2, dec_hidden_size, bias=False)
self.linear_out = nn.Linear(enc_hidden_size*2 + dec_hidden_size, dec_hidden_size)

def forward(self, output, context, mask):
# output: batch_size, output_len, dec_hidden_size
# context: batch_size, context_len, 2*enc_hidden_size

batch_size = output.size(0)
output_len = output.size(1)
input_len = context.size(1)

context_in = self.linear_in(context.view(batch_size*input_len, -1)).view(
batch_size, input_len, -1) # batch_size, context_len, dec_hidden_size

# context_in.transpose(1,2): batch_size, dec_hidden_size, context_len
# output: batch_size, output_len, dec_hidden_size
attn = torch.bmm(output, context_in.transpose(1,2))
# batch_size, output_len, context_len

attn.data.masked_fill(mask, -1e6)

attn = F.softmax(attn, dim=2)
# batch_size, output_len, context_len

context = torch.bmm(attn, context)
# batch_size, output_len, enc_hidden_size

output = torch.cat((context, output), dim=2) # batch_size, output_len, hidden_size*2

output = output.view(batch_size*output_len, -1)
output = torch.tanh(self.linear_out(output))
output = output.view(batch_size, output_len, -1)
return output, attn

Decoder

  • decoder会根据已经翻译的句子内容,和context vectors,来决定下一个输出的单词

In [0]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
class Decoder(nn.Module):
def __init__(self, vocab_size, embed_size, enc_hidden_size, dec_hidden_size, dropout=0.2):
super(Decoder, self).__init__()
self.embed = nn.Embedding(vocab_size, embed_size)
self.attention = Attention(enc_hidden_size, dec_hidden_size)
self.rnn = nn.GRU(embed_size, hidden_size, batch_first=True)
self.out = nn.Linear(dec_hidden_size, vocab_size)
self.dropout = nn.Dropout(dropout)

def create_mask(self, x_len, y_len):
# a mask of shape x_len * y_len
device = x_len.device
max_x_len = x_len.max()
max_y_len = y_len.max()
x_mask = torch.arange(max_x_len, device=x_len.device)[None, :] < x_len[:, None]
y_mask = torch.arange(max_y_len, device=x_len.device)[None, :] < y_len[:, None]
mask = (1 - x_mask[:, :, None] * y_mask[:, None, :]).byte()
return mask

def forward(self, ctx, ctx_lengths, y, y_lengths, hid):
sorted_len, sorted_idx = y_lengths.sort(0, descending=True)
y_sorted = y[sorted_idx.long()]
hid = hid[:, sorted_idx.long()]

y_sorted = self.dropout(self.embed(y_sorted)) # batch_size, output_length, embed_size

packed_seq = nn.utils.rnn.pack_padded_sequence(y_sorted, sorted_len.long().cpu().data.numpy(), batch_first=True)
out, hid = self.rnn(packed_seq, hid)
unpacked, _ = nn.utils.rnn.pad_packed_sequence(out, batch_first=True)
_, original_idx = sorted_idx.sort(0, descending=False)
output_seq = unpacked[original_idx.long()].contiguous()
hid = hid[:, original_idx.long()].contiguous()

mask = self.create_mask(y_lengths, ctx_lengths)

output, attn = self.attention(output_seq, ctx, mask)
output = F.log_softmax(self.out(output), -1)

return output, hid, attn

Seq2Seq

  • 最后我们构建Seq2Seq模型把encoder, attention, decoder串到一起

In [0]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class Seq2Seq(nn.Module):
def __init__(self, encoder, decoder):
super(Seq2Seq, self).__init__()
self.encoder = encoder
self.decoder = decoder

def forward(self, x, x_lengths, y, y_lengths):
encoder_out, hid = self.encoder(x, x_lengths)
output, hid, attn = self.decoder(ctx=encoder_out,
ctx_lengths=x_lengths,
y=y,
y_lengths=y_lengths,
hid=hid)
return output, attn

def translate(self, x, x_lengths, y, max_length=100):
encoder_out, hid = self.encoder(x, x_lengths)
preds = []
batch_size = x.shape[0]
attns = []
for i in range(max_length):
output, hid, attn = self.decoder(ctx=encoder_out,
ctx_lengths=x_lengths,
y=y,
y_lengths=torch.ones(batch_size).long().to(y.device),
hid=hid)
y = output.max(2)[1].view(batch_size, 1)
preds.append(y)
attns.append(attn)
return torch.cat(preds, 1), torch.cat(attns, 1)

训练

In [0]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
dropout = 0.2
embed_size = hidden_size = 100
encoder = Encoder(vocab_size=en_total_words,
embed_size=embed_size,
enc_hidden_size=hidden_size,
dec_hidden_size=hidden_size,
dropout=dropout)
decoder = Decoder(vocab_size=cn_total_words,
embed_size=embed_size,
enc_hidden_size=hidden_size,
dec_hidden_size=hidden_size,
dropout=dropout)
model = Seq2Seq(encoder, decoder)
model = model.to(device)
loss_fn = LanguageModelCriterion().to(device)
optimizer = torch.optim.Adam(model.parameters())

In [2]:

1
train(model, train_data, num_epochs=30)

In [0]:

1
2
3
for i in range(100,120):
translate_dev(i)
print()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
BOS you have nice skin . EOS
BOS 你 的 皮 膚 真 好 。 EOS
你好害怕。

BOS you 're UNK correct . EOS
BOS 你 部 分 正 确 。 EOS
你是全子的声音。

BOS everyone admired his courage . EOS
BOS 每 個 人 都 佩 服 他 的 勇 氣 。 EOS
他的袋子是他的勇氣。

BOS what time is it ? EOS
BOS 几 点 了 ? EOS
多少时间是什么?

BOS i 'm free tonight . EOS
BOS 我 今 晚 有 空 。 EOS
我今晚有空。

BOS here is your book . EOS
BOS 這 是 你 的 書 。 EOS
这儿是你的书。

BOS they are at lunch . EOS
BOS 他 们 在 吃 午 饭 。 EOS
他们在午餐。

BOS this chair is UNK . EOS
BOS 這 把 椅 子 很 UNK 。 EOS
這些花一下是正在的。

BOS it 's pretty heavy . EOS
BOS 它 真 重 。 EOS
它很美的脚。

BOS many attended his funeral . EOS
BOS 很 多 人 都 参 加 了 他 的 葬 礼 。 EOS
多多衛年轻地了他。

BOS training will be provided . EOS
BOS 会 有 训 练 。 EOS
别将被付錢。

BOS someone is watching you . EOS
BOS 有 人 在 看 著 你 。 EOS
有人看你。

BOS i slapped his face . EOS
BOS 我 摑 了 他 的 臉 。 EOS
我把他的臉抱歉。

BOS i like UNK music . EOS
BOS 我 喜 歡 流 行 音 樂 。 EOS
我喜歡音樂。

BOS tom had no children . EOS
BOS T o m 沒 有 孩 子 。 EOS
汤姆没有照顧孩子。

BOS please lock the door . EOS
BOS 請 把 門 鎖 上 。 EOS
请把門開門。

BOS tom has calmed down . EOS
BOS 汤 姆 冷 静 下 来 了 。 EOS
汤姆在做了。

BOS please speak more loudly . EOS
BOS 請 說 大 聲 一 點 兒 。 EOS
請說更多。

BOS keep next sunday free . EOS
BOS 把 下 周 日 空 出 来 。 EOS
繼續下週一下一步。

BOS i made a mistake . EOS
BOS 我 犯 了 一 個 錯 。 EOS
我做了一件事。
文章目录
  1. 1. Seq2Seq, Attention
    1. 1.1. 更多阅读
      1. 1.1.0.1. 课件
      2. 1.1.0.2. 论文
      3. 1.1.0.3. PyTorch代码
      4. 1.1.0.4. 更多关于Machine Translation
      5. 1.1.0.5. TODO
      6. 1.1.0.6. NER
    2. 1.1.1. 没有Attention的版本
      1. 1.1.1.1. Encoder
  2. 1.2. 下面的注释我先把原理捋清楚吧
    1. 1.2.0.1. Luong Attention
    2. 1.2.0.2. Decoder
    3. 1.2.0.3. Seq2Seq
|