sentiment情感分析代码注释

情感分析

第一步:导入豆瓣电影数据集,只有训练集和测试集

  • TorchText中的一个重要概念是FieldField决定了你的数据会被怎样处理。在我们的情感分类任务中,我们所需要接触到的数据有文本字符串和两种情感,”pos”或者”neg”。

  • Field的参数制定了数据会被怎样处理。

  • 我们使用TEXT field来定义如何处理电影评论,使用LABEL field来处理两个情感类别。

  • 我们的TEXT field带有tokenize='spacy',这表示我们会用spaCy tokenizer来tokenize英文句子。如果我们不特别声明tokenize这个参数,那么默认的分词方法是使用空格。

  • 安装spaCy

    1
    2
    pip install -U spacy
    python -m spacy download en
  • LABELLabelField定义。这是一种特别的用来处理label的Field。我们后面会解释dtype。

  • 更多关于Fields,参见https://github.com/pytorch/text/blob/master/torchtext/data/field.py

  • 和之前一样,我们会设定random seeds使实验可以复现。

  • TorchText支持很多常见的自然语言处理数据集。

  • 下面的代码会自动下载IMDb数据集,然后分成train/test两个torchtext.datasets类别。数据被前面的Fields处理。IMDb数据集一共有50000电影评论,每个评论都被标注为正面的或负面的。

先了解下Spacy库:spaCy介绍和使用教程
再了解下torchtext库:torchtext介绍和使用教程:这个新手必看,不看下面代码听不懂

In [ ]:

1
!ls

In [4]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import torch
from torchtext import data

SEED = 1234

torch.manual_seed(SEED) #为CPU设置随机种子
torch.cuda.manual_seed(SEED)#为GPU设置随机种子
torch.backends.cudnn.deterministic = True #在程序刚开始加这条语句可以提升一点训练速度,没什么额外开销。

#首先,我们要创建两个Field 对象:这两个对象包含了我们打算如何预处理文本数据的信息。
TEXT = data.Field(tokenize='spacy')
#torchtext.data.Field : 用来定义字段的处理方法(文本字段,标签字段)
# spaCy:英语分词器,类似于NLTK库,如果没有传递tokenize参数,则默认只是在空格上拆分字符串。
LABEL = data.LabelField(dtype=torch.float)
#LabelField是Field类的一个特殊子集,专门用于处理标签。

In [2]:

1
2
3
from torchtext import datasets
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
# 加载豆瓣电影评论数据集
1
downloading aclImdb_v1.tar.gz
1
aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:03<00:00, 22.8MB/s]

In [3]:

1
print(vars(train_data.examples[0])) #可以查看数据集长啥样子
1
{'text': ['This', 'movie', 'is', 'visually', 'stunning', '.', 'Who', 'cares', 'if', 'she', 'can', 'act', 'or', 'not', '.', 'Each', 'scene', 'is', 'a', 'work', 'of', 'art', 'composed', 'and', 'captured', 'by', 'John', 'Derek', '.', 'The', 'locations', ',', 'set', 'designs', ',', 'and', 'costumes', 'function', 'perfectly', 'to', 'convey', 'what', 'is', 'found', 'in', 'a', 'love', 'story', 'comprised', 'of', 'beauty', ',', 'youth', 'and', 'wealth', '.', 'In', 'some', 'ways', 'I', 'would', 'like', 'to', 'see', 'this', 'movie', 'as', 'a', 'tribute', 'to', 'John', 'and', 'Bo', 'Derek', "'s", 'story', '.', 'And', '...', 'this', 'commentary', 'would', 'not', 'be', 'complete', 'without', 'mentioning', 'Anthony', 'Quinn', "'s", 'role', 'as', 'father', ',', 'mentor', ',', 'lover', ',', 'and', 'his', 'portrayal', 'of', 'a', 'man', ',', 'of', 'men', ',', 'lost', 'to', 'a', 'bygone', 'era', 'when', 'men', 'were', 'men', '.', 'There', 'are', 'some', 'of', 'us', 'who', 'find', 'value', 'in', 'strength', 'and', 'direction', 'wrapped', 'in', 'a', 'confidence', 'that', 'contributes', 'to', 'a', 'sense', 'of', 'confidence', ',', 'containment', ',', 'and', 'security', '.', 'Yes', ',', 'they', 'do', 'not', 'make', 'men', 'like', 'that', 'anymore', '!', 'But', ',', 'then', 'how', 'often', 'do', 'you', 'find', 'women', 'who', 'are', 'made', 'like', 'Bo', 'Derek', '.'], 'label': 'pos'}

第二步:训练集划分为训练集和验证集

  • 由于我们现在只有train/test这两个分类,所以我们需要创建一个新的validation set。我们可以使用.split()创建新的分类。
  • 默认的数据分割是 70、30,如果我们声明split_ratio,可以改变split之间的比例,split_ratio=0.8表示80%的数据是训练集,20%是验证集。
  • 我们还声明random_state这个参数,确保我们每次分割的数据集都是一样的。

In [4]:

1
2
import random
train_data, valid_data = train_data.split(random_state=random.seed(SEED)) #默认split_ratio=0.7

In [5]:

1
2
3
print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')
1
2
3
Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000

第三步:用训练集建立vocabulary,就是把每个单词一一映射到一个数字。

  • 下一步我们需要创建 vocabularyvocabulary 就是把每个单词一一映射到一个数字。img
  • 我们使用最常见的25k个单词来构建我们的单词表,用max_size这个参数可以做到这一点。
  • 所有其他的单词都用<unk>来表示。

In [6]:

1
2
3
4
5
6
7
# TEXT.build_vocab(train_data, max_size=25000)
# LABEL.build_vocab(train_data)
TEXT.build_vocab(train_data, max_size=25000, vectors="glove.6B.100d", unk_init=torch.Tensor.normal_)
#从预训练的词向量(vectors) 中,将当前(corpus语料库)词汇表的词向量抽取出来,构成当前 corpus 的 Vocab(词汇表)。
#预训练的 vectors 来自glove模型,每个单词有100维。glove模型训练的词向量参数来自很大的语料库,
#而我们的电影评论的语料库小一点,所以词向量需要更新,glove的词向量适合用做初始化参数。
LABEL.build_vocab(train_data)
1
2
.vector_cache/glove.6B.zip: 862MB [00:23, 36.0MB/s]                               
100%|█████████▉| 399597/400000 [00:25<00:00, 16569.01it/s]

In [7]:

1
2
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}")
1
2
Unique tokens in TEXT vocabulary: 25002
Unique tokens in LABEL vocabulary: 2

In [8]:

1
2
3
4
5
6
print(list(LABEL.vocab.stoi.items())) # 只有两个类别值
print(list(TEXT.vocab.stoi.items())[:20])
#语料库单词频率越高,索引越靠前。前两个默认为unk和pad。
print("------"*10)
print(TEXT.vocab.freqs.most_common(20))
# 这里可以看到unk和pad没有计数
1
2
3
4
[('neg', 0), ('pos', 1)]
[('<unk>', 0), ('<pad>', 1), ('the', 2), (',', 3), ('.', 4), ('a', 5), ('and', 6), ('of', 7), ('to', 8), ('is', 9), ('in', 10), ('I', 11), ('it', 12), ('that', 13), ('"', 14), ("'s", 15), ('this', 16), ('-', 17), ('/><br', 18), ('was', 19)]
------------------------------------------------------------
[('the', 201815), (',', 192511), ('.', 165127), ('a', 109096), ('and', 108875), ('of', 100402), ('to', 93905), ('is', 76001), ('in', 61097), ('I', 54439), ('it', 53649), ('that', 49325), ('"', 44431), ("'s", 43359), ('this', 42423), ('-', 37142), ('/><br', 35613), ('was', 34947), ('as', 30412), ('movie', 29873)]

In [9]:

1
print(TEXT.vocab.itos[:10]) #查看TEXT单词表
1
['<unk>', '<pad>', 'the', ',', '.', 'a', 'and', 'of', 'to', 'is']

第四步:创建iterators,每个itartion都会返回一个batch的样本。

  • 最后一步数据的准备是创建iterators。每个itartion都会返回一个batch的examples。
  • 我们会使用BucketIteratorBucketIterator会把长度差不多的句子放到同一个batch中,确保每个batch中不出现太多的padding。
  • 严格来说,我们这份notebook中的模型代码都有一个问题,也就是我们把<pad>也当做了模型的输入进行训练。更好的做法是在模型中把由<pad>产生的输出给消除掉。在这节课中我们简单处理,直接把<pad>也用作模型输入了。由于<pad>数量不多,模型的效果也不差。
  • 如果我们有GPU,还可以指定每个iteration返回的tensor都在GPU上。

In [11]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

#相当于把样本划分batch,把相等长度的单词尽可能的划分到一个batch,不够长的就用padding。
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
(train_data, valid_data, test_data),
batch_size=BATCH_SIZE,
device=device)

'''
Iterator:标准迭代器

BucketIerator:相比于标准迭代器,会将类似长度的样本当做一批来处理,
因为在文本处理中经常会需要将每一批样本长度补齐为当前批中最长序列的长度,
因此当样本长度差别较大时,使用BucketIerator可以带来填充效率的提高。
除此之外,我们还可以在Field中通过fix_length参数来对样本进行截断补齐操作。

BPTTIterator: 基于BPTT(基于时间的反向传播算法)的迭代器,一般用于语言模型中。
'''

Out[11]:

1
'\nIterator:标准迭代器\n\nBucketIerator:相比于标准迭代器,会将类似长度的样本当做一批来处理,\n因为在文本处理中经常会需要将每一批样本长度补齐为当前批中最长序列的长度,\n因此当样本长度差别较大时,使用BucketIerator可以带来填充效率的提高。\n除此之外,我们还可以在Field中通过fix_length参数来对样本进行截断补齐操作。\n\nBPTTIterator: 基于BPTT(基于时间的反向传播算法)的迭代器,一般用于语言模型中。\n'

In [12]:

1
2
3
4
print(next(iter(train_iterator)).label.shape)
print(next(iter(train_iterator)).text.shape)#
# 多运行一次可以发现一条评论的单词长度会变
# 下面text的维度983*64,983为一条评论的单词长度
1
2
torch.Size([64])
torch.Size([983, 64])

In [13]:

1
2
3
4
5
# 取出一句评论
batch = next(iter(train_iterator))
print(batch.text.shape)
print([TEXT.vocab.itos[i] for i in batch.text[:,0]])
# 可以看到这句话的长度是1077,最后面有很多pad
1
2
torch.Size([1077, 64])
['It', 'was', 'interesting', 'to', 'see', 'how', 'accurate', 'the', 'writing', 'was', 'on', 'the', 'geek', 'buzz', 'words', ',', 'yet', 'very', 'naive', 'on', 'the', 'corporate', 'world', '.', 'The', 'Justice', 'Department', 'would', 'catch', 'more', 'of', 'the', 'big', '<unk>', 'giants', 'if', 'they', 'did', 'such', 'naive', 'things', 'to', 'win', '.', 'The', 'real', 'corporate', 'world', 'is', 'much', 'more', 'subtle', 'and', 'interesting', ',', 'yet', 'every', 'bit', 'as', 'sinister', '.', 'I', 'seriously', 'doubt', 'ANY', '<unk>', 'would', 'actually', 'kill', 'someone', 'directly', ';', 'even', 'the', '<unk>', 'is', 'more', '<unk>', 'these', 'days', '.', 'In', 'the', 'real', 'world', ',', 'they', 'do', 'kill', 'people', 'with', '<unk>', ',', 'pollution', ',', '<unk>', ',', '<unk>', ',', 'etc', '.', 'This', 'movie', 'must', 'have', 'been', 'developed', 'by', 'some', 'garage', 'geeks', ',', 'I', 'think', ',', 'and', 'the', 'studios', 'did', "n't", 'know', 'the', 'difference', '.', 'They', 'just', 'wanted', 'something', 'to', 'capitalize', 'on', 'the', 'Microsoft', '<unk>', 'case', 'in', 'the', 'news', '.', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']

In [ ]:

1
2


第五步:创建Word Averaging模型

Word Averaging模型

  • 我们首先介绍一个简单的Word Averaging模型。这个模型非常简单,我们把每个单词都通过Embedding层投射成word embedding vector,然后把一句话中的所有word vector做个平均,就是整个句子的vector表示了。接下来把这个sentence vector传入一个Linear层,做分类即可。

img

  • 我们使用avg_pool2d来做average pooling。我们的目标是把sentence length那个维度平均成1,然后保留embedding这个维度。

img

  • avg_pool2d的kernel size是 (embedded.shape[1], 1),所以句子长度的那个维度会被压扁。

img

img

In [5]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import torch.nn as nn
import torch.nn.functional as F

class WordAVGModel(nn.Module):
def __init__(self, vocab_size, embedding_dim, output_dim, pad_idx):
#初始化参数,
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
#vocab_size=词汇表长度=25002,embedding_dim=每个单词的维度=100
#padding_idx:如果提供的话,这里如果遇到padding的单词就用0填充。

self.fc = nn.Linear(embedding_dim, output_dim)
#output_dim输出的维度,一个数就可以了,=1

def forward(self, text):
# text.shape = (seq_len,batch_size)
# text下面会指定,为一个batch的数据,seq_len为一条评论的单词长度
embedded = self.embedding(text)
# embedded = [seq_len, batch_size, embedding_dim]
embedded = embedded.permute(1, 0, 2)
# [batch_size, seq_len, embedding_dim]更换顺序

pooled = F.avg_pool2d(embedded, (embedded.shape[1], 1)).squeeze(1)
# [batch size, embedding_dim] 把单词长度的维度压扁为1,并降维

return self.fc(pooled)
#(batch size, embedding_dim)*(embedding_dim, output_dim)=(batch size,output_dim)

In [6]:

1
2
3
4
5
6
7
8
INPUT_DIM = len(TEXT.vocab) #25002
EMBEDDING_DIM = 100
OUTPUT_DIM = 1 # 大于某个值是正,小于是负
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]
# TEXT.pad_token = pad
# PAD_IDX = 1 为pad的索引

model = WordAVGModel(INPUT_DIM, EMBEDDING_DIM, OUTPUT_DIM, PAD_IDX)
1
2
3
4
5
6
7
8
9
10
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-6-d9889c88c56d> in <module>
----> 1 INPUT_DIM = len(TEXT.vocab) #25002
2 EMBEDDING_DIM = 100
3 OUTPUT_DIM = 1 # 大于某个值是正,小于是负
4 PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]
5 # TEXT.pad_token = pad

AttributeError: 'Field' object has no attribute 'vocab'

In [16]:

1
TEXT.pad_token

Out[16]:

1
'<pad>'

In [17]:

1
2
3
4
5
def count_parameters(model): #统计参数,可以不用管
return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')
# {}大括号里调用了函数
1
The model has 2,500,301 trainable parameters

第六步:初始化参数

In [18]:

1
2
3
4
# 把模型参数初始化成glove的向量参数
pretrained_embeddings = TEXT.vocab.vectors # 取出glove embedding词向量的参数
model.embedding.weight.data.copy_(pretrained_embeddings) #遇到_的语句直接替换,不需要另外赋值=
#把上面vectors="glove.6B.100d"取出的词向量作为初始化参数,数量为25000*100个参数

Out[18]:

1
2
3
4
5
6
7
tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
[-0.8555, -0.7208, 1.3755, ..., 0.0825, -1.1314, 0.3997],
[-0.0382, -0.2449, 0.7281, ..., -0.1459, 0.8278, 0.2706],
...,
[-0.1419, 0.0282, 0.2185, ..., -0.1100, -0.1250, 0.0282],
[-0.3326, -0.9215, 0.9239, ..., 0.5057, -1.2898, 0.1782],
[-0.8304, 0.3732, 0.0726, ..., -0.0122, 0.2313, -0.2783]])

In [19]:

1
2
3
4
5
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token] # UNK_IDX=0

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM) #
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)
#词汇表25002个单词,前两个unk和pad也需要初始化成EMBEDDING_DIM维的向量

第七步:训练模型

In [20]:

1
2
3
4
5
6
7
import torch.optim as optim

optimizer = optim.Adam(model.parameters()) #定义优化器
criterion = nn.BCEWithLogitsLoss() #定义损失函数,这个BCEWithLogitsLoss特殊情况,二分类损失函数
# nn.BCEWithLogitsLoss()看这个:https://blog.csdn.net/qq_22210253/article/details/85222093
model = model.to(device) #送到gpu上去
criterion = criterion.to(device) #送到gpu上去

计算预测的准确率

In [21]:

1
2
3
4
5
6
7
8
9
10
11
12
def binary_accuracy(preds, y): #计算准确率
"""
Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
"""

#round predictions to the closest integer
rounded_preds = torch.round(torch.sigmoid(preds))
#.round函数:四舍五入

correct = (rounded_preds == y).float() #convert into float for division
acc = correct.sum()/len(correct)
return acc

In [22]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def train(model, iterator, optimizer, criterion):


epoch_loss = 0
epoch_acc = 0
total_len = 0
model.train() #model.train()代表了训练模式
#这步一定要加,是为了区分model训练和测试的模式的。
#有时候训练时会用到dropout、归一化等方法,但是测试的时候不能用dropout等方法。



for batch in iterator: #iterator为train_iterator
optimizer.zero_grad() #加这步防止梯度叠加

predictions = model(batch.text).squeeze(1)
#batch.text 就是上面forward函数的参数text
# squeeze(1)压缩维度,不然跟batch.label维度对不上

loss = criterion(predictions, batch.label)
acc = binary_accuracy(predictions, batch.label)
# 每次迭代都计算一边准确率


loss.backward() #反向传播
optimizer.step() #梯度下降

epoch_loss += loss.item() * len(batch.label)
#二分类损失函数loss因为已经平均化了,这里需要乘以len(batch.label),
#得到一个batch的损失,累加得到所有样本损失。

epoch_acc += acc.item() * len(batch.label)
#(acc.item():一个batch的正确率) *batch数 = 正确数
# 累加得到所有训练样本正确数。

total_len += len(batch.label)
#计算train_iterator所有样本的数量,不出意外应该是17500

return epoch_loss / total_len, epoch_acc / total_len
#epoch_loss / total_len :train_iterator所有batch的平均损失
#epoch_acc / total_len :train_iterator所有batch的平均正确率

In [23]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def evaluate(model, iterator, criterion):


epoch_loss = 0
epoch_acc = 0
total_len = 0

model.eval()
#转换成测试模式,冻结dropout层或其他层。

with torch.no_grad():
for batch in iterator:
#iterator为valid_iterator

#没有反向传播和梯度下降
predictions = model(batch.text).squeeze(1)
loss = criterion(predictions, batch.label)
acc = binary_accuracy(predictions, batch.label)


epoch_loss += loss.item() * len(batch.label)
epoch_acc += acc.item() * len(batch.label)
total_len += len(batch.label)
model.train() #调回训练模式

return epoch_loss / total_len, epoch_acc / total_len

In [24]:

1
2
3
4
5
6
7
import time 

def epoch_time(start_time, end_time): #查看每个epoch的时间
elapsed_time = end_time - start_time
elapsed_mins = int(elapsed_time / 60)
elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
return elapsed_mins, elapsed_secs

第八步:查看模型运行结果

In [25]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# 同上,这里用的kaggleGPU跑的,花了2分钟。
N_EPOCHS = 20

best_valid_loss = float('inf') #无穷大

for epoch in range(N_EPOCHS):

start_time = time.time()

train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
# 得到训练集每个epoch的平均损失和准确率
valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
# 得到验证集每个epoch的平均损失和准确率,这个model里传入的参数是训练完的参数

end_time = time.time()

epoch_mins, epoch_secs = epoch_time(start_time, end_time)

if valid_loss < best_valid_loss: #只要模型效果变好,就存模型
best_valid_loss = valid_loss
torch.save(model.state_dict(), 'wordavg-model.pt')

print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
print(f'\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Epoch: 01 | Epoch Time: 0m 5s
Train Loss: 0.684 | Train Acc: 58.78%
Val. Loss: 0.617 | Val. Acc: 72.51%
Epoch: 02 | Epoch Time: 0m 5s
Train Loss: 0.642 | Train Acc: 72.62%
Val. Loss: 0.504 | Val. Acc: 76.65%
Epoch: 03 | Epoch Time: 0m 5s
Train Loss: 0.569 | Train Acc: 78.81%
Val. Loss: 0.439 | Val. Acc: 81.07%
Epoch: 04 | Epoch Time: 0m 5s
Train Loss: 0.497 | Train Acc: 82.97%
Val. Loss: 0.404 | Val. Acc: 84.03%
Epoch: 05 | Epoch Time: 0m 5s
Train Loss: 0.435 | Train Acc: 85.95%
Val. Loss: 0.400 | Val. Acc: 85.69%
Epoch: 06 | Epoch Time: 0m 5s
Train Loss: 0.388 | Train Acc: 87.73%
Val. Loss: 0.412 | Val. Acc: 86.80%
Epoch: 07 | Epoch Time: 0m 5s
Train Loss: 0.349 | Train Acc: 88.83%
Val. Loss: 0.425 | Val. Acc: 87.64%
Epoch: 08 | Epoch Time: 0m 5s
Train Loss: 0.319 | Train Acc: 89.84%
Val. Loss: 0.446 | Val. Acc: 87.83%
Epoch: 09 | Epoch Time: 0m 5s
Train Loss: 0.293 | Train Acc: 90.54%
Val. Loss: 0.464 | Val. Acc: 88.25%
Epoch: 10 | Epoch Time: 0m 5s
Train Loss: 0.272 | Train Acc: 91.19%
Val. Loss: 0.480 | Val. Acc: 88.68%
Epoch: 11 | Epoch Time: 0m 5s
Train Loss: 0.254 | Train Acc: 91.82%
Val. Loss: 0.498 | Val. Acc: 88.87%
Epoch: 12 | Epoch Time: 0m 5s
Train Loss: 0.238 | Train Acc: 92.53%
Val. Loss: 0.517 | Val. Acc: 89.01%
Epoch: 13 | Epoch Time: 0m 5s
Train Loss: 0.222 | Train Acc: 93.03%
Val. Loss: 0.532 | Val. Acc: 89.25%
Epoch: 14 | Epoch Time: 0m 5s
Train Loss: 0.210 | Train Acc: 93.47%
Val. Loss: 0.547 | Val. Acc: 89.44%
Epoch: 15 | Epoch Time: 0m 5s
Train Loss: 0.198 | Train Acc: 93.95%
Val. Loss: 0.564 | Val. Acc: 89.49%
Epoch: 16 | Epoch Time: 0m 5s
Train Loss: 0.186 | Train Acc: 94.31%
Val. Loss: 0.582 | Val. Acc: 89.68%
Epoch: 17 | Epoch Time: 0m 5s
Train Loss: 0.175 | Train Acc: 94.74%
Val. Loss: 0.596 | Val. Acc: 89.69%
Epoch: 18 | Epoch Time: 0m 5s
Train Loss: 0.166 | Train Acc: 95.09%
Val. Loss: 0.615 | Val. Acc: 89.95%
Epoch: 19 | Epoch Time: 0m 5s
Train Loss: 0.156 | Train Acc: 95.36%
Val. Loss: 0.631 | Val. Acc: 89.91%
Epoch: 20 | Epoch Time: 0m 5s
Train Loss: 0.147 | Train Acc: 95.75%
Val. Loss: 0.647 | Val. Acc: 90.07%

第九步:预测结果

In [26]:

1
!ls
1
__notebook_source__.ipynb  wordavg-model.pt

In [55]:

1
2
3
4
5
6
7
8
9
10
11
12
# kaggle上下载模型文件到本地,运行下面代码,点击输出的链接就行
from IPython.display import HTML
import pandas as pd
import numpy as np

def create_download_link(title = "Download model file", filename = "CNN-model.pt"):
html = '<a href={filename}>{title}</a>'
html = html.format(title=title,filename=filename)
return HTML(html)

# create a link to download the dataframe which was saved with .to_csv method
create_download_link(filename='wordavg-model.pt')

Out[55]:

Download model file

In [1]:

1
2
model.load_state_dict(torch.load("wordavg-model.pt"))
#用保存的模型参数预测数据
1
2
3
4
5
6
7
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-1-f795a3e78d6a> in <module>
----> 1 model.load_state_dict(torch.load("wordavg-model.pt"))
2 #用保存的模型参数预测数据

NameError: name 'model' is not defined

In [28]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import spacy  #分词工具,跟NLTK类似
nlp = spacy.load('en')

def predict_sentiment(sentence): # 传入预测的句子I love This film bad
tokenized = [tok.text for tok in nlp.tokenizer(sentence)] #分词
# print(tokenized) = ['I', 'love', 'This', 'film', 'bad']
indexed = [TEXT.vocab.stoi[t] for t in tokenized]
#sentence的在25002中的索引

tensor = torch.LongTensor(indexed).to(device) #seq_len
# 所有词向量都应该变成LongTensor

tensor = tensor.unsqueeze(1)
#模型的输入是默认有batch_size的,需要升维,seq_len * batch_size(1)

prediction = torch.sigmoid(model(tensor))
# 预测准确率,在0,1之间,需要sigmoid下

return prediction.item()

In [29]:

1
predict_sentiment("I love This film bad")

Out[29]:

1
0.9373546242713928

In [30]:

1
predict_sentiment("This film is great")

Out[30]:

1
1.0

RNN模型

  • 下面我们尝试把模型换成一个

    recurrent neural network

(RNN)。RNN经常会被用来encode一个sequence

ℎ𝑡=RNN(𝑥𝑡,ℎ𝑡−1)ht=RNN(xt,ht−1)

  • 我们使用最后一个hidden state ℎ𝑇hT来表示整个句子。

  • 然后我们把ℎ𝑇hT通过一个线性变换𝑓f,然后用来预测句子的情感。

In [ ]:

1
2


In [32]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
class RNN(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim,
n_layers, bidirectional, dropout, pad_idx):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers,
bidirectional=bidirectional, dropout=dropout)
#embedding_dim:每个单词维度
#hidden_dim:隐藏层维度
#num_layers:神经网络深度,纵向深度
#bidirectional:是否双向循环RNN
#这个自己先得理解LSTM各个维度,不然容易晕,双向RNN网络图示看上面,可以借鉴下


self.fc = nn.Linear(hidden_dim*2, output_dim)
# 这里hidden_dim乘以2是因为是双向,需要拼接两个方向,跟n_layers的层数无关。

self.dropout = nn.Dropout(dropout)

def forward(self, text):
# text.shape=[seq_len, batch_size]
embedded = self.dropout(self.embedding(text)) #[seq_len, batch_size, emb_dim]
output, (hidden, cell) = self.rnn(embedded)
# output = [seq_len, batch size, hid_dim * num directions]
# hidden = [num layers * num directions, batch_size, hid_dim]
# cell = [num layers * num directions, batch_size, hid_dim]
# 这里的num layers * num directions可以看上面图,上面图除掉输入输出层只有两层双向网络。
# num layers = 2表示需要纵向上在加两层双向,总共有4层神经元。
# 对于LSTM模型的任意一个时间序列t,h层的输出维度


#concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers
#and apply dropout
hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
# hidden = [batch size, hid dim * num directions],
# 看下上面图示,最后前向和后向输出的隐藏层会concat到输出层,4层神经元最后两层作为最终的输出。
# 这里因为我们只需要得到最后一个时间序列的输出,所以最终输出的hidden跟seq_len无关。

return self.fc(hidden.squeeze(0)) # 在接一个全连接层,最终输出[batch size, output_dim]

In [ ]:

1
2


In [36]:

1
2
3
4
5
6
7
8
9
10
11
12
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM,
N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)
model

Out[36]:

1
2
3
4
5
6
RNN(
(embedding): Embedding(25002, 100, padding_idx=1)
(rnn): LSTM(100, 256, num_layers=2, dropout=0.5, bidirectional=True)
(fc): Linear(in_features=512, out_features=1, bias=True)
(dropout): Dropout(p=0.5)
)

In [34]:

1
2
print(f'The model has {count_parameters(model):,} trainable parameters')
# 比averge model模型多了一倍的参数
1
The model has 4,810,857 trainable parameters

In [37]:

1
2
3
4
5
6
7
8
# 同上初始化
model.embedding.weight.data.copy_(pretrained_embeddings)
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

print(model.embedding.weight.data)
1
2
3
4
5
6
7
tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[-0.0382, -0.2449, 0.7281, ..., -0.1459, 0.8278, 0.2706],
...,
[-0.1419, 0.0282, 0.2185, ..., -0.1100, -0.1250, 0.0282],
[-0.3326, -0.9215, 0.9239, ..., 0.5057, -1.2898, 0.1782],
[-0.8304, 0.3732, 0.0726, ..., -0.0122, 0.2313, -0.2783]])

训练RNN模型

In [38]:

1
2
optimizer = optim.Adam(model.parameters())
model = model.to(device)

In [39]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 同上,这里用的kaggleGPU跑的,花了40分钟。
N_EPOCHS = 20
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
start_time = time.time()
train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)

end_time = time.time()

epoch_mins, epoch_secs = epoch_time(start_time, end_time)

if valid_loss < best_valid_loss:
best_valid_loss = valid_loss
torch.save(model.state_dict(), 'lstm-model.pt')

print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
print(f'\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Epoch: 01 | Epoch Time: 2m 1s
Train Loss: 0.667 | Train Acc: 59.09%
Val. Loss: 0.633 | Val. Acc: 64.67%
Epoch: 02 | Epoch Time: 2m 1s
Train Loss: 0.663 | Train Acc: 60.33%
Val. Loss: 0.669 | Val. Acc: 69.21%
Epoch: 03 | Epoch Time: 2m 2s
Train Loss: 0.650 | Train Acc: 61.06%
Val. Loss: 0.579 | Val. Acc: 70.55%
Epoch: 04 | Epoch Time: 2m 2s
Train Loss: 0.493 | Train Acc: 77.43%
Val. Loss: 0.382 | Val. Acc: 83.43%
Epoch: 05 | Epoch Time: 2m 2s
Train Loss: 0.394 | Train Acc: 83.71%
Val. Loss: 0.338 | Val. Acc: 85.97%
Epoch: 06 | Epoch Time: 2m 3s
Train Loss: 0.338 | Train Acc: 86.26%
Val. Loss: 0.309 | Val. Acc: 87.21%
Epoch: 07 | Epoch Time: 2m 2s
Train Loss: 0.292 | Train Acc: 88.37%
Val. Loss: 0.295 | Val. Acc: 88.73%
Epoch: 08 | Epoch Time: 2m 3s
Train Loss: 0.252 | Train Acc: 90.26%
Val. Loss: 0.300 | Val. Acc: 89.31%
Epoch: 09 | Epoch Time: 2m 2s
Train Loss: 0.246 | Train Acc: 90.51%
Val. Loss: 0.282 | Val. Acc: 88.76%
Epoch: 10 | Epoch Time: 2m 3s
Train Loss: 0.205 | Train Acc: 92.37%
Val. Loss: 0.295 | Val. Acc: 88.31%
Epoch: 11 | Epoch Time: 2m 1s
Train Loss: 0.203 | Train Acc: 92.46%
Val. Loss: 0.289 | Val. Acc: 89.25%
Epoch: 12 | Epoch Time: 2m 3s
Train Loss: 0.178 | Train Acc: 93.58%
Val. Loss: 0.301 | Val. Acc: 89.41%
Epoch: 13 | Epoch Time: 2m 3s
Train Loss: 0.158 | Train Acc: 94.43%
Val. Loss: 0.301 | Val. Acc: 89.51%
Epoch: 14 | Epoch Time: 2m 2s
Train Loss: 0.158 | Train Acc: 94.63%
Val. Loss: 0.289 | Val. Acc: 89.95%
Epoch: 15 | Epoch Time: 2m 2s
Train Loss: 0.142 | Train Acc: 95.00%
Val. Loss: 0.314 | Val. Acc: 89.59%
Epoch: 16 | Epoch Time: 2m 2s
Train Loss: 0.123 | Train Acc: 95.62%
Val. Loss: 0.329 | Val. Acc: 89.99%
Epoch: 17 | Epoch Time: 2m 4s
Train Loss: 0.107 | Train Acc: 96.16%
Val. Loss: 0.325 | Val. Acc: 89.75%
Epoch: 18 | Epoch Time: 2m 4s
Train Loss: 0.100 | Train Acc: 96.66%
Val. Loss: 0.341 | Val. Acc: 89.49%
Epoch: 19 | Epoch Time: 2m 3s
Train Loss: 0.096 | Train Acc: 96.63%
Val. Loss: 0.340 | Val. Acc: 89.79%
Epoch: 20 | Epoch Time: 2m 3s
Train Loss: 0.080 | Train Acc: 97.31%
Val. Loss: 0.380 | Val. Acc: 89.83%

You may have noticed the loss is not really decreasing and the accuracy is poor. This is due to several issues with the model which we’ll improve in the next notebook.

Finally, the metric we actually care about, the test loss and accuracy, which we get from our parameters that gave us the best validation loss.

In [40]:

1
2
3
4
5
6
7
8
9
10
11
12
# 下载文件到本地
from IPython.display import HTML
import pandas as pd
import numpy as np

def create_download_link(title = "Download model file", filename = "wordavg-model.pt"):
html = '<a href={filename}>{title}</a>'
html = html.format(title=title,filename=filename)
return HTML(html)

# create a link to download the dataframe which was saved with .to_csv method
create_download_link(filename='lstm-model.pt')

Out[40]:

Download model file

In [41]:

1
2
3
model.load_state_dict(torch.load('lstm-model.pt'))
test_loss, test_acc = evaluate(model, test_iterator, criterion)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')
1
Test Loss: 0.304 | Test Acc: 88.11%

In [44]:

1
predict_sentiment("I feel This film bad")

Out[44]:

1
0.3637591600418091

In [43]:

1
predict_sentiment("This film is great")

Out[43]:

1
0.9947803020477295

CNN模型

In [45]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class CNN(nn.Module):
def __init__(self, vocab_size, embedding_dim, n_filters,
filter_sizes, output_dim, dropout, pad_idx):
super().__init__()

self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
self.convs = nn.ModuleList([
nn.Conv2d(in_channels = 1, out_channels = n_filters,
kernel_size = (fs, embedding_dim))
for fs in filter_sizes
])
# in_channels:输入的channel,文字都是1
# out_channels:输出的channel维度
# fs:每次滑动窗口计算用到几个单词
# for fs in filter_sizes打算用好几个卷积模型最后concate起来看效果。

self.fc = nn.Linear(len(filter_sizes) * n_filters, output_dim)
self.dropout = nn.Dropout(dropout)

def forward(self, text):
text = text.permute(1, 0) # [batch size, sent len]
embedded = self.embedding(text) # [batch size, sent len, emb dim]
embedded = embedded.unsqueeze(1) # [batch size, 1, sent len, emb dim]
# 升维是为了和nn.Conv2d的输入维度吻合,把channel列升维。
conved = [F.relu(conv(embedded)).squeeze(3) for conv in self.convs]
# conved = [batch size, n_filters, sent len - filter_sizes+1]
# 有几个filter_sizes就有几个conved


pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
# 把conv的第三个维度最大池化了
#pooled_n = [batch size, n_filters]

cat = self.dropout(torch.cat(pooled, dim=1))
# cat = [batch size, n_filters * len(filter_sizes)]
# 把 len(filter_sizes)个卷积模型concate起来传到全连接层。

return self.fc(cat)

In [47]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 同上
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
N_FILTERS = 100
FILTER_SIZES = [3,4,5]
OUTPUT_DIM = 1
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]


model = CNN(INPUT_DIM, EMBEDDING_DIM, N_FILTERS, FILTER_SIZES, OUTPUT_DIM, DROPOUT, PAD_IDX)
model.embedding.weight.data.copy_(pretrained_embeddings)
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)
model = model.to(device)
print(f'The model has {count_parameters(model):,} trainable parameters')
# 比averge model模型参数差不多
1
The model has 2,620,801 trainable parameters

In [48]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# 同上,需要花8分钟左右
optimizer = optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()
criterion = criterion.to(device)

N_EPOCHS = 20

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

start_time = time.time()

train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)

end_time = time.time()

epoch_mins, epoch_secs = epoch_time(start_time, end_time)

if valid_loss < best_valid_loss:
best_valid_loss = valid_loss
torch.save(model.state_dict(), 'CNN-model.pt')

print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
print(f'\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Epoch: 01 | Epoch Time: 0m 19s
Train Loss: 0.652 | Train Acc: 61.81%
Val. Loss: 0.527 | Val. Acc: 76.20%
Epoch: 02 | Epoch Time: 0m 19s
Train Loss: 0.427 | Train Acc: 80.66%
Val. Loss: 0.358 | Val. Acc: 84.36%
Epoch: 03 | Epoch Time: 0m 19s
Train Loss: 0.304 | Train Acc: 87.14%
Val. Loss: 0.318 | Val. Acc: 86.45%
Epoch: 04 | Epoch Time: 0m 19s
Train Loss: 0.215 | Train Acc: 91.42%
Val. Loss: 0.313 | Val. Acc: 86.92%
Epoch: 05 | Epoch Time: 0m 19s
Train Loss: 0.156 | Train Acc: 94.18%
Val. Loss: 0.326 | Val. Acc: 87.01%
Epoch: 06 | Epoch Time: 0m 19s
Train Loss: 0.105 | Train Acc: 96.33%
Val. Loss: 0.344 | Val. Acc: 87.16%
Epoch: 07 | Epoch Time: 0m 19s
Train Loss: 0.075 | Train Acc: 97.61%
Val. Loss: 0.372 | Val. Acc: 87.28%
Epoch: 08 | Epoch Time: 0m 19s
Train Loss: 0.052 | Train Acc: 98.39%
Val. Loss: 0.403 | Val. Acc: 87.21%
Epoch: 09 | Epoch Time: 0m 19s
Train Loss: 0.041 | Train Acc: 98.64%
Val. Loss: 0.433 | Val. Acc: 87.09%
Epoch: 10 | Epoch Time: 0m 19s
Train Loss: 0.031 | Train Acc: 99.10%
Val. Loss: 0.462 | Val. Acc: 87.01%
Epoch: 11 | Epoch Time: 0m 19s
Train Loss: 0.023 | Train Acc: 99.29%
Val. Loss: 0.495 | Val. Acc: 86.93%
Epoch: 12 | Epoch Time: 0m 19s
Train Loss: 0.021 | Train Acc: 99.34%
Val. Loss: 0.530 | Val. Acc: 86.84%
Epoch: 13 | Epoch Time: 0m 19s
Train Loss: 0.015 | Train Acc: 99.60%
Val. Loss: 0.559 | Val. Acc: 86.73%
Epoch: 14 | Epoch Time: 0m 19s
Train Loss: 0.013 | Train Acc: 99.69%
Val. Loss: 0.597 | Val. Acc: 86.48%
Epoch: 15 | Epoch Time: 0m 19s
Train Loss: 0.012 | Train Acc: 99.70%
Val. Loss: 0.608 | Val. Acc: 86.63%
Epoch: 16 | Epoch Time: 0m 19s
Train Loss: 0.009 | Train Acc: 99.76%
Val. Loss: 0.640 | Val. Acc: 86.77%
Epoch: 17 | Epoch Time: 0m 19s
Train Loss: 0.010 | Train Acc: 99.73%
Val. Loss: 0.674 | Val. Acc: 86.51%
Epoch: 18 | Epoch Time: 0m 19s
Train Loss: 0.012 | Train Acc: 99.63%
Val. Loss: 0.704 | Val. Acc: 86.71%
Epoch: 19 | Epoch Time: 0m 19s
Train Loss: 0.010 | Train Acc: 99.65%
Val. Loss: 0.757 | Val. Acc: 86.44%
Epoch: 20 | Epoch Time: 0m 20s
Train Loss: 0.006 | Train Acc: 99.80%
Val. Loss: 0.756 | Val. Acc: 86.55%

In [49]:

1
2
3
4
# 发现上面结果过拟合了,同学们可以自行调参
model.load_state_dict(torch.load('CNN-model.pt'))
test_loss, test_acc = evaluate(model, test_iterator, criterion)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')
1
Test Loss: 0.339 | Test Acc: 85.68%

In [50]:

1
predict_sentiment("I feel This film bad")

Out[50]:

1
0.6535547375679016

In [52]:

1
2
predict_sentiment("This film is great well") 
# 我后面加了个well,不加会报错,因为我们的FILTER_SIZES = [3,4,5]有设置为5,所以输出的句子长度不能小于5

Out[52]:

1
0.9950380921363831

In [54]:

1
2
3
4
5
6
7
8
9
10
11
12
# kaggle上下载模型文件到本地
from IPython.display import HTML
import pandas as pd
import numpy as np

def create_download_link(title = "Download model file", filename = "CNN-model.pt"):
html = '<a href={filename}>{title}</a>'
html = html.format(title=title,filename=filename)
return HTML(html)

# create a link to download the dataframe which was saved with .to_csv method
create_download_link(filename='CNN-model.pt')

Out[54]:

Download model file

In [ ]:

1
 
文章目录
  1. 1. 情感分析
    1. 1.1. 第一步:导入豆瓣电影数据集,只有训练集和测试集
  2. 2. 第二步:训练集划分为训练集和验证集
  3. 3. 第三步:用训练集建立vocabulary,就是把每个单词一一映射到一个数字。
  4. 4. 第四步:创建iterators,每个itartion都会返回一个batch的样本。
  5. 5. 第五步:创建Word Averaging模型
    1. 5.1. Word Averaging模型
  6. 6. 第六步:初始化参数
  7. 7. 第七步:训练模型
  8. 8. 第八步:查看模型运行结果
  9. 9. 第九步:预测结果
  10. 10. RNN模型
  11. 11. 训练RNN模型
  12. 12. CNN模型
|