酒店评价情感分类与CNN模型

酒店评价情感分类与CNN模型

参考了https://github.com/bentrevett/pytorch-sentiment-analysis

我们会用PyTorch模型来做情感分析(检测一段文字的情感是正面的还是负面的)。我们会使用ChnSentiCorp_htl数据集,即酒店评论数据集。

数据下载链接:https://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/ChnSentiCorp_htl_all/ChnSentiCorp_htl_all.csv

模型从简单到复杂,我们会依次构建:

  • Word Averaging模型
  • RNN/LSTM模型
  • CNN模型

准备数据

  • 首先让我们加载数据,来看看这一批酒店评价数据长得怎样

In [1]:

1
2
3
4
5
6
7
8
import pandas as pd
import numpy as np
path = "ChnSentiCorp_htl_all.csv"
pd_all = pd.read_csv(path)

print('评论数目(总体):%d' % pd_all.shape[0])
print('评论数目(正向):%d' % pd_all[pd_all.label==1].shape[0])
print('评论数目(负向):%d' % pd_all[pd_all.label==0].shape[0])
1
2
3
评论数目(总体):7766
评论数目(正向):5322
评论数目(负向):2444

In [2]:

1
pd_all.sample(5)

Out[2]:

label review
914 1 地点看上去不错,在北京西客站对面,但出行十分不便,周边没有地铁,门口出租车倒是挺多,但就是不…
7655 0 酒店位置较偏僻,环境清净,交通也方便,但酒店及周边就餐选择不多;浴场海水中有水草,水亦太浅,…
3424 1 酒店给人感觉很温欣,服务员也挺有礼貌,房间内的舒适度也非常不错,离开李公递也很近,下次来苏州…
4854 1 离故宫不太远,走路大概10分钟不到点,环境还好,有一点非常不好的是窗帘就只有一层,早上很早就…
5852 0 宾馆背面就是省道,交通是方便的,停车场很大也很方便,但晚上尤其半夜路过的汽车声音很响,拖拉机…

In [3]:

1
2
3
4
5
import pkuseg

seg = pkuseg.pkuseg() # 以默认配置加载模型
text = seg.cut('我爱北京天安门') # 进行分词
print(text)
1
['我', '爱', '北京', '天安门']

下面我们先手工把数据分成train, dev, test三个部分

In [4]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
pd_all_shuf = pd_all.sample(frac=1)

# 总共有多少ins
total_num_ins = pd_all_shuf.shape[0]
pd_train = pd_all_shuf.iloc[:int(total_num_ins*0.8)]
pd_dev = pd_all_shuf.iloc[int(total_num_ins*0.8):int(total_num_ins*0.9)]
pd_test = pd_all_shuf.iloc[int(total_num_ins*0.9):]

# text, label
train_text = [seg.cut(str(text)) for text in pd_train.review.tolist()]
dev_text = [seg.cut(str(text)) for text in pd_dev.review.tolist()]
test_text = [seg.cut(str(text)) for text in pd_test.review.tolist()]
train_label = pd_train.label.tolist()
dev_label = pd_dev.label.tolist()
test_label = pd_test.label.tolist()

In [6]:

1
train_label[0]

Out[6]:

1
0

我们从训练数据构造出一个由单词到index的单词表

In [7]:

1
2
3
4
5
6
7
8
9
10
11
12
from collections import Counter
def build_vocab(sents, max_words=50000):
word_counts = Counter()
for sent in sents:
for word in sent:
word_counts[word] += 1
itos = [w for w, c in word_counts.most_common(max_words)]
itos = ["UNK", "PAD"] + itos
stoi = {w:i for i, w in enumerate(itos)}
return itos, stoi

itos, stoi = build_vocab(train_text)

查看一下比较高频的单词

In [8]:

1
itos[:10]

Out[8]:

1
['UNK', 'PAD', ',', '的', '。', '了', ',', '酒店', '是', '很']

In [10]:

1
stoi["酒店"]

Out[10]:

1
7

我们把文本中的单词都转换成index

In [12]:

1
2
3
train_idx = [[stoi.get(word, stoi.get("UNK")) for word in text] for text in train_text ]
dev_idx = [[stoi.get(word, stoi.get("UNK")) for word in text] for text in dev_text ]
test_idx = [[stoi.get(word, stoi.get("UNK")) for word in text] for text in test_text ]

把数据和label都转成batch

In [15]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def get_minibatches(text_idx, labels, batch_size=64, sort=True):
if sort:
text_idx_and_labels = sorted(list(zip(text_idx, labels)), key=lambda x: len(x[0]))

text_idx_batches = []
label_batches = []
for i in range(0, len(text_idx), batch_size):
text_batch = [t for t, l in text_idx_and_labels[i:i+batch_size]]
label_batch = [l for t, l in text_idx_and_labels[i:i+batch_size]]
max_len = max([len(t) for t in text_batch])
text_batch_np = np.ones((len(text_batch), max_len), dtype=np.int) # batch_size * max_seq_ength
for i, t in enumerate(text_batch):
text_batch_np[i, :len(t)] = t
text_idx_batches.append(text_batch_np)
label_batches.append(np.array(label_batch))

return text_idx_batches, label_batches

train_batches, train_label_batches = get_minibatches(train_idx, train_label)
dev_batches, dev_label_batches = get_minibatches(dev_idx, dev_label)
test_batches, test_label_batches = get_minibatches(test_idx, test_label)

In [17]:

1
train_batches[20]

Out[17]:

1
2
3
4
5
6
7
array([[  80,  177,  149, ...,  191,    3,    1],
[ 49, 18, 20, ..., 53, 4, 1],
[ 7, 18, 17, ..., 702, 4, 1],
...,
[1107, 2067, 10, ..., 748, 172, 442],
[ 241, 9, 19, ..., 17, 44, 30],
[3058, 20, 6, ..., 9, 19, 98]])

In [18]:

1
train_label_batches[20]

Out[18]:

1
2
3
array([1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1,
1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0,
1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1])
  • 和之前一样,我们会设定random seeds使实验可以复现。

In [19]:

1
2
3
4
5
6
7
8
9
10
11
import torch
from torchtext import data
import random

SEED = 1234

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Word Averaging模型

  • 我们首先介绍一个简单的Word Averaging模型。这个模型非常简单,我们把每个单词都通过Embedding层投射成word embedding vector,然后把一句话中的所有word vector做个平均,就是整个句子的vector表示了。接下来把这个sentence vector传入一个Linear层,做分类即可。

In [32]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import torch
import torch.nn as nn
import torch.nn.functional as F

class WordAVGModel(nn.Module):
def __init__(self, vocab_size, embedding_size, output_size, pad_idx, dropout_p=0.2):
super(WordAVGModel, self).__init__()
self.embed = nn.Embedding(vocab_size, embedding_size, padding_idx=pad_idx)
self.linear = nn.Linear(embedding_size, output_size)
self.dropout = nn.Dropout(dropout_p) # 这个参数经常拿来调节

def forward(self, text, mask):
# text: batch_size * max_seq_len
# mask: batch_size * max_seq_len
embedded = self.embed(text) # [batch_size, max_seq_len, embedding_size]
embedded = self.dropout(embedded)
# dropout
mask = (1. - mask.float()).unsqueeze(2) # [batch_size, seq_len, 1], 1 represents word, 0 represents padding
embedded = embedded * mask # [batch_size, seq_len, embedding_size]
# 求平均
sent_embed = embedded.sum(1) / (mask.sum(1) + 1e-9) # 防止mask.sum为0,那么不能除以零。
# dropout
return self.linear(sent_embed)

In [75]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import torch
import torch.nn as nn
import torch.nn.functional as F

class WordMaxModel(nn.Module):
def __init__(self, vocab_size, embedding_size, output_size, pad_idx, dropout_p=0.2):
super(WordMaxModel, self).__init__()
self.embed = nn.Embedding(vocab_size, embedding_size, padding_idx=pad_idx)
self.linear = nn.Linear(embedding_size, output_size)
self.dropout = nn.Dropout(dropout_p) # 这个参数经常拿来调节

def forward(self, text, mask):
# text: batch_size * max_seq_len
# mask: batch_size * max_seq_len
embedded = self.embed(text) # [batch_size, max_seq_len, embedding_size]
embedded = self.dropout(embedded)
embedded.masked_fill(mask.unsqueeze(2), -999999)
# dropout
sent_embed = torch.max(embedded, 1)[0]
# dropout
return self.linear(sent_embed)

In [76]:

1
2
3
4
5
6
7
8
9
VOCAB_SIZE = len(itos)
EMBEDDING_SIZE = 100
OUTPUT_SIZE = 1
PAD_IDX = stoi["PAD"]

model = WordMaxModel(vocab_size=VOCAB_SIZE,
embedding_size=EMBEDDING_SIZE,
output_size=OUTPUT_SIZE,
pad_idx=PAD_IDX)

In [77]:

1
VOCAB_SIZE

Out[77]:

1
24001

In [78]:

1
2
3
4
5
# model
def count_parameters(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)

count_parameters(model)

Out[78]:

1
2400201

In [79]:

1
UNK_IDX = stoi["UNK"]

训练模型

In [80]:

1
2
3
4
5
optimizer = torch.optim.Adam(model.parameters())
crit = nn.BCEWithLogitsLoss()

model = model.to(device)
# crit = crit.to(device)

计算预测的准确率

In [81]:

1
2
3
4
5
def binary_accuracy(preds, y):
rounded_preds = torch.round(torch.sigmoid(preds))
correct = (rounded_preds == y).float()
acc = correct.sum() / len(correct)
return acc

In [82]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def train(model, text_idxs, labels, optimizer, crit):
epoch_loss, epoch_acc = 0., 0.
model.train()
total_len = 0.
for text, label in zip(text_idxs, labels):
text = torch.from_numpy(text).to(device)
label = torch.from_numpy(label).to(device)
mask = text == PAD_IDX
preds = model(text, mask).squeeze() # [batch_size, sent_length]
loss = crit(preds, label.float())
acc = binary_accuracy(preds, label)

# sgd
optimizer.zero_grad()
loss.backward()
optimizer.step()

# print("batch loss: {}".format(loss.item()))

epoch_loss += loss.item() * len(label)
epoch_acc += acc.item() * len(label)
total_len += len(label)

return epoch_loss / total_len, epoch_acc / total_len

In [83]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def evaluate(model, text_idxs, labels, crit):
epoch_loss, epoch_acc = 0., 0.
model.eval()
total_len = 0.
for text, label in zip(text_idxs, labels):
text = torch.from_numpy(text).to(device)
label = torch.from_numpy(label).to(device)
mask = text == PAD_IDX
with torch.no_grad():
preds = model(text, mask).squeeze()
loss = crit(preds, label.float())
acc = binary_accuracy(preds, label)

epoch_loss += loss.item() * len(label)
epoch_acc += acc.item() * len(label)
total_len += len(label)
model.train()

return epoch_loss / total_len, epoch_acc / total_len

In [84]:

1
2
3
4
5
6
7
8
9
10
11
12
N_EPOCHS = 10
best_valid_acc = 0.
for epoch in range(N_EPOCHS):
train_loss, train_acc = train(model, train_batches, train_label_batches, optimizer, crit)
valid_loss, valid_acc = evaluate(model, dev_batches, dev_label_batches, crit)

if valid_acc > best_valid_acc:
best_valid_acc = valid_acc
torch.save(model.state_dict(), "wordavg-model.pth")

print("Epoch", epoch, "Train Loss", train_loss, "Train Acc", train_acc)
print("Epoch", epoch, "Valid Loss", valid_loss, "Valid Acc", valid_acc)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Epoch 0 Train Loss 0.6241851981347865 Train Acc 0.6785254346426272
Epoch 0 Valid Loss 0.7439712684126895 Valid Acc 0.396396396434752
Epoch 1 Train Loss 0.6111872896254639 Train Acc 0.6775595621377978
Epoch 1 Valid Loss 0.7044879783381213 Valid Acc 0.4761904762288318
Epoch 2 Train Loss 0.5826128212314072 Train Acc 0.7041210560206053
Epoch 2 Valid Loss 0.667004818775172 Valid Acc 0.5791505791889348
Epoch 3 Train Loss 0.5516750626769744 Train Acc 0.7293947198969736
Epoch 3 Valid Loss 0.6347547087583456 Valid Acc 0.6473616476684924
Epoch 4 Train Loss 0.5236469646921791 Train Acc 0.7512878300064392
Epoch 4 Valid Loss 0.5953507212444928 Valid Acc 0.7348777351845799
Epoch 5 Train Loss 0.4969042095152394 Train Acc 0.7707662588538313
Epoch 5 Valid Loss 0.5561252224092471 Valid Acc 0.7786357789426237
Epoch 6 Train Loss 0.46501466702892946 Train Acc 0.797005795235029
Epoch 6 Valid Loss 0.5206214915217887 Valid Acc 0.8018018021086468
Epoch 7 Train Loss 0.43595607032794303 Train Acc 0.8163232453316163
Epoch 7 Valid Loss 0.4846100037776058 Valid Acc 0.8159588161889497
Epoch 8 Train Loss 0.40671270164611334 Train Acc 0.8386992916934964
Epoch 8 Valid Loss 0.45964578196809097 Valid Acc 0.8211068213369549
Epoch 9 Train Loss 0.38044816804408105 Train Acc 0.8539922730199614
Epoch 9 Valid Loss 0.4279780917953187 Valid Acc 0.8416988419289755

In [85]:

1
model.load_state_dict(torch.load("wordavg-model.pth"))

Out[85]:

1
<All keys matched successfully>

In [86]:

1
2
3
4
5
6
7
8
9
10
def predict_sentiment(model, sentence):
model.eval()
indexed = [stoi.get(t, PAD_IDX) for t in seg.cut(sentence)]
tensor = torch.LongTensor(indexed).to(device) # seq_len
tensor = tensor.unsqueeze(0) # batch_size* seq_len
mask = tensor == PAD_IDX
# print(tensor, "\n", mask)
with torch.no_grad():
pred = torch.sigmoid(model(tensor, mask))
return pred.item()

In [88]:

1
predict_sentiment(model, "这个酒店非常脏乱差,不推荐")

Out[88]:

1
0.6831367611885071

In [90]:

1
predict_sentiment(model, "这个酒店非常好,强烈推荐!")

Out[90]:

1
0.8252924680709839

In [91]:

1
predict_sentiment(model, "房间设备太破,连喷头都是不好用,空调几乎感觉不到,虽然我开了最大另外就是设备维修不及时,洗澡用品感觉都是廉价货,味道很奇怪的洗头液等等...总体感觉服务还可以,设备招待所水平...")

Out[91]:

1
0.5120517611503601

In [92]:

1
predict_sentiment(model, "房间稍小,但清洁,非常实惠。不足之处是:双人房的洗澡用品只有一套.宾馆反馈2008年8月5日:尊敬的宾客:您好!感谢您选择入住金陵溧阳宾馆!对于酒店双人房内的洗漱用品只有一套的问题,我们已经召集酒店相关部门对此问题进行了研究和整改。努力将我们的管理与服务工作做到位,进一步关注宾客,关注细节!再次向您表示我们最衷心的感谢!期待您能再次来溧阳并入住金陵溧阳宾馆!让我们有给您提供更加优质服务的机会!顺祝您工作顺利!身体健康!金陵溧阳宾馆客务关系主任")

Out[92]:

1
0.7319579124450684

In [93]:

1
predict_sentiment(model, "该酒店对去溧阳公务或旅游的人都很适合,自助早餐很丰富,酒店内部环境和服务很好。唯一的不足是酒店大门口在晚上时太乱,各种车辆和人在门口挤成一团。补充点评2008年5月9日:房间淋浴水压不稳,一会热、一会冷,很不好调整。宾馆反馈2008年5月13日:非常感谢您选择入住金陵溧阳宾馆。您给予我们的肯定与赞赏让我们倍受鼓舞,也使我们更加自信地去做好每一天的服务工作。正是有许多像您一样的宾客给予我们不断的鼓励和赞赏,酒店的服务品质才能得以不断提升。对于酒店大门口的秩序和房间淋浴水的问题我们已做出了相应的措施。再次向您表示我们最衷心的感谢!我们期待您的再次光临!")

Out[93]:

1
0.793725311756134

In [94]:

1
predict_sentiment(model, "环境不错,室内色调很温馨,MM很满意!就是窗户收拾得太马虎了,拉开窗帘就觉得很凌乱的感觉。最不足的地方就是淋浴了,一是地方太小了,二是洗澡时水时大时小的,中间还停了几秒!!")

Out[94]:

1
0.7605408430099487

In [95]:

1
2
3
model.load_state_dict(torch.load("wordavg-model.pth"))
test_loss, test_acc = evaluate(model, test_batches, test_label_batches, crit)
print("CNN model test loss: ", test_loss, "accuracy:", test_acc)
1
CNN model test loss:  0.44893962796897346 accuracy: 0.8133848134615247

RNN模型

  • 下面我们尝试把模型换成一个

    recurrent neural network

(RNN)。RNN经常会被用来encode一个sequence

ℎ𝑡=RNN(𝑥𝑡,ℎ𝑡−1)ht=RNN(xt,ht−1)

  • 我们使用最后一个hidden state ℎ𝑇hT来表示整个句子。

  • 然后我们把ℎ𝑇hT通过一个线性变换𝑓f,然后用来预测句子的情感。

img

img

In [57]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
class RNNModel(nn.Module):
def __init__(self, vocab_size, embedding_size, output_size, pad_idx, hidden_size, dropout, avg_hidden=True):
super(RNNModel, self).__init__()
self.embed = nn.Embedding(vocab_size, embedding_size, padding_idx=pad_idx)
self.lstm = nn.LSTM(embedding_size, hidden_size, bidirectional=True, num_layers=2, batch_first=True)
self.linear = nn.Linear(hidden_size*2, output_size)

self.dropout = nn.Dropout(dropout)
self.avg_hidden = avg_hidden

def forward(self, text, mask):
embedded = self.embed(text) # [batch_size, seq_len, embedding_size] 其中包含一些pad
embedded = self.dropout(embedded)

# mask: batch_size * seq_length
seq_length = (1. - mask.float()).sum(1)
embedded = torch.nn.utils.rnn.pack_padded_sequence(
input=embedded,
lengths=seq_length,
batch_first=True,
enforce_sorted=False
) # batch_size * seq_len * ..., seq_len * batch_size * ...
output, (hidden, cell) = self.lstm(embedded)
output, seq_length = torch.nn.utils.rnn.pad_packed_sequence(
sequence=output,
batch_first=True,
padding_value=0,
total_length=mask.shape[1]
)

# output: [batch_size, seq_length, hidden_dim * num_directions]
# hidden: [num_layers * num_directions, batch_size, hidden_dim]


if self.avg_hidden:
hidden = torch.sum(output * (1. - mask.float()).unsqueeze(2), 1) / torch.sum((1. - mask.float()), 1).unsqueeze(1)
else:
# 拿最后一个hidden state作为句子的表示
# hidden: 2 * batch_size * hidden_size
hidden = torch.cat([hidden[-1], hidden[-2]], dim=1)
hidden = self.dropout(hidden.squeeze())
return self.linear(hidden)

In [58]:

1
2
3
4
5
6
model = RNNModel(vocab_size=VOCAB_SIZE, 
embedding_size=EMBEDDING_SIZE,
output_size=OUTPUT_SIZE,
pad_idx=PAD_IDX,
hidden_size=100,
dropout=0.5)

训练RNN模型

In [59]:

1
2
3
4
5
optimizer = torch.optim.Adam(model.parameters()) # L2
crit = nn.BCEWithLogitsLoss()

model = model.to(device)
crit = crit.to(device)

In [60]:

1
2
3
4
5
6
7
8
9
10
11
12
N_EPOCHS = 10
best_valid_acc = 0.
for epoch in range(N_EPOCHS):
train_loss, train_acc = train(model, train_batches, train_label_batches, optimizer, crit)
valid_loss, valid_acc = evaluate(model, dev_batches, dev_label_batches, crit)

if valid_acc > best_valid_acc:
best_valid_acc = valid_acc
torch.save(model.state_dict(), "lstm-model.pth")

print("Epoch", epoch, "Train Loss", train_loss, "Train Acc", train_acc)
print("Epoch", epoch, "Valid Loss", valid_loss, "Valid Acc", valid_acc)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Epoch 0 Train Loss 0.5140281977456996 Train Acc 0.7472633612363168
Epoch 0 Valid Loss 0.7321655497894631 Valid Acc 0.8133848134615247
Epoch 1 Train Loss 0.4205178504441526 Train Acc 0.8209916291049582
Epoch 1 Valid Loss 0.5658483397086155 Valid Acc 0.8391248392782616
Epoch 2 Train Loss 0.3576773465620036 Train Acc 0.8473921442369607
Epoch 2 Valid Loss 0.6089477152437777 Valid Acc 0.8545688548756996
Epoch 3 Train Loss 0.3190276817504391 Train Acc 0.8647778493238892
Epoch 3 Valid Loss 0.5731698980355968 Valid Acc 0.8622908625977073
Epoch 4 Train Loss 0.2850390273336434 Train Acc 0.8881197681905988
Epoch 4 Valid Loss 0.6073675444073966 Valid Acc 0.8622908625209961
Epoch 5 Train Loss 0.26827128295812463 Train Acc 0.8884417256922086
Epoch 5 Valid Loss 0.4971172449057934 Valid Acc 0.8700128701662925
Epoch 6 Train Loss 0.23699480644442233 Train Acc 0.9059884095299421
Epoch 6 Valid Loss 0.5370476412343549 Valid Acc 0.8635778636545748
Epoch 7 Train Loss 0.22414902945487483 Train Acc 0.9072762395363811
Epoch 7 Valid Loss 0.48257371315317876 Valid Acc 0.8725868726635838
Epoch 8 Train Loss 0.2119196125996435 Train Acc 0.9162910495814552
Epoch 8 Valid Loss 0.59562370292315 Valid Acc 0.8468468471536919
Epoch 9 Train Loss 0.20756761220698194 Train Acc 0.9207984546039922
Epoch 9 Valid Loss 0.6451035161122699 Valid Acc 0.8700128701662925

In [62]:

1
predict_sentiment(model, "沈阳市政府的酒店,比较大气,交通便利,出门往左就是北陵公园,环境好。")

Out[62]:

1
0.9994519352912903

In [63]:

1
predict_sentiment(model, "这个酒店非常脏乱差,不推荐!")

Out[63]:

1
0.01588270254433155

In [68]:

1
predict_sentiment(model, "这个酒店不乱,非常推荐!")

Out[68]:

1
0.04462616145610809

在test上做模型预测

In [69]:

1
2
3
model.load_state_dict(torch.load("lstm-model.pth"))
test_loss, test_acc = evaluate(model, test_batches, test_label_batches, crit)
print("CNN model test loss: ", test_loss, "accuracy:", test_acc)
1
CNN model test loss:  0.5639284941220376 accuracy: 0.8481338484406932

CNN模型

In [70]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
class CNN(nn.Module):
def __init__(self, vocab_size, embedding_size, output_size, pad_idx, num_filters, filter_sizes, dropout):
super(CNN, self).__init__()
self.filter_sizes = filter_sizes
self.embed = nn.Embedding(vocab_size, embedding_size, padding_idx=pad_idx)
self.convs = nn.ModuleList([
nn.Conv2d(in_channels=1, out_channels=num_filters,
kernel_size=(fs, embedding_size))
for fs in filter_sizes
]) # 3个CNN
# fs实际上就是n-gram的n
# self.conv = nn.Conv2d(in_channels=1, out_channels=num_filters, kernel_size=(filter_size, embedding_size))
self.linear = nn.Linear(num_filters * len(filter_sizes), output_size)
self.dropout = nn.Dropout(dropout)

def forward(self, text, mask):
embedded = self.embed(text) # [batch_size, seq_len, embedding_size]
embedded = embedded.unsqueeze(1) # # [batch_size, 1, seq_len, embedding_size]
# conved = F.relu(self.conv(embedded)) # [batch_size, num_filters, seq_len-filter_size+1, 1]
# conved = conved.squeeze(3) # [batch_size, num_filters, seq_len-filter_size+1]
conved = [
F.relu(conv(embedded)).squeeze(3) for conv in self.convs
] # [batch_size, num_filters, seq_len-filter_size+1]

# [2, 5, 1, 1]

# mask [[0, 0, 1, 1]]
# fs: 2
# [0, 0, 1]
conved = [
conv.masked_fill(mask[:, :-filter_size+1].unsqueeze(1) , -999999) for (conv, filter_size) in zip(conved, self.filter_sizes)
]
# max over time pooling
# pooled = F.max_pool1d(conved, conved.shape[2]) # [batch_size, num_filters, 1]
# pooled = pooled.squeeze(2)
pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
pooled = torch.cat(pooled, dim=1) # batch_size, 3*num_filters
pooled = self.dropout(pooled)

return self.linear(pooled)

# Conv1d? 1x1 Conv2d?

In [71]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
model = CNN(vocab_size=VOCAB_SIZE, 
embedding_size=EMBEDDING_SIZE,
output_size=OUTPUT_SIZE,
pad_idx=PAD_IDX,
num_filters=100,
filter_sizes=[3,4,5], # 3-gram, 4-gram, 5-gram
dropout=0.5)

optimizer = torch.optim.Adam(model.parameters())
crit = nn.BCEWithLogitsLoss()

model = model.to(device)
crit = crit.to(device)

N_EPOCHS = 10
best_valid_acc = 0.
for epoch in range(N_EPOCHS):
train_loss, train_acc = train(model, train_batches, train_label_batches, optimizer, crit)
valid_loss, valid_acc = evaluate(model, dev_batches, dev_label_batches, crit)

if valid_acc > best_valid_acc:
best_valid_acc = valid_acc
torch.save(model.state_dict(), "cnn-model.pth")

print("Epoch", epoch, "Train Loss", train_loss, "Train Acc", train_acc)
print("Epoch", epoch, "Valid Loss", valid_loss, "Valid Acc", valid_acc)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Epoch 0 Train Loss 0.5229088294452341 Train Acc 0.7443657437218287
Epoch 0 Valid Loss 0.39319338566087847 Valid Acc 0.8108108110409445
Epoch 1 Train Loss 0.3683148011498043 Train Acc 0.837894397939472
Epoch 1 Valid Loss 0.3534783678778964 Valid Acc 0.840411840565263
Epoch 2 Train Loss 0.3185185533801318 Train Acc 0.8644558918222794
Epoch 2 Valid Loss 0.34023444222207233 Valid Acc 0.8545688547222771
Epoch 3 Train Loss 0.27130810793366883 Train Acc 0.8889246619446233
Epoch 3 Valid Loss 0.30879392936116173 Valid Acc 0.8648648650182874
Epoch 4 Train Loss 0.24334710945314694 Train Acc 0.9034127495170637
Epoch 4 Valid Loss 0.3020249246553718 Valid Acc 0.8790218791753015
Epoch 5 Train Loss 0.2156534520195556 Train Acc 0.912105602060528
Epoch 5 Valid Loss 0.326562241774575 Valid Acc 0.8571428572962797
Epoch 6 Train Loss 0.189559489642123 Train Acc 0.9245009658725049
Epoch 6 Valid Loss 0.28917587651095644 Valid Acc 0.885456885610308
Epoch 7 Train Loss 0.16508568145445063 Train Acc 0.9356084996780425
Epoch 7 Valid Loss 0.2982815937876241 Valid Acc 0.8790218791753015
Epoch 8 Train Loss 0.14198238390007764 Train Acc 0.9452672247263362
Epoch 8 Valid Loss 0.2929042390184513 Valid Acc 0.8880308881843105
Epoch 9 Train Loss 0.11862559608529824 Train Acc 0.9552479072762395
Epoch 9 Valid Loss 0.29382622203618247 Valid Acc 0.886743886820598

In [72]:

1
2
3
model.load_state_dict(torch.load("cnn-model.pth"))
test_loss, test_acc = evaluate(model, test_batches, test_label_batches, crit)
print("CNN model test loss: ", test_loss, "accuracy:", test_acc)
1
CNN model test loss:  0.32514461861537386 accuracy: 0.8674388674388674

In [74]:

1
predict_sentiment(model, "酒店位于昆明中心区,地理位置不错,可惜酒店服务有些差,第一天晚上可能入住的客人不多,空调根本没开,打了电话问,说是中央空调要晚上统一开,结果晚上也没开,就热了一晚上,第二天有开会的入住,晚上就有了空调,不得不说酒店经济帐作的好.房间的床太硬,睡的不好.酒店的早餐就如其他人评价一样,想法的难吃.不过携程的预订价钱还不错.")

Out[74]:

1
0.893503725528717

learning representation

1
 
文章目录
  1. 1. 酒店评价情感分类与CNN模型
  • 准备数据
  • Word Averaging模型
  • 训练模型
  • RNN模型
  • 训练RNN模型
  • CNN模型
  • |