英文书籍word级别的文本生成代码注释

先看丘吉尔的人物传记char级别的文本生成

举个小小的例子,来看看LSTM是怎么玩的

我们这里不再用char级别,我们用word级别来做。我们这里的文本预测就是,给了前面的单词以后,下一个单词是谁?

比如,hello from the other, 给出 side

第一步,一样,先导入各种库

导入数据并分词

In [1]:

1
2
3
4
5
6
7
8
9
10
import os
import numpy as np
import nltk
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
from gensim.models.word2vec import Word2Vec
1
Using TensorFlow backend.

In [8]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 运行资源充足的可以试试下面的代码
# raw_text = ''
# for file in os.listdir("./input/"):
# # os.listdir列出路径下的所有文件的名字
# if file.endswith(".txt"): # 取出后缀.txt的文件
# raw_text += open("./input/"+file, errors='ignore').read() + '\n\n'
raw_text = open('./input/Winston_Churchil.txt').read()
# 我们仍用丘吉尔的语料生成文本
raw_text = raw_text.lower()
sentensor = nltk.data.load('tokenizers/punkt/english.pickle')
# 加载英文的划分句子的模型
sents = sentensor.tokenize(raw_text)
# .tokenize对一段文本进行分句,分成各个句子组成的列表。详解看下这个博客,蛮有意思的
# https://blog.csdn.net/ustbbsy/article/details/80053307
print(sents[:2])
1
['\ufeffproject gutenberg’s real soldiers of fortune, by richard harding davis\n\nthis ebook is for the use of anyone anywhere at no cost and with\nalmost no restrictions whatsoever.', 'you may copy it, give it away or\nre-use it under the terms of the project gutenberg license included\nwith this ebook or online at www.gutenberg.org\n\n\ntitle: real soldiers of fortune\n\nauthor: richard harding davis\n\nposting date: february 22, 2009 [ebook #3029]\nlast updated: september 26, 2016\n\nlanguage: english\n\ncharacter set encoding: utf-8\n\n*** start of this project gutenberg ebook real soldiers of fortune ***\n\n\n\n\nproduced by david reed, and ronald j. wilson\n\n\n\n\n\nreal soldiers of fortune\n\n\nby richard harding davis\n\n\n\n\n\nmajor-general henry ronald douglas maciver\n\nany sunny afternoon, on fifth avenue, or at night in the _table d’hote_\nrestaurants of university place, you may meet the soldier of fortune who\nof all his brothers in arms now living is the most remarkable.']

In [9]:

1
2
3
4
5
6
corpus = []
for sen in sents: # 针对每个句子,再次进行分词。
corpus.append(nltk.word_tokenize(sen))

print(len(corpus))
print(corpus[:2])
1
2
1792
[['\ufeffproject', 'gutenberg', '’', 's', 'real', 'soldiers', 'of', 'fortune', ',', 'by', 'richard', 'harding', 'davis', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', '.'], ['you', 'may', 'copy', 'it', ',', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'project', 'gutenberg', 'license', 'included', 'with', 'this', 'ebook', 'or', 'online', 'at', 'www.gutenberg.org', 'title', ':', 'real', 'soldiers', 'of', 'fortune', 'author', ':', 'richard', 'harding', 'davis', 'posting', 'date', ':', 'february', '22', ',', '2009', '[', 'ebook', '#', '3029', ']', 'last', 'updated', ':', 'september', '26', ',', '2016', 'language', ':', 'english', 'character', 'set', 'encoding', ':', 'utf-8', '***', 'start', 'of', 'this', 'project', 'gutenberg', 'ebook', 'real', 'soldiers', 'of', 'fortune', '***', 'produced', 'by', 'david', 'reed', ',', 'and', 'ronald', 'j.', 'wilson', 'real', 'soldiers', 'of', 'fortune', 'by', 'richard', 'harding', 'davis', 'major-general', 'henry', 'ronald', 'douglas', 'maciver', 'any', 'sunny', 'afternoon', ',', 'on', 'fifth', 'avenue', ',', 'or', 'at', 'night', 'in', 'the', '_table', 'd', '’', 'hote_', 'restaurants', 'of', 'university', 'place', ',', 'you', 'may', 'meet', 'the', 'soldier', 'of', 'fortune', 'who', 'of', 'all', 'his', 'brothers', 'in', 'arms', 'now', 'living', 'is', 'the', 'most', 'remarkable', '.']]

word2vec生成词向量

In [45]:

1
2
3
4
5
6
7
8
w2v_model = Word2Vec(corpus, size=128, window=5, min_count=2, workers=4)
# Word2Vec()参数看这个博客:https://www.cnblogs.com/pinard/p/7278324.html
# size:词向量的维度
# window:即词向量上下文最大距离,window越大,则和某一词较远的词也会产生上下文关系。默认值为5。
# min_count:需要计算词向量的最小词频。这个值可以去掉一些很生僻的低频词,默认是5。如果是小语料,可以调低这个值。
# workers:用于控制训练的并行数。

print(w2v_model['office'][:20])
1
2
3
4
[-0.03379476 -0.22743131 -0.17660786 -0.00957653 -0.10752155 -0.14298159
0.02914934 -0.08970737 -0.15872304 -0.05246524 -0.00084796 -0.05634443
-0.1461402 0.03880814 -0.12331649 -0.06511988 -0.08555544 -0.2300725
-0.0083805 0.02204316]
1
/Users/yyg/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:8: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).

构造训练集

In [46]:

1
2
3
4
5
6
7
8
9
raw_input = [item for sublist in corpus for item in sublist]
print(len(raw_input)) # 原始语料库里的词语总数
text_stream = []
vocab = w2v_model.wv.vocab # 查看w2v_model生成的词向量
for word in raw_input:
if word in vocab:
text_stream.append(word)
print(len(text_stream))
# 查看去掉低频词后的总的词数,因为min_count把低频词去掉了
1
2
55562
51876

In [47]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 处理方式同char级别的文本生成
seq_length = 10
x = []
y = []
for i in range(0, len(text_stream) - seq_length):
given = text_stream[i:i + seq_length]
predict = text_stream[i + seq_length]
x.append([w2v_model[word] for word in given])
y.append(w2v_model[predict])

x = np.reshape(x, (-1, seq_length, 128))
y = np.reshape(y, (-1,128))
print(x.shape)
print(y.shape)
1
2
3
4
/Users/yyg/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:8: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).

/Users/yyg/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:9: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
if __name__ == '__main__':
1
2
(51866, 10, 128)
(51866, 128)

构建和训练模型

In [53]:

1
2
3
4
5
6
7
8
model = Sequential()
model.add(LSTM(256, input_shape=(seq_length, 128),dropout=0.2, recurrent_dropout=0.2))
# 第一个dropout是x和hidden之间的dropout
# 第二个recurrent_dropout,这里我理解为是横向不同时刻隐藏层之间的dropout
model.add(Dropout(0.2)) # 第三个,这里我理解为纵向层与层之间的dropout
model.add(Dense(128, activation='sigmoid'))
model.compile(loss='mse', optimizer='adam')
# 损失用的均方差损失,优化器adam

In [54]:

1
model.fit(x, y, nb_epoch=10, batch_size=4096)
1
2
/Users/yyg/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: UserWarning: The `nb_epoch` argument in `fit` has been renamed `epochs`.
"""Entry point for launching an IPython kernel.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Epoch 1/10
51866/51866 [==============================] - 28s 539us/step - loss: 0.3177
Epoch 2/10
51866/51866 [==============================] - 28s 542us/step - loss: 0.1405
Epoch 3/10
51866/51866 [==============================] - 29s 560us/step - loss: 0.1329
Epoch 4/10
51866/51866 [==============================] - 30s 584us/step - loss: 0.1318
Epoch 5/10
51866/51866 [==============================] - 28s 548us/step - loss: 0.1313
Epoch 6/10
51866/51866 [==============================] - 30s 574us/step - loss: 0.1309
Epoch 7/10
51866/51866 [==============================] - 30s 570us/step - loss: 0.1306
Epoch 8/10
51866/51866 [==============================] - 29s 551us/step - loss: 0.1303
Epoch 9/10
51866/51866 [==============================] - 27s 524us/step - loss: 0.1299
Epoch 10/10
51866/51866 [==============================] - 27s 512us/step - loss: 0.1296

Out[54]:

1
<keras.callbacks.History at 0x1a32c9a2b0>

预测模型

In [55]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 代码注释同丘吉尔的人物传记char级别的文本生成
def predict_next(input_array):
x = np.reshape(input_array, (-1,seq_length,128))
y = model.predict(x)
return y

def string_to_index(raw_input):
raw_input = raw_input.lower()
input_stream = nltk.word_tokenize(raw_input)
res = []
for word in input_stream[(len(input_stream)-seq_length):]:
res.append(w2v_model[word])
return res

def y_to_word(y):
word = w2v_model.most_similar(positive=y, topn=1)
return word

In [56]:

1
2
3
4
5
6
def generate_article(init, rounds=30):
in_string = init.lower()
for i in range(rounds):
n = y_to_word(predict_next(string_to_index(in_string)))
in_string += ' ' + n[0][0]
return in_string

In [58]:

1
2
3
init = 'His object in coming to New York was to engage officers for that service. He came at an  moment'
article = generate_article(init)
print(article) # 语料库较小,可以看到重复了
1
2
3
4
/Users/yyg/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:12: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
if sys.path[0] == '':
/Users/yyg/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:16: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
app.launch_new_instance()
1
his object in coming to new york was to engage officers for that service. he came at an  moment battery battery battery battery battery battery battery battery battery battery battery battery battery battery battery battery battery battery battery battery battery battery battery battery battery battery battery battery battery battery

In [ ]:

1
 
文章目录
  1. 1. 导入数据并分词
  • word2vec生成词向量
    1. 1. 构造训练集
    2. 2. 构建和训练模型
    3. 3. 预测模型
  • |