黑马程序员技术交流社区

标题: 【上海校区】深度学习框架Keras介绍及实战 [打印本页]

作者: 梦缠绕的时候    时间: 2019-2-26 09:46
标题: 【上海校区】深度学习框架Keras介绍及实战
Keras 是一个用 Python 编写的高级神经网络 API,它能够以 TensorFlow, CNTK, 或者 Theano 作为后端运行。Keras 的开发重点是支持快速的实验。能够以最小的时延把你的想法转换为实验结果,是做好研究的关键。
本文以Kaggle上的项目:IMDB影评情感分析为例,学习如何用Keras搭建一个神经网络,处理实际问题.阅读本文需要对神经网络有基础的了解.
文章分为两个部分:
ModelDense 全连接层
keras.layers.core.Dense(units, activation=None, use_bias=True, k
ernel_initializer='glorot_uniform', bias_initializer='zeros', ke
rnel_regularizer=None, bias_regularizer=None, activity_regulariz
er=None, kernel_constraint=None, bias_constraint=None)
# as first layer in a sequential model:# as first layer in a sequential model:model = Sequential()model.add(Dense(32, input_shape=(16,)))# now the model will take as input arrays of shape (*, 16)# and output arrays of shape (*, 32)# after the first layer, you don't need to specify# the size of the input anymore:model.add(Dense(32))嵌入层 Embedding
keras.layers.embeddings.Embedding(input_dim, output_dim, embeddi
ngs_initializer='uniform', embeddings_regularizer=None, activity
_regularizer=None, embeddings_constraint=None, mask_zero=False,
input_length=None)
有兴趣的看这个链接https://machinelearningmastery.c ... eep-learning-keras/
其实就是word to vector。 这一层的作用就是得到用词向量表示的文本.
比如如下代表:我们输入一个M*50的矩阵,这个矩阵中不同的词的个数为200,我们想把每个词转换为32维向量. 返回的是一个(M,50,32)的张量.
一个句子50个词,每个词是32维向量,共M个句子. 所以是e.shape=(M,50,32)
e = Embedding(200, 32, input_length=50)
LSTM层.
LSTM是循环神经网络的一种特殊情况.http://deeplearning.net/tutorial/lstm.html
简单来说,我们此前说过的神经网络,包括CNN,都是单向的,没有考虑序列关系,但是某个词的意义与其上下文是有关的,比如"我用着小米手机,吃着小米粥",两个小米肯定不是一个意思.在做语义分析的时候,需要考虑上下文. 循环神经网络RNN就是干这个事情的.或者说"这部电影质量很高,但是我不喜欢".这个句子里既有正面评价,又有负面评价,参考上下文的LSTM会识别出"但是"后面的才是我们想要重点表达的.
keras.layers.recurrent.LSTM(units, activation='tanh', recurrent_
activation='hard_sigmoid', use_bias=True, kernel_initializer='gl
orot_uniform', recurrent_initializer='orthogonal', bias_initiali
zer='zeros', unit_forget_bias=True, kernel_regularizer=None, rec
urrent_regularizer=None, bias_regularizer=None, activity_regular
izer=None, kernel_constraint=None, recurrent_constraint=None, bi
as_constraint=None, dropout=0.0, recurrent_dropout=0.0)
池化层数据预处理文本预处理t1="i love that girl"t2='i hate u'texts=[t1,t2]tokenizer = Tokenizer(num_words=None)tokenizer.fit_on_texts(texts)  #得到词典 每个词对应一个index.print( tokenizer.word_counts) #OrderedDict([('i', 2), ('love', 1), ('that', 1), ('girl', 1), ('hate', 1), ('u', 1)])print( tokenizer.word_index) #{'i': 1, 'love': 2, 'that': 3, 'girl': 4, 'hate': 5, 'u': 6}print( tokenizer.word_docs) #{'i': 2, 'love': 1, 'that': 1, 'girl': 1, 'u': 1, 'hate': 1})print( tokenizer.index_docs) #{1: 2, 2: 1, 3: 1, 4: 1, 6: 1, 5: 1}tokennized_texts = tokenizer.texts_to_sequences(texts) print(tokennized_texts) #[[1, 2, 3, 4], [1, 5, 6]] 每个词由其index表示X_t = pad_sequences(tokennized_texts, maxlen=None) #转换为2d array 即矩阵形式. 每个文本的词的个数均为maxlen. 不存在的词用0表示.print(X_t)#[[1 2 3 4][0 1 5 6]]    序列预处理
keras实战:IMDB影评情感分析
数据集介绍
主要步骤
数据加载import pandas as pdimport matplotlib.pyplot as pltimport numpy as npdf_train = pd.read_csv("./dataset/word2vec-nlp-tutorial/labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)df_train1=pd.read_csv("./dataset/imdb-review-dataset/imdb_master.csv",encoding="latin-1")df_train1=df_train1.drop(["type",'file'],axis=1)df_train1.rename(columns={'label':'sentiment',                          'Unnamed: 0':'id',                          'review':'review'},                  inplace=True)df_train1 = df_train1[df_train1.sentiment != 'unsup']df_train1['sentiment'] = df_train1['sentiment'].map({'pos': 1, 'neg': 0})new_train=pd.concat([df_train,df_train1])数据清洗
用bs4处理html数据
import refrom bs4 import BeautifulSoupfrom nltk.corpus import stopwordsdef review_to_words( raw_review ):    review_text = BeautifulSoup(raw_review, 'lxml').get_text()     letters_only = re.sub("[^a-zA-Z]", " ", review_text)     words = letters_only.lower().split()                               stops = set(stopwords.words("english"))                      meaningful_words = [w for w in words if not w in stops]       return( " ".join( meaningful_words ))   new_train['review']=new_train['review'].apply(review_to_words)df_test["review"]=df_test["review"].apply(review_to_words)Keras搭建网络
文本转换为矩阵
- Tokenizer作用于list(sentence)得到词典.将词用词在词典中的Index做替换,得到数字矩阵
- pad_sequences做补0. 保证矩阵每一行数目相等. 即每个句子有相同数量的词.
list_classes = ["sentiment"]y = new_train[list_classes].valuesprint(y.shape)list_sentences_train = new_train["review"]list_sentences_test = df_test["review"]max_features = 6000tokenizer = Tokenizer(num_words=max_features)tokenizer.fit_on_texts(list(list_sentences_train))list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)print(len(tokenizer.word_index))totalNumWords = [len(one_comment) for one_comment in list_tokenized_train]print(max(totalNumWords),sum(totalNumWords) / len(totalNumWords))maxlen = 400X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)X_te = pad_sequences(list_tokenized_test, maxlen=maxlen)
模型构建
inp = Input(shape=(maxlen, ))print(inp.shape) # (?, 400) #每个句子400个词embed_size = 128 #每个词转换成128维的向量x = Embedding(max_features, embed_size)(inp)print(x.shape) #(?, 400, 128)x = LSTM(60, return_sequences=True,name='lstm_layer')(x)print(x.shape)x = GlobalMaxPool1D()(x)print(x.shape)x = Dropout(0.1)(x)print(x.shape)x = Dense(50, activation="relu")(x)print(x.shape)x = Dropout(0.1)(x)print(x.shape)x = Dense(1, activation="sigmoid")(x)print(x.shape)model = Model(inputs=inp, outputs=x)model.compile(loss='binary_crossentropy',                  optimizer='adam',                  metrics=['accuracy'])batch_size = 32epochs = 2print(X_t.shape,y.shape)model.fit(X_t,y, batch_size=batch_size, epochs=epochs, validation_split=0.2)prediction = model.predict(X_te)y_pred = (prediction > 0.5)原文地址https://www.cnblogs.com/sdu20112013/p/10428471.html

作者: 不二晨    时间: 2019-2-26 15:36
奈斯,感谢分享




欢迎光临 黑马程序员技术交流社区 (http://bbs.itheima.com/) 黑马程序员IT技术论坛 X3.2