KerasのSequentialモデルで、GloVeのPretrained Word Embeddingsを使ってみる

– Word2Vec developed by Google and GloVe, Stanford NLP Group
L co-occurrence matrix and matrix factorization

### Pretrained word embeddings
Global Vectors for Word RepresentationのサイトからはDLできないので、kaggleからDLします。
GloVe

Kaggle GloVe6B

e.g. 50 characters in first lines
$ head -n 1 glove.6B.50d.txt | cut -c-50
the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.04445

import numpy as np
from keras.preprocessing.text import Tokenizer

def create_embedding_matrix(filepath, word_index, embedding_dim):
	global vocab_size
	vocab_size = len(word_index) + 1 # Adding 1 because of reserved 0 index
	embedding_matrix = np.zeros((vocab_size, embedding_dim))

	with open(filepath) as f:
		for line in f:
			word, *vector = line.split()
			if word in word_index:
				idx = word_index[word]
				embedding_matrix[idx] = np.array(
					vector, dtype=np.float32)[:embedding_dim]

	return embedding_matrix

tokenizer = Tokenizer(num_words=5000)
embedding_dim = 50
embedding_matrix = create_embedding_matrix(
	'glove.6B.50d.txt',
	tokenizer.word_index, embedding_dim)

nonzero_elements = np.count_nonzero(np.count_nonzero(embedding_matrix, axis=1))
print(nonzero_elements / vocab_size)

$ python3 glove.py
0.0
ん? 何かおかしい。。。

GlobalMaxPool1D layer

from keras.models import Sequential
from keras import layers

// 省略
vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 50
maxlen = 100

model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim,
							weights=[embedding_matrix],
							input_length=maxlen,
							trainable=False))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
			loss='binary_crossentropy',
			metrics=['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 100, 50) 50
_________________________________________________________________
global_max_pooling1d (Global (None, 50) 0
_________________________________________________________________
dense (Dense) (None, 10) 510
_________________________________________________________________
dense_1 (Dense) (None, 1) 11
=================================================================
Total params: 571
Trainable params: 521
Non-trainable params: 50
_________________________________________________________________
None

なんかoutputが違うな