– Word2Vec developed by Google and GloVe, Stanford NLP Group
L co-occurrence matrix and matrix factorization
### Pretrained word embeddings
Global Vectors for Word RepresentationのサイトからはDLできないので、kaggleからDLします。
GloVe
e.g. 50 characters in first lines
$ head -n 1 glove.6B.50d.txt | cut -c-50
the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.04445
import numpy as np from keras.preprocessing.text import Tokenizer def create_embedding_matrix(filepath, word_index, embedding_dim): global vocab_size vocab_size = len(word_index) + 1 # Adding 1 because of reserved 0 index embedding_matrix = np.zeros((vocab_size, embedding_dim)) with open(filepath) as f: for line in f: word, *vector = line.split() if word in word_index: idx = word_index[word] embedding_matrix[idx] = np.array( vector, dtype=np.float32)[:embedding_dim] return embedding_matrix tokenizer = Tokenizer(num_words=5000) embedding_dim = 50 embedding_matrix = create_embedding_matrix( 'glove.6B.50d.txt', tokenizer.word_index, embedding_dim) nonzero_elements = np.count_nonzero(np.count_nonzero(embedding_matrix, axis=1)) print(nonzero_elements / vocab_size)
$ python3 glove.py
0.0
ん? 何かおかしい。。。
GlobalMaxPool1D layer
from keras.models import Sequential from keras import layers // 省略 vocab_size = len(tokenizer.word_index) + 1 embedding_dim = 50 maxlen = 100 model = Sequential() model.add(layers.Embedding(vocab_size, embedding_dim, weights=[embedding_matrix], input_length=maxlen, trainable=False)) model.add(layers.GlobalMaxPool1D()) model.add(layers.Dense(10, activation='relu')) model.add(layers.Dense(1, activation='sigmoid')) model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) print(model.summary())
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 100, 50) 50
_________________________________________________________________
global_max_pooling1d (Global (None, 50) 0
_________________________________________________________________
dense (Dense) (None, 10) 510
_________________________________________________________________
dense_1 (Dense) (None, 1) 11
=================================================================
Total params: 571
Trainable params: 521
Non-trainable params: 50
_________________________________________________________________
None
なんかoutputが違うな