Kerasを使ってwords embedding

### Embeddingとは?
自然言語処理におけるEmbeddingとは、「文や単語、文字など自然言語の構成要素に対して何らかの空間におけるベクトルを与えること」

There are various ways to vectorize text
– Words represented by each word as a vector
– Characters represented by each character as a vector
– N-grams of words/characters represented as a vector

### One-Hot Encoding
taking a vector of the length of the vocabulary with the entry for each word in the corpus

cities = ['London', 'Berlin', 'Berlin', 'New York', 'London']
print(cities)

$ python3 one-hot.py
[‘London’, ‘Berlin’, ‘Berlin’, ‘New York’, ‘London’]

label encode

from sklearn.preprocessing import LabelEncoder

cities = ['London', 'Berlin', 'Berlin', 'New York', 'London']

encoder = LabelEncoder()
city_labels = encoder.fit_transform(cities)
print(city_labels)

$ python3 one-hot.py
[1 0 0 2 1]

Using OneHotEncoder

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

cities = ['London', 'Berlin', 'Berlin', 'New York', 'London']

encoder = LabelEncoder()
city_labels = encoder.fit_transform(cities)
encoder = OneHotEncoder(sparse=False)
city_labels = city_labels.reshape((5, 1))
array = encoder.fit_transform(city_labels)
print(array)

$ python3 one-hot.py
[[0. 1. 0.]
[1. 0. 0.]
[1. 0. 0.]
[0. 0. 1.]
[0. 1. 0.]]

### Word Embeddings
The word embeddings collect more information into fewer dimensions.
To map semantic meaning into a geometric space called embedding space.
famous e.g. “King – Man + Woman = Queen”

from keras.preprocessing.text import Tokenizer
// 省略
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(sentences_train)

X_train = tokenizer.texts_to_sequences(sentences_train)
X_test = tokenizer.texts_to_sequences(sentences_test)

vocab_size = len(tokenizer.word_index) + 1 # adding 1 because of reserved 0 index

print(sentences_train[2])
print(X_train[2])

$ python3 split.py
Of all the dishes, the salmon was the best, but all were great.
[11, 43, 1, 171, 1, 283, 3, 1, 47, 26, 43, 24, 22]

for word in ['the', 'all', 'happy', 'sad']:
    print('{}: {}'.format(word, tokenizer.word_index[word]))

the: 1
all: 43
happy: 320
sad: 450

sklearnのCountVectorizerはwordのvector
kerasのTokenizerはwordのvalues

pad_sequences

from keras.preprocessing.sequence import pad_sequences

maxlen = 100

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

print(X_train[0, :])

$ python3 split.py
raise TypeError(“sparse matrix length is ambiguous; use getnnz()”
TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

何やと。。。

tensorflowとKerasを使ってTextClassificationをしたい

– Neural network model

> We have to multiply each input node by a weight w and add a bias b.
> It is generally common to use a rectified linear unit (ReLU) for hidden layers, a sigmoid function for the output layer in a binary classification problem, or a softmax function for the output layer of multi-class classification problems.

### Keras
– Keras is a deep learning and neural networks API by Francois Chollet
$ pip3 install keras

kerasを使うにはbackgroundにtensorflowが動いていないといけないので、amazon linux2にtensorflowをインストールします。
$ pip3 install tensorflow
$ python3 -c “import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))”
tf.Tensor(-784.01, shape=(), dtype=float32)
上手くインストールできたようです。

from keras.models import Sequential
from keras import layers

// 省略
input_dim = X_train.shape[1]

model = Sequential()
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
				optimizer='adam',
				metrics=['accuracy'])
print(model.summary())

$ python3 split.py
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 10) 17150
_________________________________________________________________
dense_1 (Dense) (None, 1) 11
=================================================================
Total params: 17,161
Trainable params: 17,161
Non-trainable params: 0
_________________________________________________________________
None

### batch size

history = model.fit(X_train, y_train,
					epochs=100,
					verbose=False,
					validation_data=(X_test, y_test)
					batch_size=10)

### evaluate accuracy

loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))

$ python3 split.py
Training Accuracy: 1.0000
Training Accuracy: 0.8040

### matplotlib
$ pip3 install matplotlib

import matplotlib.pyplot as plt

// 省略
def plot_history(history):
    acc = history.history['accuracy']
    val_acc = history.history['val_accuracy']
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    x = range(1, len(acc) + 1)

    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    plt.plot(x, acc, 'b', label='Training acc')
    plt.plot(x, val_acc, 'r', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.legend()
    plt.subplot(1, 2, 2)
    plt.plot(x, loss, 'b', label='Training loss')
    plt.plot(x, val_loss, 'r', label='Validation loss')
    plt.title('Training and validation loss')
    plt.savefig("img.png")

plot_history(history)

おおお、なんか凄え

Pythonで英文のデータセットを使ってTextClassificationをしたい1

#### choosing a Data Set
Sentiment Labelled Sentences Data Set
https://archive.ics.uci.edu/ml/machine-learning-databases/00331/

※Yelpはビジネスレビューサイト(食べログのようなもの)
※imdbは映画、テレビなどのレビューサイト

こちらから、英文のポジティブ、ネガティブのデータセットを取得します。
$ ls
amazon_cells_labelled.txt imdb_labelled.txt readme.txt yelp_labelled.txt

import pandas as pd 

filepath_dict = {
	'yelp': 'data/yelp_labelled.txt',
	'amazon': 'data/amazon_cells_labelled.txt',
	'imdb': 'data/imdb_labelled.txt'
}

df_list = []
for source, filepath in filepath_dict.items():
	df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
	df['source'] = source
	df_list.append(df)

df = pd.concat(df_list)
print(df.iloc[0])

$ python3 app.py
sentence Wow… Loved this place.
label 1
source yelp
Name: 0, dtype: object

This data, predict sentiment of sentence.
vocabularyごとにベクトル化して重みを学習して判定する
>>> sentences = [‘John likes ice cream’, ‘John hates chocolate.’]
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> vectorizer = CountVectorizer(min_df=0, lowercase=False)
>>> vectorizer.fit(sentences)
CountVectorizer(lowercase=False, min_df=0)
>>> vectorizer.vocabulary_
{‘John’: 0, ‘likes’: 5, ‘ice’: 4, ‘cream’: 2, ‘hates’: 3, ‘chocolate’: 1}
>>> vectorizer.transform(sentences).toarray()
array([[1, 0, 1, 0, 1, 1],
[1, 1, 0, 1, 0, 0]])

### Defining Baseline Model
First, split the data into a training and testing set

from sklearn.model_selection import train_test_split
import pandas as pd 

filepath_dict = {
	'yelp': 'data/yelp_labelled.txt',
	'amazon': 'data/amazon_cells_labelled.txt',
	'imdb': 'data/imdb_labelled.txt'
}

df_list = []
for source, filepath in filepath_dict.items():
	df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
	df['source'] = source
	df_list.append(df)

df = pd.concat(df_list)

df_yelp = df[df['source'] == 'yelp']
sentences = df_yelp['sentence'].values
y = df_yelp['label'].values

sentences_train, sentences_test, y_train, y_test = train_test_split(
	sentences, y, test_size=0.25, random_state=1000)

.value return NumPy array

from sklearn.feature_extraction.text import CountVectorizer

// 省略

sentences_train, sentences_test, y_train, y_test = train_test_split(
	sentences, y, test_size=0.25, random_state=1000)

vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)

X_train = vectorizer.transform(sentences_train)
X_test = vectorizer.transform(sentences_test)
print(X_train)

$ python3 split.py
(0, 125) 1
(0, 145) 1
(0, 201) 1
(0, 597) 1
(0, 600) 1
(0, 710) 1
(0, 801) 2
(0, 888) 1
(0, 973) 1
(0, 1042) 1
(0, 1308) 1
(0, 1345) 1
(0, 1360) 1
(0, 1494) 2
(0, 1524) 2
(0, 1587) 1
(0, 1622) 1
(0, 1634) 1
(1, 63) 1
(1, 136) 1
(1, 597) 1
(1, 616) 1
(1, 638) 1
(1, 725) 1
(1, 1001) 1
: :
(746, 1634) 1
(747, 42) 1
(747, 654) 1
(747, 1193) 2
(747, 1237) 1
(747, 1494) 1
(747, 1520) 1
(748, 600) 1
(748, 654) 1
(748, 954) 1
(748, 1001) 1
(748, 1494) 1
(749, 14) 1
(749, 15) 1
(749, 57) 1
(749, 108) 1
(749, 347) 1
(749, 553) 1
(749, 675) 1
(749, 758) 1
(749, 801) 1
(749, 1010) 1
(749, 1105) 1
(749, 1492) 1
(749, 1634) 2

#### LogisticRegression

from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(X_train, y_train)
score = classifier.score(X_test, y_test)

print("Accuracy:", score)

$ python3 split.py
Accuracy: 0.796

for source in df['source'].unique():
	df_source = df[df['source'] == source]
	sentences = df_source['sentence'].values
	y = df_source['label'].values

	sentences_train, sentences_test, y_train, y_test = train_test_split(
		sentences, y, test_size=0.25, random_state=1000)

	vectorizer = CountVectorizer()
	vectorizer.fit(sentences_train)
	X_train = vectorizer.transform(sentences_train)
	X_test = vectorizer.transform(sentences_test)

	classifier = LogisticRegression()
	classifier.fit(X_train, y_train)
	score = classifier.score(X_test, y_test)
	print('Accuracy for {} data: {:.4f}'.format(source, score))

$ python3 split.py
Accuracy for yelp data: 0.7960
Accuracy for amazon data: 0.7960
Accuracy for imdb data: 0.7487

fastTextで英文のジャンル分類(Text Classification)

### Getting and preparing data
$ wget https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz
$ ls
cooking.stackexchange.id
$ head cooking.stackexchange.txt
__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?

“__label__” prefix is how fasttext recognize difference of word and label.

$ wc cooking.stackexchange.txt
15404 169582 1401900 cooking.stackexchange.txt
-> split example and validation
$ head -n 12404 cooking.stackexchange.txt > cooking.train
$ tail -n 3000 cooking.stackexchange.txt > cooking.valid

### train_supervised
training.py

import fasttext
model = fasttext.train_supervised(input="cooking.train")

$ python3 training.py
Read 0M words
Number of words: 14543
Number of labels: 735
Floating point exception

何故だ〜〜〜〜〜〜〜〜〜〜

fastTextによるカテゴリ分類

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ make
$ pip3 install cython
$ pip3 install fasttext

$ install requests requests_oauthlib

Twitterの開発者向けのページから、consumer key, consumer secret, access token, access token secretを取得します。

tweet_get.py

import re
import json
import MeCab
from requests_oauthlib import OAuth1Session

CK = ""
CS = ""
AT = ""
AS = ""

API_URL = "https://api.twitter.com/1.1/search/tweets.json?tweet_mode=extended"
KEYWORD = "芸能 OR アニメ OR 漫画 OR TV OR ゲーム"
CLASS_LABEL = "__label__1"

def main():
	tweets = get_tweet()
	surfaces = get_surfaces(tweets)
	write_txt(surfaces)

def get_tweet():
    params = {'q' : KEYWORD, 'count' : 20}
    twitter = OAuth1Session(CK, CS, AT, AS)
    req = twitter.get(API_URL, params = params)
    results = []
    if req.status_code == 200:
        tweets = json.loads(req.text)
        for tweet in tweets['statuses']:
            results.append(tweet['full_text'])
        return results
    else:
        print ("Error: %d" % req.status_code)

def get_surfaces(contents):
    results = []
    for row in contents:
        content = format_text(row)
        tagger = MeCab.Tagger('')
        tagger.parse('')
        surf = []
        node = tagger.parseToNode(content)
        while node:
            surf.append(node.surface)
            node = node.next
        results.append(surf)
    return results

def write_txt(contents):
	try:
		if(len(contents) > 0):
			fileName = CLASS_LABEL + ".txt"
			labelText = CLASS_LABEL + ","

			f = open(fileName,'a')
			for row in contents:
				spaceTokens = "".join(row);
				result = labelText + spaceTokens + "\n"
				f.write(result)
			f.close()

		print(str(len(contents))+"行を書き込み")

	except Exception as e:
		print("テキストの書き込みに失敗")
		print(e)

def format_text(text):
	text=re.sub(r'https?://[\w/:%#\$&\?\(\)~\.=\+\-…]+', "", text)
	text=re.sub(r'@[\w/:%#\$&\?\(\)~\.=\+\-…]+', "", text)
	text=re.sub(r'&[\w/:%#\$&\?\(\)~\.=\+\-…]+', "", text)
	text=re.sub(';', "", text)
	text=re.sub('RT', "", text)
	text=re.sub('\n', " ", text)
	return text

if __name__ == '__main__':
	main()

1.エンタメ、2.美容、3.住まい/暮らし のキーワードで集めます。

__label__1: エンタメ
-> 芸能 OR アニメ OR 漫画 OR TV OR ゲーム
__label__2: 美容
-> 肌 OR ヨガ OR 骨盤 OR ウィッグ OR シェイプ
__label__3: 住まい/暮らし
-> リフォーム OR 住宅 OR 家事 OR 収納 OR 食材

label1〜3のテキストを結合させます
$ cat __label__1.txt __label__2.txt __label__3.txt > model.txt
$ ls
__label__1.txt __label__3.txt model.txt
__label__2.txt fastText tweet_get.py

learning.py

import sys
import fasttext as ft

argvs = sys.argv
input_file = argvs[1]
output_file = argvs[2]

classifier = ft.supervised(input_file, output_file)

$ python3 learning.py model.txt model
raise Exception(“`supervised` is not supported any more. Please use `train_supervised`. For more information please refer to https://fasttext.cc/blog/2019/06/25/blog-post.html#2-you-were-using-the-unofficial-fasttext-module”)
Exception: `supervised` is not supported any more. Please use `train_supervised`.

model = ft.train_supervised(input_file)

$ python3 learning.py model.txt ftmodel
Read 0M words
Number of words: 6
Number of labels: 135
Floating point exception

おかしいな、number of labelsが135ではなく3のはずなんだけど。。。

TfIdfとは?

### TfIdf
Tf = Term Frequency
Idf = Inverse Document Frequency

Tfはドキュメント内の単語の出現頻度、Idfは全ての文章内の単語の出現頻度の逆数
TfIdfを使うと、いろいろな文章に出てくる単語は無視して、ある文章に何回も出てくる単語は重要な語として扱う

def get_vector_by_text_list(_items):
	count_vect = TfidfVectorizer(analyzer=_split_to_words)
	# count_vect = CountVectorizer(analyzer=_split_to_words)
	box = count_vect.fit_transform(_items)
	X = box.todense()
	return [X,count_vect]

MLPClassifierとは、Multi-layer Perceptron classifierの略で多層パーセプトロンと呼ばれる分類器のモデル

Python・Mecab・ディープラーニングを用いたテキストのジャンル分類

$ pip3 install numpy
$ pip3 install scikit-learn
$ pip3 install pandas

scikit-learnはpythonで使用できる機械学習ライブラリ
ニュースコーパスは livedoorニュースコーパス を使用します。
livedoor ニュースコーパス

$ ls
text
$ cd text
$ ls
CHANGES.txt dokujo-tsushin livedoor-homme smax
README.txt it-life-hack movie-enter sports-watch
app.py kaden-channel peachy topic-news

app.py

import os, os.path
import csv

f = open('corpus.csv', 'w')
csv_writer = csv.writer(f, quotechar="'")
files = os.listdir('./')

datas = []
for filename in files:
	if os.path.isfile(filename):
		continue

	category = filename
	for file in os.listdir('./'+filename):
		path = './'+filename+'/'+file
		r = open(path, 'r')
		line_a = r.readlines()

		text = ''
		for line in line_a[2:]:
			text += line.strip()
		r.close()

		datas.append()
		print(text)
csv_writer.writerows(datas)
f.close()

$python3 app.py

以下のようなCSVファイルが生成されます。

### NNの仕組み
– 全てのテキストを形態素に分解し、全ての単語を把握した後、全ての単語に対してテキストに出現するかと言うベクトルを作成する。
– その後、ニューラルネットワークに対し、このベクトルを入力とし、出力をカテゴリとして学習を行う

nlp_tasks.py

# -*- coding: utf-8 -*-
#! /usr/bin/python3
import MeCab
from sklearn.feature_extraction.text import CountVectorizer

def _split_to_words(text):
	tagger = MeCab.Tagger('-O wakati')
	try:
		res = tagger.parse(text.strip())
	except:
		return[]
	return res

def get_vector_by_text_list(_items):
	count_vect = CountVectorizer(analyzer=_split_to_words)
	box = count_vect.fit_transform(_items)
	X = box.todense()
	return [X,count_vect]

models というフォルダを作成する

main.py

# -*- coding: utf-8 -*-
#! /usr/bin/python3
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.externals import joblib
import os.path
import nlp_tasks

from sklearn.neural_network import MLPClassifier # アルゴリズムのmlp

def train():
	classifier = MyMLPClassifier()
classifier.train('corpus.csv')

def predict():
	classifier = MyMLPClassifier()
	classifier.load_model()
	result = classifier.predict(u"競馬の3歳馬日本一を決めるG1レース、「第88回日本ダービー」が行われ、4番人気のシャフリヤールが優勝しました。")
	print(result)

class MyMLPClassifier():
	model = None
	model_name = "mlp"

	def load_model(self):
		if os.path.exists(self.get_model_path())==False:
			raise Exception('no model file found!')
		self.model = joblib.load(self.get_model_path())
		self.classes = joblib.load(self.get_model_path('class')).tolist()
		self.vectorizer = joblib.load(self.get_model_path('vect'))
		self.le = joblib.load(self.get_model_path('le'))

	def get_model_path(self, type='model'):
		return 'models/'+self.model_name+"_"+type+".pkl"

	def get_vector(self,text)
		return self.vectorizer.transform()


	def train(self, csvfile):
		df = pd.read_csv(cavfile,names=('text','category'))
		X, vectorizer = nlp_tasks.get_vector_by_text_list(df["text"])

		le = LabelEncoder()
		le.fit(df['category'])
		Y = le.transform(df['category'])

		model = MLPClassifier(max_iter=300, hidden_layer_sizes=(100,),verbose=10,)
		model.fit(X, Y)

		# save models
		joblib.dump(model, self.get_model_path())
		joblib.dump(le.classes_, self.get_model_path("class"))
		joblib.dump(vectorizer, self.get_model_path("vect"))
		joblib.dump(le, self.get_model_path("le"))

		self.model = model
		self.classes = le.classes_.tolist()
		self.vectorizer = vectorizer

	def predict(self, query):
		X = self.vectorizer.transform([query])
		key = self.model.predict(X)
		return self.classes[key[0]]

if __name__ == '__main__':
	train()
	#predict()

$ ls
corpus.csv main.py models nlp_tasks.py text

$ python3 main.py
Iteration 1, loss = 4.24610757
Iteration 2, loss = 2.08110509
Iteration 3, loss = 1.70032200
Iteration 4, loss = 1.49572560
// 省略
Iteration 136, loss = 0.00257828
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.

if __name__ == '__main__':
    # train()
    predict()

$ python3 main.py
sports-watch

まじかーーーーーーーーーーーーーーーーーー
Sugeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee

Python3とMeCabで文章のジャンル判定を行う

自然言語処理では、辞書に存在する単語が、文章中に存在するかどうかをチェックしていくが、その辞書を通常、gazetteerと言う

### gazetterの中の単語が文章中にあれば出力

import MeCab

text = '今日は晴れかな?'

weather_set = set(['晴れ','天気','雨'])

mecab = MeCab.Tagger("-Ochasen")
tokens = mecab.parse(text)
token = tokens.split("\n")

for ele in token:
	element = ele.split("\t")
	if element[0] == "EOS":
		break

	surface = element[0]

	if surface in weather_set:
		print(surface)

$ python3 app.py
晴れ

### カテゴリ分類
– カテゴリに対応したgazetterを用意する必要がある
– gazetterはweb上から検索する手法がよく取られる
– 文章を大規模に集積したものをコーパスと言う
– gazetteerの網羅率を上げることが分類精度において重要

import MeCab

text = '東京駅まで電車でどれ位かかりますか?'

weather_set = set(['晴れ','天気','雨','曇り'])
navi_set = set(['渋谷','東京','電車','地図'])

mecab = MeCab.Tagger("-Ochasen")

# def classify_category(text):
tokens = mecab.parse(text)
token = tokens.split("\n")
weather_score = 0
navi_score = 0

for ele in token:
	element = ele.split("\t")
	if element[0] == "EOS":
		break

	surface = element[0]

	if surface in weather_set:
		weather_score += 1
	if surface in navi_set:
		navi_score += 1

if weather_score > navi_score:
	print("天気")
elif weather_score < navi_score:
	print("ナビゲーション")
else:
	print("ジャンル判定不能")

$ python3 app.py
ナビゲーション

これ、Yahoo!掲示板で売り煽りが多いか、買い煽りが多いか自動判定したいな。

vagrantでmecab-ipadic-neologdインストール時に’std::bad_alloc’となる時

メモリーが足りないので、vagrantfileでメモリを増やしてmakeする必要があります。
swapファイルで1G増やしてもダメなので、一旦vagrant haltして、4Gまで増やします。

Vagrantfile

config.vm.provider "virtualbox" do |vb|
    # Display the VirtualBox GUI when booting the machine
    # vb.gui = true
  
    # Customize the amount of memory on the VM:
    vb.customize ["modifyvm", :id, "--memory", 4096]
  end

$ vagrant up
$ vagrant ssh
$ free
total used free shared buff/cache available
Mem: 4039428 415580 3183212 544 440636 3408324
Swap: 0 0 0

$ ./mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n -y
$ sudo sed -i -e “s|^dicdir.*$|dicdir = /usr/local/lib/mecab/dic/mecab-ipadic-neologd|” $(mecab-config –sysconfdir)/mecabrc

$ echo すもももももももものうち | me
すもももももももものうち 名詞,固有名詞,一般,*,*,*,すもももももももものうち,スモモモモモモモモノウチ,スモモモモモモモモノウチ
EOS

ec2でもメモリーはコンソールから自由に増やせるので、これで解決ですね。
おっけいいいいいいいいいいいいい