CornellのsentimentデータでTextClassification

### dataset
Cornell Natural Language Processing Groupの映画レビューのデータセットを使います。
http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
txt_sentokenフォルダ配下にnegativeとpositiveのデータが入っています。

### Sentiment Analysis with Scikit-Learn
1. import libraries and dataset
2. text preprocessing
3. converting text to numbers
4. training and test sets
5. training text classification model and predicting sentiment
6. evaluating the model
7. saving and loading the model

import numpy as np
import re
import nltk
from sklearn.datasets import load_files
nltk.download('stopwords')
import pickle
import nltk
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# データ取得
movie_data = load_files("txt_sentoken")
X, y = movie_data.data, movie_data.target

# remove all the special characters
stemmer = WordNetLemmatizer()

documents = []

for sen in range(0, len(X)):
	# remove all the special character
	document = re.sub(r'\W', ' ', str(X[sen]))
	# remove all the single character
	document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)
	# remove all the single character from the start
	document = re.sub(r'\^[a-zA-Z]\s+', ' ', document)
	# substituting multiple spaces with single space
	document = re.sub(r'\s+', ' ', document, flags=re.I)
	# Removing prefixed 'b'
	document = re.sub(r'^b\s+', '', document)
	# converting to lowercase
	document = document.lower()
	# lemmatization (見出し語に変換)
	document = document.split()

	document = [stemmer.lemmatize(word) for word in document]
	document = ' '.join(document)

	documents.append(document)

# Bag of wordsとWord Embedding があるがここではBag of wordsを使う
# max_featuresはmost occuring world of 1500, min_dfはminimum number of documents contain this feature, max_dfはfraction corresponds to a percentage 最大70%
vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
# fit_transformでnumeric featuresに変換
X = vectorizer.fit_transform(documents).toarray()
# tfidf
tfidfconverter = TfidfTransformer()
X = tfidfconverter.fit_transform(X).toarray()

# training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# random forest algorithm, predicting sentiment
classifier = RandomForestClassifier(n_estimators=1000, random_state=0)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

$ python3 model.py
[[180 28]
[ 30 162]]
precision recall f1-score support

0 0.86 0.87 0.86 208
1 0.85 0.84 0.85 192

accuracy 0.85 400
macro avg 0.85 0.85 0.85 400
weighted avg 0.85 0.85 0.85 400

0.855

# save model
with open('text_classifier', 'wb') as picklefile:
    pickle.dump(classifier,picklefile)

import pickle
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

with open('text_classifier', 'rb') as training_model:
	model = pickle.load(training_model)

y_pred2 = model.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

Scikit-Learnでやるんやな。

中国語のRSSテックニュース(36Kr)をバッチで取得しよう

### 課題
– 中国語の学習を習慣化させ、効果的に身に付けたい
– アプリ(Busuu)で隙間時間に学ぼうとやってみたが、イマイチ続かない
– 単語などの基礎とは別に、生きた文章を並行して習得したい

ということで、経済ニュースのRSSをバッチ処理で送ることにした。
中国のテックメディアで36Krが大手らしいので、rsshub.appから取得して、毎朝Gメールに送るようにする。
mb_send_mailのmb_languageは”uni”、encodingは”utf-8″

$rss = simplexml_load_file('https://rsshub.app/36kr/newsflashes');

mb_language("uni");
mb_internal_encoding("UTF-8");

$to = "hoge@gmail.com";
$date = date("m月d日");
$subject = "快讯 36氪 (" . $date.")";

$message = "";
$i =0;
foreach($rss->channel->item as $value){
	if($i < 10){
	$k = $i + 1;
	$message .= $k.".".$value->title ."\n";
	$message .= $value->link . "\n";
	}
	$i++;
}

$email = "hoge@hoge.jp";
mb_send_mail($to, $subject, $message, "From:".$email);

悪くない

Keras x CNN(Convolutional Neural Network)を試す

Convents have revolutionized image classification and computer vision to extract features from images.

Keras use Conv1D layer

model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
model.add(layers.Conv1D(128, 5, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
			loss='binary_crossentropy',
			metrics=['accuracy'])
print(model.summary())

Model: “sequential”
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 100, 100) 100
_________________________________________________________________
conv1d (Conv1D) (None, 96, 128) 64128
_________________________________________________________________
global_max_pooling1d (Global (None, 128) 0
_________________________________________________________________
dense (Dense) (None, 10) 1290
_________________________________________________________________
dense_1 (Dense) (None, 1) 11
=================================================================
Total params: 65,529
Trainable params: 65,529
Non-trainable params: 0
_________________________________________________________________

KerasのSequentialモデルで、GloVeのPretrained Word Embeddingsを使ってみる

– Word2Vec developed by Google and GloVe, Stanford NLP Group
L co-occurrence matrix and matrix factorization

### Pretrained word embeddings
Global Vectors for Word RepresentationのサイトからはDLできないので、kaggleからDLします。
GloVe

Kaggle GloVe6B

e.g. 50 characters in first lines
$ head -n 1 glove.6B.50d.txt | cut -c-50
the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.04445

import numpy as np
from keras.preprocessing.text import Tokenizer

def create_embedding_matrix(filepath, word_index, embedding_dim):
	global vocab_size
	vocab_size = len(word_index) + 1 # Adding 1 because of reserved 0 index
	embedding_matrix = np.zeros((vocab_size, embedding_dim))

	with open(filepath) as f:
		for line in f:
			word, *vector = line.split()
			if word in word_index:
				idx = word_index[word]
				embedding_matrix[idx] = np.array(
					vector, dtype=np.float32)[:embedding_dim]

	return embedding_matrix

tokenizer = Tokenizer(num_words=5000)
embedding_dim = 50
embedding_matrix = create_embedding_matrix(
	'glove.6B.50d.txt',
	tokenizer.word_index, embedding_dim)

nonzero_elements = np.count_nonzero(np.count_nonzero(embedding_matrix, axis=1))
print(nonzero_elements / vocab_size)

$ python3 glove.py
0.0
ん? 何かおかしい。。。

GlobalMaxPool1D layer

from keras.models import Sequential
from keras import layers

// 省略
vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 50
maxlen = 100

model = Sequential()
model.add(layers.Embedding(vocab_size, embedding_dim,
							weights=[embedding_matrix],
							input_length=maxlen,
							trainable=False))
model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
			loss='binary_crossentropy',
			metrics=['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 100, 50) 50
_________________________________________________________________
global_max_pooling1d (Global (None, 50) 0
_________________________________________________________________
dense (Dense) (None, 10) 510
_________________________________________________________________
dense_1 (Dense) (None, 1) 11
=================================================================
Total params: 571
Trainable params: 521
Non-trainable params: 50
_________________________________________________________________
None

なんかoutputが違うな

ニューラルネットワークの学習とは?

深層学習では、損失関数として使用されるクロスエントロピーを最小化させる為に、ベクトルθを繰り返し更新する
勾配降下法でもっとも下りが低くなる方向に進む(gradient descent, steepest descent)
訓練データから繰り返しサンプルを取り出し、勾配を計算する
サンプルの集合をミニバッチと呼ぶ

各層のパラメータを使って予測を行う 計算は入力層から出力層に向かって行われる

コーディングと理論の学習を並行して行った方が効率が良いな。

Keras Embedding Layer

keras parameter
– input_dim: the size of the vocabulary
– output_dim: the size of the dense vector
– input_length: the length of the sequence

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

// 省略
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(sentences_train)

X_train = tokenizer.texts_to_sequences(sentences_train)
X_test = tokenizer.texts_to_sequences(sentences_test)

vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 50
maxlen = 100

model = Sequential()
model.add(layers.Embedding(input_dim=vocab_size,
                            output_dim=embedding_dim,
                            input_length=maxlen))
model.add(layers.Flatten())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
                loss='binary_crossentropy',
                metrics=['accuracy'])
print(model.summary())

$ python3 test.py
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 100, 50) 87350
_________________________________________________________________
flatten (Flatten) (None, 5000) 0
_________________________________________________________________
dense (Dense) (None, 10) 50010
_________________________________________________________________
dense_1 (Dense) (None, 1) 11
=================================================================
Total params: 137,371
Trainable params: 137,371
Non-trainable params: 0
_________________________________________________________________
None

history = model.fit(X_train, y_train,
					epochs=20,
					verbose=False,
					validation_data=(X_test, y_test),
					batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy: {:.4f}".format(accuracy))

plot_history(history)

ValueError: Failed to find data adapter that can handle input: ( containing values of types {‘( containing values of types {““})’}),

何でやろう。。。。

model = Sequential()
model.add(layers.Embedding(input_dim=vocab_size,
                            output_dim=embedding_dim,
                            input_length=maxlen))

model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
                loss='binary_crossentropy',
                metrics=['accuracy'])

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 100, 50) 87350
_________________________________________________________________
global_max_pooling1d (Global (None, 50) 0
_________________________________________________________________
dense (Dense) (None, 10) 510
_________________________________________________________________
dense_1 (Dense) (None, 1) 11
=================================================================
Total params: 87,871
Trainable params: 87,871
Non-trainable params: 0
_________________________________________________________________

なんかこんがらがってきた。。

Kerasを使ってwords embedding

### Embeddingとは?
自然言語処理におけるEmbeddingとは、「文や単語、文字など自然言語の構成要素に対して何らかの空間におけるベクトルを与えること」

There are various ways to vectorize text
– Words represented by each word as a vector
– Characters represented by each character as a vector
– N-grams of words/characters represented as a vector

### One-Hot Encoding
taking a vector of the length of the vocabulary with the entry for each word in the corpus

cities = ['London', 'Berlin', 'Berlin', 'New York', 'London']
print(cities)

$ python3 one-hot.py
[‘London’, ‘Berlin’, ‘Berlin’, ‘New York’, ‘London’]

label encode

from sklearn.preprocessing import LabelEncoder

cities = ['London', 'Berlin', 'Berlin', 'New York', 'London']

encoder = LabelEncoder()
city_labels = encoder.fit_transform(cities)
print(city_labels)

$ python3 one-hot.py
[1 0 0 2 1]

Using OneHotEncoder

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

cities = ['London', 'Berlin', 'Berlin', 'New York', 'London']

encoder = LabelEncoder()
city_labels = encoder.fit_transform(cities)
encoder = OneHotEncoder(sparse=False)
city_labels = city_labels.reshape((5, 1))
array = encoder.fit_transform(city_labels)
print(array)

$ python3 one-hot.py
[[0. 1. 0.]
[1. 0. 0.]
[1. 0. 0.]
[0. 0. 1.]
[0. 1. 0.]]

### Word Embeddings
The word embeddings collect more information into fewer dimensions.
To map semantic meaning into a geometric space called embedding space.
famous e.g. “King – Man + Woman = Queen”

from keras.preprocessing.text import Tokenizer
// 省略
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(sentences_train)

X_train = tokenizer.texts_to_sequences(sentences_train)
X_test = tokenizer.texts_to_sequences(sentences_test)

vocab_size = len(tokenizer.word_index) + 1 # adding 1 because of reserved 0 index

print(sentences_train[2])
print(X_train[2])

$ python3 split.py
Of all the dishes, the salmon was the best, but all were great.
[11, 43, 1, 171, 1, 283, 3, 1, 47, 26, 43, 24, 22]

for word in ['the', 'all', 'happy', 'sad']:
    print('{}: {}'.format(word, tokenizer.word_index[word]))

the: 1
all: 43
happy: 320
sad: 450

sklearnのCountVectorizerはwordのvector
kerasのTokenizerはwordのvalues

pad_sequences

from keras.preprocessing.sequence import pad_sequences

maxlen = 100

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

print(X_train[0, :])

$ python3 split.py
raise TypeError(“sparse matrix length is ambiguous; use getnnz()”
TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

何やと。。。

tensorflowとKerasを使ってTextClassificationをしたい

– Neural network model

> We have to multiply each input node by a weight w and add a bias b.
> It is generally common to use a rectified linear unit (ReLU) for hidden layers, a sigmoid function for the output layer in a binary classification problem, or a softmax function for the output layer of multi-class classification problems.

### Keras
– Keras is a deep learning and neural networks API by Francois Chollet
$ pip3 install keras

kerasを使うにはbackgroundにtensorflowが動いていないといけないので、amazon linux2にtensorflowをインストールします。
$ pip3 install tensorflow
$ python3 -c “import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))”
tf.Tensor(-784.01, shape=(), dtype=float32)
上手くインストールできたようです。

from keras.models import Sequential
from keras import layers

// 省略
input_dim = X_train.shape[1]

model = Sequential()
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
				optimizer='adam',
				metrics=['accuracy'])
print(model.summary())

$ python3 split.py
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 10) 17150
_________________________________________________________________
dense_1 (Dense) (None, 1) 11
=================================================================
Total params: 17,161
Trainable params: 17,161
Non-trainable params: 0
_________________________________________________________________
None

### batch size

history = model.fit(X_train, y_train,
					epochs=100,
					verbose=False,
					validation_data=(X_test, y_test)
					batch_size=10)

### evaluate accuracy

loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))

$ python3 split.py
Training Accuracy: 1.0000
Training Accuracy: 0.8040

### matplotlib
$ pip3 install matplotlib

import matplotlib.pyplot as plt

// 省略
def plot_history(history):
    acc = history.history['accuracy']
    val_acc = history.history['val_accuracy']
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    x = range(1, len(acc) + 1)

    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    plt.plot(x, acc, 'b', label='Training acc')
    plt.plot(x, val_acc, 'r', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.legend()
    plt.subplot(1, 2, 2)
    plt.plot(x, loss, 'b', label='Training loss')
    plt.plot(x, val_loss, 'r', label='Validation loss')
    plt.title('Training and validation loss')
    plt.savefig("img.png")

plot_history(history)

おおお、なんか凄え

Pythonで英文のデータセットを使ってTextClassificationをしたい1

#### choosing a Data Set
Sentiment Labelled Sentences Data Set
https://archive.ics.uci.edu/ml/machine-learning-databases/00331/

※Yelpはビジネスレビューサイト(食べログのようなもの)
※imdbは映画、テレビなどのレビューサイト

こちらから、英文のポジティブ、ネガティブのデータセットを取得します。
$ ls
amazon_cells_labelled.txt imdb_labelled.txt readme.txt yelp_labelled.txt

import pandas as pd 

filepath_dict = {
	'yelp': 'data/yelp_labelled.txt',
	'amazon': 'data/amazon_cells_labelled.txt',
	'imdb': 'data/imdb_labelled.txt'
}

df_list = []
for source, filepath in filepath_dict.items():
	df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
	df['source'] = source
	df_list.append(df)

df = pd.concat(df_list)
print(df.iloc[0])

$ python3 app.py
sentence Wow… Loved this place.
label 1
source yelp
Name: 0, dtype: object

This data, predict sentiment of sentence.
vocabularyごとにベクトル化して重みを学習して判定する
>>> sentences = [‘John likes ice cream’, ‘John hates chocolate.’]
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> vectorizer = CountVectorizer(min_df=0, lowercase=False)
>>> vectorizer.fit(sentences)
CountVectorizer(lowercase=False, min_df=0)
>>> vectorizer.vocabulary_
{‘John’: 0, ‘likes’: 5, ‘ice’: 4, ‘cream’: 2, ‘hates’: 3, ‘chocolate’: 1}
>>> vectorizer.transform(sentences).toarray()
array([[1, 0, 1, 0, 1, 1],
[1, 1, 0, 1, 0, 0]])

### Defining Baseline Model
First, split the data into a training and testing set

from sklearn.model_selection import train_test_split
import pandas as pd 

filepath_dict = {
	'yelp': 'data/yelp_labelled.txt',
	'amazon': 'data/amazon_cells_labelled.txt',
	'imdb': 'data/imdb_labelled.txt'
}

df_list = []
for source, filepath in filepath_dict.items():
	df = pd.read_csv(filepath, names=['sentence', 'label'], sep='\t')
	df['source'] = source
	df_list.append(df)

df = pd.concat(df_list)

df_yelp = df[df['source'] == 'yelp']
sentences = df_yelp['sentence'].values
y = df_yelp['label'].values

sentences_train, sentences_test, y_train, y_test = train_test_split(
	sentences, y, test_size=0.25, random_state=1000)

.value return NumPy array

from sklearn.feature_extraction.text import CountVectorizer

// 省略

sentences_train, sentences_test, y_train, y_test = train_test_split(
	sentences, y, test_size=0.25, random_state=1000)

vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)

X_train = vectorizer.transform(sentences_train)
X_test = vectorizer.transform(sentences_test)
print(X_train)

$ python3 split.py
(0, 125) 1
(0, 145) 1
(0, 201) 1
(0, 597) 1
(0, 600) 1
(0, 710) 1
(0, 801) 2
(0, 888) 1
(0, 973) 1
(0, 1042) 1
(0, 1308) 1
(0, 1345) 1
(0, 1360) 1
(0, 1494) 2
(0, 1524) 2
(0, 1587) 1
(0, 1622) 1
(0, 1634) 1
(1, 63) 1
(1, 136) 1
(1, 597) 1
(1, 616) 1
(1, 638) 1
(1, 725) 1
(1, 1001) 1
: :
(746, 1634) 1
(747, 42) 1
(747, 654) 1
(747, 1193) 2
(747, 1237) 1
(747, 1494) 1
(747, 1520) 1
(748, 600) 1
(748, 654) 1
(748, 954) 1
(748, 1001) 1
(748, 1494) 1
(749, 14) 1
(749, 15) 1
(749, 57) 1
(749, 108) 1
(749, 347) 1
(749, 553) 1
(749, 675) 1
(749, 758) 1
(749, 801) 1
(749, 1010) 1
(749, 1105) 1
(749, 1492) 1
(749, 1634) 2

#### LogisticRegression

from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(X_train, y_train)
score = classifier.score(X_test, y_test)

print("Accuracy:", score)

$ python3 split.py
Accuracy: 0.796

for source in df['source'].unique():
	df_source = df[df['source'] == source]
	sentences = df_source['sentence'].values
	y = df_source['label'].values

	sentences_train, sentences_test, y_train, y_test = train_test_split(
		sentences, y, test_size=0.25, random_state=1000)

	vectorizer = CountVectorizer()
	vectorizer.fit(sentences_train)
	X_train = vectorizer.transform(sentences_train)
	X_test = vectorizer.transform(sentences_test)

	classifier = LogisticRegression()
	classifier.fit(X_train, y_train)
	score = classifier.score(X_test, y_test)
	print('Accuracy for {} data: {:.4f}'.format(source, score))

$ python3 split.py
Accuracy for yelp data: 0.7960
Accuracy for amazon data: 0.7960
Accuracy for imdb data: 0.7487