[自然言語処理] python x transformerで感情分析

$ pip3 install fugashi
$ pip3 install ipadic

sentiment.py

# -*- coding: utf-8 -*-
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("daigo/bert-base-japanese-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("daigo/bert-base-japanese-sentiment")

print(pipeline("sentiment-analysis", model="daigo/bert-base-japanese-sentiment", tokenizer="daigo/bert-base-japanese-sentiment")("私は幸福である。"))

$ python3 sentiment.py
[{‘label’: ‘ポジティブ’, ‘score’: 0.9843042492866516}]

単語やニュートラルな文章はポジティブに判定されやすい

“もうダメだ” にして再実行
$ python3 sentiment.py
[{‘label’: ‘ネガティブ’, ‘score’: 0.9892264604568481}]

なるほど
処理時間がかかるのがスネに傷やな

[自然言語処理] word2vecによる類似単語を検索

$ pip3 install gensim==3.8.1

### word2vecのファイルをDL
$ wget http://public.shiroyagi.s3.amazonaws.com/latest-ja-word2vec-gensim-model.zip

$ unzip latest-ja-word2vec-gensim-model.zip

app.py

# -*- coding: utf-8 -*-
from gensim.models import word2vec

model = word2vec.Word2Vec.load('word2vec.gensim.model')
results = model.wv.most_similar(positive=['日本'])
for result in results:
	print(result)

$ python3 app.py
(‘韓国’, 0.7088127732276917)
(‘台湾’, 0.6461570262908936)
(‘日本国内’, 0.6403890252113342)
(‘欧米’, 0.6350583434104919)
(‘日本国外’, 0.6200590133666992)
(‘台湾出身’, 0.6174061894416809)
(‘中華圏’, 0.612815260887146)
(‘日本の経済’, 0.6088820099830627)
(‘日本の歴史’, 0.6070738434791565)
(‘韓国国内’, 0.6054152250289917)

gensimはバージョンを指定しないとエラーになるので注意が必要

OK
これを繋げていく

[自然言語処理] pythonによる文章自動生成

transformersを使います
$ pip3 install transformers==4.3.3 torch==1.8.0 sentencepiece==0.1.91

# -*- coding: utf-8 -*-
import transformers

tokenizer = transformers.AutoTokenizer.from_pretrained("colorfulscoop/gpt2-small-ja")
model = transformers.AutoModelForCausalLM.from_pretrained("colorfulscoop/gpt2-small-ja")

input = tokenizer.encode("昔々あるところに", return_tensors="pt")
output = model.generate(input, do_sample=True, top_p=0.95, top_k=50, num_return_sequences=3)

print(tokenizer.batch_decode(output))

[‘昔々あるところには、お祭りの女神さんが現れ、そこでお姫様の姫様’, ‘昔々あるところに、ある。ある夏の日、彼は旅人と出会い、その目的がどう’, ‘昔々あるところに、一億年も前には人間たちが住んでいた。いまや、それはこの’]

おおお、なんか色々出来そうではある…
途中の処理を考える必要はあるが

[自然言語処理] pythonとasariで感情分析を行う

$ pip3 install asari

from asari.api import Sonar

sonar = Sonar()
sonar.ping(text="広告が多すぎる♡")

$ python3 asari.py
Traceback (most recent call last):
File “asari.py”, line 1, in
from asari.api import Sonar
File “/home/vagrant/dev/nlp/asari.py”, line 1, in
from asari.api import Sonar
ModuleNotFoundError: No module named ‘asari.api’; ‘asari’ is not a package

—-
Requirement already satisfied: joblib>=0.11 in /home/vagrant/.local/lib/python3.8/site-packages (from scikit-learn>=0.19.1->asari) (1.0.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/vagrant/.local/lib/python3.8/site-packages (from scikit-learn>=0.19.1->asari) (2.2.0)
Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.7.3->pandas>=0.22.0->asari) (1.14.0)
Installing collected packages: pytz, pandas, Janome, asari
Successfully installed Janome-0.4.1 asari-0.0.4 pandas-1.3.4 pytz-2021.3

python2系じゃないと動かないのか?

python2系の環境で再度やり直します。
$ pip install scikit-learn==0.20.4
$ pip install Janome==0.3.7

[{‘text’: ‘広’, ‘top_class’: ‘positive’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.28345109397418017}, {‘class_name’: ‘positive’, ‘confidence’: 0.7165489060258198}]}, {‘text’: ‘告’, ‘top_class’: ‘positive’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.3450436370027103}, {‘class_name’: ‘positive’, ‘confidence’: 0.6549563629972897}]}, {‘text’: ‘が’, ‘top_class’: ‘negative’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.9006654377630458}, {‘class_name’: ‘positive’, ‘confidence’: 0.09933456223695437}]}, {‘text’: ‘多’, ‘top_class’: ‘negative’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.9364137330979464}, {‘class_name’: ‘positive’, ‘confidence’: 0.06358626690205357}]}, {‘text’: ‘す’, ‘top_class’: ‘positive’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.2124141523260128}, {‘class_name’: ‘positive’, ‘confidence’: 0.7875858476739873}]}, {‘text’: ‘ぎ’, ‘top_class’: ‘negative’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.5383816766180572}, {‘class_name’: ‘positive’, ‘confidence’: 0.4616183233819428}]}, {‘text’: ‘る’, ‘top_class’: ‘negative’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.6881484923868434}, {‘class_name’: ‘positive’, ‘confidence’: 0.3118515076131566}]}]

# -*- coding: utf-8 -*-
from asari.api import Sonar

sonar = Sonar()
text="広告が多すぎる"
res = sonar.ping(text="広告多すぎる♡")
print(res)

$ python3 app.py
{‘text’: ‘広告多すぎる♡’, ‘top_class’: ‘negative’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.9086981552962491}, {‘class_name’: ‘positive’, ‘confidence’: 0.0913018447037509}]}

なるほど、janomeとscikit-learnはversionを指定すると動く

[SnowNLP] Python3で中国語の自然言語処理

$ pip3 install snownlp

tokenization

from snownlp import SnowNLP

s = SnowNLP(u'今天是周六。')
print(s.words)

$ python3 snow.py
[‘今天’, ‘是’, ‘周六’, ‘。’]

speech tagにするとnoun, adverb, verb, adjectiveなどを表現できます。

print(list(s.tags))

$ python3 snow.py
[(‘今天’, ‘t’), (‘是’, ‘v’), (‘周六’, ‘t’), (‘。’, ‘w’)]

pinyin

print(s.pinyin)

$ python3 snow.py
[‘jin’, ‘tian’, ‘shi’, ‘zhou’, ‘liu’, ‘。’]

sentences

s = SnowNLP(u'在茂密的大森林里,一只饥饿的老虎逮住了一只狐狸。老虎张开大嘴就要把狐狸吃掉。"慢着"!狐狸虽然很害怕但还是装出一副很神气的样子说,"你知道我是谁吗?我可是玉皇大帝派来管理百兽的兽王,你要是吃了我,玉皇大帝是决不会放过你的"。')
print(s.sentences)

[‘在茂密的大森林里’, ‘一只饥饿的老虎逮住了一只狐狸’, ‘老虎张开大嘴就要把狐狸吃掉’, ‘”慢着”‘, ‘狐狸虽然很害怕但还是装出一副很神气的样子说’, ‘”你知道我是谁吗’, ‘我可是玉皇大帝派来管理百兽的兽王’, ‘你要是吃了我’, ‘玉皇大帝是决不会放过你的”‘]

keyword

print(s.keywords(5))

$ python3 snow.py
[‘狐狸’, ‘大’, ‘老虎’, ‘大帝’, ‘皇’]

summary

print(s.summary(3))

[‘老虎张开大嘴就要把狐狸吃掉’, ‘我可是玉皇大帝派来管理百兽的兽王’, ‘玉皇大帝是决不会放过你的”‘]

sentiment analysis

text = SnowNLP(u'这个产品很好用,这个产品不好用,这个产品是垃圾,这个也太贵了吧,超级垃圾,是个垃圾中的垃圾')
sent = text.sentences
for sen in sent:
	s = SnowNLP(sen)
	print(s.sentiments)

$ python3 snow.py
0.7853504415636449
0.5098208142944668
0.13082804652201174
0.5
0.0954842128485538
0.04125325276132508

0から1の値を取り、1に近づくほどポジティブ、0に近いほどネガティブとなります。

[NLTK] customize sentiment analysis

unwanted = nltk.corpus.stopwords.words("english")
unwanted.extend([w.lower() for w in nltk.corpus.names.words()])

def skip_unwanted(pos_tuple):
	word, tag = pos_tuple
	if not word.isalpha() or word in unwanted:
		return False
	if tag.startswith("NN"):
		return False
	return True

positive_words = [word for word, tag in filter(
	skip_unwanted,
	nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["pos"]))
)]
negative_words = [word for word, tag in filter(
	skip_unwanted,
	nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["neg"]))
)]

positive_fd = nltk.FreqDist(positive_words)
negative_fd = nltk.FreqDist(negative_words)

common_set = set(positive_fd).intersection(negative_fd)

for word in common_set:
	del positive_fd[word]
	del negative_fd[word]

top_100_positive = {word for word, count in positive_fd.most_common(100)}
top_100_negative = {word for word, count in negative_fd.most_common(100)}

unwanted = nltk.corpus.stopwords.words("english")
unwanted.extend([w.lower() for w in nltk.corpus.names.words()])

positive_bigram_finder = nltk.collocations.BigramCollocationFinder.from_words([
	w for w in nltk.corpus.movie_reviews.words(categories=["pos"])
	if w.isalpha() and w not in unwanted
])

negative_bigram_finder = nltk.collocations.BigramCollocationFinder.from_words([
	w for w in nltk.corpus.movie_reviews.words(categories=["neg"])
	

[NLTK] sentiment analysis

NLTK has a built-in pretrained sentiment analyzer, VADER(Valence Aware Dictionary and sEntiment Reasoner)

import nltk
from pprint import pprint
from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()
pprint(sia.polarity_scores("Wow, NLTK is really powerful!"))

$ python3 app.py
{‘compound’: 0.8012, ‘neg’: 0.0, ‘neu’: 0.295, ‘pos’: 0.705}

compoundはaverageで-1から1までを示す

twitter corpus

tweets = [t.replace("://", "//") for t in nltk.corpus.twitter_samples.strings()]

def is_positive(tweet: str) -> bool:
    """True if tweet has positive compound sentiment, False otherwise."""
    return sia.polarity_scores(tweet)["compound"] > 0

shuffle(tweets)
for tweet in tweets[:10]:
    print(">", is_positive(tweet), tweet)

$ python3 app.py
> False Most Tory voters not concerned which benefits Tories will cut. Benefits don’t figure in the lives if most Tory voters. #Labour #NHS #carers
> False .@uberuk you cancelled my ice cream uber order. Everyone else in the office got it but me. 🙁
> False oh no i’m too early 🙁
> False I don’t know what I’m doing for #BlockJam at all since my schedule’s just whacked right now 🙁
> False What should i do .

BAD VS PARTY AGAIN :(((((((
> True @Shadypenguinn take care! 🙂
> True Thanks to amazing 4000 Followers on Instagram
If you´re not among them yet,
feel free to connect :-)… http//t.co/ILy03AtJ83
> False RT @mac123_m: Ed Miliband has spelt it out again. No deals with the SNP.
There’s a choice:
Vote SNP get Tories
Vote LAB and get LAB http//…
> True @gus33000 but Disk Management is same since NT4 iirc 😀
Also, what UX refinements were in zdps?
> False RT @KevinJPringle: One of many bizarre things about @Ed_Miliband’s anti-SNP stance is he doesn’t reject deal with LibDems, who imposed aust…

postivie_review_ids = nltk.corpus.movie_reviews.fileids(categories=["pos"])
negative_review_ids = nltk.corpus.movie_reviews.fileids(categories=["neg"])
all_review_ids = positive_review_ids + negative_review_ids

def is_positive(review_id: str) -> bool:
	"""True if the average of all sentence compound scores is positive. """
	text = nltk.corpus.movie_reviews.raw(review_id)
	scores = [
		sia.polarity_scores(sentence)["compound"]
		for sentence in nltk.sent_tokenize(text)
	]
	return mean(scores) > 0

shuffle(all_review_ids)
correct = 0
for review_id in all_review_ids:
	if is_positive(review_id):
		if review in positive_review_ids:
			correct += 1
	else:
		if review in negative_review_ids:
			correct += 1

print(F"{correct / len(all_review_ids):.2%} correct")

既にcorpusがあるのは良いですね。

[NLTK] Word frequency

$ pip3 install nltk

### download
NLTK can be download resouces
– names, stopwords, state_union, twitter_samples, moview_review, averaged_perceptron_tagger, vader_lexicon, punkt

import nltk

nltk.download([
	"names",
	"stopwords",
	"state_union",
	"twitter_samples",
	"movie_reviews",
	"averaged_perceptron_tagger",
	"vader_lexicon",
	"punkt",
])

State of union corpus

words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]

to use stop words

words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]
stopwords = nltk.corpus.stopwords.words("english")
words = [w for w in words if w.lower() not in stopwords]

word_tokenize()

text = """
For some quick analysis, creating a corpus could be overkill.
If all you need is a word list,
there are simpler ways to achieve that goal.
"""
pprint(nltk.word_tokenize(text), width=79, compact=True)

most common

fd = nltk.FreqDist(words)
pprint(fd.most_common(3))

$ python3 app.py
[(‘must’, 1568), (‘people’, 1291), (‘world’, 1128)]

specific word

fd = nltk.FreqDist(words)
pprint(fd["America"])

$ python3 app.py
1076

### concordance
どこに出現するかを示す

text = nltk.Text(nltk.corpus.state_union.words())
text.concordance("america", lines=5)

$ python3 app.py
Displaying 5 of 1079 matches:
would want us to do . That is what America will do . So much blood has already
ay , the entire world is looking to America for enlightened leadership to peace
beyond any shadow of a doubt , that America will continue the fight for freedom
to make complete victory certain , America will never become a party to any pl
nly in law and in justice . Here in America , we have labored long and hard to

text = nltk.Text(nltk.corpus.state_union.words())
concordance_list = text.concordance_list("america", lines=2)
for entry in concordance_list:
	print(entry.line)

$ python3 app.py
would want us to do . That is what America will do . So much blood has already
ay , the entire world is looking to America for enlightened leadership to peace

other frequency distribution

words: list[str] = nltk.word_tokenize(
"""Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.""")
text = nltk.Text(words)
fd = text.vocab()
fd.tabulate(3)

collocation

words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]
finder = nltk.collocations.TrigramCollocationFinder.from_words(words)

pprint(finder.ngram_fd.most_common(2))
pprint(finder.ngram_fd.tabulate(2))

$ python3 app.py
[((‘the’, ‘United’, ‘States’), 294), ((‘the’, ‘American’, ‘people’), 185)]
(‘the’, ‘United’, ‘States’) (‘the’, ‘American’, ‘people’)
294 185

nltkが強力なのはわかった。

【株式投資】twitterで買い煽り・売り煽りが多いかpython・MeCabで調べる

優秀なトレーダーで、買い煽りが多いか、売り煽りが多いか、定性的に調べたいと思う方も多いだろう。というのも、「買い煽り」「売り煽り」というのは、ポジションを持ったイナゴの心理状態をかなり表しているからだ。当然、株が上がって欲しい時は買い煽りをするし、株価が下落して欲しい時は夢中で売り煽りを行う。外野から見たらやや滑稽に見えるかもしれないが、本人にとってはいたって真剣に行っている。なぜなら、案外、この煽りに影響を受けて売買をする人が多いからだ。話がややズレたが、したがって「買い煽り」「売り煽り」を定性的に分析さえできれば、より冷静に判断することができるので、株式投資のパフォーマンスは飛躍的に向上すると期待できる。そこで、本投稿では、Pythonを使って、買い煽り・売り煽り分析を行う。
yahooファイナンスの掲示板で調べたかったが、スクレイピングが禁止なので、twitterで銘柄コードと一緒につぶやかれている内容で、買い煽りが多いか、売り煽りが多いかを調べる。

自然言語処理の感情分析で極めて重要なのはガゼッタ(辞書)であろう。
東工大の高村教授が作成したPN Tableや東北大学の乾・岡崎研究室の日本語評価極性辞書が有名だが、買い煽り・売り煽りとは全く関係ないだろう。そのため、自分で辞書を作る必要がある。検討を重ね、以下の単語が入ってれば、それぞれ買い煽りのツイート、売り煽りのツイートとして評価することにした。

### 買い煽りのガゼッタ
‘買い’,’買う’,’特買い’,’安い’,’全力’,’爆益’,’ノンホル’,’超絶’,’本物’,’成長’,’反転’,’拾う’,’ホールド’,’踏み上げ’,’国策’,’ストップ高’,’上がる’,’初動’,’クジラ’,’売り豚’,’売り方’,’ハイカラ’,’頑張れ’,’貸借’,’反転’,’応援’,’強い’,’逆日歩’

### 売り煽りのガゼッタ
‘売り’,’売る’,’下がる’,’大損’,’ナイアガラ’,’ガラ’,’ジェットコースター’,’ダメ’,’嵌め込み’,’養分’,’退場’,’追証’,’赤字’,’ワラント’,’仕手’,’特売り’,’高い’,’アホルダー’,’イナゴ’,’撤退’,’倒産’,”,’買い豚’,’買い方’,’信用買’,’逃げろ’,’疑義’,’空売り’,’利確’,’損切’,’振い落とし’

### 銘柄に関連した呟きを取得
– tweepyで取得してテキストデータとして保存する

import tweepy
import datetime
import re
 
keyword = "6343 -RT"
dfile = "test.txt"
 
jsttime = datetime.timedelta(hours=9)
 
Consumer_key = ''
Consumer_secret = ''
Access_token = ''
Access_secret = ''
 
auth = tweepy.OAuthHandler(Consumer_key, Consumer_secret)
auth.set_access_token(Access_token, Access_secret)
 
api = tweepy.API(auth, wait_on_rate_limit = True)
 
q = keyword
 
tweets_data = []
 
 
for tweet in tweepy.Cursor(api.search, q=q, count=5,tweet_mode='extended').items(200):
	tweets_data.append(tweet.full_text + '\n')
 
fname = r"'" + dfile + "'"
fname = fname.replace("'", "")
 
with open(fname, "w", encoding="utf-8") as f:
    f.writelines(tweets_data)

### スコア分類
– つぶやきをMecabで形態素解析を行う
– 買い煽りのガゼッタのワードがつぶやかれたら買い煽り+1ポイント、売り煽りのガゼッタのワードがつぶやかれたら売り煽り+1ポイントとして評価した。

import MeCab

dfile = "test.txt"
 
fname = r"'" + dfile + "'"
fname = fname.replace("'","")
 
mecab = MeCab.Tagger("-Owakati")
 
words = []
 
with open(fname, 'r', encoding="utf-8") as f:
 
    reader = f.readline()
 
    while reader:
 
        node = mecab.parseToNode(reader)
 
        while node:
            word_type = node.feature.split(",")[0]
            if word_type in ["名詞", "動詞", "形容詞", "副詞"]:
                words.append(node.surface)
            node = node.next
        reader = f.readline()


# 買い煽りのgazetteer
buying_set = set(['買い','買う','特買い','安い','全力','爆益','ノンホル','超絶','本物','成長','反転','拾う','ホールド','踏み上げ','国策','ストップ高','上がる','初動','クジラ','売り豚','売り方','ハイカラ','頑張れ','貸借','反転','応援','強い','逆日歩'])

# 売り煽りのgazetter
selling_set = set(['売り','売る','下がる','大損','ナイアガラ','ガラ','ジェットコースター','ダメ','嵌め込み','養分','退場','追証','赤字','ワラント','仕手','特売り','高い','アホルダー','イナゴ','撤退','倒産','','買い豚','買い方','信用買','逃げろ','疑義','空売り','利確','損切','振い落とし'])

def classify_category(text):

	buying_score = 0
	selling_score = 0

	for ele in words:
		element = ele.split("\t")
		if element[0] == "EOS":
			break

		surface = element[0]

		if surface in buying_set:
			buying_score += 1
		if surface in selling_set:
			selling_score += 1

	print("買い煽りスコア:" + str(buying_score))
	print("売り煽りスコア:" + str(selling_score))

classify_category(words)

今回は、先日の金曜日にストップ高をつけたフリージア・マクロス(6343)で調べた。
$ python3 app.py
買い煽りスコア:13
売り煽りスコア:0

結果、買い煽りだらけとなった。
空売りを入れよう。

CornellのsentimentデータでTextClassification

### dataset
Cornell Natural Language Processing Groupの映画レビューのデータセットを使います。
http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
txt_sentokenフォルダ配下にnegativeとpositiveのデータが入っています。

### Sentiment Analysis with Scikit-Learn
1. import libraries and dataset
2. text preprocessing
3. converting text to numbers
4. training and test sets
5. training text classification model and predicting sentiment
6. evaluating the model
7. saving and loading the model

import numpy as np
import re
import nltk
from sklearn.datasets import load_files
nltk.download('stopwords')
import pickle
import nltk
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# データ取得
movie_data = load_files("txt_sentoken")
X, y = movie_data.data, movie_data.target

# remove all the special characters
stemmer = WordNetLemmatizer()

documents = []

for sen in range(0, len(X)):
	# remove all the special character
	document = re.sub(r'\W', ' ', str(X[sen]))
	# remove all the single character
	document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)
	# remove all the single character from the start
	document = re.sub(r'\^[a-zA-Z]\s+', ' ', document)
	# substituting multiple spaces with single space
	document = re.sub(r'\s+', ' ', document, flags=re.I)
	# Removing prefixed 'b'
	document = re.sub(r'^b\s+', '', document)
	# converting to lowercase
	document = document.lower()
	# lemmatization (見出し語に変換)
	document = document.split()

	document = [stemmer.lemmatize(word) for word in document]
	document = ' '.join(document)

	documents.append(document)

# Bag of wordsとWord Embedding があるがここではBag of wordsを使う
# max_featuresはmost occuring world of 1500, min_dfはminimum number of documents contain this feature, max_dfはfraction corresponds to a percentage 最大70%
vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
# fit_transformでnumeric featuresに変換
X = vectorizer.fit_transform(documents).toarray()
# tfidf
tfidfconverter = TfidfTransformer()
X = tfidfconverter.fit_transform(X).toarray()

# training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# random forest algorithm, predicting sentiment
classifier = RandomForestClassifier(n_estimators=1000, random_state=0)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

$ python3 model.py
[[180 28]
[ 30 162]]
precision recall f1-score support

0 0.86 0.87 0.86 208
1 0.85 0.84 0.85 192

accuracy 0.85 400
macro avg 0.85 0.85 0.85 400
weighted avg 0.85 0.85 0.85 400

0.855

# save model
with open('text_classifier', 'wb') as picklefile:
    pickle.dump(classifier,picklefile)

import pickle
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

with open('text_classifier', 'rb') as training_model:
	model = pickle.load(training_model)

y_pred2 = model.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

Scikit-Learnでやるんやな。