[自然言語処理] chatterbotで日本語のcorpusを使う

$ git clone https://github.com/gunthercox/chatterbot-corpus.git
$ cp -r chatterbot-corpus/chatterbot_corpus/data/japanese /home/vagrant/.local/lib/python3.8/site-packages/chatterbot_corpus/data

from chatterbot import ChatBot 
from chatterbot.trainers import ListTrainer
from chatterbot.trainers import ChatterBotCorpusTrainer

bot = ChatBot("チャットボット試作品")

trainer = ChatterBotCorpusTrainer(bot)
trainer.train("chatterbot.corpus.japanese")

while True:
	try:
		bot_input = bot.get_response(input())
		print(bot_input)
	except(KeyboardInterrupt, EOFError, SystemExit):
		break

出身はどちらですか
私はすべてのソフトウェアプログラムの出所です。遠く離れた銀河。
こんにちは
こちらこそ こんにちは
宜しくお願いします
a
ところで

今日は何の日ですか?
不明な変数名。
疲れたー
わかります。
プログラミングは面白い

何だって?
ピーターパークカー。
何?
それはあなたが一度だけ生きていることを意味します。あなたはそれをどこで聞いたのですか?
いいえ
顕著ではない。
その通り。
どの言語を使いたいですか?
日本語
0と1以外のすべての数字のように、多くのもの。
何のこと?
関心度
関心度
それはすべて中央銀行の行動に依存します。
経済やん
はい、そうです。
そうなのか
ロボットは嘘をついてはいけません
よくわからん

まあ
株式市場

うーむ、全く会話になってないな…

[自然言語処理] sumyで要約を作りたい

$ pip3 install sumy

import MeCab

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer

text = ""

def summy_test(text):
    dic_url = ""
    tagger = MeCab.Tagger()
    key = tagger.parse(text)
    corpus = []
    for row in key.split("\n"):
        word = row.split("\t")[0]
        if word == "EOS":
            break
        else:
            corpus.append(word)

    parser = PlaintextParser.from_string(text, Tokenizer('japanese'))

    summarizer = LexRankSummarizer()
    summarizer.stop_words = ['']
    summary = summarizer(document=parser.document, sentences_count=2)
    b = []
    for sentence in summary:
        b.append(sentence.__str__())
    return "".join(b)

print(summy_test(text))

簡単やわ

[自然言語処理] python x transformerで感情分析

$ pip3 install fugashi
$ pip3 install ipadic

sentiment.py

# -*- coding: utf-8 -*-
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("daigo/bert-base-japanese-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("daigo/bert-base-japanese-sentiment")

print(pipeline("sentiment-analysis", model="daigo/bert-base-japanese-sentiment", tokenizer="daigo/bert-base-japanese-sentiment")("私は幸福である。"))

$ python3 sentiment.py
[{‘label’: ‘ポジティブ’, ‘score’: 0.9843042492866516}]

単語やニュートラルな文章はポジティブに判定されやすい

“もうダメだ” にして再実行
$ python3 sentiment.py
[{‘label’: ‘ネガティブ’, ‘score’: 0.9892264604568481}]

なるほど
処理時間がかかるのがスネに傷やな

[自然言語処理] word2vecによる類似単語を検索

$ pip3 install gensim==3.8.1

### word2vecのファイルをDL
$ wget http://public.shiroyagi.s3.amazonaws.com/latest-ja-word2vec-gensim-model.zip

$ unzip latest-ja-word2vec-gensim-model.zip

app.py

# -*- coding: utf-8 -*-
from gensim.models import word2vec

model = word2vec.Word2Vec.load('word2vec.gensim.model')
results = model.wv.most_similar(positive=['日本'])
for result in results:
	print(result)

$ python3 app.py
(‘韓国’, 0.7088127732276917)
(‘台湾’, 0.6461570262908936)
(‘日本国内’, 0.6403890252113342)
(‘欧米’, 0.6350583434104919)
(‘日本国外’, 0.6200590133666992)
(‘台湾出身’, 0.6174061894416809)
(‘中華圏’, 0.612815260887146)
(‘日本の経済’, 0.6088820099830627)
(‘日本の歴史’, 0.6070738434791565)
(‘韓国国内’, 0.6054152250289917)

gensimはバージョンを指定しないとエラーになるので注意が必要

OK
これを繋げていく

[自然言語処理] pythonによる文章自動生成

transformersを使います
$ pip3 install transformers==4.3.3 torch==1.8.0 sentencepiece==0.1.91

# -*- coding: utf-8 -*-
import transformers

tokenizer = transformers.AutoTokenizer.from_pretrained("colorfulscoop/gpt2-small-ja")
model = transformers.AutoModelForCausalLM.from_pretrained("colorfulscoop/gpt2-small-ja")

input = tokenizer.encode("昔々あるところに", return_tensors="pt")
output = model.generate(input, do_sample=True, top_p=0.95, top_k=50, num_return_sequences=3)

print(tokenizer.batch_decode(output))

[‘昔々あるところには、お祭りの女神さんが現れ、そこでお姫様の姫様’, ‘昔々あるところに、ある。ある夏の日、彼は旅人と出会い、その目的がどう’, ‘昔々あるところに、一億年も前には人間たちが住んでいた。いまや、それはこの’]

おおお、なんか色々出来そうではある…
途中の処理を考える必要はあるが

[自然言語処理] pythonとasariで感情分析を行う

$ pip3 install asari

from asari.api import Sonar

sonar = Sonar()
sonar.ping(text="広告が多すぎる♡")

$ python3 asari.py
Traceback (most recent call last):
File “asari.py”, line 1, in
from asari.api import Sonar
File “/home/vagrant/dev/nlp/asari.py”, line 1, in
from asari.api import Sonar
ModuleNotFoundError: No module named ‘asari.api’; ‘asari’ is not a package

—-
Requirement already satisfied: joblib>=0.11 in /home/vagrant/.local/lib/python3.8/site-packages (from scikit-learn>=0.19.1->asari) (1.0.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/vagrant/.local/lib/python3.8/site-packages (from scikit-learn>=0.19.1->asari) (2.2.0)
Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.7.3->pandas>=0.22.0->asari) (1.14.0)
Installing collected packages: pytz, pandas, Janome, asari
Successfully installed Janome-0.4.1 asari-0.0.4 pandas-1.3.4 pytz-2021.3

python2系じゃないと動かないのか?

python2系の環境で再度やり直します。
$ pip install scikit-learn==0.20.4
$ pip install Janome==0.3.7

[{‘text’: ‘広’, ‘top_class’: ‘positive’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.28345109397418017}, {‘class_name’: ‘positive’, ‘confidence’: 0.7165489060258198}]}, {‘text’: ‘告’, ‘top_class’: ‘positive’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.3450436370027103}, {‘class_name’: ‘positive’, ‘confidence’: 0.6549563629972897}]}, {‘text’: ‘が’, ‘top_class’: ‘negative’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.9006654377630458}, {‘class_name’: ‘positive’, ‘confidence’: 0.09933456223695437}]}, {‘text’: ‘多’, ‘top_class’: ‘negative’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.9364137330979464}, {‘class_name’: ‘positive’, ‘confidence’: 0.06358626690205357}]}, {‘text’: ‘す’, ‘top_class’: ‘positive’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.2124141523260128}, {‘class_name’: ‘positive’, ‘confidence’: 0.7875858476739873}]}, {‘text’: ‘ぎ’, ‘top_class’: ‘negative’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.5383816766180572}, {‘class_name’: ‘positive’, ‘confidence’: 0.4616183233819428}]}, {‘text’: ‘る’, ‘top_class’: ‘negative’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.6881484923868434}, {‘class_name’: ‘positive’, ‘confidence’: 0.3118515076131566}]}]

# -*- coding: utf-8 -*-
from asari.api import Sonar

sonar = Sonar()
text="広告が多すぎる"
res = sonar.ping(text="広告多すぎる♡")
print(res)

$ python3 app.py
{‘text’: ‘広告多すぎる♡’, ‘top_class’: ‘negative’, ‘classes’: [{‘class_name’: ‘negative’, ‘confidence’: 0.9086981552962491}, {‘class_name’: ‘positive’, ‘confidence’: 0.0913018447037509}]}

なるほど、janomeとscikit-learnはversionを指定すると動く

[SnowNLP] Python3で中国語の自然言語処理

$ pip3 install snownlp

tokenization

from snownlp import SnowNLP

s = SnowNLP(u'今天是周六。')
print(s.words)

$ python3 snow.py
[‘今天’, ‘是’, ‘周六’, ‘。’]

speech tagにするとnoun, adverb, verb, adjectiveなどを表現できます。

print(list(s.tags))

$ python3 snow.py
[(‘今天’, ‘t’), (‘是’, ‘v’), (‘周六’, ‘t’), (‘。’, ‘w’)]

pinyin

print(s.pinyin)

$ python3 snow.py
[‘jin’, ‘tian’, ‘shi’, ‘zhou’, ‘liu’, ‘。’]

sentences

s = SnowNLP(u'在茂密的大森林里,一只饥饿的老虎逮住了一只狐狸。老虎张开大嘴就要把狐狸吃掉。"慢着"!狐狸虽然很害怕但还是装出一副很神气的样子说,"你知道我是谁吗?我可是玉皇大帝派来管理百兽的兽王,你要是吃了我,玉皇大帝是决不会放过你的"。')
print(s.sentences)

[‘在茂密的大森林里’, ‘一只饥饿的老虎逮住了一只狐狸’, ‘老虎张开大嘴就要把狐狸吃掉’, ‘”慢着”‘, ‘狐狸虽然很害怕但还是装出一副很神气的样子说’, ‘”你知道我是谁吗’, ‘我可是玉皇大帝派来管理百兽的兽王’, ‘你要是吃了我’, ‘玉皇大帝是决不会放过你的”‘]

keyword

print(s.keywords(5))

$ python3 snow.py
[‘狐狸’, ‘大’, ‘老虎’, ‘大帝’, ‘皇’]

summary

print(s.summary(3))

[‘老虎张开大嘴就要把狐狸吃掉’, ‘我可是玉皇大帝派来管理百兽的兽王’, ‘玉皇大帝是决不会放过你的”‘]

sentiment analysis

text = SnowNLP(u'这个产品很好用,这个产品不好用,这个产品是垃圾,这个也太贵了吧,超级垃圾,是个垃圾中的垃圾')
sent = text.sentences
for sen in sent:
	s = SnowNLP(sen)
	print(s.sentiments)

$ python3 snow.py
0.7853504415636449
0.5098208142944668
0.13082804652201174
0.5
0.0954842128485538
0.04125325276132508

0から1の値を取り、1に近づくほどポジティブ、0に近いほどネガティブとなります。

[NLTK] customize sentiment analysis

unwanted = nltk.corpus.stopwords.words("english")
unwanted.extend([w.lower() for w in nltk.corpus.names.words()])

def skip_unwanted(pos_tuple):
	word, tag = pos_tuple
	if not word.isalpha() or word in unwanted:
		return False
	if tag.startswith("NN"):
		return False
	return True

positive_words = [word for word, tag in filter(
	skip_unwanted,
	nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["pos"]))
)]
negative_words = [word for word, tag in filter(
	skip_unwanted,
	nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["neg"]))
)]

positive_fd = nltk.FreqDist(positive_words)
negative_fd = nltk.FreqDist(negative_words)

common_set = set(positive_fd).intersection(negative_fd)

for word in common_set:
	del positive_fd[word]
	del negative_fd[word]

top_100_positive = {word for word, count in positive_fd.most_common(100)}
top_100_negative = {word for word, count in negative_fd.most_common(100)}

unwanted = nltk.corpus.stopwords.words("english")
unwanted.extend([w.lower() for w in nltk.corpus.names.words()])

positive_bigram_finder = nltk.collocations.BigramCollocationFinder.from_words([
	w for w in nltk.corpus.movie_reviews.words(categories=["pos"])
	if w.isalpha() and w not in unwanted
])

negative_bigram_finder = nltk.collocations.BigramCollocationFinder.from_words([
	w for w in nltk.corpus.movie_reviews.words(categories=["neg"])
	

[NLTK] sentiment analysis

NLTK has a built-in pretrained sentiment analyzer, VADER(Valence Aware Dictionary and sEntiment Reasoner)

import nltk
from pprint import pprint
from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()
pprint(sia.polarity_scores("Wow, NLTK is really powerful!"))

$ python3 app.py
{‘compound’: 0.8012, ‘neg’: 0.0, ‘neu’: 0.295, ‘pos’: 0.705}

compoundはaverageで-1から1までを示す

twitter corpus

tweets = [t.replace("://", "//") for t in nltk.corpus.twitter_samples.strings()]

def is_positive(tweet: str) -> bool:
    """True if tweet has positive compound sentiment, False otherwise."""
    return sia.polarity_scores(tweet)["compound"] > 0

shuffle(tweets)
for tweet in tweets[:10]:
    print(">", is_positive(tweet), tweet)

$ python3 app.py
> False Most Tory voters not concerned which benefits Tories will cut. Benefits don’t figure in the lives if most Tory voters. #Labour #NHS #carers
> False .@uberuk you cancelled my ice cream uber order. Everyone else in the office got it but me. 🙁
> False oh no i’m too early 🙁
> False I don’t know what I’m doing for #BlockJam at all since my schedule’s just whacked right now 🙁
> False What should i do .

BAD VS PARTY AGAIN :(((((((
> True @Shadypenguinn take care! 🙂
> True Thanks to amazing 4000 Followers on Instagram
If you´re not among them yet,
feel free to connect :-)… http//t.co/ILy03AtJ83
> False RT @mac123_m: Ed Miliband has spelt it out again. No deals with the SNP.
There’s a choice:
Vote SNP get Tories
Vote LAB and get LAB http//…
> True @gus33000 but Disk Management is same since NT4 iirc 😀
Also, what UX refinements were in zdps?
> False RT @KevinJPringle: One of many bizarre things about @Ed_Miliband’s anti-SNP stance is he doesn’t reject deal with LibDems, who imposed aust…

postivie_review_ids = nltk.corpus.movie_reviews.fileids(categories=["pos"])
negative_review_ids = nltk.corpus.movie_reviews.fileids(categories=["neg"])
all_review_ids = positive_review_ids + negative_review_ids

def is_positive(review_id: str) -> bool:
	"""True if the average of all sentence compound scores is positive. """
	text = nltk.corpus.movie_reviews.raw(review_id)
	scores = [
		sia.polarity_scores(sentence)["compound"]
		for sentence in nltk.sent_tokenize(text)
	]
	return mean(scores) > 0

shuffle(all_review_ids)
correct = 0
for review_id in all_review_ids:
	if is_positive(review_id):
		if review in positive_review_ids:
			correct += 1
	else:
		if review in negative_review_ids:
			correct += 1

print(F"{correct / len(all_review_ids):.2%} correct")

既にcorpusがあるのは良いですね。

[NLTK] Word frequency

$ pip3 install nltk

### download
NLTK can be download resouces
– names, stopwords, state_union, twitter_samples, moview_review, averaged_perceptron_tagger, vader_lexicon, punkt

import nltk

nltk.download([
	"names",
	"stopwords",
	"state_union",
	"twitter_samples",
	"movie_reviews",
	"averaged_perceptron_tagger",
	"vader_lexicon",
	"punkt",
])

State of union corpus

words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]

to use stop words

words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]
stopwords = nltk.corpus.stopwords.words("english")
words = [w for w in words if w.lower() not in stopwords]

word_tokenize()

text = """
For some quick analysis, creating a corpus could be overkill.
If all you need is a word list,
there are simpler ways to achieve that goal.
"""
pprint(nltk.word_tokenize(text), width=79, compact=True)

most common

fd = nltk.FreqDist(words)
pprint(fd.most_common(3))

$ python3 app.py
[(‘must’, 1568), (‘people’, 1291), (‘world’, 1128)]

specific word

fd = nltk.FreqDist(words)
pprint(fd["America"])

$ python3 app.py
1076

### concordance
どこに出現するかを示す

text = nltk.Text(nltk.corpus.state_union.words())
text.concordance("america", lines=5)

$ python3 app.py
Displaying 5 of 1079 matches:
would want us to do . That is what America will do . So much blood has already
ay , the entire world is looking to America for enlightened leadership to peace
beyond any shadow of a doubt , that America will continue the fight for freedom
to make complete victory certain , America will never become a party to any pl
nly in law and in justice . Here in America , we have labored long and hard to

text = nltk.Text(nltk.corpus.state_union.words())
concordance_list = text.concordance_list("america", lines=2)
for entry in concordance_list:
	print(entry.line)

$ python3 app.py
would want us to do . That is what America will do . So much blood has already
ay , the entire world is looking to America for enlightened leadership to peace

other frequency distribution

words: list[str] = nltk.word_tokenize(
"""Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.""")
text = nltk.Text(words)
fd = text.vocab()
fd.tabulate(3)

collocation

words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]
finder = nltk.collocations.TrigramCollocationFinder.from_words(words)

pprint(finder.ngram_fd.most_common(2))
pprint(finder.ngram_fd.tabulate(2))

$ python3 app.py
[((‘the’, ‘United’, ‘States’), 294), ((‘the’, ‘American’, ‘people’), 185)]
(‘the’, ‘United’, ‘States’) (‘the’, ‘American’, ‘people’)
294 185

nltkが強力なのはわかった。